back to indexLesson 12: Deep Learning Part 2 2018 - Generative Adversarial Networks (GANs)
Chapters
0:0 Introduction
1:5 Christine Payne
7:16 Darknet
8:55 Basic Skills
11:3 Architecture
11:50 Basic Architecture
14:38 Res Blocks
16:46 Number of Channels
18:23 InPlace
21:10 Padding
22:13 One by One Con
26:0 Wide Residual Networks
29:44 SelfNormalising
31:28 Group Layers
37:25 Adaptive Average Pooling
46:53 Strides
48:26 GANs
54:51 Generating Pictures
00:00:00.000 |
So, we're going to be talking about GANs today. 00:00:11.520 |
Very hot technology, but definitely deserving to be in the cutting-edge deep learning part 00:00:20.640 |
of the course, because they're not quite proven to be necessarily useful for anything, but 00:00:29.680 |
They're definitely going to get there, and we're going to focus on the things where they're 00:00:38.400 |
There are a number of areas where they may turn out to be useful in practice, but we 00:00:42.880 |
So I think the area that we're going to be useful in practice is the kind of thing you 00:00:48.400 |
see on the left here, which is, for example, turning drawings into rendered pictures. 00:00:54.960 |
This comes from a paper that just came out two days ago. 00:00:59.480 |
So there's a very active research going on right now. 00:01:05.560 |
Before we get there, though, let's talk about some interesting stuff from the last class. 00:01:12.400 |
This is an interesting thing that one of our diversity fellows, Christine Payne, did. 00:01:17.800 |
Christine has a master's in medicine from Stanford, and so she obviously had an interest 00:01:24.000 |
in thinking what would it look like if we built a language model of medicine. 00:01:30.800 |
One of the things that we briefly touched on back in lesson 4 but didn't really talk 00:01:35.560 |
much about last time is this idea that you can actually seed a generative language model, 00:01:41.720 |
which basically means you've trained a language model on some corpus, and then you're going 00:01:46.040 |
to generate some text from that language model. 00:01:49.520 |
And so you can start off by feeding it a few words to basically say here's the first few 00:01:55.360 |
words to create the hidden state in the language model, and then generate from there, please. 00:02:01.000 |
And so Christine did something clever, which was to kind of pick a -- was to seed it with 00:02:06.920 |
a question and then repeat the question three times, Christine, three times, and then let 00:02:15.760 |
And so she fed a language model lots of different medical texts, and then fed it this question, 00:02:21.360 |
what is the prevalence of malaria, and the model said in the US about 10% of the population 00:02:27.640 |
has the virus, but only about 1% is infected with the virus, about 50 to 80 million are 00:02:33.440 |
She said what's the treatment for ectopic pregnancy, and it said it's a safe and safe 00:02:38.240 |
treatment for women with a history of symptoms that may have a significant impact on clinical 00:02:41.720 |
response, most important factor is development of management of ectopic pregnancy, etc. 00:02:46.720 |
And so what I find interesting about this is it's pretty close to being a -- to me as 00:02:56.000 |
somebody who doesn't have a master's in medicine from Stanford, pretty close to being a believable 00:03:00.600 |
answer to the question, but it really has no bearing on reality whatsoever, and I kind 00:03:06.920 |
of think it's an interesting kind of ethical and user experience quandary. 00:03:13.560 |
So actually, I'm involved also in a company called Doc.ai that's trying to basically -- or 00:03:20.480 |
doing a number of things, but in the end provide an app for doctors and patients which can 00:03:25.600 |
help create a conversational user interface around helping them with their medical issues. 00:03:31.520 |
And I've been continually saying to the software engineers on that team, please don't try to 00:03:37.640 |
create a generative model using like an LSTM or something because they're going to be really 00:03:43.520 |
good at creating bad advice that sounds impressive, kind of like political pundits or tenured professors, 00:03:55.320 |
people who can say bullshit with great authority. 00:04:03.400 |
So I thought it was a really interesting experiment, and great to see what our diversity fellows 00:04:15.680 |
I suppose I shouldn't just say master's in medicine, actually a Juilliard trained classical 00:04:21.080 |
musician or actually also a Princeton valedictorian in physics, so also a high performance computing 00:04:31.120 |
So yeah, really impressive group of people and great to see such exciting kind of ideas 00:04:37.440 |
And if you're wondering, you know, I've done some interesting experiments, should I let 00:04:46.800 |
Well, Christine mentioned this on the forum, I went on to mention it on Twitter, to which 00:04:52.920 |
I got this response, you're looking for a job, you may be wondering who Xavier Maricain is, 00:04:59.080 |
well he is the founder of a hot new medical AI startup, he was previously the head of 00:05:05.400 |
engineering at Quora, before that he was the guy at Netflix who ran the data science team 00:05:10.960 |
and built their recommender systems, so this is what happens if you do something cool, 00:05:17.160 |
let people know about it and get noticed by awesome people like Xavier. 00:05:32.520 |
And the reason I'm going to talk about sci-fi 10 is that we're going to be looking at some 00:05:39.560 |
more bare-bones PyTorch stuff today to build these generative adversarial models, there's 00:05:47.040 |
no really fast AI support to speak of at all for GANs at the moment, I'm sure there will 00:05:53.320 |
be soon enough, but currently there isn't, so we're going to be building a lot of models 00:05:57.200 |
It's been a while since we've done serious model building, a little bit of model building 00:06:03.240 |
I guess for our bounding box stuff, but really all the interesting stuff there was the loss 00:06:10.840 |
So we looked at sci-fi 10 in the part 1 of the course and we built something which was 00:06:15.080 |
getting about 85% accuracy and I can't remember, a couple of hours to train. 00:06:21.800 |
Interestingly there's a competition going on now to see who can actually train sci-fi 00:06:25.880 |
10 the fastest, going through this Stanford/Dawn bench and currently, so the goal is to get 00:06:34.160 |
So it'd be interesting to see if we can build an architecture that can get to 94% accuracy 00:06:39.760 |
because that's a lot better than our previous attempt and so hopefully in doing so we'll 00:06:44.600 |
learn something about creating good architectures. 00:06:47.720 |
That will be then useful for looking at these GANs today, but I think also it's useful because 00:06:56.460 |
I've been looking much more deeply into the last few years' papers about different kinds 00:07:02.840 |
of CNN architectures and realized that a lot of the insights in those papers are not being 00:07:08.060 |
widely leveraged and clearly not widely understood. 00:07:11.560 |
So I want to show you what happens if we can leverage some of that understanding. 00:07:17.200 |
So I've got this notebook called sci-fi 10 darknet. 00:07:23.200 |
That's because the architecture we're going to look at is really very close to the darknet 00:07:29.600 |
But you'll see in the process that the darknet architecture has not the whole YOLO version 00:07:34.920 |
3 end-to-end thing, but just the part of it that they pre-trained on ImageNet to do classification. 00:07:41.520 |
It's almost like the most generic simple architecture almost you could come up with. 00:07:48.160 |
And so it's a really great starting point for experiments. 00:07:52.360 |
So we're going to call it darknet, but it's not quite darknet and you can fiddle around 00:07:56.240 |
with it to create things that definitely aren't darknet. 00:07:58.840 |
It's really just the basis of nearly any modern ResNet-based architecture. 00:08:06.480 |
So sci-fi 10, remember, is a fairly small dataset. 00:08:14.400 |
And I think it's a really great dataset to work with because you can train it relatively 00:08:23.280 |
It's a relatively small amount of data, unlike ImageNet, and it's actually quite hard to recognize 00:08:28.040 |
the images because 32x32 is kind of too small to easily see what's going on. 00:08:35.440 |
So I think it's a really underappreciated dataset because it's old, and who at DeepMind 00:08:42.400 |
or OpenAI wants to work with a small old dataset when they could use their entire server room 00:08:50.440 |
But to me, I think this is a really great dataset to focus on. 00:08:56.640 |
So we'll go ahead and import our usual stuff, and we're going to try and build a network 00:09:08.760 |
One thing that I think is a really good exercise for anybody who's not 100% confident with 00:09:14.120 |
their kind of broadcasting and PyTorch and so forth basic skills is figure out how I 00:09:24.240 |
So these numbers are the averages for each channel and the standard deviations for each 00:09:30.680 |
So try and as a bit of a homework, just make sure you can recreate those numbers and see 00:09:35.480 |
if you can do it in no more than a couple of lines of code, you know, no loops. 00:09:42.960 |
Ideally you want to do it in one go if you can. 00:09:49.160 |
Because these are fairly small, we can use a larger batch size than usual, 256, and the 00:09:58.940 |
Transformations - normally we have this standard set of side-on transformations we use for 00:10:07.640 |
We're not going to use that here though because these images are so small that trying to rotate 00:10:11.520 |
a 32x32 image a bit is going to introduce a lot of blocking kind of distortions. 00:10:19.000 |
So the kind of standard transforms that people tend to use is a random horizontal flip, and 00:10:25.040 |
then we add size divided by 8, so 4 pixels of padding on each side. 00:10:32.120 |
And one thing which I find works really well is by default FastAI doesn't add black padding, 00:10:39.080 |
We actually take the last 4 pixels of the existing photo and flip it and reflect it, 00:10:44.680 |
and we find that we get much better results by using this reflection padding by default. 00:10:51.400 |
So now that we've got a 36x36 image, this set of transforms in training will randomly 00:11:00.240 |
So we get a little bit of variation, but not heaps. 00:11:03.320 |
Alright, so we can use our normal from paths to grab our data. 00:11:10.360 |
And what we're going to do is create an architecture which fits in one screen. 00:11:18.080 |
So this is from scratch, as you can see, I'm using the predefined com2d, batch norm2d, 00:11:27.280 |
leaky value modules, but I'm not using any blocks or anything, they're all being defined. 00:11:37.840 |
So if you're ever wondering can I understand a modern good quality architecture, absolutely. 00:11:48.480 |
So my basic starting point with an architecture is to say it's a stacked bunch of layers. 00:11:57.960 |
And generally speaking there's going to be some kind of hierarchy of layers. 00:12:00.680 |
So at the very bottom level there's things like a convolutional layer and a batch norm 00:12:05.280 |
So if you're thinking anytime you have a convolution, you're probably going to have some standard 00:12:10.800 |
sequence and normally it's going to be conv, batch norm, then a nonlinear activation. 00:12:17.720 |
So I try to start right from the top by saying, okay, what are my basic units going to be? 00:12:25.600 |
And so by defining it here, that way I don't have to worry about trying to keep everything 00:12:34.520 |
consistent and it's going to make everything a lot simpler. 00:12:37.040 |
So here's my conv layer, and so anytime I say conv layer, I mean conv, batch norm, relu. 00:12:43.880 |
Now I'm not quite saying relu, I'm saying leaky relu, and I think we've briefly mentioned 00:12:53.280 |
it before, but the basic idea is that normally a relu looks like that. 00:13:18.740 |
So this part, as before, has a gradient of 1, and this part has a gradient of, it can 00:13:23.400 |
vary, but something around 0.1 or 0.01 is common. 00:13:28.640 |
And the idea behind it is that when you're in this negative zone here, you don't end 00:13:34.380 |
up with a 0 gradient, which makes it very hard to update it. 00:13:39.080 |
In practice, people have found leaky relu more useful on smaller datasets and less useful 00:13:45.480 |
on big datasets, but it's interesting that for the YOLO version 3 paper, they did use 00:13:49.720 |
a leaky relu and got great performance from it. 00:13:53.080 |
So it rarely makes things worse, and it often makes things better. 00:13:57.320 |
So it's probably not bad if you need to create your own architecture to make that your default 00:14:07.560 |
You'll notice I don't define a PyTorch module here, I just go ahead and go sequential. 00:14:13.640 |
This is something that if you read other people's PyTorch code, it's really underutilized. 00:14:19.480 |
People tend to write everything as a PyTorch module with an init and a forward. 00:14:24.040 |
But if the thing you want is just a sequence of things one after the other, it's much more 00:14:29.240 |
concise and easy to understand to just make it a sequential. 00:14:32.640 |
So I've just got a simple plain function that just returns a sequential model. 00:14:38.700 |
So I mentioned that there's generally a number of hierarchies of units in most modern networks. 00:14:49.960 |
And I think we know now that the next level in this unit hierarchy for ResNets is the 00:15:01.640 |
res block or the residual block, I call it here a res layer. 00:15:10.520 |
And back when we last did scifi 10, I oversimplified this, I cheated a little bit. 00:15:18.560 |
We had x coming in, and we put that through a conv, and then we added it back up to x 00:15:30.720 |
So in general, we've got your output is equal to your input plus some function of your input. 00:15:45.080 |
And the thing we did last year was we made f was a 2D conv. 00:15:52.240 |
And actually, in the real res block, there's actually two of them. 00:16:14.480 |
And when I say conv, I'm using this as a shortcut for our conv layer. 00:16:24.800 |
So you can see here, I've created two convs, and here it is. 00:16:28.680 |
I take my x, put it through the first conv, put it through the second conv, and add it 00:16:33.100 |
back up to my input again to get my basic res block. 00:16:40.000 |
So, one interesting insight here is what are the number of channels in these convolutions? 00:16:59.960 |
So we've got coming in some number of input filters. 00:17:09.040 |
The way that the darknet folks set things up is they said we're going to make every one 00:17:13.320 |
of these res layers spit out the same number of channels that came in. 00:17:18.840 |
And I kind of like that, that's why I used it here, because it makes life simpler. 00:17:23.120 |
And so what they did is they said let's have the first conv have the number of channels, 00:17:31.480 |
So ni goes to ni/2, and then ni/2 goes to ni. 00:17:36.180 |
So you've kind of got this funneling thing where if you've got like 64 channels coming 00:17:42.720 |
in, it kind of gets squished down with a first conv down to 32 channels, and then taken back 00:17:56.000 |
Why is inplace equals true in the leaky value? 00:18:01.560 |
A lot of people forget this or don't know about it. 00:18:05.640 |
But this is a really important memory technique. 00:18:11.120 |
If you think about it, this conv layer is like the lowest level thing, so pretty much 00:18:15.360 |
everything in our resnet once it's all put together is going to be conv layers, conv layers, 00:18:23.680 |
If you don't have inplace equals true, it's going to create a whole separate piece of 00:18:34.720 |
So like it's going to allocate a whole bunch of memory that's totally unnecessary. 00:18:39.280 |
And actually, since I wrote this, I came up with another idea the other day, which I'll 00:18:45.800 |
now implement, which is you can do the same thing for the res layer, rather than going 00:18:50.280 |
-- let's just reorder this to say x plus that -- you can actually do the same thing here. 00:19:01.360 |
Hopefully some of you might remember that in PyTorch, pretty much every function has 00:19:06.400 |
an underscore suffix version which says do that inplace. 00:19:11.160 |
So plus, there's also an add, and so that's add inplace. 00:19:18.680 |
And so that's now suddenly reduced my memory there as well. 00:19:26.920 |
And I actually forgot the inplace equals true at first for this, and I literally was having 00:19:31.000 |
to decrease my batch size to much lower amounts than I knew should be possible, and it was 00:19:34.720 |
driving me crazy, and then I realized that that was missing. 00:19:39.560 |
You can also do that with dropout, by the way, if you have dropout. 00:19:42.880 |
So dropout and all the activation functions you can do inplace, and then generally any 00:19:50.280 |
arithmetic operation you can do inplace as well. 00:19:54.360 |
Why is bias usually in ResNet set to false in the conv layer? 00:20:02.400 |
If you're watching the video, pause now and see if you can figure this out, because this 00:20:16.060 |
So if you've figured it out, here's the thing, immediately after the conv is a batch norm. 00:20:24.000 |
And remember batch norm has two learnable parameters for each activation, the kind of 00:20:31.280 |
the thing you multiply by and the thing you add. 00:20:34.800 |
So if we had bias here to add, and then we add another thing here, we're adding two things 00:20:41.120 |
which is totally pointless, like that's two weights where one would do. 00:20:44.440 |
So if you have a batch norm after a conv, then you can either say in the batch norm, 00:20:51.240 |
don't include the add bit there please, or easier is just to say don't include the bias 00:21:00.200 |
There's no particular harm, but again, it's going to take more memory because that's more 00:21:12.400 |
Also another thing, a little trick, is most people's conv layers have padding as a parameter, 00:21:19.320 |
but generally speaking you should be able to calculate the padding easily enough. 00:21:23.600 |
And I see people try to implement special same padding modules and all kinds of stuff 00:21:30.760 |
But if you've got a stride 1, and you've got a kernel size of 3, then obviously that's 00:21:44.000 |
going to overlap by one unit on each side, so we want padding of 1. 00:21:50.680 |
Whereas if it's stride 1, then we don't need any padding. 00:21:54.480 |
So in general, padding of kernel size integer divided by 2 is what you need. 00:22:00.640 |
There's some tweaks sometimes, but in this case this works perfectly well. 00:22:05.400 |
So again, trying to simplify my code by having the computer calculate stuff for me rather 00:22:15.120 |
Another thing here with the two conv layers, we have this idea of a bottleneck, this idea 00:22:21.240 |
of reducing the channels and then increasing them again, is also what kernel size we use. 00:22:26.320 |
So here is a 1x1 conv, and this is again something you might want to pause the video now and 00:22:42.920 |
So if we've got a little 4x4 grid here, and of course there's a filter or channels axis 00:22:55.200 |
as well, maybe that's like 32, and we're going to do a 1x1 conv. 00:23:01.160 |
So what's the kernel for a 1x1 conv going to look like? 00:23:14.760 |
So remember when we talk about the kernel size, we never mention that last piece, but 00:23:20.360 |
let's say it's 1x1 by 32 because that's part of the filters in and filters out. 00:23:24.840 |
So in other words then, what happens is this one thing gets placed first of all here on 00:23:31.480 |
the first cell, and we basically get a dot product of that 32 deep bit with this 32 bit 00:23:39.320 |
deep bit, and that's going to give us our first output. 00:23:47.920 |
And then we're going to take that 32 bit bit and put it with the second one to get the 00:23:52.040 |
So it's basically going to be a bunch of little dot products for each point in the grid. 00:24:01.640 |
So what it basically is then is basically something which is allowing us to kind of 00:24:13.960 |
change the dimensionality in whatever way we want in the channel dimension. 00:24:28.760 |
And so in this case we're creating ni divided by 2 of these, so we're going to have ni divided 00:24:35.920 |
by 2 of these dot products, all with different weighted averages of the input channels. 00:24:42.240 |
So it basically lets us, with very little computation, add this additional step of calculations 00:24:55.720 |
So that's a cool trick, this idea of taking advantage of these 1x1 comms, creating this 00:25:01.200 |
bottleneck and then pulling it out again with 3x3 comms. 00:25:05.280 |
So that's actually going to take advantage of the 2D nature of the input properly. 00:25:12.160 |
The 1x1 comm doesn't take advantage of that at all. 00:25:17.760 |
So these two lines of code, there's not much in it, but it's a really great test of your 00:25:24.760 |
understanding and kind of your intuition about what's going on. 00:25:29.160 |
Why is it that a 1x1 comm going from ni to ni over 2 channels, followed by a 3x3 comm 00:25:36.440 |
going from ni over 2 to ni channels? Why does it work? Why do the tensor ranks line up? Why 00:25:42.800 |
do the dimensions all line up nicely? Why is it a good idea? What's it really doing? It's 00:25:49.280 |
a really good thing to fiddle around with, maybe create some small ones in Jupyter Notebook, 00:25:55.280 |
run them yourself, see what inputs and outputs come in and out. Really get a feel for that. 00:26:01.360 |
Once you've done so, you can then play around with different things. One of the really unappreciated 00:26:13.480 |
papers is this one, Wide Residual Networks. It's really quite a simple paper, but what 00:26:25.200 |
they do is they basically fiddle around with these two lines of code. And what they do 00:26:33.360 |
is they say, well what if this wasn't divided by 2, but what if it was times 2? That would 00:26:40.600 |
be totally allowable. That's going to line up nicely. Or what if we had another comm3 00:26:52.120 |
after this, and so this was actually ni over 2 to ni over 2, and then this is ni over 2. 00:27:01.760 |
Again that's going to work, right? Kernel size 1, 3, 1, going to half the number of kernels, 00:27:07.680 |
leave it at half and then double it again at the end. And so they come up with this 00:27:11.360 |
kind of simple notation for basically defining what this can look like. And then they show 00:27:19.680 |
lots of experiments. And basically what they show is that this approach of bottlenecking, 00:27:30.000 |
of decreasing the number of channels, which is almost universal in resnets, is probably 00:27:35.480 |
not a good idea. In fact from the experiment, it's definitely not a good idea. Because what 00:27:39.640 |
happens is it lets you create really deep networks. The guys who created resnets got 00:27:45.360 |
particularly famous creating a 1,001-layer network. But the thing about 1,001 layers is 00:27:51.480 |
you can't calculate layer 2 until you finish layer 1. You can't calculate layer 3 until 00:27:56.600 |
you finish layer 2. So it's sequential. GPUs don't like sequential. So what they showed 00:28:03.280 |
is that if you have less layers, but with more calculations per layer, and so one easy 00:28:10.920 |
way to do that would be to remove the /2. No other changes. Like try this at home. Try 00:28:17.960 |
running sci-fi and see what happens. Or maybe even multiply it by 2 or fiddle around. And 00:28:24.760 |
that basically lets your GPU do more work. And it's very interesting because the vast 00:28:29.320 |
majority of papers that talk about performance of different architectures never actually 00:28:34.760 |
time how long it takes to run a batch through it. They literally say this one requires x 00:28:42.960 |
number of floating-point operations per batch, but then they never actually bother to run 00:28:48.160 |
the damn thing like a proper experimentalist and find out whether it's faster or slower. 00:28:53.000 |
And so a lot of the architectures that are really famous now turn out to be slowest molasses 00:28:59.440 |
and take craploads of memory and just totally useless because the researchers never actually 00:29:06.200 |
bother to see whether they're fast and to actually see whether they fit in RAM with 00:29:09.880 |
normal batch sizes. So the wide resnet paper is unusual in that it actually times how long 00:29:17.280 |
it takes, as does the YOLO version 3 paper, which made the same insight. I'm not sure 00:29:22.560 |
they might have missed the wide resnets paper because the YOLO version 3 paper came to a 00:29:27.000 |
lot of the same conclusions, but I'm not even sure they cited the wide resnets paper, so 00:29:32.120 |
they might not be aware that all that work's been done. But they're both great to see people 00:29:38.640 |
actually timing things and noticing what actually makes sense. 00:29:45.720 |
Cellu looked really hot in the paper which came out, but I noticed that you don't use 00:29:52.600 |
So Cellu is something largely for fully connected layers which allows you to get rid of batch 00:29:59.840 |
norm, and the basic idea is that if you use this different activation function, it's kind 00:30:06.040 |
of self-normalizing. That's what the S in Cellu stands for. So self-normalizing means that 00:30:12.440 |
it will always remain at a unit standard deviation and zero mean, and therefore you don't need 00:30:16.760 |
that batch norm. It hasn't really gone anywhere, and the reason it hasn't really gone anywhere 00:30:23.040 |
is because it's incredibly finicky. You have to use a very specific initialization, otherwise 00:30:28.680 |
it doesn't start with exactly the right standard deviation of mean. It's very hard to use it 00:30:35.680 |
with things like embeddings. If you do, then you have to use a particular kind of embedding 00:30:40.320 |
initialization which doesn't necessarily actually make sense for embeddings. 00:30:46.280 |
You do all this work very hard to get it right, and if you do finally get it right, what's 00:30:52.040 |
the point where you've managed to get rid of some batch norm layers which weren't really 00:30:56.160 |
hurting you anyway. It's interesting because that paper, that Cellu paper, I think one 00:31:01.360 |
of the reasons people noticed it, or in my experience the main reason people noticed 00:31:05.120 |
it was because it was created by the inventor of LSTMs, and also it had a huge mathematical 00:31:10.800 |
appendix and people were like "Lots of maths from a famous guy, this must be great!" But 00:31:17.480 |
in practice I don't see anybody using it to get any state-of-the-art results or win any 00:31:30.240 |
This is some of the tiniest bits of code we've seen, but there's so much here and it's fascinating 00:31:34.080 |
to play with. Now we've got this block which is built on this block, and then we're going 00:31:40.280 |
to create another block on top of that block. We're going to call this a group layer, and 00:31:47.680 |
it's going to contain a bunch of res layers. A group layer is going to have some number 00:31:55.680 |
of channels or filters coming in, and what we're going to do is we're going to double 00:32:04.360 |
the number of channels coming in by just using a standard conv layer. Optionally, we'll halve 00:32:11.960 |
the grid size by using a stride of 2, and then we're going to do a whole bunch of res blocks, 00:32:20.600 |
a whole bunch of res layers. We can pick how many. That could be 2 or 3 or 8. Because remember, 00:32:26.560 |
these res layers don't change the grid size and they don't change the number of channels. 00:32:31.840 |
You can add as many as you like, anywhere you like, without causing any problems. It's 00:32:37.200 |
just going to use more computation and more RAM, but there's no reason other than that 00:32:42.600 |
you can't add as many as you like. A group layer, therefore, is going to end up doubling 00:32:49.720 |
the number of channels because it's this initial convolution which doubles the number of channels. 00:32:58.560 |
And depending on what we pass in a stride, it may also halve the grid size if we put 00:33:03.560 |
stride=2. And then we can do a whole bunch of res block computations as many as we like. 00:33:13.960 |
So then to define our dark net, or whatever we want to call this thing, we're just going 00:33:21.240 |
to pass in something that looks like this. And what this says is, create 5 group layers. 00:33:30.400 |
The first one will contain 1 of these extra res layers. The second will contain 2, then 00:33:36.400 |
4, then 6, then 3. And I want you to start with 32 filters. 00:33:47.560 |
So the first one of these res layers will contain 32 filters, and there will just be 00:33:57.100 |
one extra res layer. The second one is going to double the number of filters because that's 00:34:03.280 |
what we do. Each time we have a new group layer, we double the number. So the second 00:34:06.600 |
one will have 64, then 128, then 256, then 512, and then that will be it. 00:34:14.660 |
So nearly all of the network is going to be those bunches of layers. And remember, every 00:34:20.840 |
one of those group layers also has one convolution of the start. And so then all we have is before 00:34:29.360 |
that all happens, we're going to have one convolutional layer at the very start, and 00:34:35.640 |
at the very end we're going to do our standard adaptive average pooling, flatten, and a linear 00:34:41.360 |
layer to create the number of classes out at the end. 00:34:45.060 |
So one convolution at the end, adaptive pooling, and one linear layer at the other end, and 00:34:51.480 |
then in the middle, these group layers, each one consisting of a convolutional layer followed 00:34:57.680 |
by n number of res layers. And that's it. Again, I think we've mentioned this a few 00:35:04.840 |
times, but I'm yet to see any code out there, any examples, anything anywhere that uses 00:35:14.200 |
adaptive average pooling. Everyone I've seen writes it like this, and then bits a particular 00:35:21.240 |
number here, which means that it's now tied to a particular image size, which definitely 00:35:26.280 |
isn't what you want. So most people, even the top researchers I speak to, most of them 00:35:31.640 |
are still under the impression that a specific architecture is tied to a specific size, and 00:35:38.900 |
that's a huge problem when people think that because it really limits their ability to 00:35:44.520 |
use smaller sizes to kind of kickstart their modeling or to use smaller sizes for doing 00:35:51.840 |
Again, you'll notice I'm using sequential here, but a nice way to create architectures 00:35:58.240 |
is to start out by creating a list. In this case, this is a list with just one conv layer 00:36:02.040 |
in, and then my function here, make_group_layer, it just returns another list. So then I can 00:36:08.920 |
just go plus equals, appending that list to the previous list, and then I can go plus equals 00:36:14.600 |
to append this bunch of things to that list, and then finally sequential of all those layers. 00:36:20.360 |
So that's a very nice thing. So now my forward is just self.layers. 00:36:25.360 |
So this is a nice kind of picture of how to make your architectures as simple as possible. 00:36:32.720 |
So you can now go ahead and create this, and as I say, you can fiddle around. You could 00:36:38.520 |
even parameterize this to make it a number that you pass in here, to pass in different 00:36:43.720 |
numbers so it's not 2, maybe it's times 2 instead. You could pass in things that change 00:36:48.920 |
the kernel size or change the number of conv layers, fiddle around with it, and maybe you 00:36:53.880 |
can create something -- I've actually got a version of this which I'm about to run for 00:36:58.600 |
you -- which kind of implements all of the different parameters that's in that wide ResNet 00:37:04.760 |
paper, so I could fiddle around to see what worked well. 00:37:09.220 |
So once we've got that, we can use conv_learner from model_data to take our pytorch_model module 00:37:15.560 |
and the model_data object and turn them into a learner, give it a criterion, add some metrics 00:37:21.600 |
if we like, and then we can call fit and away we go. 00:37:26.480 |
Could you please explain adaptive average pooling? How does setting to one work? 00:37:31.560 |
Sure. Before I do, since we've only got a certain amount of time in this class, I do 00:37:45.160 |
want to see how we go with this simple network against these state-of-the-art results. So 00:37:54.120 |
to make life a little easier, we can start it running now and see how it looks later. 00:37:59.240 |
So I've got the command ready to go. So we've basically taken all that stuff and put it 00:38:05.160 |
into a simple little Python script, and I've modified some of those parameters I mentioned 00:38:10.280 |
to create something I've called a WRN22 network, which doesn't officially exist, but it's got 00:38:15.080 |
a bunch of changes to the parameters we talked about based on my experiments. 00:38:20.280 |
We're going to use the new Leslie Smith one-cycle thing. So there's quite a bunch of cool stuff 00:38:26.560 |
here. So the one-cycle implementation was done by our student, Sylvain Gugge, the trained 00:38:35.640 |
sci-fi experiments were largely done by Brett Coons, and stuff like getting the half-position 00:38:42.040 |
floating-point implementation integrated into fast.ai was done by Andrew Shaw. So it's been 00:38:49.120 |
a cool bunch of different student projects coming together to allow us to run this. So 00:38:55.240 |
this is going to run actually on an AWS, Amazon AWS P3, which has eight GPUs. The P3 has these 00:39:04.280 |
newer Volta architecture GPUs, which actually have special support for half-position floating 00:39:10.440 |
point. Fast.ai is the first library I know of to actually integrate the Volta-optimized 00:39:18.280 |
half-position floating point into the library, so we can just go learn.half now and get that 00:39:23.600 |
support automatically. And it's also the first one to integrate one-cycle, so these are the 00:39:30.160 |
parameters for the one-cycle. So we can go ahead and get this running. So what this actually 00:39:36.840 |
does is it's using PyTorch's multi-GPU support. Since there are eight GPUs, it's actually 00:39:43.800 |
going to fire off eight separate Python processes, and each one's going to train on a little 00:39:49.640 |
bit, and then at the end it's going to pass the gradient updates back to the master process 00:39:56.760 |
that's going to integrate them all together. So you'll see, here they are, lots of progress 00:40:04.600 |
bars all pop up together. And you can see it's training three or four seconds when you 00:40:12.800 |
do it this way. When I was training earlier, I was getting about 30 seconds per epoch. So 00:40:26.240 |
doing it this way, we can kind of train things like 10 times faster or so, which is pretty 00:40:33.440 |
Okay, so we'll leave that running. So you were asking about adaptive average pooling, 00:40:38.740 |
and I think specifically what's the number 1 doing? So normally when we're doing average 00:40:49.960 |
pooling, let's say we've got 4x4. Let's say we did average pooling 2, 2. Then that creates 00:41:06.440 |
a 2x2 area and takes the average of those 4, and then we can pass in the stride. So if 00:41:22.200 |
we said stride 1, then the next one is we would look at this block of 2x2 and take that 00:41:27.480 |
average, and so forth. So that's what a normal 2x2 average pooling would be. 00:41:35.320 |
And so in that case, if we didn't have any padding, that would spit out a 3x3, because 00:41:42.640 |
it's 2 here, 2 here, 2 here. And if we added padding, we can make it 4x4. So if we wanted 00:41:52.680 |
to spit out something, we didn't want 3x3, what if we wanted 1x1? Then we could say average 00:41:59.480 |
pool 4, 4. And so that's going to do 4, 4, and average the whole lot. And that would spit 00:42:17.360 |
But that's just one way to do it. Rather than saying the size of the pooling filter, why 00:42:25.800 |
don't we instead say, I don't care what the size of the input grid is, I always want 1x1. 00:42:32.880 |
So that's where then you say "adaptive average pool", and now you don't say what's the size 00:42:41.320 |
of the pooling filter, you instead say what's the size of the output I want. And so I want 00:42:45.920 |
something that's 1x1. And if you only put a single int, it assumes you mean 1x1. So in 00:42:52.320 |
this case, adaptive average pooling 1 with a 4x4 grid coming in is the same as average 00:43:00.200 |
pooling 4, 4. If it was a 7x7 grid coming in, it would be the same as 7, 7. 00:43:06.880 |
So it's the same operation, it's just expressing it in a way that says regardless of the input, 00:43:26.840 |
We got to 94, and it took 3 minutes and 11 seconds, and the previous state-of-the-art 00:43:34.120 |
was 1 hour and 7 minutes. So was it worth fiddling around with those parameters and learning 00:43:40.060 |
a little bit about how these architectures actually work and not just using what came 00:43:43.280 |
out of the box? Well, holy shit, we just used a publicly available instance. We used a spot 00:43:49.760 |
instance so that cost us $8 per hour for 3 minutes. It cost us a few cents to train this 00:44:00.800 |
from scratch 20 times faster than anybody's ever done it before. 00:44:06.960 |
So that's like the most crazy state-of-the-art result we've ever seen, but this one just 00:44:12.680 |
blew it out of the water. This is partly thanks to just fiddling around with those parameters 00:44:21.440 |
of the architecture. Mainly, frankly, about using Leslie Smith's one-cycle thing and Zulma's 00:44:28.360 |
implementation of that. Remember, not only a reminder of what that's doing, it's basically 00:44:36.040 |
saying this is batches, and this is learning rate. It creates an upward path that's equally 00:44:50.360 |
long as the downward path, so it's a true C-L-R, triangular, cyclical learning rate. 00:44:57.520 |
As per usual, you can pick the ratio between those two numbers. So x divided by y in this 00:45:06.720 |
case is the number that you get to pick. In this case, we picked 50, so we started out 00:45:16.520 |
with a much smaller one here. And then it's got this cool idea which is you get to say 00:45:22.360 |
what percentage of your epochs then is spent going from the bottom of this down all the 00:45:28.120 |
way down pretty much to zero. That's what this second number here is. So 15% of the batches 00:45:34.720 |
is spent going from the bottom of our triangle even further. 00:45:42.720 |
So importantly though, that's not the only thing one cycle does. We also have momentum, 00:45:50.960 |
and momentum goes from 0.95 to 0.85 like this. In other words, when the learning rate is really 00:46:08.040 |
low, we use a lot of momentum, and when the learning rate is really high, we use very 00:46:11.720 |
little momentum, which makes a lot of sense. But until Leslie Smith showed this in that 00:46:16.160 |
paper, I've never seen anybody do it before, so it's a really cool trick. 00:46:23.520 |
You can now use that by using the useCLRbeta parameter in fast.ai, and you should be able 00:46:30.840 |
to basically replicate this state-of-the-art result. You can use it on your own computer 00:46:36.280 |
or your paper space. Obviously the only thing you won't get is the multi-GPU piece, but 00:46:40.920 |
that makes it a bit easier to train. So on a single GPU, you should be able to beat this 00:46:53.560 |
Make group layer contains stride=2, so this means stride is 1 for layer 1 and 2 for everything 00:47:00.360 |
else. What's the logic behind it? Usually the strides I've seen are odd. 00:47:08.920 |
Strides are either 1 or 2, I think you're thinking of kernel sizes. So stride=2 means 00:47:14.280 |
that I jump 2 across, and so a stride of 2 means that you halve your grid size. I think 00:47:20.000 |
you might have got confused between stride and kernel size there. If we have a stride 00:47:27.120 |
of 1, the grid size doesn't change. If we have a stride of 2, then it does. In this case, 00:47:35.140 |
this is for sci-fi 10. 32x32 is small, and we don't get to halve the grid size very often, 00:47:42.100 |
because pretty quickly we're going to run out of cells. That's why the first layer has 00:47:50.960 |
a stride of 1, so we don't decrease the grid size straight away, basically. It's kind of 00:47:59.400 |
a nice way of doing it, because that's why we have a low number here, so we can start 00:48:05.520 |
out with not too much computation on the big grid, and then we can gradually do more and 00:48:12.120 |
more computation as the grids get smaller and smaller. Because the smaller grid the computation 00:48:17.800 |
will take less time. I think so that we can do all of our scanning 00:48:30.340 |
in one go. Let's take a slightly early break and come back at 7.30. 00:48:50.400 |
So we're going to talk about generative adversarial networks, also known as GANs, and specifically 00:48:57.280 |
we're going to focus on the Wasserstein GAN paper, which included some guy called Sumith 00:49:04.400 |
Chintala, who went on to create some piece of software called HiTorch. The Wasserstein 00:49:11.440 |
GAN was heavily influenced by the - so I'm just going to call this WGAN, that's the time 00:49:15.840 |
- the DC GAN, or deep convolutional generative adversarial networks paper, which also Sumith 00:49:22.320 |
was involved with. It's a really interesting paper to read. A lot of it looks like this. 00:49:39.240 |
The good news is you can skip those bits, because there's also a bit that looks like 00:49:45.360 |
this which says do these things. Now I will say though that a lot of papers have a theoretical 00:49:55.500 |
section which seems to be there entirely to get past the reviewer's need for theory. That's 00:50:03.220 |
not true of the WGAN paper. The theory bit is actually really interesting. You don't 00:50:08.060 |
need to know it to use it, but if you want to learn about some cool ideas and see the 00:50:14.600 |
thinking behind why this particular algorithm, it's absolutely fascinating. Before this paper 00:50:22.720 |
came out, I didn't know literally I knew nobody who had studied the math that it's based on, 00:50:28.760 |
so everybody had to learn the math it was based on. The paper does a pretty good job 00:50:33.520 |
of laying out all the pieces. You'll have to do a bunch of reading yourself. If you're 00:50:39.000 |
interested in digging into the deeper math behind some paper to see what it's like to 00:50:46.240 |
study it, I would pick this one. Because at the end of that theory section, you'll come 00:50:51.520 |
away saying, okay, I can see now why they made this algorithm the way it is. And then 00:51:01.600 |
having come up with that idea, the other thing is often these theoretical sections are very 00:51:05.560 |
clearly added after they come up with the algorithm. They'll come up with the algorithm 00:51:09.020 |
based on intuition and experiments, and then later on post-hoc justify it. Whereas this 00:51:14.280 |
one you can clearly see it's like, okay, let's actually think about what's going on in GANs 00:51:19.720 |
and think about what they need to do and then come up with the algorithm. 00:51:24.280 |
So the basic idea of a GAN is it's a generative model. So it's something that is going to 00:51:31.880 |
create sentences or create images. It's going to generate stuff. And it's going to try and 00:51:42.400 |
create stuff which is very hard to tell the difference between generated stuff and real 00:51:49.480 |
stuff. So a generative model could be used to face-swap a video, a very well-known controversial 00:51:58.600 |
thing of deep fakes and fake pornography and stuff happening at the moment. It could be 00:52:04.360 |
used to fake somebody's voice. It could be used to fake the answer to a medical question. 00:52:13.640 |
But in that case, it's not really a fake. It could be a generative answer to a medical 00:52:18.080 |
question that's actually a good answer. So you're generating language. You could generate 00:52:23.200 |
a caption to an image, for example. So generative models have lots of interesting applications. 00:52:35.920 |
But generally speaking, they need to be good enough that, for example, if you're using 00:52:41.240 |
it to automatically create a new scene for Carrie Fisher in the next Star Wars movies 00:52:48.640 |
and she's not around to play that part anymore, you want to try and generate an image of her 00:52:54.580 |
that looks the same, then it has to fool the Star Wars audience into thinking that doesn't 00:53:00.400 |
look like some weird Carrie Fisher, that looks like the real Carrie Fisher. Or if you're 00:53:05.680 |
trying to generate an answer to a medical question, you want to generate English that 00:53:10.680 |
reads nicely and clearly and sounds authoritative and meaningful. So the idea of a generative 00:53:19.320 |
adversarial network is we're going to create not just a generative model to create, say, 00:53:27.140 |
the generated image, but a second model that's going to try to pick which ones are real and 00:53:33.600 |
which ones are generated. We're going to call them fake. So which ones are real and which 00:53:38.360 |
ones are fake? So we've got a generator that's going to create our fake content and a discriminator 00:53:45.500 |
that's going to try to get good at recognizing which ones are real and which ones are fake. 00:53:50.400 |
So there's going to be two models. And then there's going to be adversarial, meaning the 00:53:53.960 |
generator is going to try to keep getting better at fooling the discriminator into thinking 00:53:59.580 |
that fake is real, and the discriminator is going to try to keep getting better at discriminating 00:54:04.320 |
between the real and the fake. And they're going to go head-to-head, like that. 00:54:09.640 |
And it's basically as easy as I just described. It really is. We're just going to build two 00:54:16.640 |
models in PyTorch. We're going to create a training loop that first of all says the loss 00:54:22.440 |
function for the discriminator is can you tell the difference between real and fake, 00:54:26.400 |
and then update the weights of that. And then we're going to create a loss function for 00:54:29.880 |
the generator, which is going to say can you generate something which pulls the discriminator 00:54:34.560 |
and update the weights from that loss. And we're going to look through that a few times 00:54:40.840 |
And so let's come back to the pseudocode here of the algorithm and let's read the real code 00:54:52.560 |
So there's lots of different things you can do with GANs. And we're going to do something 00:54:57.640 |
that's kind of boring but easy to understand, and it's kind of cool that it's even possible. 00:55:04.040 |
We're just going to generate some pictures from nothing. We're just going to get it to 00:55:09.040 |
draw some pictures. And specifically we're going to get it to draw pictures of bedrooms. 00:55:15.440 |
You'll find if you hopefully get a chance to play around with this during the week with 00:55:19.640 |
your own datasets, if you pick a dataset that's very varied, like ImageNet, and then get a 00:55:26.040 |
GAN to try and create ImageNet pictures, it tends not to do so well because it's not really 00:55:35.640 |
So it's better to give it, for example, there's a dataset called CelebA, which is pictures 00:55:40.400 |
of celebrity faces. That works great with GANs. You create really clear celebrity faces 00:55:46.280 |
that don't actually exist. The bedroom dataset, also a good one. Lots of pictures of the same 00:55:55.660 |
So there's something called the lsun_scene_classification_dataset. You can download it using these steps. 00:56:06.600 |
It's pretty huge. So I've actually created a Kaggle dataset of a 20% sample. So unless 00:56:13.320 |
you're really excited about generating bedroom images, you might prefer to grab the 20% sample. 00:56:20.940 |
So then we do the normal steps of creating some different paths. In this case, as we 00:56:27.320 |
do before, I find it much easier to go the CSV route when it comes to handling our data. 00:56:34.320 |
So I just generate a CSV with the list of files that we want and a fake label that's 00:56:41.040 |
zero because we don't really have labels for these at all. 00:56:45.120 |
So I actually create two CSV files, one that contains everything in that bedroom dataset 00:56:51.720 |
and one that just contains a random 10%. It's just nice to do that because then I can most 00:56:57.840 |
of the time use the sample when I'm experimenting. Because there's well over a million files, 00:57:11.480 |
So this will look pretty familiar. So here's a conv block. This is before I realized that 00:57:18.640 |
sequential models are much better. So if you compare this to my previous conv block with 00:57:23.320 |
a sequential model, there's just a lot more lines of code here. But it does the same thing 00:57:36.440 |
And we calculate our padding, and here's a bias pulse. So this is the same as before 00:57:48.520 |
So the first thing we're going to do is build a discriminator. So a discriminator is going 00:57:53.480 |
to receive as input an image, and it's going to spit out a number. And the number is meant 00:58:02.120 |
to be lower if it thinks this image is real. Of course, what does it do for a lower number 00:58:11.120 |
thing doesn't appear in the architecture, that will be in the loss function. So all 00:58:15.000 |
we have to do is create something that takes an image and spits out a number. 00:58:23.560 |
So a lot of this code is borrowed from the original authors of the paper, so some of 00:58:34.200 |
the naming scheme and stuff is different to what we're used to. So sorry about that. 00:58:43.600 |
But I've tried to make it look at least somewhat familiar. I probably should have renamed things 00:58:46.800 |
a little bit. But it looks very similar to actually what we had before. We start out 00:58:50.800 |
with a convolution, so remember conv block is conv-relievational. 00:58:57.980 |
And then we have a bunch of extra conv layers. This is not going to use a residual. It looks 00:59:04.720 |
very similar to before, a bunch of extra layers, but these are going to be conv layers rather 00:59:08.280 |
than res layers. And then at the end, we need to append enough stride 2 conv layers that 00:59:22.120 |
we decrease the grid size down to be no bigger than 4x4. So it's going to keep using stride 00:59:29.920 |
2, divide the size by 2, stride 2, divide by size by 2, until our grid size is no bigger 00:59:36.020 |
than 4. So this is quite a nice way of creating as many layers as you need in a network to 00:59:42.240 |
handle arbitrary sized images and turn them into a fixed known grid size. 00:59:46.960 |
Yes, Rachel? Does a GAN need a lot more data than say dogs 00:59:51.600 |
versus cats or NLP, or is it comparable? Honestly, I'm kind of embarrassed to say 00:59:58.880 |
I am not an expert practitioner in GANs. The stuff I teach in part 1 is stuff I'm happy 01:00:09.120 |
to say I know the best way to do these things and so I can show you state-of-the-art results 01:00:15.520 |
like I just did with sci-fi 10 with the help of some of my students, of course. I'm not 01:00:21.680 |
there at all with GANs. So I'm not quite sure how much you need. In general, it seems you 01:00:29.760 |
need quite a lot. But remember, the only reason we didn't need too much in dogs and cats is 01:00:35.920 |
because we had a pre-trained model, and could we leverage pre-trained GAN models and fine-tune 01:00:40.880 |
them? Probably. I don't think anybody's done it as far as I know. That could be a really 01:00:48.000 |
interesting thing for people to kind of think about and experiment with. Maybe people have 01:00:52.120 |
done it and there's some literature there I haven't come across. So I'm somewhat familiar 01:00:57.120 |
with the main pieces of literature in GANs, but I don't know all of it. So maybe I've 01:01:03.040 |
missed something about transfer learning in GANs, but that would be the trick to not needing 01:01:06.880 |
too much data. So it's the huge speed-up combination of one cycle learning rate and momentum annealing 01:01:14.560 |
plus the 8 GPU parallel training and the half precision. Is that only possible to do the 01:01:20.360 |
half-precision calculation with consumer GPU? Another question, why is the calculation 8 01:01:26.760 |
times faster from single to half-precision while from double to single is only 2 times 01:01:32.280 |
Okay, so the sci-fi 10 result, it's not 8 times faster from single to half. It's about 01:01:39.160 |
2 or 3 times as fast from single to half. The Nvidia claims about the flops performance 01:01:46.400 |
of the tensor cores are academically correct but in practice meaningless because it really 01:01:54.000 |
depends on what cores you need for what pieces. So about 2 or 3x improvement for half. So 01:02:02.640 |
the half-precision helps a bit, the extra GPU helps a bit, the one cycle helps an enormous 01:02:10.720 |
amount. Then another key piece was the playing around with the parameters that I told you 01:02:16.240 |
about. So reading the wide resnet paper carefully, identifying the kinds of things that they 01:02:23.040 |
found there, and then writing a version of the architecture you just saw that made it 01:02:29.040 |
really easy for me to fiddle around with parameters. Staying up all night trying every possible 01:02:37.520 |
combination of different kernel sizes and numbers of kernels and numbers of layer groups 01:02:43.760 |
and size of layer groups. Remember we did a bottleneck but actually we tended to focus 01:02:51.560 |
not on bottlenecks but instead on widening. So we actually like things that increase the 01:02:55.760 |
size and then decrease it because it takes better advantage of the GPU. So all those 01:03:01.000 |
things combined together. I'd say the one cycle was perhaps the most critical but every 01:03:07.840 |
one of those resulted in a big speedup. That's why we were able to get this 30x improvement 01:03:13.400 |
over the state of the art. And we got some ideas for other things like after this Dawn 01:03:24.000 |
Bench finishes. Maybe we'll try and go even further and see if we can beat one minute 01:03:37.480 |
So here's our discriminator. The important thing to remember about an architecture is 01:03:42.080 |
it doesn't do anything other than have some input tensor size and rank and some output 01:03:48.080 |
tensor size and rank. You see the last com here has one channel. This is a bit different 01:03:55.840 |
to what we're used to, because normally our last thing is a linear block. But our last 01:04:00.960 |
thing here is a com block. And it's only got one channel but it's got a grid size of something 01:04:08.240 |
around 4x4. So we're going to spit out a 4x4 by 1 tensor. 01:04:17.120 |
So what we then do is we then take the mean of that. So it goes from 4x4 by 1 to the scalar. 01:04:27.980 |
So this is kind of like the ultimate adaptive average pooling, because we've got something 01:04:32.200 |
with just one channel, we take the mean. So this is a bit different. Normally we first 01:04:37.120 |
do average pooling and then we put it through a fully connected layer to get our one thing 01:04:42.000 |
out. In this case though we're getting one channel out and then taking the mean of that. 01:04:48.720 |
I haven't fiddled around with why did we do it that way, what would instead happen if 01:04:53.160 |
we did the usual average pooling followed by a fully connected layer. Would it work better? 01:04:58.560 |
Would it not? I don't know. I rather suspect it would work better if we did it the normal 01:05:04.960 |
way, but I haven't tried it and I don't really have a good enough intuition to know whether 01:05:10.400 |
I'm missing something. It would be an interesting experiment to try. If somebody wants to stick 01:05:14.640 |
an adaptive average pooling layer here and a fully connected layer afterwards with a 01:05:17.880 |
single output, it should keep working. It should do something. The loss will go down to see 01:05:27.320 |
So that's the discriminator. There's going to be a training loop. Let's assume we've 01:05:31.640 |
already got a generator. Somebody says, "Okay Jeremy, here's a generator, it generates bedrooms. 01:05:37.760 |
I want you to build a model that can figure out which ones are real and which ones aren't. 01:05:41.160 |
So I'm going to take the data set and I'm going to basically label a bunch of images 01:05:45.680 |
which are fake bedrooms from the generator and a bunch of images of real bedrooms from 01:05:50.040 |
my else-on data set to stick a 1 or a 0 in each one and then I'll try to get the discriminator 01:05:56.080 |
to tell the difference. So that's going to be simple enough. 01:06:04.000 |
But I haven't been given a generator, I need to build one. So a generator, and we haven't 01:06:09.880 |
talked about the loss function yet. We're just going to assume there's some loss function 01:06:13.400 |
that does this thing. So a generator is also an architecture which doesn't do anything 01:06:19.920 |
by itself until we have a loss function and data. But what are the ranks and sizes of 01:06:25.560 |
the tensors? The input to the generator is going to be a vector of random numbers. In 01:06:34.480 |
the paper, they call that the prior. It's going to be a vector of random numbers. How 01:06:38.440 |
big? I don't know. Some big. 64, 128. And the idea is that a different bunch of random numbers 01:06:46.880 |
will generate a different bedroom. So our generator has to take as input a vector, and 01:07:01.320 |
it's going to take that vector, so here's our input, and it's going to stick it through, 01:07:06.200 |
in this case a sequential model. And the sequential model is going to take that vector and it's 01:07:11.160 |
going to turn it into a rank 4 tensor, or if we take off the batch bit, a rank 3 tensor, 01:07:26.520 |
height by width by 3. So you can see at the end here, our final step here, NC, number of 01:07:40.440 |
channels. So I think that's going to have to end up being 3 because we're going to create 01:07:48.800 |
In com-block-forward, is there a reason why BatchNorm comes after ReLU, i.e. self.batchnorm.relu? 01:07:57.760 |
No, there's not. It's just what they had in the code I borrowed from, I think. 01:08:05.240 |
So again, unless my intuition about GANs is all wrong and for some reason needs to be 01:08:17.280 |
different to what I'm used to, I would normally expect to go ReLU then BatchNorm. This is 01:08:29.200 |
actually the order that makes more sense to me. But I think the order I had in the darknet 01:08:35.680 |
was what they used in the darknet paper. Everybody seems to have a different order of these things. 01:08:45.200 |
And in fact, most people for sci-fi 10 have a different order again, which is they actually 01:08:52.220 |
go bn, then ReLU, then conv, which is kind of a quirky way of thinking about it. But 01:09:02.160 |
it turns out that often for residual blocks that works better. That's called a pre-activation 01:09:07.920 |
resnet. So if you Google for pre-activation resnet, you can see that. 01:09:13.520 |
So yeah, there's not so much papers but more blog posts out there where people have experimented 01:09:19.120 |
with different orders of those things. And yeah, it seems to depend a lot on what specific 01:09:25.200 |
data set it is and what you're doing with, although in general the difference in performance 01:09:29.940 |
is small enough you won't care unless it's for a competition. 01:09:36.960 |
So the generator needs to start with a vector and end up with a rank 3 tensor. We don't 01:09:45.040 |
really know how to do that yet, so how do we do that? How do we start with a vector 01:09:52.880 |
We need to use something called a deconvolution. And a deconvolution is, or as they call it 01:10:02.920 |
in PyTorch, a transposed convolution. Same thing, different name. And so a deconvolution 01:10:13.360 |
is something which, rather than decreasing the grid size, it increases the grid size. 01:10:22.920 |
So as with all things, it's easiest to see in an Excel spreadsheet. 01:10:28.780 |
So here's a convolution. We start with a 4x4 grid cell with a single channel, a single filter. 01:10:38.240 |
And let's put it through a 3x3 kernel again with a single output. So we've got a single 01:10:46.820 |
channel in, a single filter kernel. And so if we don't add any padding, we're going to 01:10:53.400 |
end up with 2x2, because that 3x3 can go in 1, 2, 3, 4 places. It can go in one of two 01:11:01.480 |
places across and one of two places down if there's no padding. 01:11:06.880 |
So there's our convolution. Remember the convolution is just the sum of the product of the kernel 01:11:14.960 |
and the appropriate grid cell. So there's our standard 3x3 on one channel, one filter. 01:11:25.440 |
So the idea now is I want to go the opposite direction. I want to start with my 2x2, and 01:11:34.320 |
I want to create a 4x4. And specifically, I want to create the same 4x4 that I started 01:11:41.040 |
with. And I want to do that by using a convolution. 01:11:45.840 |
So how would I do that? Well, if I have a 3x3 convolution, then if I want to create 01:11:51.200 |
a 4x4 output, I'm going to need to create this much padding. Because with this much 01:12:01.340 |
padding, I'm going to end up with 1, 2, 3, 4 by 1, 2, 3, 4. You see why that is? So this 01:12:11.180 |
filter can go in any one of four places across and four places up and down. 01:12:18.380 |
So let's say my convolutional filter was just a bunch of zeros, then I can calculate my 01:12:23.960 |
error for each cell just by taking this attraction, and then I can get the sum of absolute values, 01:12:32.920 |
the L1 loss, by just summing up the absolute values of those errors. 01:12:38.300 |
So now I could use optimization. So in Excel, that's called Solver to do a gradient descent. 01:12:48.900 |
So I'm going to set that cell equal to a minimum, and I'll try and reduce my loss by changing 01:12:56.380 |
my filter, and I'll go Solve. And you can see it's come up with a filter such that 15.7 01:13:05.420 |
compared to 16, 17 is right, 17.8, 18, 19, so it's not perfect. And in general, you can't 01:13:12.540 |
assume that a deconvolution can exactly create the exact thing that you want, because there's 01:13:21.220 |
just not enough. There's only 9 things here, and there's 16 things you're trying to create. 01:13:26.500 |
But it's made a pretty good attempt. So this is what a deconvolution looks like, a stride 01:13:34.140 |
1 3x3 deconvolution on a 2x2 grid cell input. How difficult is it to create a discriminator 01:13:46.060 |
to identify fake news versus real news? Well, you don't need anything special, that's just 01:13:53.220 |
a classifier. So you would just use the NLP classifier from previous to previous class 01:14:00.780 |
and lesson 4. In that case, there's no generative piece, right? So you just need a dataset that 01:14:10.860 |
says these are the things that we believe are fake news, and these are the things we 01:14:13.780 |
consider to be real news. And it should actually work very well. To the best of my knowledge, 01:14:22.740 |
if you try it, you should get as good a result as anybody else has got, whether it's good 01:14:27.900 |
enough to be useful to practice, I don't know. Oh, I was going to say that it's very hard 01:14:32.580 |
using the technique you've described. Very hard. There's not a good solution that does 01:14:41.300 |
that. Well, but I don't think anybody in our course has tried, and nobody else outside 01:14:46.700 |
our course knows of this technique. So there's been, as we've learned, we've just had a very 01:14:53.500 |
significant jump in NLP classification capabilities. Obviously the best you could do at this stage 01:15:04.940 |
would be to generate a triage that says these things look pretty sketchy based on how they're 01:15:13.340 |
written and some human could go and fact check them. An NLP classifier and RNN can't fact 01:15:20.680 |
check things, but it could recognize like, oh, these are written in that kind of highly 01:15:29.820 |
popularized style which often fake news is written in, and so maybe these ones are worth 01:15:34.540 |
paying attention to. I think that would probably be the best you could hope for without drawing 01:15:44.380 |
But it's important to remember that a discriminator is basically just a classifier and you don't 01:15:51.460 |
need any special techniques beyond what we've already learnt to do NLP classification. 01:16:01.360 |
So to do that kind of deconvolution in PyTorch, just say com_transport is 2D, and in the normal 01:16:08.780 |
way you say the number of input channels, the number of output channels, the kernel size, 01:16:14.500 |
the stride, the padding, the bias, so these parameters are all the same. And the reason 01:16:19.340 |
it's called a com_transpose is because actually it turns out that this is the same as the 01:16:24.860 |
calculation of the gradient of convolution. So this is a really nice example back on the 01:16:36.140 |
old Theano website that comes from a really nice paper which actually shows you some visualizations. 01:16:42.740 |
So this is actually the one we just saw of doing a 2x2 deconvolution. If there's a stride 01:16:48.580 |
2, then you don't just have padding around the outside, but you actually have to put 01:16:52.460 |
padding in the middle as well. They're not actually quite implemented this way because 01:16:58.620 |
this is slow to do. In practice they implement them a different way, but it all happens behind 01:17:03.900 |
the scenes, we don't have to worry about it. We've talked about this convolution arithmetic 01:17:10.700 |
tutorial before, and if you're still not comfortable with convolutions and in order to get comfortable 01:17:16.580 |
with deconvolutions, this is a great site to go to. If you want to see the paper, just 01:17:22.420 |
Google for convolution arithmetic, that'll be the first thing that comes up. Let's do 01:17:27.700 |
it now so you know you've found it. Here it is. And so that Theano tutorial actually comes 01:17:38.580 |
from this paper. But the paper doesn't have the animated gifs. 01:17:48.700 |
So it's interesting then. A deconv block looks identical to a conv block, except it's got 01:17:53.000 |
the word transpose written here. We just go conv-related batch norm as before, it's got 01:17:58.220 |
input filters, output filters. The only difference is that stride 2 means that the grid size 01:18:11.140 |
Both nn_conf_transpose_2D and nn.upsample seem to do the same thing, i.e. expand grid size, 01:18:18.700 |
height and width from the previous layer. Can we say that conv_transpose_2D is always better 01:18:23.860 |
than upsample, since upsample is merely resizing and filling unknowns by zeros or interpolation? 01:18:30.980 |
No, you can't. So there's a fantastic interactive paper on distill.pub called Deconvolution 01:18:52.340 |
But the good news is everybody else does it. If you have a look here, can you see these 01:19:01.460 |
checkerboard artifacts? It's all like dark blue, light blue, dark blue, light blue. So 01:19:07.620 |
these are all from actual papers, right? Basically they noticed every one of these papers with 01:19:13.860 |
generative models has these checkerboard artifacts. And what they realized is it's because when 01:19:20.820 |
you have a stride 2 convolution of size 3 kernel, they overlap. And so you basically get like 01:19:30.580 |
some pixels get twice as much, some grid cells get twice as much activation. And so even 01:19:38.340 |
if you start with random weights, you end up with a checkerboard artifact. So you can 01:19:44.780 |
kind of see it here. And so the deeper you get, the worse it gets. Their advice is actually 01:19:58.060 |
less direct from it than it ought to be. I found that for most generative models, upsampling 01:20:03.620 |
is better. So if you do nn.upsample, then all it does is it's basically doing cooling. 01:20:13.300 |
But it's kind of the opposite of cooling. It says let's replace this one pixel or this one 01:20:20.100 |
grid cell with 4, 2x2. And there's a number of ways to upsample. One is just to copy 01:20:26.180 |
it across to those 4. Another is to use bilinear or bicubic interpolation. There are various 01:20:32.260 |
techniques to try and create a smooth upsampled version, and you can pretty much choose any 01:20:37.500 |
of them in PyTorch. So if you do a 2x2 upsample and then a regular stride 1 3x3 conv, that's 01:20:48.180 |
like another way of doing the same kind of thing as a conv transpose. It's doubling the 01:20:55.660 |
grid size and doing some convolutional arithmetic on it. And I found for generative models it 01:21:04.020 |
pretty much always works better. And in that distillator publication, they kind of indicate 01:21:11.260 |
that maybe that's a good approach, but they don't just come out and say just do this, 01:21:15.080 |
whereas I would just say just do this. Having said that, for GANs, I haven't had that much 01:21:21.700 |
success with it yet, and I think it probably requires some tweaking to get it to work. 01:21:26.540 |
I'm sure some people have got it to work. The issue I think is that in the early stages, 01:21:33.980 |
it doesn't create enough noise. I had a version actually where I tried to do it with an upsample, 01:21:47.140 |
and you could kind of see that the noise didn't look very noisy. So anyway, it's an interesting 01:21:53.420 |
version. But next week when we look at style transfer and super resolution and stuff, I 01:21:58.660 |
think you'll see an end-up sample really comes into its own. 01:22:05.100 |
So the generator, we can now basically start with a vector. We can decide and say, okay, 01:22:10.340 |
let's not think of it as a vector, but actually it's a 1x1 grid cell, and then we can turn 01:22:14.120 |
it into a 4x4 and an 8x8 and so forth. And so that's why we have to make sure it's a suitable 01:22:20.860 |
multiple so that we can actually create something of the right size. And so you can see it's 01:22:26.500 |
doing the exact opposite as before, right? It's making the cell size smaller and smaller 01:22:31.340 |
by 2 at a time, as long as it can, until it gets to half the size that we want. And then 01:22:48.340 |
finally we add one more on at the end -- sorry, we add n more on at the end with no stride, 01:22:58.100 |
and then we add one more com transpose to finally get to the size that we wanted, and 01:23:06.020 |
we're done. Finally, we put that through a than, and that's going to force us to be in 01:23:13.100 |
the 0-to-1 range, because of course we don't want to spit out arbitrary size pixel values. 01:23:24.860 |
So we've got a generator architecture which spits out an image of some given size with 01:23:29.580 |
the correct number of channels and with values between 0 and 1. 01:23:39.420 |
So at this point we can now create our ModelData object. These things take a while to train, 01:23:47.700 |
so I just made it 128x128, so this is just a convenient way to make it a bit faster. And 01:23:58.380 |
that's going to be the size of the import, but then we're going to use transformations 01:24:01.180 |
to turn it into 64x64. There's been more recent advances which have attempted to really increase 01:24:10.060 |
this up to high resolution sizes, but they still tend to require either a batch size 01:24:15.060 |
of 1 or lots and lots of GPUs or whatever. We're trying to do things that we can do on 01:24:21.740 |
single consumer GPUs here. So here's an example of one of the 64x64 bedrooms. 01:24:31.260 |
So we're going to do pretty much everything manually, so let's go ahead and create our 01:24:36.740 |
two models, our generator and our discriminator. And as you can see, the DCGAN, so in other 01:24:44.460 |
words they're the same modules that were appeared in this paper. So if you're interested in 01:24:52.380 |
reading the papers, it's well worth going back and looking at the DCGAN paper to see 01:24:58.740 |
what these architectures are, because it's assumed that when you read the Wasserstein 01:25:07.820 |
Shouldn't we use a sigmoid if we want values between 0 and 1? 01:25:13.460 |
I always forget which one's which. So sigmoid is 0 to 1, than is 1 to -1. I think what will 01:25:27.780 |
happen is -- I'm going to have to check that. I vaguely remember thinking about this when 01:25:34.860 |
I was writing this notebook and realizing that 1 to -1 made sense for some reason, but 01:25:39.540 |
I can't remember what that reason was now. So let me get back to you about that during 01:25:50.180 |
So we've got our generator and our discriminator. So we need a function that returns a prior 01:25:56.060 |
vector, so a bunch of noise. So we do that by creating a bunch of zeros. nz is the size 01:26:05.020 |
of z, so very often in our code if you see a mysterious letter, it's because that's the 01:26:10.260 |
letter they used in the paper. So z is the size of our noise vector. 01:26:16.980 |
So there's the size of our noise vector, and then we use a normal distribution to generate 01:26:22.340 |
random numbers inside that. And that needs to be a variable because it's going to be 01:26:32.780 |
So here's an example of creating some noise, and so here are four different pieces of noise. 01:26:40.060 |
So we need an optimizer in order to update our gradients. In the Wasserstein GAN paper, 01:26:51.480 |
they told us to use rmsprop. So that's fine. So when you see this thing saying do an rmsprop 01:26:56.820 |
update in a paper, that's nice. We can just do an rmsprop update with pytorch. And they 01:27:05.140 |
suggested a learning rate of 5e-neg-5. I think I found 1e-neg-4 seemed to work, so I just 01:27:14.820 |
So now we need a training loop. And so this is the thing that's going to implement this 01:27:18.820 |
algorithm. So a training loop is going to go through some number of epochs that we get 01:27:26.780 |
to pick, so that's going to be a parameter. And so remember, when you do everything manually, 01:27:33.860 |
you've got to remember all the manual steps to do. So one is that you have to set your 01:27:37.700 |
modules into training mode when you're training them, and into evaluation mode when you're 01:27:43.340 |
evaluating them. Because in training mode, batch norm updates happen, and dropout happens. 01:27:49.940 |
In evaluation mode, those two things get turned off. That's basically the difference. So put 01:27:54.860 |
it into training mode. We're going to grab an iterator from our training data loader. 01:28:01.900 |
We're going to see how many steps we have to go through, and then we'll use TQDM to 01:28:08.020 |
give us a progress bar, and then we're going to go through that many steps. 01:28:14.060 |
So the first step of this algorithm is to update the discriminator. So in this one -- they 01:28:37.060 |
don't call it a discriminator, they call it a critic. So w are the weights of the critic. 01:28:43.380 |
So the first step is to train our critic a little bit, and then we're going to train 01:28:48.660 |
our generator a little bit, and then we're going to go back to the top of the loop. So 01:28:53.500 |
we've got a while loop on the outside, so here's our while loop on the outside, and 01:28:58.060 |
then inside that there's another loop for the critic, and so here's our little loop 01:29:02.440 |
inside that for the critic. We call it a discriminator. 01:29:07.060 |
So what we're going to do now is we've got a generator, and at the moment it's random. 01:29:13.660 |
So our generator is going to generate stuff that looks something like this, and so we 01:29:19.360 |
need to first of all teach our discriminator to tell the difference between that and a 01:29:24.060 |
bedroom. It shouldn't be too hard, you would hope. So we just do it in basically the usual 01:29:34.780 |
So first of all, we're going to grab a mini-batch of real bedroom photos, so we can just grab 01:29:42.060 |
the next batch from our iterator, turn it into a variable. Then we're going to calculate 01:29:54.580 |
the loss for that. So this is going to be, how much does the discriminator think this 01:30:08.220 |
looks fake? And then we're going to create some fake images, and to do that we'll create 01:30:16.240 |
some random noise, and we'll stick it through our generator, which at this stage is just 01:30:20.940 |
a bunch of random weights, and that's going to create a mini-batch of fake images. And 01:30:27.060 |
so then we'll put that through the same discriminator module as before to get the loss for that. 01:30:34.780 |
So how fake do the fake ones look? Remember when you do everything manually, you have 01:30:39.900 |
to zero the gradients in your loop, and if you've forgotten about that, go back to the 01:30:45.780 |
Part 1 lesson where we do everything from scratch. So now finally, the total discriminator 01:30:52.820 |
loss is equal to the real loss minus the fake loss. And so you can see that here. They don't 01:31:03.140 |
talk about the loss, they actually just talk about what are the gradient updates. So this 01:31:08.140 |
here is the symbol for get the gradients. So inside here is the loss. And try to learn 01:31:16.860 |
to throw away in your head all of the boring stuff. So when you see sum over m divided 01:31:22.780 |
by m, that means take the average. So just throw that away and replace it with np.mean 01:31:27.820 |
in your head. There's another np.mean. So you want to get quick at being able to see these 01:31:33.180 |
common idioms. So anytime you see 1 over m, sum over m, you go, okay, np.mean. So we're 01:31:40.060 |
taking the mean of, and we're taking the mean of, so that's all fine. 01:31:45.580 |
x_i, what's x_i? It looks like it's x to the power of i, but it's not. The math notation 01:31:52.100 |
is very overloaded. They showed us here what x_i is, and it's a set of m samples from a 01:32:01.420 |
batch of the real data. So in other words, this is a mini-batch. So when you see something 01:32:06.940 |
saying sample, it means just grab a row, grab a row, grab a row, and you can see here grab 01:32:12.180 |
at m times, and we'll call the first row x, parenthesis 1, the second row x, parenthesis 01:32:17.980 |
2. One of the annoying things about math notation is the way that we index into arrays is everybody 01:32:28.500 |
uses different approaches, subscripts, superscripts, things in brackets, combinations, commas, square 01:32:33.700 |
brackets, whatever. So you've just got to look in the paper and be like, okay, at some point 01:32:39.300 |
they're going to say take the i-th row from this matrix or the i-th image in this batch, 01:32:45.220 |
how are they going to do it? In this case, it's a superscript in parenthesis. 01:32:51.020 |
So that's all sample means, and curly brackets means it's just a set of them. This little 01:32:56.420 |
squiggle followed by something here means according to some probability distribution. 01:33:04.580 |
And so in this case, and very very often in papers, it simply means, hey, you've got a 01:33:09.420 |
bunch of data, grab a bit from it at random. So that's the probability distribution of 01:33:17.380 |
the data you have is the data you have. So this says grab m things at random from your 01:33:24.620 |
prior samples, and so that means in other words call create_noise to create m random vectors. 01:33:42.620 |
So now we've got m real images. Each one gets put through our discriminator. We've got m 01:33:54.380 |
bits of noise. Each one gets put through our generator to create m generated images. Each 01:34:03.500 |
one of those gets put through, look, f(w), that's the same thing, so each one of those 01:34:07.460 |
gets put through our discriminator to try and figure out whether they're fake or not. 01:34:11.820 |
And so then it's this, minus this, and the mean of that, and then finally get the gradient 01:34:18.060 |
of that in order to figure out how to use rmsprop to update our weights using some learning 01:34:27.180 |
So in PyTorch, we don't have to worry about getting the gradients. We can just specify 01:34:34.660 |
the loss bit, and then just say loss.backward, discriminator optimizer.step. 01:34:42.140 |
Now there's one key step, which is that we have to keep all of our weights, which are 01:34:56.660 |
the parameters in a PyTorch module, in this small range between -0.01 and 0.01. Why? Because 01:35:07.540 |
the mathematical assumptions that make this algorithm work only apply in like a small 01:35:15.620 |
ball. I think it's kind of interesting to understand the math of why that's the case, 01:35:23.220 |
but it's very specific to this one paper, and understanding it won't help you understand 01:35:28.420 |
any other paper. So only study it if you're interested. I think it's nicely explained, 01:35:34.140 |
I think it's fun, but it won't be information that you'll reuse elsewhere unless you get 01:35:42.380 |
I'll also mention, after the paper came out, an improved Frostenstein GAN came out that 01:35:47.900 |
said there are better ways to ensure that your weight space is in this type ball, which 01:35:54.380 |
was basically to penalize gradients that are too high. So nowadays there are slightly different 01:36:02.060 |
ways to do this. Anyway, that's why this line of code there is kind of the key contribution. 01:36:08.720 |
This one line of code actually is the one line of code you add to make it a Frostenstein 01:36:13.100 |
GAN. But the work was all in knowing that that's the thing you can do that makes everything 01:36:20.700 |
At the end of this, we've got a discriminator that can recognize it in real bedrooms and 01:36:25.580 |
now totally random crappy generated images. So let's now try and create some better images. 01:36:33.160 |
So now set trainable discriminator to false, set trainable to true, zero out the gradients 01:36:42.740 |
And now our loss again is fw, that's the discriminator of the generator applied to some more random 01:36:57.380 |
noise. So here's our random noise, here's our generator, and here's our discriminator. 01:37:09.260 |
I think I can remove that now because I think I've put it inside the discriminator but I 01:37:16.300 |
won't change it now because it's going to confuse me. So it's exactly the same as before 01:37:22.540 |
where we did generator on the noise and then pass that to discriminator, but this time 01:37:28.140 |
the thing that's trainable is the generator, not the discriminator. So in other words, 01:37:32.580 |
in this pseudocode, the thing they update is theta, which is the generator's parameters 01:37:40.900 |
rather than w, which is the discriminator's parameters. 01:37:45.860 |
And so hopefully you'll see now that this w down here is telling you these are the parameters 01:37:52.740 |
of the discriminator, this theta down here is telling you these are the parameters of 01:38:03.420 |
And again, it's not a universal mathematical notation, it's a thing they're doing in this 01:38:10.740 |
particular paper, but it's kind of nice when you see some suffix like that, try to think 01:38:21.780 |
So we take some noise, generate some images, try and figure out if they're fake or real, 01:38:28.500 |
and use that to get gradients with respect to the generator, as opposed to earlier we 01:38:35.420 |
got them with respect to the discriminator, and use that to update their weights with 01:38:47.980 |
You'll see that it's kind of unfair that the discriminator is getting trained n critic 01:38:55.700 |
times, which they set to 5, for every time that we train the generator once. 01:39:04.700 |
And the paper talks a bit about this, but the basic idea is there's no point making 01:39:09.500 |
the generator better if the discriminator doesn't know how to discriminate yet. 01:39:18.460 |
And here's that 5, and actually something which was added in the later paper is the 01:39:28.880 |
idea that from time to time, and a bunch of times at the start, you should do more steps 01:39:40.060 |
So make sure that the discriminator is pretty capable from time to time. 01:39:46.420 |
So do a bunch of epochs of training the discriminator a bunch of times to get better at telling 01:39:51.900 |
the difference between real and fake, and then do one step with making the generator 01:39:56.860 |
being better at generating, and that is an epoch. 01:40:01.940 |
And so let's train that for one epoch, and then let's create some noise so we can generate 01:40:17.020 |
Let's first of all decrease the learning rate by 10 and do one more pass. 01:40:20.900 |
So we've now done two epochs, and now let's use our noise to pass it to our generator, 01:40:30.540 |
and then put it through our denormalization to turn it back into something we can see, 01:40:43.620 |
It's not real bedrooms, and some of them don't look particularly like bedrooms, but some 01:40:53.620 |
And I think the best way to think about a GAN is it's like an underlying technology 01:41:01.020 |
that you'll probably never use like this, but you'll use in lots of interesting ways. 01:41:08.820 |
For example, we're going to use it to create now a CycleGAN, and we're going to use the 01:41:28.040 |
You could also use it to turn Monet prints into photos, or to turn photos of Yosemite 01:41:39.860 |
One, is there any reason for using RMS props, specifically as the optimizer as opposed to 01:41:46.820 |
I don't remember it being explicitly discussed in the paper, I don't know if it's just experimental 01:41:54.220 |
Have a look in the paper and see what it says, I don't recall. 01:41:57.900 |
And which could be a reasonable way of detecting overfitting while training, or evaluating 01:42:02.420 |
the performance of one of these GAN models once we're done training? 01:42:06.060 |
In other words, how does the notion of training validation test sets translate to GANs? 01:42:18.740 |
And there's a lot of people who make jokes about how GANs is the one field where you 01:42:24.420 |
don't need a test set, and people take advantage of that by making stuff up and saying it looks 01:42:33.060 |
There are some pretty famous problems with GANs. 01:42:36.420 |
One of the famous problems with GANs is called mode collapse. 01:42:40.060 |
And mode collapse happens where you look at your bedrooms and it turns out that there's 01:42:44.900 |
basically only three kinds of bedrooms that every possible noise vector mapped to, and 01:42:50.660 |
you look at your gallery and it turns out they're all just the same thing, or there's 01:42:56.980 |
Mode collapse is easy to see if you collapse down to a small number of modes, like three 01:43:02.700 |
But what if you have a mode collapse down to 10,000 modes, so there's only 10,000 possible 01:43:08.300 |
bedrooms that all of your noise vectors collapse to? 01:43:12.700 |
You wouldn't be able to see it here, because it's pretty unlikely you would have two identical 01:43:18.460 |
Or what if every one of these bedrooms is basically a direct copy of one of the -- it 01:43:31.680 |
And the truth is most papers don't do a good job or sometimes any job of checking those 01:43:42.440 |
So the question of how do we evaluate GANs, and even the point of maybe we should actually 01:43:50.100 |
evaluate GANs properly is something that is not widely enough understood even now. 01:44:02.460 |
So Ian Goodfellow, who a lot of you will know because he came and spoke here at a lot of 01:44:09.540 |
the book club meetings last year, and of course was the first author on the most famous deep 01:44:16.540 |
He's the inventor of GANs, and he's been sending a continuous stream of tweets reminding people 01:44:24.900 |
about the importance of testing GANs properly. 01:44:30.500 |
So if you see a paper that claims exceptional GAN results, then this is definitely something 01:44:47.100 |
So this is going to be really straightforward because it's just a neural net. 01:44:51.260 |
So all we're going to do is we're going to create an input containing lots of zebra photos, 01:44:58.780 |
and with each one we'll pair it with an equivalent horse photo, and we'll just train a neural 01:45:08.300 |
Or you can do the same thing for every Monet painting, create a dataset containing the 01:45:15.020 |
Oh wait, that's not possible because the places that Monet painted aren't there anymore, and 01:45:23.700 |
And oh wait, how the hell is this going to work? 01:45:27.920 |
This seems to break everything we know about what neural nets can do and how they do them. 01:45:32.980 |
Alright Rachel, you're going to ask me a question, just spoil our whole train of thought. 01:45:49.660 |
There are some papers that try to do semi-supervised learning with GANs. 01:45:53.620 |
I haven't found any that are particularly compelling, showing state-of-the-art results on really 01:46:00.340 |
interesting datasets that have been widely studied. 01:46:08.100 |
The reason I'm a little skeptical is because in my experience, if you train a model with 01:46:12.660 |
synthetic data, the neural net will become fantastically good at recognizing the specific 01:46:20.020 |
problems of your synthetic data, and that will end up what it's learning from. 01:46:26.820 |
And there are lots of other ways of doing semi-supervised models which do work well. 01:46:34.260 |
For example, you might remember Otavio Good who created that fantastic visualization in 01:46:39.540 |
Part 1 of the Zooming ConvNet where he kind of showed a letter going through MNIST. 01:46:45.740 |
He at least at that time was the number one autonomous remote-controlled car guy in autonomous 01:46:58.420 |
And he trained his model using synthetically augmented data where he basically talked real 01:47:06.040 |
videos of a car driving around a circuit and added fake people and fake other cars and 01:47:15.020 |
And I think that worked well because he's kind of a genius and because I think he had 01:47:22.660 |
a well-defined subset that he had to work in. 01:47:32.820 |
But in general it's really hard to use synthetic data. 01:47:36.620 |
I've tried using synthetic data in models for decades now, obviously not GANs because 01:47:42.220 |
they're pretty new, but in general it's very hard to do. 01:47:53.380 |
So somehow these folks at Berkeley created a model that can turn a horse into a zebra 01:48:03.700 |
despite not having any photos unless they went out there and painted horses and took 01:48:08.820 |
before and after shots, but I believe they did it. 01:48:24.300 |
I will say the person I know who's doing the most interesting practice of CycleGAN right 01:48:35.780 |
She's the only artist I know of who is a CycleGAN artist. 01:48:42.700 |
She created this little doodle in the top left, and then trained a CycleGAN to turn 01:48:49.140 |
it into this beautiful painting in the bottom right. 01:48:58.860 |
I think it's really interesting, I mentioned at the start of this class that GANs are in 01:49:04.500 |
the category of stuff that's not there yet, but it's nearly there, and in this case there's 01:49:12.140 |
at least one person in the world now who's creating beautiful and extraordinary artworks 01:49:16.840 |
using GANs, and there's lots of specifically CycleGANs, and there's actually at least 01:49:22.660 |
maybe a dozen people I know of who are just doing interesting creative work with neural 01:49:27.420 |
nets more generally, and the field of creative AI is going to expand dramatically. 01:49:33.300 |
I think it's interesting with Helena, I don't know her personally, but from what I understand 01:49:38.700 |
of her background, she's a software developer, it's her full-time job, and an artist as her 01:49:45.420 |
hobby, and she's kind of started combining these two by saying, "Gosh, I wonder what 01:49:53.660 |
And so if you follow her Twitter account, we'll make sure we add it on the wiki. 01:50:03.020 |
She basically posts a new work almost every day, and they're always pretty amazing. 01:50:11.820 |
So here's the basic trick, and this is from the CycleGAN paper. 01:50:18.620 |
We're going to have two images, assuming we're doing this with images, but the key thing 01:50:27.140 |
is they're not paired images. We don't have a data set of horses and the equivalent zebras. 01:50:34.620 |
We've got a bunch of horses, a bunch of zebras. 01:50:41.780 |
We've now got an X, let's say X is horse, and Y is zebra. 01:50:47.060 |
We're going to train a generator, and what they call here a mapping function, that turns 01:50:52.940 |
horse into zebra, we'll call that mapping function G, and we'll create one mapping function, 01:50:58.900 |
generator, that turns a zebra into a horse, and we'll call that F. 01:51:04.140 |
We'll create a discriminator, just like we did before, which is going to get as good 01:51:09.660 |
as possible at recognizing real from fake horses, so that'll be DX, and then another 01:51:16.500 |
discriminator which is going to be as good as possible and recognizing real from fake 01:51:24.740 |
So that's kind of our starting point, but then the key thing to making this work, we're 01:51:31.700 |
kind of generating a loss function here, right? Here's one bit of the loss function, here's 01:51:34.620 |
the second bit of the loss function. We're going to create something called cycle consistency 01:51:39.060 |
loss which says after you turn your horse into a zebra with your G generator and check 01:51:47.900 |
whether or not I can recognize that it's real. I keep forgetting which one's horse and which 01:51:54.420 |
one's zebra, I apologize if I get my X's and Y's backwards. 01:51:57.700 |
I turn my horse into a zebra, and then I'm going to try and turn that zebra back into 01:52:03.080 |
the same horse that I started with. So then I'm going to have another function that's 01:52:09.240 |
going to check whether this horse, which I've generated knowing nothing about X, generated 01:52:19.500 |
entirely from this zebra, is similar to the original horse or not. 01:52:24.980 |
So the idea would be if your generated zebra doesn't look anything like your original horse, 01:52:32.420 |
you've got no chance of turning it back into the original horse. So a loss, which compares 01:52:38.340 |
X hat to X, is going to be really bad unless you can go into Y and back out again. And 01:52:46.140 |
you're probably only going to be able to do that if you're able to create a zebra that 01:52:50.540 |
looks like the original horse so that you know what the original horse looked like. 01:52:55.400 |
And vice versa. Take your zebra, turn it into a fake horse, and then try and turn it back 01:53:04.880 |
into the original zebra and check that it looks like the original. So notice here, this 01:53:10.100 |
F is our zebra to horse. This G is our horse to zebra. So the G and the F are kind of doing 01:53:19.260 |
two things. They're both turning the original horse into the zebra and then turning the 01:53:25.740 |
zebra back into the original horse. So notice that there's only two generators. There isn't 01:53:31.780 |
a separate generator for the reverse mapping. You have to use the same generator that was 01:53:36.980 |
used for the original mapping. So this is the cycle consistency loss. And I just think 01:53:42.420 |
this is genius. The idea that this is a thing that could be even possible, honestly when 01:53:51.620 |
this came out, it just never occurred to me as a thing that I could even try and solve. 01:53:56.540 |
It seems so obviously impossible. And then the idea that you can solve it like this, 01:54:05.940 |
So it's good to look at the equations in this paper because they're written pretty simply. 01:54:16.660 |
It's not like some of the stuff in the Wasserstein-Gahn paper which is like lots of theoretical proofs 01:54:23.140 |
and whatever else. In this case, they're just equations that just lay out what's going on. 01:54:28.500 |
And you really want to get to a point where you can read them and understand them. So let's 01:54:35.040 |
So we've got a horse and a zebra. So for some mapping function G, which is our horse to 01:54:48.500 |
zebra mapping function, then there's a GAN loss, which is the bit we're already familiar 01:54:54.580 |
with. It says I've got a horse, a zebra, a fake zebra recognizer, and a horse to zebra 01:55:03.060 |
generator. And the loss is what we saw before. It's our ability to draw one zebra out of 01:55:14.220 |
our zebras and recognize whether it's real or fake. And then take a horse and turn it 01:55:29.700 |
into a zebra and recognize whether that's real or fake. And then you can then do one 01:55:37.300 |
minus the other. In this case, they've got a log in there. The log's not terribly important. 01:55:43.060 |
So this is the thing we just saw. So that's why we did Wasserstein GAN first. This is 01:55:54.780 |
All of this sounds awfully like translating in one language to another than back to the 01:55:58.580 |
original. Have GANs or any equivalent been tried in translation? 01:56:04.460 |
Not that I know of. There's this unsupervised machine translation which does kind of do 01:56:24.020 |
something like this, but I haven't looked at it closely enough to know if it's nearly 01:56:35.020 |
So to kind of back up to what I do know, normally with translation you require this kind of 01:56:40.340 |
paired input. You require parallel texts. This is the French translation of this English 01:56:47.820 |
I do know there's been a couple of recent papers that show the ability to create good 01:56:53.060 |
quality translation models without paired data. I haven't implemented them and I don't 01:57:00.860 |
understand anything, I haven't implemented them, but they may well be doing the same 01:57:04.820 |
basic idea. We'll look at it during the week and get back to you. 01:57:15.980 |
So we've got our GAN loss. The next piece is the cycle consistency loss. So the basic 01:57:22.180 |
idea here is that we start with our horse, use our zebra generator on that to create 01:57:28.980 |
a zebra, use our horse generator on that to create a horse, and then compare that to the 01:57:34.580 |
original horse. And this double lines with a 1, we've seen this before, this is the L1 01:57:41.500 |
loss. So this is the sum of the absolute value of differences. 01:57:45.580 |
Or else if this was a 2, it would be the L2 loss, or the 2 norm, which would be the sum 01:57:57.480 |
And again, we now know this squiggle idea, which is from our horses, grab a horse. So 01:58:07.540 |
this is what we mean by sample from a distribution. There's all kinds of distributions, but most 01:58:12.380 |
commonly in these papers we're using in empirical distribution. In other words, we've got some 01:58:20.060 |
So when you see this thing, squiggle, other thing, this thing here, when it says pdata, 01:58:27.340 |
that means grab something from the data, and we're going to call that thing x. So from 01:58:33.700 |
our horse's pictures, grab a horse, turn it into a zebra, turn it back into a horse, compare 01:58:39.740 |
it to the original, and sum up the absolute values. Do that for horse to zebra, do it for 01:58:44.760 |
zebra to horse as well, add the two together, and that is our cycle consistency loss. 01:58:56.420 |
So now we get our loss function, and the whole loss function depends on our horse generator, 01:59:03.460 |
our zebra generator, our horse recognizer, our zebra recognizer discriminator, and we're 01:59:09.340 |
going to add up the GAN loss for recognizing horses, the GAN loss for recognizing zebras, 01:59:17.980 |
and the cycle consistency loss for our two generators. 01:59:23.580 |
And then we've got a lambda here, which hopefully we're kind of used to this idea now, that 01:59:27.620 |
is when you've got two different kinds of loss, you chuck in a parameter there, you 01:59:32.580 |
can multiply them by so they're about the same scale. We did a similar thing with our 01:59:38.100 |
bounding box loss compared to our classifier loss when we did that localization stuff. 01:59:49.140 |
So then we're going to try to, for this loss function, maximize the capability of the discriminators 01:59:56.100 |
that are discriminating whilst minimizing that for the generators. So the generators 02:00:04.700 |
and the discriminators are going to be facing off against each other. 02:00:08.880 |
So when you see this min-max thing in papers, you'll see it a lot. It basically means this 02:00:15.500 |
idea that in your training loop, one thing is trying to make something better, the other 02:00:20.180 |
is trying to make something worse, and there's lots of ways to do it, but most commonly you 02:00:24.700 |
will alternate between the two. And you'll often see this just referred to in math papers 02:00:29.820 |
as min-max. So when you see min-max, you should immediately think, okay, adversarial training. 02:00:42.320 |
So let's look at the code. We probably won't be able to finish this today, but we're going 02:00:49.020 |
to do something almost unheard of, which is I started looking at somebody else's code, 02:00:56.660 |
and I was not so disgusted that I threw the whole thing away and did it myself. I actually 02:01:01.060 |
said I quite like this. I like it enough I'm going to show it to my students. 02:01:07.140 |
So this is where the code comes from. So this is one of the people that created the original 02:01:17.340 |
code for CycleGANs, and they've created a PyTorch version. I had to clean it up a little 02:01:29.660 |
bit, but it's actually pretty damn good. I think the first time I found code that I didn't 02:01:37.740 |
feel the need to rewrite from scratch before I showed it to you. 02:01:44.340 |
And so the cool thing about this is one of the reasons I like doing it this way, like 02:01:49.140 |
finally finding something that's not awful, is that you're now going to get to see almost 02:01:55.940 |
all the bits of fast.ai, or all the relevant bits of fast.ai, written in a different way 02:02:00.980 |
than somebody else. And so you're going to get to see how they do data sets, and data 02:02:06.140 |
loaders, and models, and training loops, and so forth. So you'll find there's a Segan directory, 02:02:17.860 |
which is basically nearly this, with some cleanups which I hope to submit as a PR sometime. 02:02:26.740 |
It was written in a way that unfortunately made it a bit over-connected to how they were 02:02:30.260 |
using it as a script. I cleaned it up a little bit so I could use it as a module, but other 02:02:35.100 |
than that it's pretty similar. So Segan is basically their code copied from their GitHub 02:02:44.540 |
repo with some minor changes. So the way the Segan mini library has been set up is that 02:02:54.220 |
the configuration options they're assuming are being passed in to a script. So they've 02:02:59.580 |
got this train options parser method, and so you can see I'm basically passing in an 02:03:06.780 |
array of script options. Where's my data? How many threads do I want to drop out? How many 02:03:14.780 |
iterations? What am I going to call this model? Which GPU do I want to write it on? So that 02:03:22.460 |
might just be an opt object, which you can then see what it contains. You'll see it contains 02:03:32.580 |
some things I didn't mention, and that's because it's got defaults for everything else that 02:03:39.540 |
So rather than using fast.ai stuff, we're going to use Segan stuff. So the first thing 02:03:46.820 |
we're going to need is a data loader. And so this is also a great opportunity for you again 02:03:52.620 |
to practice your ability to navigate through code with your editor or IDE of choice. So 02:04:01.060 |
we're going to start with create data loader. So you should be able to go find symbol or 02:04:06.940 |
in vim tag to jump straight to create data loader, and we can see that's creating a custom 02:04:14.260 |
dataset loader, and then we can see custom dataset loader is a base data loader. So basically 02:04:26.180 |
we can see that it's going to use a standard PyTorch data loader. So that's good. And so 02:04:32.540 |
we know if you're going to use a standard PyTorch data loader, you have to pass it a 02:04:36.540 |
dataset. And we know that a dataset is something that contains a length and an indexer. So 02:04:43.900 |
presumably when we look at create dataset, it's going to do that. Here is create dataset. 02:04:49.980 |
So this library actually does more than just CycleGAN. It handles both aligned and unaligned 02:04:55.660 |
image pairs. We know that our image pairs are unaligned. So we've got an unaligned dataset. 02:05:02.300 |
Okay, here it is. And as expected, it has a getItem and a length. Good. And so obviously 02:05:13.260 |
the length is just whatever. So A and B is our horses and zebras. We've got two sets. 02:05:21.820 |
So whichever one is longer is the length of the data loader. And so getItem is just going 02:05:26.620 |
to go ahead and randomly grab something from each of our two horses and zebras, open them 02:05:36.620 |
up with Pillow or PIL, run them through some transformations, and then we could either 02:05:43.260 |
be turning horses into zebras or zebras into horses, so there's some direction, and then 02:05:48.260 |
it will just go ahead and return our horse and our zebra and our path to the horse and 02:05:53.980 |
the path to zebra. So hopefully you can kind of see that this is looking pretty similar 02:05:59.980 |
to the kind of stuff that FastAI does. FastAI obviously does quite a lot more when it comes 02:06:06.860 |
to transforms and performance and stuff like this. But remember, this is like research 02:06:12.380 |
code for this one thing. It's pretty cool that they did all this work. 02:06:17.900 |
So we've got a data loader, so we can go and load our data into it, and so that will tell 02:06:24.860 |
us how many minibatches are in it. That's the length of the data loader in PyTorch. 02:06:31.140 |
Next step, we've got a data loader is to create a model. So you can go tag for create_model. 02:06:43.660 |
There it is. Same idea, we've got different kinds of models, so we're going to be doing 02:06:48.220 |
a CycleGAN. So here's our CycleGAN model. So there's quite a lot of stuff in a CycleGAN 02:06:55.180 |
model, so let's go through and find out what's going to be used. But basically at this stage, 02:07:04.540 |
we've just called initializer. So when we initialize it, you can see it's going to go through and 02:07:11.540 |
it's going to define two generators, which is not surprising, a generator for our horses 02:07:17.480 |
and a generator for our zebras. There's some way for it to generate a pool of fake data. 02:07:31.940 |
And then here we're going to grab our GAN loss, and as we talked about, our cycle consistency 02:07:39.140 |
loss is an L1 loss. That's interesting, they're going to use ADAM. So obviously for CycleGANs, 02:07:48.780 |
they found ADAM works pretty well. And so then we're going to have an optimizer for 02:07:54.020 |
our horse discriminator, an optimizer for our zebra discriminator, and an optimizer for 02:08:01.500 |
our generator. The optimizer for the generator is going to contain the parameters both for 02:08:09.880 |
the horse generator and the zebra generator all in one place. So the initializer is going 02:08:17.380 |
to set up all of the different networks and loss functions we need, and they're all going 02:08:21.100 |
to be stored inside this model. And so then it prints out and shows us exactly the PyTorch 02:08:30.940 |
bottles we have. And so it's interesting to see that they're using ResNets. And so you 02:08:36.100 |
can see the ResNets look pretty familiar. We've got conv_batch_norm_rail_u, conv_batch_norm. 02:08:45.980 |
So instance_norm is just the same as batch_norm, basically, but it applies it to one image 02:08:52.240 |
at a time. The difference isn't particularly important. And you can see they're doing reflection 02:09:00.940 |
padding just like we are. You can kind of see when you try to build everything from 02:09:08.620 |
scratch like this, it is a lot of work. And you can kind of get the nice little things 02:09:16.740 |
that fast.ai does automatically for you. You kind of have to do all of them by hand and 02:09:23.140 |
only end up with a subset of them. So over time, hopefully soon, we'll get all of this 02:09:29.300 |
GAN stuff into fast.ai and it'll be nice and easy. 02:09:34.140 |
So we've got our model, and remember the model contains the loss functions, it contains the 02:09:38.820 |
generators, it contains the discriminators, all in one convenient place. So I've gone 02:09:43.260 |
ahead and kind of copied and pasted and slightly refactored the training loop from the code 02:09:53.860 |
So this is a lot pretty familiar, right? It's a loop to go through each epoch, and a loop 02:09:58.980 |
to go through the data. Before we did this, we set up our -- this is actually not a PyTorch 02:10:08.940 |
dataset, I think this is what they used slightly confusingly to talk about their combined, 02:10:15.660 |
what we would call a model data object, I guess, or the data that they need. We'll go through 02:10:21.060 |
that with TQDM to get a progress bar, and so now we can go through and see what happens 02:10:28.740 |
So set input. So it's kind of a different approach to what we do in fast.ai. It's kind 02:10:42.540 |
of neat, it's quite specific to CycleGANs, but basically internally inside this model 02:10:48.140 |
is this idea that we're going to go into our data and grab -- we're either going horse 02:10:55.880 |
to zebra or zebra to horse, depending on which way we go. A is either the horse or the zebra, 02:11:02.140 |
and vice versa, and if necessary, put it on the appropriate GPU and then grab the appropriate 02:11:11.500 |
So the model now has a mini-batch of horses and a mini-batch of zebras, and so now we 02:11:31.300 |
So it's kind of nice to see it like this. You can see each step. First of all, try to optimize 02:11:42.000 |
the generators, then try to optimize the horse discriminator, then try to optimize the zebra 02:11:49.460 |
0 grad is part of PyTorch. Step is part of PyTorch. So the interesting bit is the actual 02:11:57.820 |
thing which does the backpropagation on the generator. 02:12:07.900 |
And let's jump to the key pieces. There's all the bits, all the formulas that we basically 02:12:12.300 |
just saw from the paper. So let's take a horse and generate a zebra. So we've now got a fake 02:12:25.580 |
zebra. And let's now use the discriminator to see if we can tell whether it's fake or 02:12:30.140 |
not. And then let's pop that into our loss function, which we set up earlier to see if 02:12:44.060 |
we can basically get a loss function based on that prediction. 02:12:51.420 |
Then let's do the same thing to do the GAN loss. So go in the opposite direction, and 02:12:57.960 |
then we need to use the opposite discriminator, and then put that through the loss function 02:13:05.100 |
And then let's do the cycle-consistency loss. So again, we take our fake, which we created 02:13:12.700 |
up here, and try and turn it back again into the original. And then let's use that cycle-consistency 02:13:23.820 |
loss function we created earlier to compare it to the real original. 02:13:29.040 |
And here's that lambda. So there's some weight that we used, and that was set up, actually. 02:13:36.420 |
We just used the default that I suggested in their options. And then do the same for 02:13:40.740 |
the opposite direction, and then add them all together. Do the backward step, and that's 02:13:49.940 |
it. So we can then do the same thing for the first discriminator. 02:13:57.980 |
And since basically all the work's been done now, there's much less to do here. So I won't 02:14:10.540 |
step all through it, but it's basically the same basic stuff that we've already seen. 02:14:17.300 |
So optimized parameters basically is calculating the losses and doing the optimizer step from 02:14:25.460 |
time to time, save and print out some results. And then from time to time, update the learning 02:14:33.260 |
rate, so they've got some learning rate annealing built in here as well. 02:14:37.980 |
It isn't very exciting, but we can take a look at it. 02:14:49.540 |
So they've basically got some kind of fast AI, they've got this idea of schedulers which 02:14:54.700 |
you can then use to update your learning rates. So I think for those of you who are interested 02:15:01.380 |
in better understanding deep learning APIs, or interested in contributing more to fast 02:15:08.900 |
AI, or interested in creating your own version of some of this stuff in some different backend, 02:15:15.940 |
it's cool to look at a second kind of API that covers some subset of some of the similar 02:15:21.620 |
things to get a sense of how are they solving some of these problems, and what are the similarities 02:15:30.420 |
So we train that for a little while, and then we can just grab a few examples, and here 02:15:39.240 |
we have them. So here are our horses, here they are as zebras, and here they are back 02:15:45.820 |
as horses again. Here's a zebra, into a horse, back on a zebra, it's kind of thrown away 02:15:51.180 |
its head for some reason, but not so much it could get it back again. This is a really 02:15:57.140 |
interesting one, like this is obviously not what zebras look like, but it's going to be 02:16:00.620 |
a zebra version of that horse. It's also interesting to see its failure situations, I guess it 02:16:05.940 |
doesn't very often see basically just an eyeball, it has no idea how to do that one. So some 02:16:13.020 |
of them don't work very well, this one's done a pretty good job. This one's interesting, 02:16:18.260 |
it's done a good job of that one and that one, but for some reason the one in the middle 02:16:21.140 |
didn't get a go. This one's a really weird shape, but it's done a reasonable job of it. 02:16:27.980 |
This one looks good, this one's pretty sloppy, again the fork just ahead, it's not bad. So 02:16:37.060 |
it took me like 24 hours to train it even that far, so it's kind of slow. And I know 02:16:45.020 |
Helena is constantly complaining on Twitter about how long these things take, I don't 02:16:49.620 |
know how she's so productive with them. So I will mention one more thing that just came 02:16:56.980 |
out yesterday, which is there's now a multimodal image-to-image translation of unpaired, and 02:17:05.220 |
so you can basically now create different cats, for instance, from this dog. So this 02:17:13.500 |
is basically not just creating one example of the output that you want, but creating 02:17:18.980 |
a multimodal one. So here's a house cat to big cat, and here's a big cat to house cat, 02:17:26.060 |
this is the paper. So this came out like yesterday or the day before, I think. I think it's pretty 02:17:31.980 |
amazing cat and a dog. So you can kind of see how this technology is developing, and 02:17:37.940 |
I think there's so many opportunities to maybe do this with music, or speech, or writing, 02:17:45.740 |
or to create tools for artists, or whatever. Alright, thanks everybody, and see you next