back to index

Lesson 12: Deep Learning Part 2 2018 - Generative Adversarial Networks (GANs)


Chapters

0:0 Introduction
1:5 Christine Payne
7:16 Darknet
8:55 Basic Skills
11:3 Architecture
11:50 Basic Architecture
14:38 Res Blocks
16:46 Number of Channels
18:23 InPlace
21:10 Padding
22:13 One by One Con
26:0 Wide Residual Networks
29:44 SelfNormalising
31:28 Group Layers
37:25 Adaptive Average Pooling
46:53 Strides
48:26 GANs
54:51 Generating Pictures

Whisper Transcript | Transcript Only Page

00:00:00.000 | So, we're going to be talking about GANs today.
00:00:05.760 | Who has heard of GANs?
00:00:07.920 | Yeah, most of you.
00:00:11.520 | Very hot technology, but definitely deserving to be in the cutting-edge deep learning part
00:00:20.640 | of the course, because they're not quite proven to be necessarily useful for anything, but
00:00:28.320 | they're nearly there.
00:00:29.680 | They're definitely going to get there, and we're going to focus on the things where they're
00:00:35.360 | definitely going to be useful in practice.
00:00:38.400 | There are a number of areas where they may turn out to be useful in practice, but we
00:00:41.600 | don't know yet.
00:00:42.880 | So I think the area that we're going to be useful in practice is the kind of thing you
00:00:48.400 | see on the left here, which is, for example, turning drawings into rendered pictures.
00:00:54.960 | This comes from a paper that just came out two days ago.
00:00:59.480 | So there's a very active research going on right now.
00:01:05.560 | Before we get there, though, let's talk about some interesting stuff from the last class.
00:01:12.400 | This is an interesting thing that one of our diversity fellows, Christine Payne, did.
00:01:17.800 | Christine has a master's in medicine from Stanford, and so she obviously had an interest
00:01:24.000 | in thinking what would it look like if we built a language model of medicine.
00:01:30.800 | One of the things that we briefly touched on back in lesson 4 but didn't really talk
00:01:35.560 | much about last time is this idea that you can actually seed a generative language model,
00:01:41.720 | which basically means you've trained a language model on some corpus, and then you're going
00:01:46.040 | to generate some text from that language model.
00:01:49.520 | And so you can start off by feeding it a few words to basically say here's the first few
00:01:55.360 | words to create the hidden state in the language model, and then generate from there, please.
00:02:01.000 | And so Christine did something clever, which was to kind of pick a -- was to seed it with
00:02:06.920 | a question and then repeat the question three times, Christine, three times, and then let
00:02:12.920 | it generate from there.
00:02:15.760 | And so she fed a language model lots of different medical texts, and then fed it this question,
00:02:21.360 | what is the prevalence of malaria, and the model said in the US about 10% of the population
00:02:27.640 | has the virus, but only about 1% is infected with the virus, about 50 to 80 million are
00:02:32.440 | infected.
00:02:33.440 | She said what's the treatment for ectopic pregnancy, and it said it's a safe and safe
00:02:38.240 | treatment for women with a history of symptoms that may have a significant impact on clinical
00:02:41.720 | response, most important factor is development of management of ectopic pregnancy, etc.
00:02:46.720 | And so what I find interesting about this is it's pretty close to being a -- to me as
00:02:56.000 | somebody who doesn't have a master's in medicine from Stanford, pretty close to being a believable
00:03:00.600 | answer to the question, but it really has no bearing on reality whatsoever, and I kind
00:03:06.920 | of think it's an interesting kind of ethical and user experience quandary.
00:03:13.560 | So actually, I'm involved also in a company called Doc.ai that's trying to basically -- or
00:03:20.480 | doing a number of things, but in the end provide an app for doctors and patients which can
00:03:25.600 | help create a conversational user interface around helping them with their medical issues.
00:03:31.520 | And I've been continually saying to the software engineers on that team, please don't try to
00:03:37.640 | create a generative model using like an LSTM or something because they're going to be really
00:03:43.520 | good at creating bad advice that sounds impressive, kind of like political pundits or tenured professors,
00:03:55.320 | people who can say bullshit with great authority.
00:04:03.400 | So I thought it was a really interesting experiment, and great to see what our diversity fellows
00:04:12.000 | are doing.
00:04:13.000 | I mean, this is why we have this program.
00:04:15.680 | I suppose I shouldn't just say master's in medicine, actually a Juilliard trained classical
00:04:21.080 | musician or actually also a Princeton valedictorian in physics, so also a high performance computing
00:04:27.400 | expert.
00:04:28.400 | Yeah, okay, so she does a bit of everything.
00:04:31.120 | So yeah, really impressive group of people and great to see such exciting kind of ideas
00:04:36.440 | coming out.
00:04:37.440 | And if you're wondering, you know, I've done some interesting experiments, should I let
00:04:44.160 | people know about it?
00:04:46.800 | Well, Christine mentioned this on the forum, I went on to mention it on Twitter, to which
00:04:52.920 | I got this response, you're looking for a job, you may be wondering who Xavier Maricain is,
00:04:59.080 | well he is the founder of a hot new medical AI startup, he was previously the head of
00:05:05.400 | engineering at Quora, before that he was the guy at Netflix who ran the data science team
00:05:10.960 | and built their recommender systems, so this is what happens if you do something cool,
00:05:17.160 | let people know about it and get noticed by awesome people like Xavier.
00:05:26.520 | So let's talk about sci-fi 10.
00:05:32.520 | And the reason I'm going to talk about sci-fi 10 is that we're going to be looking at some
00:05:39.560 | more bare-bones PyTorch stuff today to build these generative adversarial models, there's
00:05:47.040 | no really fast AI support to speak of at all for GANs at the moment, I'm sure there will
00:05:53.320 | be soon enough, but currently there isn't, so we're going to be building a lot of models
00:05:56.200 | from scratch.
00:05:57.200 | It's been a while since we've done serious model building, a little bit of model building
00:06:03.240 | I guess for our bounding box stuff, but really all the interesting stuff there was the loss
00:06:09.840 | function.
00:06:10.840 | So we looked at sci-fi 10 in the part 1 of the course and we built something which was
00:06:15.080 | getting about 85% accuracy and I can't remember, a couple of hours to train.
00:06:21.800 | Interestingly there's a competition going on now to see who can actually train sci-fi
00:06:25.880 | 10 the fastest, going through this Stanford/Dawn bench and currently, so the goal is to get
00:06:31.720 | it to train to 94% accuracy.
00:06:34.160 | So it'd be interesting to see if we can build an architecture that can get to 94% accuracy
00:06:39.760 | because that's a lot better than our previous attempt and so hopefully in doing so we'll
00:06:44.600 | learn something about creating good architectures.
00:06:47.720 | That will be then useful for looking at these GANs today, but I think also it's useful because
00:06:56.460 | I've been looking much more deeply into the last few years' papers about different kinds
00:07:02.840 | of CNN architectures and realized that a lot of the insights in those papers are not being
00:07:08.060 | widely leveraged and clearly not widely understood.
00:07:11.560 | So I want to show you what happens if we can leverage some of that understanding.
00:07:17.200 | So I've got this notebook called sci-fi 10 darknet.
00:07:23.200 | That's because the architecture we're going to look at is really very close to the darknet
00:07:28.600 | architecture.
00:07:29.600 | But you'll see in the process that the darknet architecture has not the whole YOLO version
00:07:34.920 | 3 end-to-end thing, but just the part of it that they pre-trained on ImageNet to do classification.
00:07:41.520 | It's almost like the most generic simple architecture almost you could come up with.
00:07:48.160 | And so it's a really great starting point for experiments.
00:07:52.360 | So we're going to call it darknet, but it's not quite darknet and you can fiddle around
00:07:56.240 | with it to create things that definitely aren't darknet.
00:07:58.840 | It's really just the basis of nearly any modern ResNet-based architecture.
00:08:06.480 | So sci-fi 10, remember, is a fairly small dataset.
00:08:11.080 | The images are only 32x32 in size.
00:08:14.400 | And I think it's a really great dataset to work with because you can train it relatively
00:08:21.360 | quickly, unlike ImageNet.
00:08:23.280 | It's a relatively small amount of data, unlike ImageNet, and it's actually quite hard to recognize
00:08:28.040 | the images because 32x32 is kind of too small to easily see what's going on.
00:08:32.840 | So it's somewhat challenging.
00:08:35.440 | So I think it's a really underappreciated dataset because it's old, and who at DeepMind
00:08:42.400 | or OpenAI wants to work with a small old dataset when they could use their entire server room
00:08:48.740 | to process something much bigger.
00:08:50.440 | But to me, I think this is a really great dataset to focus on.
00:08:56.640 | So we'll go ahead and import our usual stuff, and we're going to try and build a network
00:09:04.160 | from scratch to train this with.
00:09:08.760 | One thing that I think is a really good exercise for anybody who's not 100% confident with
00:09:14.120 | their kind of broadcasting and PyTorch and so forth basic skills is figure out how I
00:09:21.400 | came up with these numbers.
00:09:24.240 | So these numbers are the averages for each channel and the standard deviations for each
00:09:28.880 | channel in SciFiTet.
00:09:30.680 | So try and as a bit of a homework, just make sure you can recreate those numbers and see
00:09:35.480 | if you can do it in no more than a couple of lines of code, you know, no loops.
00:09:42.960 | Ideally you want to do it in one go if you can.
00:09:49.160 | Because these are fairly small, we can use a larger batch size than usual, 256, and the
00:09:54.160 | size of these images is 32.
00:09:58.940 | Transformations - normally we have this standard set of side-on transformations we use for
00:10:05.440 | photos of normal objects.
00:10:07.640 | We're not going to use that here though because these images are so small that trying to rotate
00:10:11.520 | a 32x32 image a bit is going to introduce a lot of blocking kind of distortions.
00:10:19.000 | So the kind of standard transforms that people tend to use is a random horizontal flip, and
00:10:25.040 | then we add size divided by 8, so 4 pixels of padding on each side.
00:10:32.120 | And one thing which I find works really well is by default FastAI doesn't add black padding,
00:10:37.080 | which basically every other library does.
00:10:39.080 | We actually take the last 4 pixels of the existing photo and flip it and reflect it,
00:10:44.680 | and we find that we get much better results by using this reflection padding by default.
00:10:51.400 | So now that we've got a 36x36 image, this set of transforms in training will randomly
00:10:58.160 | pick a 32x32 crop.
00:11:00.240 | So we get a little bit of variation, but not heaps.
00:11:03.320 | Alright, so we can use our normal from paths to grab our data.
00:11:07.400 | So we now need an architecture.
00:11:10.360 | And what we're going to do is create an architecture which fits in one screen.
00:11:18.080 | So this is from scratch, as you can see, I'm using the predefined com2d, batch norm2d,
00:11:27.280 | leaky value modules, but I'm not using any blocks or anything, they're all being defined.
00:11:36.040 | So the entire thing is here on one screen.
00:11:37.840 | So if you're ever wondering can I understand a modern good quality architecture, absolutely.
00:11:45.560 | Let's study this one.
00:11:48.480 | So my basic starting point with an architecture is to say it's a stacked bunch of layers.
00:11:57.960 | And generally speaking there's going to be some kind of hierarchy of layers.
00:12:00.680 | So at the very bottom level there's things like a convolutional layer and a batch norm
00:12:04.280 | layer.
00:12:05.280 | So if you're thinking anytime you have a convolution, you're probably going to have some standard
00:12:10.800 | sequence and normally it's going to be conv, batch norm, then a nonlinear activation.
00:12:17.720 | So I try to start right from the top by saying, okay, what are my basic units going to be?
00:12:25.600 | And so by defining it here, that way I don't have to worry about trying to keep everything
00:12:34.520 | consistent and it's going to make everything a lot simpler.
00:12:37.040 | So here's my conv layer, and so anytime I say conv layer, I mean conv, batch norm, relu.
00:12:43.880 | Now I'm not quite saying relu, I'm saying leaky relu, and I think we've briefly mentioned
00:12:53.280 | it before, but the basic idea is that normally a relu looks like that.
00:13:06.800 | Hopefully you all know that now.
00:13:10.740 | A leaky relu looks like that.
00:13:18.740 | So this part, as before, has a gradient of 1, and this part has a gradient of, it can
00:13:23.400 | vary, but something around 0.1 or 0.01 is common.
00:13:28.640 | And the idea behind it is that when you're in this negative zone here, you don't end
00:13:34.380 | up with a 0 gradient, which makes it very hard to update it.
00:13:39.080 | In practice, people have found leaky relu more useful on smaller datasets and less useful
00:13:45.480 | on big datasets, but it's interesting that for the YOLO version 3 paper, they did use
00:13:49.720 | a leaky relu and got great performance from it.
00:13:53.080 | So it rarely makes things worse, and it often makes things better.
00:13:57.320 | So it's probably not bad if you need to create your own architecture to make that your default
00:14:02.400 | go-to is to use leaky relu.
00:14:07.560 | You'll notice I don't define a PyTorch module here, I just go ahead and go sequential.
00:14:13.640 | This is something that if you read other people's PyTorch code, it's really underutilized.
00:14:19.480 | People tend to write everything as a PyTorch module with an init and a forward.
00:14:24.040 | But if the thing you want is just a sequence of things one after the other, it's much more
00:14:29.240 | concise and easy to understand to just make it a sequential.
00:14:32.640 | So I've just got a simple plain function that just returns a sequential model.
00:14:38.700 | So I mentioned that there's generally a number of hierarchies of units in most modern networks.
00:14:49.960 | And I think we know now that the next level in this unit hierarchy for ResNets is the
00:15:01.640 | res block or the residual block, I call it here a res layer.
00:15:10.520 | And back when we last did scifi 10, I oversimplified this, I cheated a little bit.
00:15:18.560 | We had x coming in, and we put that through a conv, and then we added it back up to x
00:15:27.400 | to go out.
00:15:30.720 | So in general, we've got your output is equal to your input plus some function of your input.
00:15:45.080 | And the thing we did last year was we made f was a 2D conv.
00:15:52.240 | And actually, in the real res block, there's actually two of them.
00:16:05.040 | So it's actually conv of conv of x.
00:16:14.480 | And when I say conv, I'm using this as a shortcut for our conv layer.
00:16:24.800 | So you can see here, I've created two convs, and here it is.
00:16:28.680 | I take my x, put it through the first conv, put it through the second conv, and add it
00:16:33.100 | back up to my input again to get my basic res block.
00:16:40.000 | So, one interesting insight here is what are the number of channels in these convolutions?
00:16:59.960 | So we've got coming in some number of input filters.
00:17:09.040 | The way that the darknet folks set things up is they said we're going to make every one
00:17:13.320 | of these res layers spit out the same number of channels that came in.
00:17:18.840 | And I kind of like that, that's why I used it here, because it makes life simpler.
00:17:23.120 | And so what they did is they said let's have the first conv have the number of channels,
00:17:29.120 | and then the second conv double it again.
00:17:31.480 | So ni goes to ni/2, and then ni/2 goes to ni.
00:17:36.180 | So you've kind of got this funneling thing where if you've got like 64 channels coming
00:17:42.720 | in, it kind of gets squished down with a first conv down to 32 channels, and then taken back
00:17:49.280 | up again to 64 channels coming out.
00:17:53.040 | Yes, Rachel?
00:17:56.000 | Why is inplace equals true in the leaky value?
00:17:59.000 | Oh, thanks for asking.
00:18:01.560 | A lot of people forget this or don't know about it.
00:18:05.640 | But this is a really important memory technique.
00:18:11.120 | If you think about it, this conv layer is like the lowest level thing, so pretty much
00:18:15.360 | everything in our resnet once it's all put together is going to be conv layers, conv layers,
00:18:21.760 | conv layers.
00:18:23.680 | If you don't have inplace equals true, it's going to create a whole separate piece of
00:18:29.800 | memory for the output of the value.
00:18:34.720 | So like it's going to allocate a whole bunch of memory that's totally unnecessary.
00:18:39.280 | And actually, since I wrote this, I came up with another idea the other day, which I'll
00:18:45.800 | now implement, which is you can do the same thing for the res layer, rather than going
00:18:50.280 | -- let's just reorder this to say x plus that -- you can actually do the same thing here.
00:19:01.360 | Hopefully some of you might remember that in PyTorch, pretty much every function has
00:19:06.400 | an underscore suffix version which says do that inplace.
00:19:11.160 | So plus, there's also an add, and so that's add inplace.
00:19:18.680 | And so that's now suddenly reduced my memory there as well.
00:19:24.640 | So these are really handy little tricks.
00:19:26.920 | And I actually forgot the inplace equals true at first for this, and I literally was having
00:19:31.000 | to decrease my batch size to much lower amounts than I knew should be possible, and it was
00:19:34.720 | driving me crazy, and then I realized that that was missing.
00:19:39.560 | You can also do that with dropout, by the way, if you have dropout.
00:19:42.880 | So dropout and all the activation functions you can do inplace, and then generally any
00:19:50.280 | arithmetic operation you can do inplace as well.
00:19:54.360 | Why is bias usually in ResNet set to false in the conv layer?
00:20:02.400 | If you're watching the video, pause now and see if you can figure this out, because this
00:20:08.080 | is a really interesting question.
00:20:10.280 | Why don't we need bias?
00:20:12.240 | So I'll wait for you to pause.
00:20:15.060 | Welcome back.
00:20:16.060 | So if you've figured it out, here's the thing, immediately after the conv is a batch norm.
00:20:24.000 | And remember batch norm has two learnable parameters for each activation, the kind of
00:20:31.280 | the thing you multiply by and the thing you add.
00:20:34.800 | So if we had bias here to add, and then we add another thing here, we're adding two things
00:20:41.120 | which is totally pointless, like that's two weights where one would do.
00:20:44.440 | So if you have a batch norm after a conv, then you can either say in the batch norm,
00:20:51.240 | don't include the add bit there please, or easier is just to say don't include the bias
00:20:55.920 | in the conv.
00:21:00.200 | There's no particular harm, but again, it's going to take more memory because that's more
00:21:04.860 | gradients that it has to keep track of.
00:21:08.920 | So best to avoid.
00:21:12.400 | Also another thing, a little trick, is most people's conv layers have padding as a parameter,
00:21:19.320 | but generally speaking you should be able to calculate the padding easily enough.
00:21:23.600 | And I see people try to implement special same padding modules and all kinds of stuff
00:21:29.760 | like that.
00:21:30.760 | But if you've got a stride 1, and you've got a kernel size of 3, then obviously that's
00:21:44.000 | going to overlap by one unit on each side, so we want padding of 1.
00:21:50.680 | Whereas if it's stride 1, then we don't need any padding.
00:21:54.480 | So in general, padding of kernel size integer divided by 2 is what you need.
00:22:00.640 | There's some tweaks sometimes, but in this case this works perfectly well.
00:22:05.400 | So again, trying to simplify my code by having the computer calculate stuff for me rather
00:22:10.760 | than me having to do it myself.
00:22:15.120 | Another thing here with the two conv layers, we have this idea of a bottleneck, this idea
00:22:21.240 | of reducing the channels and then increasing them again, is also what kernel size we use.
00:22:26.320 | So here is a 1x1 conv, and this is again something you might want to pause the video now and
00:22:32.000 | think about what's a 1x1 conv really?
00:22:36.060 | What actually happens in a 1x1 conv?
00:22:42.920 | So if we've got a little 4x4 grid here, and of course there's a filter or channels axis
00:22:55.200 | as well, maybe that's like 32, and we're going to do a 1x1 conv.
00:23:01.160 | So what's the kernel for a 1x1 conv going to look like?
00:23:06.640 | It's going to be 1 by 32.
00:23:14.760 | So remember when we talk about the kernel size, we never mention that last piece, but
00:23:20.360 | let's say it's 1x1 by 32 because that's part of the filters in and filters out.
00:23:24.840 | So in other words then, what happens is this one thing gets placed first of all here on
00:23:31.480 | the first cell, and we basically get a dot product of that 32 deep bit with this 32 bit
00:23:39.320 | deep bit, and that's going to give us our first output.
00:23:47.920 | And then we're going to take that 32 bit bit and put it with the second one to get the
00:23:51.040 | second output.
00:23:52.040 | So it's basically going to be a bunch of little dot products for each point in the grid.
00:24:01.640 | So what it basically is then is basically something which is allowing us to kind of
00:24:13.960 | change the dimensionality in whatever way we want in the channel dimension.
00:24:23.300 | And so that would be one of our filters.
00:24:28.760 | And so in this case we're creating ni divided by 2 of these, so we're going to have ni divided
00:24:35.920 | by 2 of these dot products, all with different weighted averages of the input channels.
00:24:42.240 | So it basically lets us, with very little computation, add this additional step of calculations
00:24:52.640 | and non-linearities.
00:24:55.720 | So that's a cool trick, this idea of taking advantage of these 1x1 comms, creating this
00:25:01.200 | bottleneck and then pulling it out again with 3x3 comms.
00:25:05.280 | So that's actually going to take advantage of the 2D nature of the input properly.
00:25:12.160 | The 1x1 comm doesn't take advantage of that at all.
00:25:17.760 | So these two lines of code, there's not much in it, but it's a really great test of your
00:25:24.760 | understanding and kind of your intuition about what's going on.
00:25:29.160 | Why is it that a 1x1 comm going from ni to ni over 2 channels, followed by a 3x3 comm
00:25:36.440 | going from ni over 2 to ni channels? Why does it work? Why do the tensor ranks line up? Why
00:25:42.800 | do the dimensions all line up nicely? Why is it a good idea? What's it really doing? It's
00:25:49.280 | a really good thing to fiddle around with, maybe create some small ones in Jupyter Notebook,
00:25:55.280 | run them yourself, see what inputs and outputs come in and out. Really get a feel for that.
00:26:01.360 | Once you've done so, you can then play around with different things. One of the really unappreciated
00:26:13.480 | papers is this one, Wide Residual Networks. It's really quite a simple paper, but what
00:26:25.200 | they do is they basically fiddle around with these two lines of code. And what they do
00:26:33.360 | is they say, well what if this wasn't divided by 2, but what if it was times 2? That would
00:26:40.600 | be totally allowable. That's going to line up nicely. Or what if we had another comm3
00:26:52.120 | after this, and so this was actually ni over 2 to ni over 2, and then this is ni over 2.
00:27:01.760 | Again that's going to work, right? Kernel size 1, 3, 1, going to half the number of kernels,
00:27:07.680 | leave it at half and then double it again at the end. And so they come up with this
00:27:11.360 | kind of simple notation for basically defining what this can look like. And then they show
00:27:19.680 | lots of experiments. And basically what they show is that this approach of bottlenecking,
00:27:30.000 | of decreasing the number of channels, which is almost universal in resnets, is probably
00:27:35.480 | not a good idea. In fact from the experiment, it's definitely not a good idea. Because what
00:27:39.640 | happens is it lets you create really deep networks. The guys who created resnets got
00:27:45.360 | particularly famous creating a 1,001-layer network. But the thing about 1,001 layers is
00:27:51.480 | you can't calculate layer 2 until you finish layer 1. You can't calculate layer 3 until
00:27:56.600 | you finish layer 2. So it's sequential. GPUs don't like sequential. So what they showed
00:28:03.280 | is that if you have less layers, but with more calculations per layer, and so one easy
00:28:10.920 | way to do that would be to remove the /2. No other changes. Like try this at home. Try
00:28:17.960 | running sci-fi and see what happens. Or maybe even multiply it by 2 or fiddle around. And
00:28:24.760 | that basically lets your GPU do more work. And it's very interesting because the vast
00:28:29.320 | majority of papers that talk about performance of different architectures never actually
00:28:34.760 | time how long it takes to run a batch through it. They literally say this one requires x
00:28:42.960 | number of floating-point operations per batch, but then they never actually bother to run
00:28:48.160 | the damn thing like a proper experimentalist and find out whether it's faster or slower.
00:28:53.000 | And so a lot of the architectures that are really famous now turn out to be slowest molasses
00:28:59.440 | and take craploads of memory and just totally useless because the researchers never actually
00:29:06.200 | bother to see whether they're fast and to actually see whether they fit in RAM with
00:29:09.880 | normal batch sizes. So the wide resnet paper is unusual in that it actually times how long
00:29:17.280 | it takes, as does the YOLO version 3 paper, which made the same insight. I'm not sure
00:29:22.560 | they might have missed the wide resnets paper because the YOLO version 3 paper came to a
00:29:27.000 | lot of the same conclusions, but I'm not even sure they cited the wide resnets paper, so
00:29:32.120 | they might not be aware that all that work's been done. But they're both great to see people
00:29:38.640 | actually timing things and noticing what actually makes sense.
00:29:43.280 | Yes, Rich?
00:29:45.720 | Cellu looked really hot in the paper which came out, but I noticed that you don't use
00:29:49.240 | it. What's your opinion on Cellu?
00:29:52.600 | So Cellu is something largely for fully connected layers which allows you to get rid of batch
00:29:59.840 | norm, and the basic idea is that if you use this different activation function, it's kind
00:30:06.040 | of self-normalizing. That's what the S in Cellu stands for. So self-normalizing means that
00:30:12.440 | it will always remain at a unit standard deviation and zero mean, and therefore you don't need
00:30:16.760 | that batch norm. It hasn't really gone anywhere, and the reason it hasn't really gone anywhere
00:30:23.040 | is because it's incredibly finicky. You have to use a very specific initialization, otherwise
00:30:28.680 | it doesn't start with exactly the right standard deviation of mean. It's very hard to use it
00:30:35.680 | with things like embeddings. If you do, then you have to use a particular kind of embedding
00:30:40.320 | initialization which doesn't necessarily actually make sense for embeddings.
00:30:46.280 | You do all this work very hard to get it right, and if you do finally get it right, what's
00:30:52.040 | the point where you've managed to get rid of some batch norm layers which weren't really
00:30:56.160 | hurting you anyway. It's interesting because that paper, that Cellu paper, I think one
00:31:01.360 | of the reasons people noticed it, or in my experience the main reason people noticed
00:31:05.120 | it was because it was created by the inventor of LSTMs, and also it had a huge mathematical
00:31:10.800 | appendix and people were like "Lots of maths from a famous guy, this must be great!" But
00:31:17.480 | in practice I don't see anybody using it to get any state-of-the-art results or win any
00:31:23.880 | competitions or anything like that.
00:31:30.240 | This is some of the tiniest bits of code we've seen, but there's so much here and it's fascinating
00:31:34.080 | to play with. Now we've got this block which is built on this block, and then we're going
00:31:40.280 | to create another block on top of that block. We're going to call this a group layer, and
00:31:47.680 | it's going to contain a bunch of res layers. A group layer is going to have some number
00:31:55.680 | of channels or filters coming in, and what we're going to do is we're going to double
00:32:04.360 | the number of channels coming in by just using a standard conv layer. Optionally, we'll halve
00:32:11.960 | the grid size by using a stride of 2, and then we're going to do a whole bunch of res blocks,
00:32:20.600 | a whole bunch of res layers. We can pick how many. That could be 2 or 3 or 8. Because remember,
00:32:26.560 | these res layers don't change the grid size and they don't change the number of channels.
00:32:31.840 | You can add as many as you like, anywhere you like, without causing any problems. It's
00:32:37.200 | just going to use more computation and more RAM, but there's no reason other than that
00:32:42.600 | you can't add as many as you like. A group layer, therefore, is going to end up doubling
00:32:49.720 | the number of channels because it's this initial convolution which doubles the number of channels.
00:32:58.560 | And depending on what we pass in a stride, it may also halve the grid size if we put
00:33:03.560 | stride=2. And then we can do a whole bunch of res block computations as many as we like.
00:33:13.960 | So then to define our dark net, or whatever we want to call this thing, we're just going
00:33:21.240 | to pass in something that looks like this. And what this says is, create 5 group layers.
00:33:30.400 | The first one will contain 1 of these extra res layers. The second will contain 2, then
00:33:36.400 | 4, then 6, then 3. And I want you to start with 32 filters.
00:33:47.560 | So the first one of these res layers will contain 32 filters, and there will just be
00:33:57.100 | one extra res layer. The second one is going to double the number of filters because that's
00:34:03.280 | what we do. Each time we have a new group layer, we double the number. So the second
00:34:06.600 | one will have 64, then 128, then 256, then 512, and then that will be it.
00:34:14.660 | So nearly all of the network is going to be those bunches of layers. And remember, every
00:34:20.840 | one of those group layers also has one convolution of the start. And so then all we have is before
00:34:29.360 | that all happens, we're going to have one convolutional layer at the very start, and
00:34:35.640 | at the very end we're going to do our standard adaptive average pooling, flatten, and a linear
00:34:41.360 | layer to create the number of classes out at the end.
00:34:45.060 | So one convolution at the end, adaptive pooling, and one linear layer at the other end, and
00:34:51.480 | then in the middle, these group layers, each one consisting of a convolutional layer followed
00:34:57.680 | by n number of res layers. And that's it. Again, I think we've mentioned this a few
00:35:04.840 | times, but I'm yet to see any code out there, any examples, anything anywhere that uses
00:35:14.200 | adaptive average pooling. Everyone I've seen writes it like this, and then bits a particular
00:35:21.240 | number here, which means that it's now tied to a particular image size, which definitely
00:35:26.280 | isn't what you want. So most people, even the top researchers I speak to, most of them
00:35:31.640 | are still under the impression that a specific architecture is tied to a specific size, and
00:35:38.900 | that's a huge problem when people think that because it really limits their ability to
00:35:44.520 | use smaller sizes to kind of kickstart their modeling or to use smaller sizes for doing
00:35:48.840 | experiments and stuff like that.
00:35:51.840 | Again, you'll notice I'm using sequential here, but a nice way to create architectures
00:35:58.240 | is to start out by creating a list. In this case, this is a list with just one conv layer
00:36:02.040 | in, and then my function here, make_group_layer, it just returns another list. So then I can
00:36:08.920 | just go plus equals, appending that list to the previous list, and then I can go plus equals
00:36:14.600 | to append this bunch of things to that list, and then finally sequential of all those layers.
00:36:20.360 | So that's a very nice thing. So now my forward is just self.layers.
00:36:25.360 | So this is a nice kind of picture of how to make your architectures as simple as possible.
00:36:32.720 | So you can now go ahead and create this, and as I say, you can fiddle around. You could
00:36:38.520 | even parameterize this to make it a number that you pass in here, to pass in different
00:36:43.720 | numbers so it's not 2, maybe it's times 2 instead. You could pass in things that change
00:36:48.920 | the kernel size or change the number of conv layers, fiddle around with it, and maybe you
00:36:53.880 | can create something -- I've actually got a version of this which I'm about to run for
00:36:58.600 | you -- which kind of implements all of the different parameters that's in that wide ResNet
00:37:04.760 | paper, so I could fiddle around to see what worked well.
00:37:09.220 | So once we've got that, we can use conv_learner from model_data to take our pytorch_model module
00:37:15.560 | and the model_data object and turn them into a learner, give it a criterion, add some metrics
00:37:21.600 | if we like, and then we can call fit and away we go.
00:37:26.480 | Could you please explain adaptive average pooling? How does setting to one work?
00:37:31.560 | Sure. Before I do, since we've only got a certain amount of time in this class, I do
00:37:45.160 | want to see how we go with this simple network against these state-of-the-art results. So
00:37:54.120 | to make life a little easier, we can start it running now and see how it looks later.
00:37:59.240 | So I've got the command ready to go. So we've basically taken all that stuff and put it
00:38:05.160 | into a simple little Python script, and I've modified some of those parameters I mentioned
00:38:10.280 | to create something I've called a WRN22 network, which doesn't officially exist, but it's got
00:38:15.080 | a bunch of changes to the parameters we talked about based on my experiments.
00:38:20.280 | We're going to use the new Leslie Smith one-cycle thing. So there's quite a bunch of cool stuff
00:38:26.560 | here. So the one-cycle implementation was done by our student, Sylvain Gugge, the trained
00:38:35.640 | sci-fi experiments were largely done by Brett Coons, and stuff like getting the half-position
00:38:42.040 | floating-point implementation integrated into fast.ai was done by Andrew Shaw. So it's been
00:38:49.120 | a cool bunch of different student projects coming together to allow us to run this. So
00:38:55.240 | this is going to run actually on an AWS, Amazon AWS P3, which has eight GPUs. The P3 has these
00:39:04.280 | newer Volta architecture GPUs, which actually have special support for half-position floating
00:39:10.440 | point. Fast.ai is the first library I know of to actually integrate the Volta-optimized
00:39:18.280 | half-position floating point into the library, so we can just go learn.half now and get that
00:39:23.600 | support automatically. And it's also the first one to integrate one-cycle, so these are the
00:39:30.160 | parameters for the one-cycle. So we can go ahead and get this running. So what this actually
00:39:36.840 | does is it's using PyTorch's multi-GPU support. Since there are eight GPUs, it's actually
00:39:43.800 | going to fire off eight separate Python processes, and each one's going to train on a little
00:39:49.640 | bit, and then at the end it's going to pass the gradient updates back to the master process
00:39:56.760 | that's going to integrate them all together. So you'll see, here they are, lots of progress
00:40:04.600 | bars all pop up together. And you can see it's training three or four seconds when you
00:40:12.800 | do it this way. When I was training earlier, I was getting about 30 seconds per epoch. So
00:40:26.240 | doing it this way, we can kind of train things like 10 times faster or so, which is pretty
00:40:32.440 | cool.
00:40:33.440 | Okay, so we'll leave that running. So you were asking about adaptive average pooling,
00:40:38.740 | and I think specifically what's the number 1 doing? So normally when we're doing average
00:40:49.960 | pooling, let's say we've got 4x4. Let's say we did average pooling 2, 2. Then that creates
00:41:06.440 | a 2x2 area and takes the average of those 4, and then we can pass in the stride. So if
00:41:22.200 | we said stride 1, then the next one is we would look at this block of 2x2 and take that
00:41:27.480 | average, and so forth. So that's what a normal 2x2 average pooling would be.
00:41:35.320 | And so in that case, if we didn't have any padding, that would spit out a 3x3, because
00:41:42.640 | it's 2 here, 2 here, 2 here. And if we added padding, we can make it 4x4. So if we wanted
00:41:52.680 | to spit out something, we didn't want 3x3, what if we wanted 1x1? Then we could say average
00:41:59.480 | pool 4, 4. And so that's going to do 4, 4, and average the whole lot. And that would spit
00:42:13.320 | out 1x1.
00:42:17.360 | But that's just one way to do it. Rather than saying the size of the pooling filter, why
00:42:25.800 | don't we instead say, I don't care what the size of the input grid is, I always want 1x1.
00:42:32.880 | So that's where then you say "adaptive average pool", and now you don't say what's the size
00:42:41.320 | of the pooling filter, you instead say what's the size of the output I want. And so I want
00:42:45.920 | something that's 1x1. And if you only put a single int, it assumes you mean 1x1. So in
00:42:52.320 | this case, adaptive average pooling 1 with a 4x4 grid coming in is the same as average
00:43:00.200 | pooling 4, 4. If it was a 7x7 grid coming in, it would be the same as 7, 7.
00:43:06.880 | So it's the same operation, it's just expressing it in a way that says regardless of the input,
00:43:11.500 | I want something of that size to output.
00:43:26.840 | We got to 94, and it took 3 minutes and 11 seconds, and the previous state-of-the-art
00:43:34.120 | was 1 hour and 7 minutes. So was it worth fiddling around with those parameters and learning
00:43:40.060 | a little bit about how these architectures actually work and not just using what came
00:43:43.280 | out of the box? Well, holy shit, we just used a publicly available instance. We used a spot
00:43:49.760 | instance so that cost us $8 per hour for 3 minutes. It cost us a few cents to train this
00:44:00.800 | from scratch 20 times faster than anybody's ever done it before.
00:44:06.960 | So that's like the most crazy state-of-the-art result we've ever seen, but this one just
00:44:12.680 | blew it out of the water. This is partly thanks to just fiddling around with those parameters
00:44:21.440 | of the architecture. Mainly, frankly, about using Leslie Smith's one-cycle thing and Zulma's
00:44:28.360 | implementation of that. Remember, not only a reminder of what that's doing, it's basically
00:44:36.040 | saying this is batches, and this is learning rate. It creates an upward path that's equally
00:44:50.360 | long as the downward path, so it's a true C-L-R, triangular, cyclical learning rate.
00:44:57.520 | As per usual, you can pick the ratio between those two numbers. So x divided by y in this
00:45:06.720 | case is the number that you get to pick. In this case, we picked 50, so we started out
00:45:16.520 | with a much smaller one here. And then it's got this cool idea which is you get to say
00:45:22.360 | what percentage of your epochs then is spent going from the bottom of this down all the
00:45:28.120 | way down pretty much to zero. That's what this second number here is. So 15% of the batches
00:45:34.720 | is spent going from the bottom of our triangle even further.
00:45:42.720 | So importantly though, that's not the only thing one cycle does. We also have momentum,
00:45:50.960 | and momentum goes from 0.95 to 0.85 like this. In other words, when the learning rate is really
00:46:08.040 | low, we use a lot of momentum, and when the learning rate is really high, we use very
00:46:11.720 | little momentum, which makes a lot of sense. But until Leslie Smith showed this in that
00:46:16.160 | paper, I've never seen anybody do it before, so it's a really cool trick.
00:46:23.520 | You can now use that by using the useCLRbeta parameter in fast.ai, and you should be able
00:46:30.840 | to basically replicate this state-of-the-art result. You can use it on your own computer
00:46:36.280 | or your paper space. Obviously the only thing you won't get is the multi-GPU piece, but
00:46:40.920 | that makes it a bit easier to train. So on a single GPU, you should be able to beat this
00:46:50.440 | on a single GPU.
00:46:53.560 | Make group layer contains stride=2, so this means stride is 1 for layer 1 and 2 for everything
00:47:00.360 | else. What's the logic behind it? Usually the strides I've seen are odd.
00:47:08.920 | Strides are either 1 or 2, I think you're thinking of kernel sizes. So stride=2 means
00:47:14.280 | that I jump 2 across, and so a stride of 2 means that you halve your grid size. I think
00:47:20.000 | you might have got confused between stride and kernel size there. If we have a stride
00:47:27.120 | of 1, the grid size doesn't change. If we have a stride of 2, then it does. In this case,
00:47:35.140 | this is for sci-fi 10. 32x32 is small, and we don't get to halve the grid size very often,
00:47:42.100 | because pretty quickly we're going to run out of cells. That's why the first layer has
00:47:50.960 | a stride of 1, so we don't decrease the grid size straight away, basically. It's kind of
00:47:59.400 | a nice way of doing it, because that's why we have a low number here, so we can start
00:48:05.520 | out with not too much computation on the big grid, and then we can gradually do more and
00:48:12.120 | more computation as the grids get smaller and smaller. Because the smaller grid the computation
00:48:17.800 | will take less time. I think so that we can do all of our scanning
00:48:30.340 | in one go. Let's take a slightly early break and come back at 7.30.
00:48:50.400 | So we're going to talk about generative adversarial networks, also known as GANs, and specifically
00:48:57.280 | we're going to focus on the Wasserstein GAN paper, which included some guy called Sumith
00:49:04.400 | Chintala, who went on to create some piece of software called HiTorch. The Wasserstein
00:49:11.440 | GAN was heavily influenced by the - so I'm just going to call this WGAN, that's the time
00:49:15.840 | - the DC GAN, or deep convolutional generative adversarial networks paper, which also Sumith
00:49:22.320 | was involved with. It's a really interesting paper to read. A lot of it looks like this.
00:49:39.240 | The good news is you can skip those bits, because there's also a bit that looks like
00:49:45.360 | this which says do these things. Now I will say though that a lot of papers have a theoretical
00:49:55.500 | section which seems to be there entirely to get past the reviewer's need for theory. That's
00:50:03.220 | not true of the WGAN paper. The theory bit is actually really interesting. You don't
00:50:08.060 | need to know it to use it, but if you want to learn about some cool ideas and see the
00:50:14.600 | thinking behind why this particular algorithm, it's absolutely fascinating. Before this paper
00:50:22.720 | came out, I didn't know literally I knew nobody who had studied the math that it's based on,
00:50:28.760 | so everybody had to learn the math it was based on. The paper does a pretty good job
00:50:33.520 | of laying out all the pieces. You'll have to do a bunch of reading yourself. If you're
00:50:39.000 | interested in digging into the deeper math behind some paper to see what it's like to
00:50:46.240 | study it, I would pick this one. Because at the end of that theory section, you'll come
00:50:51.520 | away saying, okay, I can see now why they made this algorithm the way it is. And then
00:51:01.600 | having come up with that idea, the other thing is often these theoretical sections are very
00:51:05.560 | clearly added after they come up with the algorithm. They'll come up with the algorithm
00:51:09.020 | based on intuition and experiments, and then later on post-hoc justify it. Whereas this
00:51:14.280 | one you can clearly see it's like, okay, let's actually think about what's going on in GANs
00:51:19.720 | and think about what they need to do and then come up with the algorithm.
00:51:24.280 | So the basic idea of a GAN is it's a generative model. So it's something that is going to
00:51:31.880 | create sentences or create images. It's going to generate stuff. And it's going to try and
00:51:42.400 | create stuff which is very hard to tell the difference between generated stuff and real
00:51:49.480 | stuff. So a generative model could be used to face-swap a video, a very well-known controversial
00:51:58.600 | thing of deep fakes and fake pornography and stuff happening at the moment. It could be
00:52:04.360 | used to fake somebody's voice. It could be used to fake the answer to a medical question.
00:52:13.640 | But in that case, it's not really a fake. It could be a generative answer to a medical
00:52:18.080 | question that's actually a good answer. So you're generating language. You could generate
00:52:23.200 | a caption to an image, for example. So generative models have lots of interesting applications.
00:52:35.920 | But generally speaking, they need to be good enough that, for example, if you're using
00:52:41.240 | it to automatically create a new scene for Carrie Fisher in the next Star Wars movies
00:52:48.640 | and she's not around to play that part anymore, you want to try and generate an image of her
00:52:54.580 | that looks the same, then it has to fool the Star Wars audience into thinking that doesn't
00:53:00.400 | look like some weird Carrie Fisher, that looks like the real Carrie Fisher. Or if you're
00:53:05.680 | trying to generate an answer to a medical question, you want to generate English that
00:53:10.680 | reads nicely and clearly and sounds authoritative and meaningful. So the idea of a generative
00:53:19.320 | adversarial network is we're going to create not just a generative model to create, say,
00:53:27.140 | the generated image, but a second model that's going to try to pick which ones are real and
00:53:33.600 | which ones are generated. We're going to call them fake. So which ones are real and which
00:53:38.360 | ones are fake? So we've got a generator that's going to create our fake content and a discriminator
00:53:45.500 | that's going to try to get good at recognizing which ones are real and which ones are fake.
00:53:50.400 | So there's going to be two models. And then there's going to be adversarial, meaning the
00:53:53.960 | generator is going to try to keep getting better at fooling the discriminator into thinking
00:53:59.580 | that fake is real, and the discriminator is going to try to keep getting better at discriminating
00:54:04.320 | between the real and the fake. And they're going to go head-to-head, like that.
00:54:09.640 | And it's basically as easy as I just described. It really is. We're just going to build two
00:54:16.640 | models in PyTorch. We're going to create a training loop that first of all says the loss
00:54:22.440 | function for the discriminator is can you tell the difference between real and fake,
00:54:26.400 | and then update the weights of that. And then we're going to create a loss function for
00:54:29.880 | the generator, which is going to say can you generate something which pulls the discriminator
00:54:34.560 | and update the weights from that loss. And we're going to look through that a few times
00:54:38.960 | and see what happens.
00:54:40.840 | And so let's come back to the pseudocode here of the algorithm and let's read the real code
00:54:47.760 | first.
00:54:52.560 | So there's lots of different things you can do with GANs. And we're going to do something
00:54:57.640 | that's kind of boring but easy to understand, and it's kind of cool that it's even possible.
00:55:04.040 | We're just going to generate some pictures from nothing. We're just going to get it to
00:55:09.040 | draw some pictures. And specifically we're going to get it to draw pictures of bedrooms.
00:55:15.440 | You'll find if you hopefully get a chance to play around with this during the week with
00:55:19.640 | your own datasets, if you pick a dataset that's very varied, like ImageNet, and then get a
00:55:26.040 | GAN to try and create ImageNet pictures, it tends not to do so well because it's not really
00:55:32.720 | clear enough what you want a picture of.
00:55:35.640 | So it's better to give it, for example, there's a dataset called CelebA, which is pictures
00:55:40.400 | of celebrity faces. That works great with GANs. You create really clear celebrity faces
00:55:46.280 | that don't actually exist. The bedroom dataset, also a good one. Lots of pictures of the same
00:55:51.560 | kind of thing. So that's just a suggestion.
00:55:55.660 | So there's something called the lsun_scene_classification_dataset. You can download it using these steps.
00:56:06.600 | It's pretty huge. So I've actually created a Kaggle dataset of a 20% sample. So unless
00:56:13.320 | you're really excited about generating bedroom images, you might prefer to grab the 20% sample.
00:56:20.940 | So then we do the normal steps of creating some different paths. In this case, as we
00:56:27.320 | do before, I find it much easier to go the CSV route when it comes to handling our data.
00:56:34.320 | So I just generate a CSV with the list of files that we want and a fake label that's
00:56:41.040 | zero because we don't really have labels for these at all.
00:56:45.120 | So I actually create two CSV files, one that contains everything in that bedroom dataset
00:56:51.720 | and one that just contains a random 10%. It's just nice to do that because then I can most
00:56:57.840 | of the time use the sample when I'm experimenting. Because there's well over a million files,
00:57:04.680 | even just reading in the list takes a while.
00:57:11.480 | So this will look pretty familiar. So here's a conv block. This is before I realized that
00:57:18.640 | sequential models are much better. So if you compare this to my previous conv block with
00:57:23.320 | a sequential model, there's just a lot more lines of code here. But it does the same thing
00:57:30.240 | of doing conv value batch norm.
00:57:36.440 | And we calculate our padding, and here's a bias pulse. So this is the same as before
00:57:40.360 | basically, but with a little bit more code.
00:57:48.520 | So the first thing we're going to do is build a discriminator. So a discriminator is going
00:57:53.480 | to receive as input an image, and it's going to spit out a number. And the number is meant
00:58:02.120 | to be lower if it thinks this image is real. Of course, what does it do for a lower number
00:58:11.120 | thing doesn't appear in the architecture, that will be in the loss function. So all
00:58:15.000 | we have to do is create something that takes an image and spits out a number.
00:58:23.560 | So a lot of this code is borrowed from the original authors of the paper, so some of
00:58:34.200 | the naming scheme and stuff is different to what we're used to. So sorry about that.
00:58:43.600 | But I've tried to make it look at least somewhat familiar. I probably should have renamed things
00:58:46.800 | a little bit. But it looks very similar to actually what we had before. We start out
00:58:50.800 | with a convolution, so remember conv block is conv-relievational.
00:58:57.980 | And then we have a bunch of extra conv layers. This is not going to use a residual. It looks
00:59:04.720 | very similar to before, a bunch of extra layers, but these are going to be conv layers rather
00:59:08.280 | than res layers. And then at the end, we need to append enough stride 2 conv layers that
00:59:22.120 | we decrease the grid size down to be no bigger than 4x4. So it's going to keep using stride
00:59:29.920 | 2, divide the size by 2, stride 2, divide by size by 2, until our grid size is no bigger
00:59:36.020 | than 4. So this is quite a nice way of creating as many layers as you need in a network to
00:59:42.240 | handle arbitrary sized images and turn them into a fixed known grid size.
00:59:46.960 | Yes, Rachel? Does a GAN need a lot more data than say dogs
00:59:51.600 | versus cats or NLP, or is it comparable? Honestly, I'm kind of embarrassed to say
00:59:58.880 | I am not an expert practitioner in GANs. The stuff I teach in part 1 is stuff I'm happy
01:00:09.120 | to say I know the best way to do these things and so I can show you state-of-the-art results
01:00:15.520 | like I just did with sci-fi 10 with the help of some of my students, of course. I'm not
01:00:21.680 | there at all with GANs. So I'm not quite sure how much you need. In general, it seems you
01:00:29.760 | need quite a lot. But remember, the only reason we didn't need too much in dogs and cats is
01:00:35.920 | because we had a pre-trained model, and could we leverage pre-trained GAN models and fine-tune
01:00:40.880 | them? Probably. I don't think anybody's done it as far as I know. That could be a really
01:00:48.000 | interesting thing for people to kind of think about and experiment with. Maybe people have
01:00:52.120 | done it and there's some literature there I haven't come across. So I'm somewhat familiar
01:00:57.120 | with the main pieces of literature in GANs, but I don't know all of it. So maybe I've
01:01:03.040 | missed something about transfer learning in GANs, but that would be the trick to not needing
01:01:06.880 | too much data. So it's the huge speed-up combination of one cycle learning rate and momentum annealing
01:01:14.560 | plus the 8 GPU parallel training and the half precision. Is that only possible to do the
01:01:20.360 | half-precision calculation with consumer GPU? Another question, why is the calculation 8
01:01:26.760 | times faster from single to half-precision while from double to single is only 2 times
01:01:31.280 | faster?
01:01:32.280 | Okay, so the sci-fi 10 result, it's not 8 times faster from single to half. It's about
01:01:39.160 | 2 or 3 times as fast from single to half. The Nvidia claims about the flops performance
01:01:46.400 | of the tensor cores are academically correct but in practice meaningless because it really
01:01:54.000 | depends on what cores you need for what pieces. So about 2 or 3x improvement for half. So
01:02:02.640 | the half-precision helps a bit, the extra GPU helps a bit, the one cycle helps an enormous
01:02:10.720 | amount. Then another key piece was the playing around with the parameters that I told you
01:02:16.240 | about. So reading the wide resnet paper carefully, identifying the kinds of things that they
01:02:23.040 | found there, and then writing a version of the architecture you just saw that made it
01:02:29.040 | really easy for me to fiddle around with parameters. Staying up all night trying every possible
01:02:37.520 | combination of different kernel sizes and numbers of kernels and numbers of layer groups
01:02:43.760 | and size of layer groups. Remember we did a bottleneck but actually we tended to focus
01:02:51.560 | not on bottlenecks but instead on widening. So we actually like things that increase the
01:02:55.760 | size and then decrease it because it takes better advantage of the GPU. So all those
01:03:01.000 | things combined together. I'd say the one cycle was perhaps the most critical but every
01:03:07.840 | one of those resulted in a big speedup. That's why we were able to get this 30x improvement
01:03:13.400 | over the state of the art. And we got some ideas for other things like after this Dawn
01:03:24.000 | Bench finishes. Maybe we'll try and go even further and see if we can beat one minute
01:03:28.880 | one day. That'll be fun.
01:03:37.480 | So here's our discriminator. The important thing to remember about an architecture is
01:03:42.080 | it doesn't do anything other than have some input tensor size and rank and some output
01:03:48.080 | tensor size and rank. You see the last com here has one channel. This is a bit different
01:03:55.840 | to what we're used to, because normally our last thing is a linear block. But our last
01:04:00.960 | thing here is a com block. And it's only got one channel but it's got a grid size of something
01:04:08.240 | around 4x4. So we're going to spit out a 4x4 by 1 tensor.
01:04:17.120 | So what we then do is we then take the mean of that. So it goes from 4x4 by 1 to the scalar.
01:04:27.980 | So this is kind of like the ultimate adaptive average pooling, because we've got something
01:04:32.200 | with just one channel, we take the mean. So this is a bit different. Normally we first
01:04:37.120 | do average pooling and then we put it through a fully connected layer to get our one thing
01:04:42.000 | out. In this case though we're getting one channel out and then taking the mean of that.
01:04:48.720 | I haven't fiddled around with why did we do it that way, what would instead happen if
01:04:53.160 | we did the usual average pooling followed by a fully connected layer. Would it work better?
01:04:58.560 | Would it not? I don't know. I rather suspect it would work better if we did it the normal
01:05:04.960 | way, but I haven't tried it and I don't really have a good enough intuition to know whether
01:05:10.400 | I'm missing something. It would be an interesting experiment to try. If somebody wants to stick
01:05:14.640 | an adaptive average pooling layer here and a fully connected layer afterwards with a
01:05:17.880 | single output, it should keep working. It should do something. The loss will go down to see
01:05:25.000 | whether it works.
01:05:27.320 | So that's the discriminator. There's going to be a training loop. Let's assume we've
01:05:31.640 | already got a generator. Somebody says, "Okay Jeremy, here's a generator, it generates bedrooms.
01:05:37.760 | I want you to build a model that can figure out which ones are real and which ones aren't.
01:05:41.160 | So I'm going to take the data set and I'm going to basically label a bunch of images
01:05:45.680 | which are fake bedrooms from the generator and a bunch of images of real bedrooms from
01:05:50.040 | my else-on data set to stick a 1 or a 0 in each one and then I'll try to get the discriminator
01:05:56.080 | to tell the difference. So that's going to be simple enough.
01:06:04.000 | But I haven't been given a generator, I need to build one. So a generator, and we haven't
01:06:09.880 | talked about the loss function yet. We're just going to assume there's some loss function
01:06:13.400 | that does this thing. So a generator is also an architecture which doesn't do anything
01:06:19.920 | by itself until we have a loss function and data. But what are the ranks and sizes of
01:06:25.560 | the tensors? The input to the generator is going to be a vector of random numbers. In
01:06:34.480 | the paper, they call that the prior. It's going to be a vector of random numbers. How
01:06:38.440 | big? I don't know. Some big. 64, 128. And the idea is that a different bunch of random numbers
01:06:46.880 | will generate a different bedroom. So our generator has to take as input a vector, and
01:07:01.320 | it's going to take that vector, so here's our input, and it's going to stick it through,
01:07:06.200 | in this case a sequential model. And the sequential model is going to take that vector and it's
01:07:11.160 | going to turn it into a rank 4 tensor, or if we take off the batch bit, a rank 3 tensor,
01:07:26.520 | height by width by 3. So you can see at the end here, our final step here, NC, number of
01:07:40.440 | channels. So I think that's going to have to end up being 3 because we're going to create
01:07:43.560 | a 3-channel image of some size.
01:07:48.800 | In com-block-forward, is there a reason why BatchNorm comes after ReLU, i.e. self.batchnorm.relu?
01:07:57.760 | No, there's not. It's just what they had in the code I borrowed from, I think.
01:08:05.240 | So again, unless my intuition about GANs is all wrong and for some reason needs to be
01:08:17.280 | different to what I'm used to, I would normally expect to go ReLU then BatchNorm. This is
01:08:29.200 | actually the order that makes more sense to me. But I think the order I had in the darknet
01:08:35.680 | was what they used in the darknet paper. Everybody seems to have a different order of these things.
01:08:45.200 | And in fact, most people for sci-fi 10 have a different order again, which is they actually
01:08:52.220 | go bn, then ReLU, then conv, which is kind of a quirky way of thinking about it. But
01:09:02.160 | it turns out that often for residual blocks that works better. That's called a pre-activation
01:09:07.920 | resnet. So if you Google for pre-activation resnet, you can see that.
01:09:13.520 | So yeah, there's not so much papers but more blog posts out there where people have experimented
01:09:19.120 | with different orders of those things. And yeah, it seems to depend a lot on what specific
01:09:25.200 | data set it is and what you're doing with, although in general the difference in performance
01:09:29.940 | is small enough you won't care unless it's for a competition.
01:09:36.960 | So the generator needs to start with a vector and end up with a rank 3 tensor. We don't
01:09:45.040 | really know how to do that yet, so how do we do that? How do we start with a vector
01:09:50.520 | and turn it into a rank 3 tensor?
01:09:52.880 | We need to use something called a deconvolution. And a deconvolution is, or as they call it
01:10:02.920 | in PyTorch, a transposed convolution. Same thing, different name. And so a deconvolution
01:10:13.360 | is something which, rather than decreasing the grid size, it increases the grid size.
01:10:22.920 | So as with all things, it's easiest to see in an Excel spreadsheet.
01:10:28.780 | So here's a convolution. We start with a 4x4 grid cell with a single channel, a single filter.
01:10:38.240 | And let's put it through a 3x3 kernel again with a single output. So we've got a single
01:10:46.820 | channel in, a single filter kernel. And so if we don't add any padding, we're going to
01:10:53.400 | end up with 2x2, because that 3x3 can go in 1, 2, 3, 4 places. It can go in one of two
01:11:01.480 | places across and one of two places down if there's no padding.
01:11:06.880 | So there's our convolution. Remember the convolution is just the sum of the product of the kernel
01:11:14.960 | and the appropriate grid cell. So there's our standard 3x3 on one channel, one filter.
01:11:25.440 | So the idea now is I want to go the opposite direction. I want to start with my 2x2, and
01:11:34.320 | I want to create a 4x4. And specifically, I want to create the same 4x4 that I started
01:11:41.040 | with. And I want to do that by using a convolution.
01:11:45.840 | So how would I do that? Well, if I have a 3x3 convolution, then if I want to create
01:11:51.200 | a 4x4 output, I'm going to need to create this much padding. Because with this much
01:12:01.340 | padding, I'm going to end up with 1, 2, 3, 4 by 1, 2, 3, 4. You see why that is? So this
01:12:11.180 | filter can go in any one of four places across and four places up and down.
01:12:18.380 | So let's say my convolutional filter was just a bunch of zeros, then I can calculate my
01:12:23.960 | error for each cell just by taking this attraction, and then I can get the sum of absolute values,
01:12:32.920 | the L1 loss, by just summing up the absolute values of those errors.
01:12:38.300 | So now I could use optimization. So in Excel, that's called Solver to do a gradient descent.
01:12:48.900 | So I'm going to set that cell equal to a minimum, and I'll try and reduce my loss by changing
01:12:56.380 | my filter, and I'll go Solve. And you can see it's come up with a filter such that 15.7
01:13:05.420 | compared to 16, 17 is right, 17.8, 18, 19, so it's not perfect. And in general, you can't
01:13:12.540 | assume that a deconvolution can exactly create the exact thing that you want, because there's
01:13:21.220 | just not enough. There's only 9 things here, and there's 16 things you're trying to create.
01:13:26.500 | But it's made a pretty good attempt. So this is what a deconvolution looks like, a stride
01:13:34.140 | 1 3x3 deconvolution on a 2x2 grid cell input. How difficult is it to create a discriminator
01:13:46.060 | to identify fake news versus real news? Well, you don't need anything special, that's just
01:13:53.220 | a classifier. So you would just use the NLP classifier from previous to previous class
01:14:00.780 | and lesson 4. In that case, there's no generative piece, right? So you just need a dataset that
01:14:10.860 | says these are the things that we believe are fake news, and these are the things we
01:14:13.780 | consider to be real news. And it should actually work very well. To the best of my knowledge,
01:14:22.740 | if you try it, you should get as good a result as anybody else has got, whether it's good
01:14:27.900 | enough to be useful to practice, I don't know. Oh, I was going to say that it's very hard
01:14:32.580 | using the technique you've described. Very hard. There's not a good solution that does
01:14:41.300 | that. Well, but I don't think anybody in our course has tried, and nobody else outside
01:14:46.700 | our course knows of this technique. So there's been, as we've learned, we've just had a very
01:14:53.500 | significant jump in NLP classification capabilities. Obviously the best you could do at this stage
01:15:04.940 | would be to generate a triage that says these things look pretty sketchy based on how they're
01:15:13.340 | written and some human could go and fact check them. An NLP classifier and RNN can't fact
01:15:20.680 | check things, but it could recognize like, oh, these are written in that kind of highly
01:15:29.820 | popularized style which often fake news is written in, and so maybe these ones are worth
01:15:34.540 | paying attention to. I think that would probably be the best you could hope for without drawing
01:15:41.460 | on some kind of external data sources.
01:15:44.380 | But it's important to remember that a discriminator is basically just a classifier and you don't
01:15:51.460 | need any special techniques beyond what we've already learnt to do NLP classification.
01:16:01.360 | So to do that kind of deconvolution in PyTorch, just say com_transport is 2D, and in the normal
01:16:08.780 | way you say the number of input channels, the number of output channels, the kernel size,
01:16:14.500 | the stride, the padding, the bias, so these parameters are all the same. And the reason
01:16:19.340 | it's called a com_transpose is because actually it turns out that this is the same as the
01:16:24.860 | calculation of the gradient of convolution. So this is a really nice example back on the
01:16:36.140 | old Theano website that comes from a really nice paper which actually shows you some visualizations.
01:16:42.740 | So this is actually the one we just saw of doing a 2x2 deconvolution. If there's a stride
01:16:48.580 | 2, then you don't just have padding around the outside, but you actually have to put
01:16:52.460 | padding in the middle as well. They're not actually quite implemented this way because
01:16:58.620 | this is slow to do. In practice they implement them a different way, but it all happens behind
01:17:03.900 | the scenes, we don't have to worry about it. We've talked about this convolution arithmetic
01:17:10.700 | tutorial before, and if you're still not comfortable with convolutions and in order to get comfortable
01:17:16.580 | with deconvolutions, this is a great site to go to. If you want to see the paper, just
01:17:22.420 | Google for convolution arithmetic, that'll be the first thing that comes up. Let's do
01:17:27.700 | it now so you know you've found it. Here it is. And so that Theano tutorial actually comes
01:17:38.580 | from this paper. But the paper doesn't have the animated gifs.
01:17:48.700 | So it's interesting then. A deconv block looks identical to a conv block, except it's got
01:17:53.000 | the word transpose written here. We just go conv-related batch norm as before, it's got
01:17:58.220 | input filters, output filters. The only difference is that stride 2 means that the grid size
01:18:06.580 | will double rather than half.
01:18:11.140 | Both nn_conf_transpose_2D and nn.upsample seem to do the same thing, i.e. expand grid size,
01:18:18.700 | height and width from the previous layer. Can we say that conv_transpose_2D is always better
01:18:23.860 | than upsample, since upsample is merely resizing and filling unknowns by zeros or interpolation?
01:18:30.980 | No, you can't. So there's a fantastic interactive paper on distill.pub called Deconvolution
01:18:52.340 | But the good news is everybody else does it. If you have a look here, can you see these
01:19:01.460 | checkerboard artifacts? It's all like dark blue, light blue, dark blue, light blue. So
01:19:07.620 | these are all from actual papers, right? Basically they noticed every one of these papers with
01:19:13.860 | generative models has these checkerboard artifacts. And what they realized is it's because when
01:19:20.820 | you have a stride 2 convolution of size 3 kernel, they overlap. And so you basically get like
01:19:30.580 | some pixels get twice as much, some grid cells get twice as much activation. And so even
01:19:38.340 | if you start with random weights, you end up with a checkerboard artifact. So you can
01:19:44.780 | kind of see it here. And so the deeper you get, the worse it gets. Their advice is actually
01:19:58.060 | less direct from it than it ought to be. I found that for most generative models, upsampling
01:20:03.620 | is better. So if you do nn.upsample, then all it does is it's basically doing cooling.
01:20:13.300 | But it's kind of the opposite of cooling. It says let's replace this one pixel or this one
01:20:20.100 | grid cell with 4, 2x2. And there's a number of ways to upsample. One is just to copy
01:20:26.180 | it across to those 4. Another is to use bilinear or bicubic interpolation. There are various
01:20:32.260 | techniques to try and create a smooth upsampled version, and you can pretty much choose any
01:20:37.500 | of them in PyTorch. So if you do a 2x2 upsample and then a regular stride 1 3x3 conv, that's
01:20:48.180 | like another way of doing the same kind of thing as a conv transpose. It's doubling the
01:20:55.660 | grid size and doing some convolutional arithmetic on it. And I found for generative models it
01:21:04.020 | pretty much always works better. And in that distillator publication, they kind of indicate
01:21:11.260 | that maybe that's a good approach, but they don't just come out and say just do this,
01:21:15.080 | whereas I would just say just do this. Having said that, for GANs, I haven't had that much
01:21:21.700 | success with it yet, and I think it probably requires some tweaking to get it to work.
01:21:26.540 | I'm sure some people have got it to work. The issue I think is that in the early stages,
01:21:33.980 | it doesn't create enough noise. I had a version actually where I tried to do it with an upsample,
01:21:47.140 | and you could kind of see that the noise didn't look very noisy. So anyway, it's an interesting
01:21:53.420 | version. But next week when we look at style transfer and super resolution and stuff, I
01:21:58.660 | think you'll see an end-up sample really comes into its own.
01:22:05.100 | So the generator, we can now basically start with a vector. We can decide and say, okay,
01:22:10.340 | let's not think of it as a vector, but actually it's a 1x1 grid cell, and then we can turn
01:22:14.120 | it into a 4x4 and an 8x8 and so forth. And so that's why we have to make sure it's a suitable
01:22:20.860 | multiple so that we can actually create something of the right size. And so you can see it's
01:22:26.500 | doing the exact opposite as before, right? It's making the cell size smaller and smaller
01:22:31.340 | by 2 at a time, as long as it can, until it gets to half the size that we want. And then
01:22:48.340 | finally we add one more on at the end -- sorry, we add n more on at the end with no stride,
01:22:58.100 | and then we add one more com transpose to finally get to the size that we wanted, and
01:23:06.020 | we're done. Finally, we put that through a than, and that's going to force us to be in
01:23:13.100 | the 0-to-1 range, because of course we don't want to spit out arbitrary size pixel values.
01:23:24.860 | So we've got a generator architecture which spits out an image of some given size with
01:23:29.580 | the correct number of channels and with values between 0 and 1.
01:23:39.420 | So at this point we can now create our ModelData object. These things take a while to train,
01:23:47.700 | so I just made it 128x128, so this is just a convenient way to make it a bit faster. And
01:23:58.380 | that's going to be the size of the import, but then we're going to use transformations
01:24:01.180 | to turn it into 64x64. There's been more recent advances which have attempted to really increase
01:24:10.060 | this up to high resolution sizes, but they still tend to require either a batch size
01:24:15.060 | of 1 or lots and lots of GPUs or whatever. We're trying to do things that we can do on
01:24:21.740 | single consumer GPUs here. So here's an example of one of the 64x64 bedrooms.
01:24:31.260 | So we're going to do pretty much everything manually, so let's go ahead and create our
01:24:36.740 | two models, our generator and our discriminator. And as you can see, the DCGAN, so in other
01:24:44.460 | words they're the same modules that were appeared in this paper. So if you're interested in
01:24:52.380 | reading the papers, it's well worth going back and looking at the DCGAN paper to see
01:24:58.740 | what these architectures are, because it's assumed that when you read the Wasserstein
01:25:02.340 | GAN paper that you already know that.
01:25:07.820 | Shouldn't we use a sigmoid if we want values between 0 and 1?
01:25:13.460 | I always forget which one's which. So sigmoid is 0 to 1, than is 1 to -1. I think what will
01:25:27.780 | happen is -- I'm going to have to check that. I vaguely remember thinking about this when
01:25:34.860 | I was writing this notebook and realizing that 1 to -1 made sense for some reason, but
01:25:39.540 | I can't remember what that reason was now. So let me get back to you about that during
01:25:43.900 | the week and remind me if I forget.
01:25:46.980 | Good question, thank you.
01:25:50.180 | So we've got our generator and our discriminator. So we need a function that returns a prior
01:25:56.060 | vector, so a bunch of noise. So we do that by creating a bunch of zeros. nz is the size
01:26:05.020 | of z, so very often in our code if you see a mysterious letter, it's because that's the
01:26:10.260 | letter they used in the paper. So z is the size of our noise vector.
01:26:16.980 | So there's the size of our noise vector, and then we use a normal distribution to generate
01:26:22.340 | random numbers inside that. And that needs to be a variable because it's going to be
01:26:26.840 | participating in the gradient updates.
01:26:32.780 | So here's an example of creating some noise, and so here are four different pieces of noise.
01:26:40.060 | So we need an optimizer in order to update our gradients. In the Wasserstein GAN paper,
01:26:51.480 | they told us to use rmsprop. So that's fine. So when you see this thing saying do an rmsprop
01:26:56.820 | update in a paper, that's nice. We can just do an rmsprop update with pytorch. And they
01:27:05.140 | suggested a learning rate of 5e-neg-5. I think I found 1e-neg-4 seemed to work, so I just
01:27:12.260 | made it a bit bigger.
01:27:14.820 | So now we need a training loop. And so this is the thing that's going to implement this
01:27:18.820 | algorithm. So a training loop is going to go through some number of epochs that we get
01:27:26.780 | to pick, so that's going to be a parameter. And so remember, when you do everything manually,
01:27:33.860 | you've got to remember all the manual steps to do. So one is that you have to set your
01:27:37.700 | modules into training mode when you're training them, and into evaluation mode when you're
01:27:43.340 | evaluating them. Because in training mode, batch norm updates happen, and dropout happens.
01:27:49.940 | In evaluation mode, those two things get turned off. That's basically the difference. So put
01:27:54.860 | it into training mode. We're going to grab an iterator from our training data loader.
01:28:01.900 | We're going to see how many steps we have to go through, and then we'll use TQDM to
01:28:08.020 | give us a progress bar, and then we're going to go through that many steps.
01:28:14.060 | So the first step of this algorithm is to update the discriminator. So in this one -- they
01:28:37.060 | don't call it a discriminator, they call it a critic. So w are the weights of the critic.
01:28:43.380 | So the first step is to train our critic a little bit, and then we're going to train
01:28:48.660 | our generator a little bit, and then we're going to go back to the top of the loop. So
01:28:53.500 | we've got a while loop on the outside, so here's our while loop on the outside, and
01:28:58.060 | then inside that there's another loop for the critic, and so here's our little loop
01:29:02.440 | inside that for the critic. We call it a discriminator.
01:29:07.060 | So what we're going to do now is we've got a generator, and at the moment it's random.
01:29:13.660 | So our generator is going to generate stuff that looks something like this, and so we
01:29:19.360 | need to first of all teach our discriminator to tell the difference between that and a
01:29:24.060 | bedroom. It shouldn't be too hard, you would hope. So we just do it in basically the usual
01:29:31.420 | way, but there's a few little tweaks.
01:29:34.780 | So first of all, we're going to grab a mini-batch of real bedroom photos, so we can just grab
01:29:42.060 | the next batch from our iterator, turn it into a variable. Then we're going to calculate
01:29:54.580 | the loss for that. So this is going to be, how much does the discriminator think this
01:30:08.220 | looks fake? And then we're going to create some fake images, and to do that we'll create
01:30:16.240 | some random noise, and we'll stick it through our generator, which at this stage is just
01:30:20.940 | a bunch of random weights, and that's going to create a mini-batch of fake images. And
01:30:27.060 | so then we'll put that through the same discriminator module as before to get the loss for that.
01:30:34.780 | So how fake do the fake ones look? Remember when you do everything manually, you have
01:30:39.900 | to zero the gradients in your loop, and if you've forgotten about that, go back to the
01:30:45.780 | Part 1 lesson where we do everything from scratch. So now finally, the total discriminator
01:30:52.820 | loss is equal to the real loss minus the fake loss. And so you can see that here. They don't
01:31:03.140 | talk about the loss, they actually just talk about what are the gradient updates. So this
01:31:08.140 | here is the symbol for get the gradients. So inside here is the loss. And try to learn
01:31:16.860 | to throw away in your head all of the boring stuff. So when you see sum over m divided
01:31:22.780 | by m, that means take the average. So just throw that away and replace it with np.mean
01:31:27.820 | in your head. There's another np.mean. So you want to get quick at being able to see these
01:31:33.180 | common idioms. So anytime you see 1 over m, sum over m, you go, okay, np.mean. So we're
01:31:40.060 | taking the mean of, and we're taking the mean of, so that's all fine.
01:31:45.580 | x_i, what's x_i? It looks like it's x to the power of i, but it's not. The math notation
01:31:52.100 | is very overloaded. They showed us here what x_i is, and it's a set of m samples from a
01:32:01.420 | batch of the real data. So in other words, this is a mini-batch. So when you see something
01:32:06.940 | saying sample, it means just grab a row, grab a row, grab a row, and you can see here grab
01:32:12.180 | at m times, and we'll call the first row x, parenthesis 1, the second row x, parenthesis
01:32:17.980 | 2. One of the annoying things about math notation is the way that we index into arrays is everybody
01:32:28.500 | uses different approaches, subscripts, superscripts, things in brackets, combinations, commas, square
01:32:33.700 | brackets, whatever. So you've just got to look in the paper and be like, okay, at some point
01:32:39.300 | they're going to say take the i-th row from this matrix or the i-th image in this batch,
01:32:45.220 | how are they going to do it? In this case, it's a superscript in parenthesis.
01:32:51.020 | So that's all sample means, and curly brackets means it's just a set of them. This little
01:32:56.420 | squiggle followed by something here means according to some probability distribution.
01:33:04.580 | And so in this case, and very very often in papers, it simply means, hey, you've got a
01:33:09.420 | bunch of data, grab a bit from it at random. So that's the probability distribution of
01:33:17.380 | the data you have is the data you have. So this says grab m things at random from your
01:33:24.620 | prior samples, and so that means in other words call create_noise to create m random vectors.
01:33:42.620 | So now we've got m real images. Each one gets put through our discriminator. We've got m
01:33:54.380 | bits of noise. Each one gets put through our generator to create m generated images. Each
01:34:03.500 | one of those gets put through, look, f(w), that's the same thing, so each one of those
01:34:07.460 | gets put through our discriminator to try and figure out whether they're fake or not.
01:34:11.820 | And so then it's this, minus this, and the mean of that, and then finally get the gradient
01:34:18.060 | of that in order to figure out how to use rmsprop to update our weights using some learning
01:34:25.260 | weight.
01:34:27.180 | So in PyTorch, we don't have to worry about getting the gradients. We can just specify
01:34:34.660 | the loss bit, and then just say loss.backward, discriminator optimizer.step.
01:34:42.140 | Now there's one key step, which is that we have to keep all of our weights, which are
01:34:56.660 | the parameters in a PyTorch module, in this small range between -0.01 and 0.01. Why? Because
01:35:07.540 | the mathematical assumptions that make this algorithm work only apply in like a small
01:35:15.620 | ball. I think it's kind of interesting to understand the math of why that's the case,
01:35:23.220 | but it's very specific to this one paper, and understanding it won't help you understand
01:35:28.420 | any other paper. So only study it if you're interested. I think it's nicely explained,
01:35:34.140 | I think it's fun, but it won't be information that you'll reuse elsewhere unless you get
01:35:40.620 | super into GANs.
01:35:42.380 | I'll also mention, after the paper came out, an improved Frostenstein GAN came out that
01:35:47.900 | said there are better ways to ensure that your weight space is in this type ball, which
01:35:54.380 | was basically to penalize gradients that are too high. So nowadays there are slightly different
01:36:02.060 | ways to do this. Anyway, that's why this line of code there is kind of the key contribution.
01:36:08.720 | This one line of code actually is the one line of code you add to make it a Frostenstein
01:36:13.100 | GAN. But the work was all in knowing that that's the thing you can do that makes everything
01:36:19.700 | work better.
01:36:20.700 | At the end of this, we've got a discriminator that can recognize it in real bedrooms and
01:36:25.580 | now totally random crappy generated images. So let's now try and create some better images.
01:36:33.160 | So now set trainable discriminator to false, set trainable to true, zero out the gradients
01:36:40.380 | of the generator.
01:36:42.740 | And now our loss again is fw, that's the discriminator of the generator applied to some more random
01:36:57.380 | noise. So here's our random noise, here's our generator, and here's our discriminator.
01:37:09.260 | I think I can remove that now because I think I've put it inside the discriminator but I
01:37:16.300 | won't change it now because it's going to confuse me. So it's exactly the same as before
01:37:22.540 | where we did generator on the noise and then pass that to discriminator, but this time
01:37:28.140 | the thing that's trainable is the generator, not the discriminator. So in other words,
01:37:32.580 | in this pseudocode, the thing they update is theta, which is the generator's parameters
01:37:40.900 | rather than w, which is the discriminator's parameters.
01:37:45.860 | And so hopefully you'll see now that this w down here is telling you these are the parameters
01:37:52.740 | of the discriminator, this theta down here is telling you these are the parameters of
01:38:02.420 | the generator.
01:38:03.420 | And again, it's not a universal mathematical notation, it's a thing they're doing in this
01:38:10.740 | particular paper, but it's kind of nice when you see some suffix like that, try to think
01:38:18.180 | about what it's telling you.
01:38:21.780 | So we take some noise, generate some images, try and figure out if they're fake or real,
01:38:28.500 | and use that to get gradients with respect to the generator, as opposed to earlier we
01:38:35.420 | got them with respect to the discriminator, and use that to update their weights with
01:38:40.020 | our MSProp with an alpha learning rate.
01:38:47.980 | You'll see that it's kind of unfair that the discriminator is getting trained n critic
01:38:55.700 | times, which they set to 5, for every time that we train the generator once.
01:39:04.700 | And the paper talks a bit about this, but the basic idea is there's no point making
01:39:09.500 | the generator better if the discriminator doesn't know how to discriminate yet.
01:39:15.020 | So that's why we've got this while loop.
01:39:18.460 | And here's that 5, and actually something which was added in the later paper is the
01:39:28.880 | idea that from time to time, and a bunch of times at the start, you should do more steps
01:39:39.060 | of the discriminator.
01:39:40.060 | So make sure that the discriminator is pretty capable from time to time.
01:39:46.420 | So do a bunch of epochs of training the discriminator a bunch of times to get better at telling
01:39:51.900 | the difference between real and fake, and then do one step with making the generator
01:39:56.860 | being better at generating, and that is an epoch.
01:40:01.940 | And so let's train that for one epoch, and then let's create some noise so we can generate
01:40:11.620 | some examples.
01:40:16.020 | So we're going to do that later.
01:40:17.020 | Let's first of all decrease the learning rate by 10 and do one more pass.
01:40:20.900 | So we've now done two epochs, and now let's use our noise to pass it to our generator,
01:40:30.540 | and then put it through our denormalization to turn it back into something we can see,
01:40:37.580 | and then plot it, and we have some bedrooms.
01:40:43.620 | It's not real bedrooms, and some of them don't look particularly like bedrooms, but some
01:40:46.940 | of them look a lot like bedrooms.
01:40:49.540 | So that's the idea, that's a GAN.
01:40:53.620 | And I think the best way to think about a GAN is it's like an underlying technology
01:41:01.020 | that you'll probably never use like this, but you'll use in lots of interesting ways.
01:41:08.820 | For example, we're going to use it to create now a CycleGAN, and we're going to use the
01:41:20.900 | CycleGAN to turn horses into zebras.
01:41:28.040 | You could also use it to turn Monet prints into photos, or to turn photos of Yosemite
01:41:33.020 | in summer into winter.
01:41:36.200 | So it's going to be pretty, yes, Rachel?
01:41:38.860 | Two questions.
01:41:39.860 | One, is there any reason for using RMS props, specifically as the optimizer as opposed to
01:41:45.820 | Adam?
01:41:46.820 | I don't remember it being explicitly discussed in the paper, I don't know if it's just experimental
01:41:52.740 | or the theoretical reason.
01:41:54.220 | Have a look in the paper and see what it says, I don't recall.
01:41:57.900 | And which could be a reasonable way of detecting overfitting while training, or evaluating
01:42:02.420 | the performance of one of these GAN models once we're done training?
01:42:06.060 | In other words, how does the notion of training validation test sets translate to GANs?
01:42:15.340 | That's an awesome question.
01:42:18.740 | And there's a lot of people who make jokes about how GANs is the one field where you
01:42:24.420 | don't need a test set, and people take advantage of that by making stuff up and saying it looks
01:42:30.620 | great.
01:42:33.060 | There are some pretty famous problems with GANs.
01:42:36.420 | One of the famous problems with GANs is called mode collapse.
01:42:40.060 | And mode collapse happens where you look at your bedrooms and it turns out that there's
01:42:44.900 | basically only three kinds of bedrooms that every possible noise vector mapped to, and
01:42:50.660 | you look at your gallery and it turns out they're all just the same thing, or there's
01:42:54.620 | just three different things.
01:42:56.980 | Mode collapse is easy to see if you collapse down to a small number of modes, like three
01:43:01.700 | or four.
01:43:02.700 | But what if you have a mode collapse down to 10,000 modes, so there's only 10,000 possible
01:43:08.300 | bedrooms that all of your noise vectors collapse to?
01:43:12.700 | You wouldn't be able to see it here, because it's pretty unlikely you would have two identical
01:43:16.060 | bedrooms out of 10,000.
01:43:18.460 | Or what if every one of these bedrooms is basically a direct copy of one of the -- it
01:43:24.980 | basically memorized some input.
01:43:29.900 | Could that be happening?
01:43:31.680 | And the truth is most papers don't do a good job or sometimes any job of checking those
01:43:39.260 | things.
01:43:42.440 | So the question of how do we evaluate GANs, and even the point of maybe we should actually
01:43:50.100 | evaluate GANs properly is something that is not widely enough understood even now.
01:43:59.780 | And some people are trying to really push.
01:44:02.460 | So Ian Goodfellow, who a lot of you will know because he came and spoke here at a lot of
01:44:09.540 | the book club meetings last year, and of course was the first author on the most famous deep
01:44:15.540 | learning book.
01:44:16.540 | He's the inventor of GANs, and he's been sending a continuous stream of tweets reminding people
01:44:24.900 | about the importance of testing GANs properly.
01:44:30.500 | So if you see a paper that claims exceptional GAN results, then this is definitely something
01:44:37.100 | to look at.
01:44:38.800 | Have they talked about mode collapse?
01:44:40.380 | Have they talked about memorization?
01:44:47.100 | So this is going to be really straightforward because it's just a neural net.
01:44:51.260 | So all we're going to do is we're going to create an input containing lots of zebra photos,
01:44:58.780 | and with each one we'll pair it with an equivalent horse photo, and we'll just train a neural
01:45:06.660 | net that goes from one to the other.
01:45:08.300 | Or you can do the same thing for every Monet painting, create a dataset containing the
01:45:13.060 | photo of the place.
01:45:15.020 | Oh wait, that's not possible because the places that Monet painted aren't there anymore, and
01:45:20.900 | there aren't exact zebra versions of horses.
01:45:23.700 | And oh wait, how the hell is this going to work?
01:45:27.920 | This seems to break everything we know about what neural nets can do and how they do them.
01:45:32.980 | Alright Rachel, you're going to ask me a question, just spoil our whole train of thought.
01:45:36.940 | Come on, better be good.
01:45:38.500 | Can GANs be used for data augmentation?
01:45:41.380 | Yeah, absolutely.
01:45:43.140 | You can use a GAN for data augmentation.
01:45:45.540 | Should you?
01:45:47.740 | I don't know.
01:45:49.660 | There are some papers that try to do semi-supervised learning with GANs.
01:45:53.620 | I haven't found any that are particularly compelling, showing state-of-the-art results on really
01:46:00.340 | interesting datasets that have been widely studied.
01:46:04.940 | I'm a little skeptical.
01:46:08.100 | The reason I'm a little skeptical is because in my experience, if you train a model with
01:46:12.660 | synthetic data, the neural net will become fantastically good at recognizing the specific
01:46:20.020 | problems of your synthetic data, and that will end up what it's learning from.
01:46:26.820 | And there are lots of other ways of doing semi-supervised models which do work well.
01:46:32.340 | There are some places that it can work.
01:46:34.260 | For example, you might remember Otavio Good who created that fantastic visualization in
01:46:39.540 | Part 1 of the Zooming ConvNet where he kind of showed a letter going through MNIST.
01:46:45.740 | He at least at that time was the number one autonomous remote-controlled car guy in autonomous
01:46:56.220 | remote-controlled car competitions.
01:46:58.420 | And he trained his model using synthetically augmented data where he basically talked real
01:47:06.040 | videos of a car driving around a circuit and added fake people and fake other cars and
01:47:13.860 | stuff like that.
01:47:15.020 | And I think that worked well because he's kind of a genius and because I think he had
01:47:22.660 | a well-defined subset that he had to work in.
01:47:32.820 | But in general it's really hard to use synthetic data.
01:47:36.620 | I've tried using synthetic data in models for decades now, obviously not GANs because
01:47:42.220 | they're pretty new, but in general it's very hard to do.
01:47:46.980 | Very interesting research question.
01:47:53.380 | So somehow these folks at Berkeley created a model that can turn a horse into a zebra
01:48:03.700 | despite not having any photos unless they went out there and painted horses and took
01:48:08.820 | before and after shots, but I believe they did it.
01:48:14.860 | So how the hell did they do this?
01:48:18.840 | It's kind of genius.
01:48:24.300 | I will say the person I know who's doing the most interesting practice of CycleGAN right
01:48:31.020 | now is one of our students, Helena Sarin.
01:48:35.780 | She's the only artist I know of who is a CycleGAN artist.
01:48:41.220 | Here's an example I love.
01:48:42.700 | She created this little doodle in the top left, and then trained a CycleGAN to turn
01:48:49.140 | it into this beautiful painting in the bottom right.
01:48:54.220 | Here are some more of her amazing works.
01:48:58.860 | I think it's really interesting, I mentioned at the start of this class that GANs are in
01:49:04.500 | the category of stuff that's not there yet, but it's nearly there, and in this case there's
01:49:12.140 | at least one person in the world now who's creating beautiful and extraordinary artworks
01:49:16.840 | using GANs, and there's lots of specifically CycleGANs, and there's actually at least
01:49:22.660 | maybe a dozen people I know of who are just doing interesting creative work with neural
01:49:27.420 | nets more generally, and the field of creative AI is going to expand dramatically.
01:49:33.300 | I think it's interesting with Helena, I don't know her personally, but from what I understand
01:49:38.700 | of her background, she's a software developer, it's her full-time job, and an artist as her
01:49:45.420 | hobby, and she's kind of started combining these two by saying, "Gosh, I wonder what
01:49:50.380 | this particular tool could bring to my art?"
01:49:53.660 | And so if you follow her Twitter account, we'll make sure we add it on the wiki.
01:49:57.540 | If somebody can find it, it's Helena Sarin.
01:50:03.020 | She basically posts a new work almost every day, and they're always pretty amazing.
01:50:11.820 | So here's the basic trick, and this is from the CycleGAN paper.
01:50:18.620 | We're going to have two images, assuming we're doing this with images, but the key thing
01:50:27.140 | is they're not paired images. We don't have a data set of horses and the equivalent zebras.
01:50:34.620 | We've got a bunch of horses, a bunch of zebras.
01:50:38.460 | Grab one horse, grab one zebra.
01:50:41.780 | We've now got an X, let's say X is horse, and Y is zebra.
01:50:47.060 | We're going to train a generator, and what they call here a mapping function, that turns
01:50:52.940 | horse into zebra, we'll call that mapping function G, and we'll create one mapping function,
01:50:58.900 | generator, that turns a zebra into a horse, and we'll call that F.
01:51:04.140 | We'll create a discriminator, just like we did before, which is going to get as good
01:51:09.660 | as possible at recognizing real from fake horses, so that'll be DX, and then another
01:51:16.500 | discriminator which is going to be as good as possible and recognizing real from fake
01:51:20.820 | zebras, we'll call that DY.
01:51:24.740 | So that's kind of our starting point, but then the key thing to making this work, we're
01:51:31.700 | kind of generating a loss function here, right? Here's one bit of the loss function, here's
01:51:34.620 | the second bit of the loss function. We're going to create something called cycle consistency
01:51:39.060 | loss which says after you turn your horse into a zebra with your G generator and check
01:51:47.900 | whether or not I can recognize that it's real. I keep forgetting which one's horse and which
01:51:54.420 | one's zebra, I apologize if I get my X's and Y's backwards.
01:51:57.700 | I turn my horse into a zebra, and then I'm going to try and turn that zebra back into
01:52:03.080 | the same horse that I started with. So then I'm going to have another function that's
01:52:09.240 | going to check whether this horse, which I've generated knowing nothing about X, generated
01:52:19.500 | entirely from this zebra, is similar to the original horse or not.
01:52:24.980 | So the idea would be if your generated zebra doesn't look anything like your original horse,
01:52:32.420 | you've got no chance of turning it back into the original horse. So a loss, which compares
01:52:38.340 | X hat to X, is going to be really bad unless you can go into Y and back out again. And
01:52:46.140 | you're probably only going to be able to do that if you're able to create a zebra that
01:52:50.540 | looks like the original horse so that you know what the original horse looked like.
01:52:55.400 | And vice versa. Take your zebra, turn it into a fake horse, and then try and turn it back
01:53:04.880 | into the original zebra and check that it looks like the original. So notice here, this
01:53:10.100 | F is our zebra to horse. This G is our horse to zebra. So the G and the F are kind of doing
01:53:19.260 | two things. They're both turning the original horse into the zebra and then turning the
01:53:25.740 | zebra back into the original horse. So notice that there's only two generators. There isn't
01:53:31.780 | a separate generator for the reverse mapping. You have to use the same generator that was
01:53:36.980 | used for the original mapping. So this is the cycle consistency loss. And I just think
01:53:42.420 | this is genius. The idea that this is a thing that could be even possible, honestly when
01:53:51.620 | this came out, it just never occurred to me as a thing that I could even try and solve.
01:53:56.540 | It seems so obviously impossible. And then the idea that you can solve it like this,
01:54:00.860 | I just think it's so damn smart.
01:54:05.940 | So it's good to look at the equations in this paper because they're written pretty simply.
01:54:16.660 | It's not like some of the stuff in the Wasserstein-Gahn paper which is like lots of theoretical proofs
01:54:23.140 | and whatever else. In this case, they're just equations that just lay out what's going on.
01:54:28.500 | And you really want to get to a point where you can read them and understand them. So let's
01:54:32.500 | kind of start talking through them.
01:54:35.040 | So we've got a horse and a zebra. So for some mapping function G, which is our horse to
01:54:48.500 | zebra mapping function, then there's a GAN loss, which is the bit we're already familiar
01:54:54.580 | with. It says I've got a horse, a zebra, a fake zebra recognizer, and a horse to zebra
01:55:03.060 | generator. And the loss is what we saw before. It's our ability to draw one zebra out of
01:55:14.220 | our zebras and recognize whether it's real or fake. And then take a horse and turn it
01:55:29.700 | into a zebra and recognize whether that's real or fake. And then you can then do one
01:55:37.300 | minus the other. In this case, they've got a log in there. The log's not terribly important.
01:55:43.060 | So this is the thing we just saw. So that's why we did Wasserstein GAN first. This is
01:55:47.740 | just a standard GAN loss in math form.
01:55:52.200 | Did you have a question, Rachel?
01:55:54.780 | All of this sounds awfully like translating in one language to another than back to the
01:55:58.580 | original. Have GANs or any equivalent been tried in translation?
01:56:04.460 | Not that I know of. There's this unsupervised machine translation which does kind of do
01:56:24.020 | something like this, but I haven't looked at it closely enough to know if it's nearly
01:56:29.820 | identical or if it's just vaguely similar.
01:56:35.020 | So to kind of back up to what I do know, normally with translation you require this kind of
01:56:40.340 | paired input. You require parallel texts. This is the French translation of this English
01:56:45.140 | sentence.
01:56:47.820 | I do know there's been a couple of recent papers that show the ability to create good
01:56:53.060 | quality translation models without paired data. I haven't implemented them and I don't
01:57:00.860 | understand anything, I haven't implemented them, but they may well be doing the same
01:57:04.820 | basic idea. We'll look at it during the week and get back to you.
01:57:15.980 | So we've got our GAN loss. The next piece is the cycle consistency loss. So the basic
01:57:22.180 | idea here is that we start with our horse, use our zebra generator on that to create
01:57:28.980 | a zebra, use our horse generator on that to create a horse, and then compare that to the
01:57:34.580 | original horse. And this double lines with a 1, we've seen this before, this is the L1
01:57:41.500 | loss. So this is the sum of the absolute value of differences.
01:57:45.580 | Or else if this was a 2, it would be the L2 loss, or the 2 norm, which would be the sum
01:57:51.900 | of the square root of it, actually.
01:57:57.480 | And again, we now know this squiggle idea, which is from our horses, grab a horse. So
01:58:07.540 | this is what we mean by sample from a distribution. There's all kinds of distributions, but most
01:58:12.380 | commonly in these papers we're using in empirical distribution. In other words, we've got some
01:58:17.540 | rows of data, grab a row.
01:58:20.060 | So when you see this thing, squiggle, other thing, this thing here, when it says pdata,
01:58:27.340 | that means grab something from the data, and we're going to call that thing x. So from
01:58:33.700 | our horse's pictures, grab a horse, turn it into a zebra, turn it back into a horse, compare
01:58:39.740 | it to the original, and sum up the absolute values. Do that for horse to zebra, do it for
01:58:44.760 | zebra to horse as well, add the two together, and that is our cycle consistency loss.
01:58:56.420 | So now we get our loss function, and the whole loss function depends on our horse generator,
01:59:03.460 | our zebra generator, our horse recognizer, our zebra recognizer discriminator, and we're
01:59:09.340 | going to add up the GAN loss for recognizing horses, the GAN loss for recognizing zebras,
01:59:17.980 | and the cycle consistency loss for our two generators.
01:59:23.580 | And then we've got a lambda here, which hopefully we're kind of used to this idea now, that
01:59:27.620 | is when you've got two different kinds of loss, you chuck in a parameter there, you
01:59:32.580 | can multiply them by so they're about the same scale. We did a similar thing with our
01:59:38.100 | bounding box loss compared to our classifier loss when we did that localization stuff.
01:59:49.140 | So then we're going to try to, for this loss function, maximize the capability of the discriminators
01:59:56.100 | that are discriminating whilst minimizing that for the generators. So the generators
02:00:04.700 | and the discriminators are going to be facing off against each other.
02:00:08.880 | So when you see this min-max thing in papers, you'll see it a lot. It basically means this
02:00:15.500 | idea that in your training loop, one thing is trying to make something better, the other
02:00:20.180 | is trying to make something worse, and there's lots of ways to do it, but most commonly you
02:00:24.700 | will alternate between the two. And you'll often see this just referred to in math papers
02:00:29.820 | as min-max. So when you see min-max, you should immediately think, okay, adversarial training.
02:00:42.320 | So let's look at the code. We probably won't be able to finish this today, but we're going
02:00:49.020 | to do something almost unheard of, which is I started looking at somebody else's code,
02:00:56.660 | and I was not so disgusted that I threw the whole thing away and did it myself. I actually
02:01:01.060 | said I quite like this. I like it enough I'm going to show it to my students.
02:01:07.140 | So this is where the code comes from. So this is one of the people that created the original
02:01:17.340 | code for CycleGANs, and they've created a PyTorch version. I had to clean it up a little
02:01:29.660 | bit, but it's actually pretty damn good. I think the first time I found code that I didn't
02:01:37.740 | feel the need to rewrite from scratch before I showed it to you.
02:01:44.340 | And so the cool thing about this is one of the reasons I like doing it this way, like
02:01:49.140 | finally finding something that's not awful, is that you're now going to get to see almost
02:01:55.940 | all the bits of fast.ai, or all the relevant bits of fast.ai, written in a different way
02:02:00.980 | than somebody else. And so you're going to get to see how they do data sets, and data
02:02:06.140 | loaders, and models, and training loops, and so forth. So you'll find there's a Segan directory,
02:02:17.860 | which is basically nearly this, with some cleanups which I hope to submit as a PR sometime.
02:02:26.740 | It was written in a way that unfortunately made it a bit over-connected to how they were
02:02:30.260 | using it as a script. I cleaned it up a little bit so I could use it as a module, but other
02:02:35.100 | than that it's pretty similar. So Segan is basically their code copied from their GitHub
02:02:44.540 | repo with some minor changes. So the way the Segan mini library has been set up is that
02:02:54.220 | the configuration options they're assuming are being passed in to a script. So they've
02:02:59.580 | got this train options parser method, and so you can see I'm basically passing in an
02:03:06.780 | array of script options. Where's my data? How many threads do I want to drop out? How many
02:03:14.780 | iterations? What am I going to call this model? Which GPU do I want to write it on? So that
02:03:22.460 | might just be an opt object, which you can then see what it contains. You'll see it contains
02:03:32.580 | some things I didn't mention, and that's because it's got defaults for everything else that
02:03:36.780 | I didn't mention.
02:03:39.540 | So rather than using fast.ai stuff, we're going to use Segan stuff. So the first thing
02:03:46.820 | we're going to need is a data loader. And so this is also a great opportunity for you again
02:03:52.620 | to practice your ability to navigate through code with your editor or IDE of choice. So
02:04:01.060 | we're going to start with create data loader. So you should be able to go find symbol or
02:04:06.940 | in vim tag to jump straight to create data loader, and we can see that's creating a custom
02:04:14.260 | dataset loader, and then we can see custom dataset loader is a base data loader. So basically
02:04:26.180 | we can see that it's going to use a standard PyTorch data loader. So that's good. And so
02:04:32.540 | we know if you're going to use a standard PyTorch data loader, you have to pass it a
02:04:36.540 | dataset. And we know that a dataset is something that contains a length and an indexer. So
02:04:43.900 | presumably when we look at create dataset, it's going to do that. Here is create dataset.
02:04:49.980 | So this library actually does more than just CycleGAN. It handles both aligned and unaligned
02:04:55.660 | image pairs. We know that our image pairs are unaligned. So we've got an unaligned dataset.
02:05:02.300 | Okay, here it is. And as expected, it has a getItem and a length. Good. And so obviously
02:05:13.260 | the length is just whatever. So A and B is our horses and zebras. We've got two sets.
02:05:21.820 | So whichever one is longer is the length of the data loader. And so getItem is just going
02:05:26.620 | to go ahead and randomly grab something from each of our two horses and zebras, open them
02:05:36.620 | up with Pillow or PIL, run them through some transformations, and then we could either
02:05:43.260 | be turning horses into zebras or zebras into horses, so there's some direction, and then
02:05:48.260 | it will just go ahead and return our horse and our zebra and our path to the horse and
02:05:53.980 | the path to zebra. So hopefully you can kind of see that this is looking pretty similar
02:05:59.980 | to the kind of stuff that FastAI does. FastAI obviously does quite a lot more when it comes
02:06:06.860 | to transforms and performance and stuff like this. But remember, this is like research
02:06:12.380 | code for this one thing. It's pretty cool that they did all this work.
02:06:17.900 | So we've got a data loader, so we can go and load our data into it, and so that will tell
02:06:24.860 | us how many minibatches are in it. That's the length of the data loader in PyTorch.
02:06:31.140 | Next step, we've got a data loader is to create a model. So you can go tag for create_model.
02:06:43.660 | There it is. Same idea, we've got different kinds of models, so we're going to be doing
02:06:48.220 | a CycleGAN. So here's our CycleGAN model. So there's quite a lot of stuff in a CycleGAN
02:06:55.180 | model, so let's go through and find out what's going to be used. But basically at this stage,
02:07:04.540 | we've just called initializer. So when we initialize it, you can see it's going to go through and
02:07:11.540 | it's going to define two generators, which is not surprising, a generator for our horses
02:07:17.480 | and a generator for our zebras. There's some way for it to generate a pool of fake data.
02:07:31.940 | And then here we're going to grab our GAN loss, and as we talked about, our cycle consistency
02:07:39.140 | loss is an L1 loss. That's interesting, they're going to use ADAM. So obviously for CycleGANs,
02:07:48.780 | they found ADAM works pretty well. And so then we're going to have an optimizer for
02:07:54.020 | our horse discriminator, an optimizer for our zebra discriminator, and an optimizer for
02:08:01.500 | our generator. The optimizer for the generator is going to contain the parameters both for
02:08:09.880 | the horse generator and the zebra generator all in one place. So the initializer is going
02:08:17.380 | to set up all of the different networks and loss functions we need, and they're all going
02:08:21.100 | to be stored inside this model. And so then it prints out and shows us exactly the PyTorch
02:08:30.940 | bottles we have. And so it's interesting to see that they're using ResNets. And so you
02:08:36.100 | can see the ResNets look pretty familiar. We've got conv_batch_norm_rail_u, conv_batch_norm.
02:08:45.980 | So instance_norm is just the same as batch_norm, basically, but it applies it to one image
02:08:52.240 | at a time. The difference isn't particularly important. And you can see they're doing reflection
02:09:00.940 | padding just like we are. You can kind of see when you try to build everything from
02:09:08.620 | scratch like this, it is a lot of work. And you can kind of get the nice little things
02:09:16.740 | that fast.ai does automatically for you. You kind of have to do all of them by hand and
02:09:23.140 | only end up with a subset of them. So over time, hopefully soon, we'll get all of this
02:09:29.300 | GAN stuff into fast.ai and it'll be nice and easy.
02:09:34.140 | So we've got our model, and remember the model contains the loss functions, it contains the
02:09:38.820 | generators, it contains the discriminators, all in one convenient place. So I've gone
02:09:43.260 | ahead and kind of copied and pasted and slightly refactored the training loop from the code
02:09:50.580 | so that we can run it inside the notebook.
02:09:53.860 | So this is a lot pretty familiar, right? It's a loop to go through each epoch, and a loop
02:09:58.980 | to go through the data. Before we did this, we set up our -- this is actually not a PyTorch
02:10:08.940 | dataset, I think this is what they used slightly confusingly to talk about their combined,
02:10:15.660 | what we would call a model data object, I guess, or the data that they need. We'll go through
02:10:21.060 | that with TQDM to get a progress bar, and so now we can go through and see what happens
02:10:26.860 | in the model.
02:10:28.740 | So set input. So it's kind of a different approach to what we do in fast.ai. It's kind
02:10:42.540 | of neat, it's quite specific to CycleGANs, but basically internally inside this model
02:10:48.140 | is this idea that we're going to go into our data and grab -- we're either going horse
02:10:55.880 | to zebra or zebra to horse, depending on which way we go. A is either the horse or the zebra,
02:11:02.140 | and vice versa, and if necessary, put it on the appropriate GPU and then grab the appropriate
02:11:09.460 | path.
02:11:11.500 | So the model now has a mini-batch of horses and a mini-batch of zebras, and so now we
02:11:20.500 | optimize the parameters.
02:11:31.300 | So it's kind of nice to see it like this. You can see each step. First of all, try to optimize
02:11:42.000 | the generators, then try to optimize the horse discriminator, then try to optimize the zebra
02:11:48.460 | discriminator.
02:11:49.460 | 0 grad is part of PyTorch. Step is part of PyTorch. So the interesting bit is the actual
02:11:57.820 | thing which does the backpropagation on the generator.
02:12:04.620 | So here it is.
02:12:07.900 | And let's jump to the key pieces. There's all the bits, all the formulas that we basically
02:12:12.300 | just saw from the paper. So let's take a horse and generate a zebra. So we've now got a fake
02:12:25.580 | zebra. And let's now use the discriminator to see if we can tell whether it's fake or
02:12:30.140 | not. And then let's pop that into our loss function, which we set up earlier to see if
02:12:44.060 | we can basically get a loss function based on that prediction.
02:12:51.420 | Then let's do the same thing to do the GAN loss. So go in the opposite direction, and
02:12:57.960 | then we need to use the opposite discriminator, and then put that through the loss function
02:13:03.580 | again.
02:13:05.100 | And then let's do the cycle-consistency loss. So again, we take our fake, which we created
02:13:12.700 | up here, and try and turn it back again into the original. And then let's use that cycle-consistency
02:13:23.820 | loss function we created earlier to compare it to the real original.
02:13:29.040 | And here's that lambda. So there's some weight that we used, and that was set up, actually.
02:13:36.420 | We just used the default that I suggested in their options. And then do the same for
02:13:40.740 | the opposite direction, and then add them all together. Do the backward step, and that's
02:13:49.940 | it. So we can then do the same thing for the first discriminator.
02:13:57.980 | And since basically all the work's been done now, there's much less to do here. So I won't
02:14:10.540 | step all through it, but it's basically the same basic stuff that we've already seen.
02:14:17.300 | So optimized parameters basically is calculating the losses and doing the optimizer step from
02:14:25.460 | time to time, save and print out some results. And then from time to time, update the learning
02:14:33.260 | rate, so they've got some learning rate annealing built in here as well.
02:14:37.980 | It isn't very exciting, but we can take a look at it.
02:14:49.540 | So they've basically got some kind of fast AI, they've got this idea of schedulers which
02:14:54.700 | you can then use to update your learning rates. So I think for those of you who are interested
02:15:01.380 | in better understanding deep learning APIs, or interested in contributing more to fast
02:15:08.900 | AI, or interested in creating your own version of some of this stuff in some different backend,
02:15:15.940 | it's cool to look at a second kind of API that covers some subset of some of the similar
02:15:21.620 | things to get a sense of how are they solving some of these problems, and what are the similarities
02:15:26.740 | and what are the differences.
02:15:30.420 | So we train that for a little while, and then we can just grab a few examples, and here
02:15:39.240 | we have them. So here are our horses, here they are as zebras, and here they are back
02:15:45.820 | as horses again. Here's a zebra, into a horse, back on a zebra, it's kind of thrown away
02:15:51.180 | its head for some reason, but not so much it could get it back again. This is a really
02:15:57.140 | interesting one, like this is obviously not what zebras look like, but it's going to be
02:16:00.620 | a zebra version of that horse. It's also interesting to see its failure situations, I guess it
02:16:05.940 | doesn't very often see basically just an eyeball, it has no idea how to do that one. So some
02:16:13.020 | of them don't work very well, this one's done a pretty good job. This one's interesting,
02:16:18.260 | it's done a good job of that one and that one, but for some reason the one in the middle
02:16:21.140 | didn't get a go. This one's a really weird shape, but it's done a reasonable job of it.
02:16:27.980 | This one looks good, this one's pretty sloppy, again the fork just ahead, it's not bad. So
02:16:37.060 | it took me like 24 hours to train it even that far, so it's kind of slow. And I know
02:16:45.020 | Helena is constantly complaining on Twitter about how long these things take, I don't
02:16:49.620 | know how she's so productive with them. So I will mention one more thing that just came
02:16:56.980 | out yesterday, which is there's now a multimodal image-to-image translation of unpaired, and
02:17:05.220 | so you can basically now create different cats, for instance, from this dog. So this
02:17:13.500 | is basically not just creating one example of the output that you want, but creating
02:17:18.980 | a multimodal one. So here's a house cat to big cat, and here's a big cat to house cat,
02:17:26.060 | this is the paper. So this came out like yesterday or the day before, I think. I think it's pretty
02:17:31.980 | amazing cat and a dog. So you can kind of see how this technology is developing, and
02:17:37.940 | I think there's so many opportunities to maybe do this with music, or speech, or writing,
02:17:45.740 | or to create tools for artists, or whatever. Alright, thanks everybody, and see you next
02:17:51.100 | week.
02:17:51.300 | (audience applauds)