Back to Index

Lesson 12: Deep Learning Part 2 2018 - Generative Adversarial Networks (GANs)


Chapters

0:0 Introduction
1:5 Christine Payne
7:16 Darknet
8:55 Basic Skills
11:3 Architecture
11:50 Basic Architecture
14:38 Res Blocks
16:46 Number of Channels
18:23 InPlace
21:10 Padding
22:13 One by One Con
26:0 Wide Residual Networks
29:44 SelfNormalising
31:28 Group Layers
37:25 Adaptive Average Pooling
46:53 Strides
48:26 GANs
54:51 Generating Pictures

Transcript

So, we're going to be talking about GANs today. Who has heard of GANs? Yeah, most of you. Very hot technology, but definitely deserving to be in the cutting-edge deep learning part of the course, because they're not quite proven to be necessarily useful for anything, but they're nearly there. They're definitely going to get there, and we're going to focus on the things where they're definitely going to be useful in practice.

There are a number of areas where they may turn out to be useful in practice, but we don't know yet. So I think the area that we're going to be useful in practice is the kind of thing you see on the left here, which is, for example, turning drawings into rendered pictures.

This comes from a paper that just came out two days ago. So there's a very active research going on right now. Before we get there, though, let's talk about some interesting stuff from the last class. This is an interesting thing that one of our diversity fellows, Christine Payne, did.

Christine has a master's in medicine from Stanford, and so she obviously had an interest in thinking what would it look like if we built a language model of medicine. One of the things that we briefly touched on back in lesson 4 but didn't really talk much about last time is this idea that you can actually seed a generative language model, which basically means you've trained a language model on some corpus, and then you're going to generate some text from that language model.

And so you can start off by feeding it a few words to basically say here's the first few words to create the hidden state in the language model, and then generate from there, please. And so Christine did something clever, which was to kind of pick a -- was to seed it with a question and then repeat the question three times, Christine, three times, and then let it generate from there.

And so she fed a language model lots of different medical texts, and then fed it this question, what is the prevalence of malaria, and the model said in the US about 10% of the population has the virus, but only about 1% is infected with the virus, about 50 to 80 million are infected.

She said what's the treatment for ectopic pregnancy, and it said it's a safe and safe treatment for women with a history of symptoms that may have a significant impact on clinical response, most important factor is development of management of ectopic pregnancy, etc. And so what I find interesting about this is it's pretty close to being a -- to me as somebody who doesn't have a master's in medicine from Stanford, pretty close to being a believable answer to the question, but it really has no bearing on reality whatsoever, and I kind of think it's an interesting kind of ethical and user experience quandary.

So actually, I'm involved also in a company called Doc.ai that's trying to basically -- or doing a number of things, but in the end provide an app for doctors and patients which can help create a conversational user interface around helping them with their medical issues. And I've been continually saying to the software engineers on that team, please don't try to create a generative model using like an LSTM or something because they're going to be really good at creating bad advice that sounds impressive, kind of like political pundits or tenured professors, people who can say bullshit with great authority.

So I thought it was a really interesting experiment, and great to see what our diversity fellows are doing. I mean, this is why we have this program. I suppose I shouldn't just say master's in medicine, actually a Juilliard trained classical musician or actually also a Princeton valedictorian in physics, so also a high performance computing expert.

Yeah, okay, so she does a bit of everything. So yeah, really impressive group of people and great to see such exciting kind of ideas coming out. And if you're wondering, you know, I've done some interesting experiments, should I let people know about it? Well, Christine mentioned this on the forum, I went on to mention it on Twitter, to which I got this response, you're looking for a job, you may be wondering who Xavier Maricain is, well he is the founder of a hot new medical AI startup, he was previously the head of engineering at Quora, before that he was the guy at Netflix who ran the data science team and built their recommender systems, so this is what happens if you do something cool, let people know about it and get noticed by awesome people like Xavier.

So let's talk about sci-fi 10. And the reason I'm going to talk about sci-fi 10 is that we're going to be looking at some more bare-bones PyTorch stuff today to build these generative adversarial models, there's no really fast AI support to speak of at all for GANs at the moment, I'm sure there will be soon enough, but currently there isn't, so we're going to be building a lot of models from scratch.

It's been a while since we've done serious model building, a little bit of model building I guess for our bounding box stuff, but really all the interesting stuff there was the loss function. So we looked at sci-fi 10 in the part 1 of the course and we built something which was getting about 85% accuracy and I can't remember, a couple of hours to train.

Interestingly there's a competition going on now to see who can actually train sci-fi 10 the fastest, going through this Stanford/Dawn bench and currently, so the goal is to get it to train to 94% accuracy. So it'd be interesting to see if we can build an architecture that can get to 94% accuracy because that's a lot better than our previous attempt and so hopefully in doing so we'll learn something about creating good architectures.

That will be then useful for looking at these GANs today, but I think also it's useful because I've been looking much more deeply into the last few years' papers about different kinds of CNN architectures and realized that a lot of the insights in those papers are not being widely leveraged and clearly not widely understood.

So I want to show you what happens if we can leverage some of that understanding. So I've got this notebook called sci-fi 10 darknet. That's because the architecture we're going to look at is really very close to the darknet architecture. But you'll see in the process that the darknet architecture has not the whole YOLO version 3 end-to-end thing, but just the part of it that they pre-trained on ImageNet to do classification.

It's almost like the most generic simple architecture almost you could come up with. And so it's a really great starting point for experiments. So we're going to call it darknet, but it's not quite darknet and you can fiddle around with it to create things that definitely aren't darknet. It's really just the basis of nearly any modern ResNet-based architecture.

So sci-fi 10, remember, is a fairly small dataset. The images are only 32x32 in size. And I think it's a really great dataset to work with because you can train it relatively quickly, unlike ImageNet. It's a relatively small amount of data, unlike ImageNet, and it's actually quite hard to recognize the images because 32x32 is kind of too small to easily see what's going on.

So it's somewhat challenging. So I think it's a really underappreciated dataset because it's old, and who at DeepMind or OpenAI wants to work with a small old dataset when they could use their entire server room to process something much bigger. But to me, I think this is a really great dataset to focus on.

So we'll go ahead and import our usual stuff, and we're going to try and build a network from scratch to train this with. One thing that I think is a really good exercise for anybody who's not 100% confident with their kind of broadcasting and PyTorch and so forth basic skills is figure out how I came up with these numbers.

So these numbers are the averages for each channel and the standard deviations for each channel in SciFiTet. So try and as a bit of a homework, just make sure you can recreate those numbers and see if you can do it in no more than a couple of lines of code, you know, no loops.

Ideally you want to do it in one go if you can. Because these are fairly small, we can use a larger batch size than usual, 256, and the size of these images is 32. Transformations - normally we have this standard set of side-on transformations we use for photos of normal objects.

We're not going to use that here though because these images are so small that trying to rotate a 32x32 image a bit is going to introduce a lot of blocking kind of distortions. So the kind of standard transforms that people tend to use is a random horizontal flip, and then we add size divided by 8, so 4 pixels of padding on each side.

And one thing which I find works really well is by default FastAI doesn't add black padding, which basically every other library does. We actually take the last 4 pixels of the existing photo and flip it and reflect it, and we find that we get much better results by using this reflection padding by default.

So now that we've got a 36x36 image, this set of transforms in training will randomly pick a 32x32 crop. So we get a little bit of variation, but not heaps. Alright, so we can use our normal from paths to grab our data. So we now need an architecture. And what we're going to do is create an architecture which fits in one screen.

So this is from scratch, as you can see, I'm using the predefined com2d, batch norm2d, leaky value modules, but I'm not using any blocks or anything, they're all being defined. So the entire thing is here on one screen. So if you're ever wondering can I understand a modern good quality architecture, absolutely.

Let's study this one. So my basic starting point with an architecture is to say it's a stacked bunch of layers. And generally speaking there's going to be some kind of hierarchy of layers. So at the very bottom level there's things like a convolutional layer and a batch norm layer.

So if you're thinking anytime you have a convolution, you're probably going to have some standard sequence and normally it's going to be conv, batch norm, then a nonlinear activation. So I try to start right from the top by saying, okay, what are my basic units going to be? And so by defining it here, that way I don't have to worry about trying to keep everything consistent and it's going to make everything a lot simpler.

So here's my conv layer, and so anytime I say conv layer, I mean conv, batch norm, relu. Now I'm not quite saying relu, I'm saying leaky relu, and I think we've briefly mentioned it before, but the basic idea is that normally a relu looks like that. Hopefully you all know that now.

A leaky relu looks like that. So this part, as before, has a gradient of 1, and this part has a gradient of, it can vary, but something around 0.1 or 0.01 is common. And the idea behind it is that when you're in this negative zone here, you don't end up with a 0 gradient, which makes it very hard to update it.

In practice, people have found leaky relu more useful on smaller datasets and less useful on big datasets, but it's interesting that for the YOLO version 3 paper, they did use a leaky relu and got great performance from it. So it rarely makes things worse, and it often makes things better.

So it's probably not bad if you need to create your own architecture to make that your default go-to is to use leaky relu. You'll notice I don't define a PyTorch module here, I just go ahead and go sequential. This is something that if you read other people's PyTorch code, it's really underutilized.

People tend to write everything as a PyTorch module with an init and a forward. But if the thing you want is just a sequence of things one after the other, it's much more concise and easy to understand to just make it a sequential. So I've just got a simple plain function that just returns a sequential model.

So I mentioned that there's generally a number of hierarchies of units in most modern networks. And I think we know now that the next level in this unit hierarchy for ResNets is the res block or the residual block, I call it here a res layer. And back when we last did scifi 10, I oversimplified this, I cheated a little bit.

We had x coming in, and we put that through a conv, and then we added it back up to x to go out. So in general, we've got your output is equal to your input plus some function of your input. And the thing we did last year was we made f was a 2D conv.

And actually, in the real res block, there's actually two of them. So it's actually conv of conv of x. And when I say conv, I'm using this as a shortcut for our conv layer. So you can see here, I've created two convs, and here it is. I take my x, put it through the first conv, put it through the second conv, and add it back up to my input again to get my basic res block.

So, one interesting insight here is what are the number of channels in these convolutions? So we've got coming in some number of input filters. The way that the darknet folks set things up is they said we're going to make every one of these res layers spit out the same number of channels that came in.

And I kind of like that, that's why I used it here, because it makes life simpler. And so what they did is they said let's have the first conv have the number of channels, and then the second conv double it again. So ni goes to ni/2, and then ni/2 goes to ni.

So you've kind of got this funneling thing where if you've got like 64 channels coming in, it kind of gets squished down with a first conv down to 32 channels, and then taken back up again to 64 channels coming out. Yes, Rachel? Why is inplace equals true in the leaky value?

Oh, thanks for asking. A lot of people forget this or don't know about it. But this is a really important memory technique. If you think about it, this conv layer is like the lowest level thing, so pretty much everything in our resnet once it's all put together is going to be conv layers, conv layers, conv layers.

If you don't have inplace equals true, it's going to create a whole separate piece of memory for the output of the value. So like it's going to allocate a whole bunch of memory that's totally unnecessary. And actually, since I wrote this, I came up with another idea the other day, which I'll now implement, which is you can do the same thing for the res layer, rather than going -- let's just reorder this to say x plus that -- you can actually do the same thing here.

Hopefully some of you might remember that in PyTorch, pretty much every function has an underscore suffix version which says do that inplace. So plus, there's also an add, and so that's add inplace. And so that's now suddenly reduced my memory there as well. So these are really handy little tricks.

And I actually forgot the inplace equals true at first for this, and I literally was having to decrease my batch size to much lower amounts than I knew should be possible, and it was driving me crazy, and then I realized that that was missing. You can also do that with dropout, by the way, if you have dropout.

So dropout and all the activation functions you can do inplace, and then generally any arithmetic operation you can do inplace as well. Why is bias usually in ResNet set to false in the conv layer? If you're watching the video, pause now and see if you can figure this out, because this is a really interesting question.

Why don't we need bias? So I'll wait for you to pause. Welcome back. So if you've figured it out, here's the thing, immediately after the conv is a batch norm. And remember batch norm has two learnable parameters for each activation, the kind of the thing you multiply by and the thing you add.

So if we had bias here to add, and then we add another thing here, we're adding two things which is totally pointless, like that's two weights where one would do. So if you have a batch norm after a conv, then you can either say in the batch norm, don't include the add bit there please, or easier is just to say don't include the bias in the conv.

There's no particular harm, but again, it's going to take more memory because that's more gradients that it has to keep track of. So best to avoid. Also another thing, a little trick, is most people's conv layers have padding as a parameter, but generally speaking you should be able to calculate the padding easily enough.

And I see people try to implement special same padding modules and all kinds of stuff like that. But if you've got a stride 1, and you've got a kernel size of 3, then obviously that's going to overlap by one unit on each side, so we want padding of 1.

Whereas if it's stride 1, then we don't need any padding. So in general, padding of kernel size integer divided by 2 is what you need. There's some tweaks sometimes, but in this case this works perfectly well. So again, trying to simplify my code by having the computer calculate stuff for me rather than me having to do it myself.

Another thing here with the two conv layers, we have this idea of a bottleneck, this idea of reducing the channels and then increasing them again, is also what kernel size we use. So here is a 1x1 conv, and this is again something you might want to pause the video now and think about what's a 1x1 conv really?

What actually happens in a 1x1 conv? So if we've got a little 4x4 grid here, and of course there's a filter or channels axis as well, maybe that's like 32, and we're going to do a 1x1 conv. So what's the kernel for a 1x1 conv going to look like?

It's going to be 1 by 32. So remember when we talk about the kernel size, we never mention that last piece, but let's say it's 1x1 by 32 because that's part of the filters in and filters out. So in other words then, what happens is this one thing gets placed first of all here on the first cell, and we basically get a dot product of that 32 deep bit with this 32 bit deep bit, and that's going to give us our first output.

And then we're going to take that 32 bit bit and put it with the second one to get the second output. So it's basically going to be a bunch of little dot products for each point in the grid. So what it basically is then is basically something which is allowing us to kind of change the dimensionality in whatever way we want in the channel dimension.

And so that would be one of our filters. And so in this case we're creating ni divided by 2 of these, so we're going to have ni divided by 2 of these dot products, all with different weighted averages of the input channels. So it basically lets us, with very little computation, add this additional step of calculations and non-linearities.

So that's a cool trick, this idea of taking advantage of these 1x1 comms, creating this bottleneck and then pulling it out again with 3x3 comms. So that's actually going to take advantage of the 2D nature of the input properly. The 1x1 comm doesn't take advantage of that at all.

So these two lines of code, there's not much in it, but it's a really great test of your understanding and kind of your intuition about what's going on. Why is it that a 1x1 comm going from ni to ni over 2 channels, followed by a 3x3 comm going from ni over 2 to ni channels?

Why does it work? Why do the tensor ranks line up? Why do the dimensions all line up nicely? Why is it a good idea? What's it really doing? It's a really good thing to fiddle around with, maybe create some small ones in Jupyter Notebook, run them yourself, see what inputs and outputs come in and out.

Really get a feel for that. Once you've done so, you can then play around with different things. One of the really unappreciated papers is this one, Wide Residual Networks. It's really quite a simple paper, but what they do is they basically fiddle around with these two lines of code.

And what they do is they say, well what if this wasn't divided by 2, but what if it was times 2? That would be totally allowable. That's going to line up nicely. Or what if we had another comm3 after this, and so this was actually ni over 2 to ni over 2, and then this is ni over 2.

Again that's going to work, right? Kernel size 1, 3, 1, going to half the number of kernels, leave it at half and then double it again at the end. And so they come up with this kind of simple notation for basically defining what this can look like. And then they show lots of experiments.

And basically what they show is that this approach of bottlenecking, of decreasing the number of channels, which is almost universal in resnets, is probably not a good idea. In fact from the experiment, it's definitely not a good idea. Because what happens is it lets you create really deep networks.

The guys who created resnets got particularly famous creating a 1,001-layer network. But the thing about 1,001 layers is you can't calculate layer 2 until you finish layer 1. You can't calculate layer 3 until you finish layer 2. So it's sequential. GPUs don't like sequential. So what they showed is that if you have less layers, but with more calculations per layer, and so one easy way to do that would be to remove the /2.

No other changes. Like try this at home. Try running sci-fi and see what happens. Or maybe even multiply it by 2 or fiddle around. And that basically lets your GPU do more work. And it's very interesting because the vast majority of papers that talk about performance of different architectures never actually time how long it takes to run a batch through it.

They literally say this one requires x number of floating-point operations per batch, but then they never actually bother to run the damn thing like a proper experimentalist and find out whether it's faster or slower. And so a lot of the architectures that are really famous now turn out to be slowest molasses and take craploads of memory and just totally useless because the researchers never actually bother to see whether they're fast and to actually see whether they fit in RAM with normal batch sizes.

So the wide resnet paper is unusual in that it actually times how long it takes, as does the YOLO version 3 paper, which made the same insight. I'm not sure they might have missed the wide resnets paper because the YOLO version 3 paper came to a lot of the same conclusions, but I'm not even sure they cited the wide resnets paper, so they might not be aware that all that work's been done.

But they're both great to see people actually timing things and noticing what actually makes sense. Yes, Rich? Cellu looked really hot in the paper which came out, but I noticed that you don't use it. What's your opinion on Cellu? So Cellu is something largely for fully connected layers which allows you to get rid of batch norm, and the basic idea is that if you use this different activation function, it's kind of self-normalizing.

That's what the S in Cellu stands for. So self-normalizing means that it will always remain at a unit standard deviation and zero mean, and therefore you don't need that batch norm. It hasn't really gone anywhere, and the reason it hasn't really gone anywhere is because it's incredibly finicky. You have to use a very specific initialization, otherwise it doesn't start with exactly the right standard deviation of mean.

It's very hard to use it with things like embeddings. If you do, then you have to use a particular kind of embedding initialization which doesn't necessarily actually make sense for embeddings. You do all this work very hard to get it right, and if you do finally get it right, what's the point where you've managed to get rid of some batch norm layers which weren't really hurting you anyway.

It's interesting because that paper, that Cellu paper, I think one of the reasons people noticed it, or in my experience the main reason people noticed it was because it was created by the inventor of LSTMs, and also it had a huge mathematical appendix and people were like "Lots of maths from a famous guy, this must be great!" But in practice I don't see anybody using it to get any state-of-the-art results or win any competitions or anything like that.

This is some of the tiniest bits of code we've seen, but there's so much here and it's fascinating to play with. Now we've got this block which is built on this block, and then we're going to create another block on top of that block. We're going to call this a group layer, and it's going to contain a bunch of res layers.

A group layer is going to have some number of channels or filters coming in, and what we're going to do is we're going to double the number of channels coming in by just using a standard conv layer. Optionally, we'll halve the grid size by using a stride of 2, and then we're going to do a whole bunch of res blocks, a whole bunch of res layers.

We can pick how many. That could be 2 or 3 or 8. Because remember, these res layers don't change the grid size and they don't change the number of channels. You can add as many as you like, anywhere you like, without causing any problems. It's just going to use more computation and more RAM, but there's no reason other than that you can't add as many as you like.

A group layer, therefore, is going to end up doubling the number of channels because it's this initial convolution which doubles the number of channels. And depending on what we pass in a stride, it may also halve the grid size if we put stride=2. And then we can do a whole bunch of res block computations as many as we like.

So then to define our dark net, or whatever we want to call this thing, we're just going to pass in something that looks like this. And what this says is, create 5 group layers. The first one will contain 1 of these extra res layers. The second will contain 2, then 4, then 6, then 3.

And I want you to start with 32 filters. So the first one of these res layers will contain 32 filters, and there will just be one extra res layer. The second one is going to double the number of filters because that's what we do. Each time we have a new group layer, we double the number.

So the second one will have 64, then 128, then 256, then 512, and then that will be it. So nearly all of the network is going to be those bunches of layers. And remember, every one of those group layers also has one convolution of the start. And so then all we have is before that all happens, we're going to have one convolutional layer at the very start, and at the very end we're going to do our standard adaptive average pooling, flatten, and a linear layer to create the number of classes out at the end.

So one convolution at the end, adaptive pooling, and one linear layer at the other end, and then in the middle, these group layers, each one consisting of a convolutional layer followed by n number of res layers. And that's it. Again, I think we've mentioned this a few times, but I'm yet to see any code out there, any examples, anything anywhere that uses adaptive average pooling.

Everyone I've seen writes it like this, and then bits a particular number here, which means that it's now tied to a particular image size, which definitely isn't what you want. So most people, even the top researchers I speak to, most of them are still under the impression that a specific architecture is tied to a specific size, and that's a huge problem when people think that because it really limits their ability to use smaller sizes to kind of kickstart their modeling or to use smaller sizes for doing experiments and stuff like that.

Again, you'll notice I'm using sequential here, but a nice way to create architectures is to start out by creating a list. In this case, this is a list with just one conv layer in, and then my function here, make_group_layer, it just returns another list. So then I can just go plus equals, appending that list to the previous list, and then I can go plus equals to append this bunch of things to that list, and then finally sequential of all those layers.

So that's a very nice thing. So now my forward is just self.layers. So this is a nice kind of picture of how to make your architectures as simple as possible. So you can now go ahead and create this, and as I say, you can fiddle around. You could even parameterize this to make it a number that you pass in here, to pass in different numbers so it's not 2, maybe it's times 2 instead.

You could pass in things that change the kernel size or change the number of conv layers, fiddle around with it, and maybe you can create something -- I've actually got a version of this which I'm about to run for you -- which kind of implements all of the different parameters that's in that wide ResNet paper, so I could fiddle around to see what worked well.

So once we've got that, we can use conv_learner from model_data to take our pytorch_model module and the model_data object and turn them into a learner, give it a criterion, add some metrics if we like, and then we can call fit and away we go. Could you please explain adaptive average pooling?

How does setting to one work? Sure. Before I do, since we've only got a certain amount of time in this class, I do want to see how we go with this simple network against these state-of-the-art results. So to make life a little easier, we can start it running now and see how it looks later.

So I've got the command ready to go. So we've basically taken all that stuff and put it into a simple little Python script, and I've modified some of those parameters I mentioned to create something I've called a WRN22 network, which doesn't officially exist, but it's got a bunch of changes to the parameters we talked about based on my experiments.

We're going to use the new Leslie Smith one-cycle thing. So there's quite a bunch of cool stuff here. So the one-cycle implementation was done by our student, Sylvain Gugge, the trained sci-fi experiments were largely done by Brett Coons, and stuff like getting the half-position floating-point implementation integrated into fast.ai was done by Andrew Shaw.

So it's been a cool bunch of different student projects coming together to allow us to run this. So this is going to run actually on an AWS, Amazon AWS P3, which has eight GPUs. The P3 has these newer Volta architecture GPUs, which actually have special support for half-position floating point.

Fast.ai is the first library I know of to actually integrate the Volta-optimized half-position floating point into the library, so we can just go learn.half now and get that support automatically. And it's also the first one to integrate one-cycle, so these are the parameters for the one-cycle. So we can go ahead and get this running.

So what this actually does is it's using PyTorch's multi-GPU support. Since there are eight GPUs, it's actually going to fire off eight separate Python processes, and each one's going to train on a little bit, and then at the end it's going to pass the gradient updates back to the master process that's going to integrate them all together.

So you'll see, here they are, lots of progress bars all pop up together. And you can see it's training three or four seconds when you do it this way. When I was training earlier, I was getting about 30 seconds per epoch. So doing it this way, we can kind of train things like 10 times faster or so, which is pretty cool.

Okay, so we'll leave that running. So you were asking about adaptive average pooling, and I think specifically what's the number 1 doing? So normally when we're doing average pooling, let's say we've got 4x4. Let's say we did average pooling 2, 2. Then that creates a 2x2 area and takes the average of those 4, and then we can pass in the stride.

So if we said stride 1, then the next one is we would look at this block of 2x2 and take that average, and so forth. So that's what a normal 2x2 average pooling would be. And so in that case, if we didn't have any padding, that would spit out a 3x3, because it's 2 here, 2 here, 2 here.

And if we added padding, we can make it 4x4. So if we wanted to spit out something, we didn't want 3x3, what if we wanted 1x1? Then we could say average pool 4, 4. And so that's going to do 4, 4, and average the whole lot. And that would spit out 1x1.

But that's just one way to do it. Rather than saying the size of the pooling filter, why don't we instead say, I don't care what the size of the input grid is, I always want 1x1. So that's where then you say "adaptive average pool", and now you don't say what's the size of the pooling filter, you instead say what's the size of the output I want.

And so I want something that's 1x1. And if you only put a single int, it assumes you mean 1x1. So in this case, adaptive average pooling 1 with a 4x4 grid coming in is the same as average pooling 4, 4. If it was a 7x7 grid coming in, it would be the same as 7, 7.

So it's the same operation, it's just expressing it in a way that says regardless of the input, I want something of that size to output. We got to 94, and it took 3 minutes and 11 seconds, and the previous state-of-the-art was 1 hour and 7 minutes. So was it worth fiddling around with those parameters and learning a little bit about how these architectures actually work and not just using what came out of the box?

Well, holy shit, we just used a publicly available instance. We used a spot instance so that cost us $8 per hour for 3 minutes. It cost us a few cents to train this from scratch 20 times faster than anybody's ever done it before. So that's like the most crazy state-of-the-art result we've ever seen, but this one just blew it out of the water.

This is partly thanks to just fiddling around with those parameters of the architecture. Mainly, frankly, about using Leslie Smith's one-cycle thing and Zulma's implementation of that. Remember, not only a reminder of what that's doing, it's basically saying this is batches, and this is learning rate. It creates an upward path that's equally long as the downward path, so it's a true C-L-R, triangular, cyclical learning rate.

As per usual, you can pick the ratio between those two numbers. So x divided by y in this case is the number that you get to pick. In this case, we picked 50, so we started out with a much smaller one here. And then it's got this cool idea which is you get to say what percentage of your epochs then is spent going from the bottom of this down all the way down pretty much to zero.

That's what this second number here is. So 15% of the batches is spent going from the bottom of our triangle even further. So importantly though, that's not the only thing one cycle does. We also have momentum, and momentum goes from 0.95 to 0.85 like this. In other words, when the learning rate is really low, we use a lot of momentum, and when the learning rate is really high, we use very little momentum, which makes a lot of sense.

But until Leslie Smith showed this in that paper, I've never seen anybody do it before, so it's a really cool trick. You can now use that by using the useCLRbeta parameter in fast.ai, and you should be able to basically replicate this state-of-the-art result. You can use it on your own computer or your paper space.

Obviously the only thing you won't get is the multi-GPU piece, but that makes it a bit easier to train. So on a single GPU, you should be able to beat this on a single GPU. Make group layer contains stride=2, so this means stride is 1 for layer 1 and 2 for everything else.

What's the logic behind it? Usually the strides I've seen are odd. Strides are either 1 or 2, I think you're thinking of kernel sizes. So stride=2 means that I jump 2 across, and so a stride of 2 means that you halve your grid size. I think you might have got confused between stride and kernel size there.

If we have a stride of 1, the grid size doesn't change. If we have a stride of 2, then it does. In this case, this is for sci-fi 10. 32x32 is small, and we don't get to halve the grid size very often, because pretty quickly we're going to run out of cells.

That's why the first layer has a stride of 1, so we don't decrease the grid size straight away, basically. It's kind of a nice way of doing it, because that's why we have a low number here, so we can start out with not too much computation on the big grid, and then we can gradually do more and more computation as the grids get smaller and smaller.

Because the smaller grid the computation will take less time. I think so that we can do all of our scanning in one go. Let's take a slightly early break and come back at 7.30. So we're going to talk about generative adversarial networks, also known as GANs, and specifically we're going to focus on the Wasserstein GAN paper, which included some guy called Sumith Chintala, who went on to create some piece of software called HiTorch.

The Wasserstein GAN was heavily influenced by the - so I'm just going to call this WGAN, that's the time - the DC GAN, or deep convolutional generative adversarial networks paper, which also Sumith was involved with. It's a really interesting paper to read. A lot of it looks like this.

The good news is you can skip those bits, because there's also a bit that looks like this which says do these things. Now I will say though that a lot of papers have a theoretical section which seems to be there entirely to get past the reviewer's need for theory.

That's not true of the WGAN paper. The theory bit is actually really interesting. You don't need to know it to use it, but if you want to learn about some cool ideas and see the thinking behind why this particular algorithm, it's absolutely fascinating. Before this paper came out, I didn't know literally I knew nobody who had studied the math that it's based on, so everybody had to learn the math it was based on.

The paper does a pretty good job of laying out all the pieces. You'll have to do a bunch of reading yourself. If you're interested in digging into the deeper math behind some paper to see what it's like to study it, I would pick this one. Because at the end of that theory section, you'll come away saying, okay, I can see now why they made this algorithm the way it is.

And then having come up with that idea, the other thing is often these theoretical sections are very clearly added after they come up with the algorithm. They'll come up with the algorithm based on intuition and experiments, and then later on post-hoc justify it. Whereas this one you can clearly see it's like, okay, let's actually think about what's going on in GANs and think about what they need to do and then come up with the algorithm.

So the basic idea of a GAN is it's a generative model. So it's something that is going to create sentences or create images. It's going to generate stuff. And it's going to try and create stuff which is very hard to tell the difference between generated stuff and real stuff.

So a generative model could be used to face-swap a video, a very well-known controversial thing of deep fakes and fake pornography and stuff happening at the moment. It could be used to fake somebody's voice. It could be used to fake the answer to a medical question. But in that case, it's not really a fake.

It could be a generative answer to a medical question that's actually a good answer. So you're generating language. You could generate a caption to an image, for example. So generative models have lots of interesting applications. But generally speaking, they need to be good enough that, for example, if you're using it to automatically create a new scene for Carrie Fisher in the next Star Wars movies and she's not around to play that part anymore, you want to try and generate an image of her that looks the same, then it has to fool the Star Wars audience into thinking that doesn't look like some weird Carrie Fisher, that looks like the real Carrie Fisher.

Or if you're trying to generate an answer to a medical question, you want to generate English that reads nicely and clearly and sounds authoritative and meaningful. So the idea of a generative adversarial network is we're going to create not just a generative model to create, say, the generated image, but a second model that's going to try to pick which ones are real and which ones are generated.

We're going to call them fake. So which ones are real and which ones are fake? So we've got a generator that's going to create our fake content and a discriminator that's going to try to get good at recognizing which ones are real and which ones are fake. So there's going to be two models.

And then there's going to be adversarial, meaning the generator is going to try to keep getting better at fooling the discriminator into thinking that fake is real, and the discriminator is going to try to keep getting better at discriminating between the real and the fake. And they're going to go head-to-head, like that.

And it's basically as easy as I just described. It really is. We're just going to build two models in PyTorch. We're going to create a training loop that first of all says the loss function for the discriminator is can you tell the difference between real and fake, and then update the weights of that.

And then we're going to create a loss function for the generator, which is going to say can you generate something which pulls the discriminator and update the weights from that loss. And we're going to look through that a few times and see what happens. And so let's come back to the pseudocode here of the algorithm and let's read the real code first.

So there's lots of different things you can do with GANs. And we're going to do something that's kind of boring but easy to understand, and it's kind of cool that it's even possible. We're just going to generate some pictures from nothing. We're just going to get it to draw some pictures.

And specifically we're going to get it to draw pictures of bedrooms. You'll find if you hopefully get a chance to play around with this during the week with your own datasets, if you pick a dataset that's very varied, like ImageNet, and then get a GAN to try and create ImageNet pictures, it tends not to do so well because it's not really clear enough what you want a picture of.

So it's better to give it, for example, there's a dataset called CelebA, which is pictures of celebrity faces. That works great with GANs. You create really clear celebrity faces that don't actually exist. The bedroom dataset, also a good one. Lots of pictures of the same kind of thing. So that's just a suggestion.

So there's something called the lsun_scene_classification_dataset. You can download it using these steps. It's pretty huge. So I've actually created a Kaggle dataset of a 20% sample. So unless you're really excited about generating bedroom images, you might prefer to grab the 20% sample. So then we do the normal steps of creating some different paths.

In this case, as we do before, I find it much easier to go the CSV route when it comes to handling our data. So I just generate a CSV with the list of files that we want and a fake label that's zero because we don't really have labels for these at all.

So I actually create two CSV files, one that contains everything in that bedroom dataset and one that just contains a random 10%. It's just nice to do that because then I can most of the time use the sample when I'm experimenting. Because there's well over a million files, even just reading in the list takes a while.

So this will look pretty familiar. So here's a conv block. This is before I realized that sequential models are much better. So if you compare this to my previous conv block with a sequential model, there's just a lot more lines of code here. But it does the same thing of doing conv value batch norm.

And we calculate our padding, and here's a bias pulse. So this is the same as before basically, but with a little bit more code. So the first thing we're going to do is build a discriminator. So a discriminator is going to receive as input an image, and it's going to spit out a number.

And the number is meant to be lower if it thinks this image is real. Of course, what does it do for a lower number thing doesn't appear in the architecture, that will be in the loss function. So all we have to do is create something that takes an image and spits out a number.

So a lot of this code is borrowed from the original authors of the paper, so some of the naming scheme and stuff is different to what we're used to. So sorry about that. But I've tried to make it look at least somewhat familiar. I probably should have renamed things a little bit.

But it looks very similar to actually what we had before. We start out with a convolution, so remember conv block is conv-relievational. And then we have a bunch of extra conv layers. This is not going to use a residual. It looks very similar to before, a bunch of extra layers, but these are going to be conv layers rather than res layers.

And then at the end, we need to append enough stride 2 conv layers that we decrease the grid size down to be no bigger than 4x4. So it's going to keep using stride 2, divide the size by 2, stride 2, divide by size by 2, until our grid size is no bigger than 4.

So this is quite a nice way of creating as many layers as you need in a network to handle arbitrary sized images and turn them into a fixed known grid size. Yes, Rachel? Does a GAN need a lot more data than say dogs versus cats or NLP, or is it comparable?

Honestly, I'm kind of embarrassed to say I am not an expert practitioner in GANs. The stuff I teach in part 1 is stuff I'm happy to say I know the best way to do these things and so I can show you state-of-the-art results like I just did with sci-fi 10 with the help of some of my students, of course.

I'm not there at all with GANs. So I'm not quite sure how much you need. In general, it seems you need quite a lot. But remember, the only reason we didn't need too much in dogs and cats is because we had a pre-trained model, and could we leverage pre-trained GAN models and fine-tune them?

Probably. I don't think anybody's done it as far as I know. That could be a really interesting thing for people to kind of think about and experiment with. Maybe people have done it and there's some literature there I haven't come across. So I'm somewhat familiar with the main pieces of literature in GANs, but I don't know all of it.

So maybe I've missed something about transfer learning in GANs, but that would be the trick to not needing too much data. So it's the huge speed-up combination of one cycle learning rate and momentum annealing plus the 8 GPU parallel training and the half precision. Is that only possible to do the half-precision calculation with consumer GPU?

Another question, why is the calculation 8 times faster from single to half-precision while from double to single is only 2 times faster? Okay, so the sci-fi 10 result, it's not 8 times faster from single to half. It's about 2 or 3 times as fast from single to half. The Nvidia claims about the flops performance of the tensor cores are academically correct but in practice meaningless because it really depends on what cores you need for what pieces.

So about 2 or 3x improvement for half. So the half-precision helps a bit, the extra GPU helps a bit, the one cycle helps an enormous amount. Then another key piece was the playing around with the parameters that I told you about. So reading the wide resnet paper carefully, identifying the kinds of things that they found there, and then writing a version of the architecture you just saw that made it really easy for me to fiddle around with parameters.

Staying up all night trying every possible combination of different kernel sizes and numbers of kernels and numbers of layer groups and size of layer groups. Remember we did a bottleneck but actually we tended to focus not on bottlenecks but instead on widening. So we actually like things that increase the size and then decrease it because it takes better advantage of the GPU.

So all those things combined together. I'd say the one cycle was perhaps the most critical but every one of those resulted in a big speedup. That's why we were able to get this 30x improvement over the state of the art. And we got some ideas for other things like after this Dawn Bench finishes.

Maybe we'll try and go even further and see if we can beat one minute one day. That'll be fun. So here's our discriminator. The important thing to remember about an architecture is it doesn't do anything other than have some input tensor size and rank and some output tensor size and rank.

You see the last com here has one channel. This is a bit different to what we're used to, because normally our last thing is a linear block. But our last thing here is a com block. And it's only got one channel but it's got a grid size of something around 4x4.

So we're going to spit out a 4x4 by 1 tensor. So what we then do is we then take the mean of that. So it goes from 4x4 by 1 to the scalar. So this is kind of like the ultimate adaptive average pooling, because we've got something with just one channel, we take the mean.

So this is a bit different. Normally we first do average pooling and then we put it through a fully connected layer to get our one thing out. In this case though we're getting one channel out and then taking the mean of that. I haven't fiddled around with why did we do it that way, what would instead happen if we did the usual average pooling followed by a fully connected layer.

Would it work better? Would it not? I don't know. I rather suspect it would work better if we did it the normal way, but I haven't tried it and I don't really have a good enough intuition to know whether I'm missing something. It would be an interesting experiment to try.

If somebody wants to stick an adaptive average pooling layer here and a fully connected layer afterwards with a single output, it should keep working. It should do something. The loss will go down to see whether it works. So that's the discriminator. There's going to be a training loop. Let's assume we've already got a generator.

Somebody says, "Okay Jeremy, here's a generator, it generates bedrooms. I want you to build a model that can figure out which ones are real and which ones aren't. So I'm going to take the data set and I'm going to basically label a bunch of images which are fake bedrooms from the generator and a bunch of images of real bedrooms from my else-on data set to stick a 1 or a 0 in each one and then I'll try to get the discriminator to tell the difference.

So that's going to be simple enough. But I haven't been given a generator, I need to build one. So a generator, and we haven't talked about the loss function yet. We're just going to assume there's some loss function that does this thing. So a generator is also an architecture which doesn't do anything by itself until we have a loss function and data.

But what are the ranks and sizes of the tensors? The input to the generator is going to be a vector of random numbers. In the paper, they call that the prior. It's going to be a vector of random numbers. How big? I don't know. Some big. 64, 128. And the idea is that a different bunch of random numbers will generate a different bedroom.

So our generator has to take as input a vector, and it's going to take that vector, so here's our input, and it's going to stick it through, in this case a sequential model. And the sequential model is going to take that vector and it's going to turn it into a rank 4 tensor, or if we take off the batch bit, a rank 3 tensor, height by width by 3.

So you can see at the end here, our final step here, NC, number of channels. So I think that's going to have to end up being 3 because we're going to create a 3-channel image of some size. In com-block-forward, is there a reason why BatchNorm comes after ReLU, i.e.

self.batchnorm.relu? No, there's not. It's just what they had in the code I borrowed from, I think. So again, unless my intuition about GANs is all wrong and for some reason needs to be different to what I'm used to, I would normally expect to go ReLU then BatchNorm. This is actually the order that makes more sense to me.

But I think the order I had in the darknet was what they used in the darknet paper. Everybody seems to have a different order of these things. And in fact, most people for sci-fi 10 have a different order again, which is they actually go bn, then ReLU, then conv, which is kind of a quirky way of thinking about it.

But it turns out that often for residual blocks that works better. That's called a pre-activation resnet. So if you Google for pre-activation resnet, you can see that. So yeah, there's not so much papers but more blog posts out there where people have experimented with different orders of those things.

And yeah, it seems to depend a lot on what specific data set it is and what you're doing with, although in general the difference in performance is small enough you won't care unless it's for a competition. So the generator needs to start with a vector and end up with a rank 3 tensor.

We don't really know how to do that yet, so how do we do that? How do we start with a vector and turn it into a rank 3 tensor? We need to use something called a deconvolution. And a deconvolution is, or as they call it in PyTorch, a transposed convolution.

Same thing, different name. And so a deconvolution is something which, rather than decreasing the grid size, it increases the grid size. So as with all things, it's easiest to see in an Excel spreadsheet. So here's a convolution. We start with a 4x4 grid cell with a single channel, a single filter.

And let's put it through a 3x3 kernel again with a single output. So we've got a single channel in, a single filter kernel. And so if we don't add any padding, we're going to end up with 2x2, because that 3x3 can go in 1, 2, 3, 4 places. It can go in one of two places across and one of two places down if there's no padding.

So there's our convolution. Remember the convolution is just the sum of the product of the kernel and the appropriate grid cell. So there's our standard 3x3 on one channel, one filter. So the idea now is I want to go the opposite direction. I want to start with my 2x2, and I want to create a 4x4.

And specifically, I want to create the same 4x4 that I started with. And I want to do that by using a convolution. So how would I do that? Well, if I have a 3x3 convolution, then if I want to create a 4x4 output, I'm going to need to create this much padding.

Because with this much padding, I'm going to end up with 1, 2, 3, 4 by 1, 2, 3, 4. You see why that is? So this filter can go in any one of four places across and four places up and down. So let's say my convolutional filter was just a bunch of zeros, then I can calculate my error for each cell just by taking this attraction, and then I can get the sum of absolute values, the L1 loss, by just summing up the absolute values of those errors.

So now I could use optimization. So in Excel, that's called Solver to do a gradient descent. So I'm going to set that cell equal to a minimum, and I'll try and reduce my loss by changing my filter, and I'll go Solve. And you can see it's come up with a filter such that 15.7 compared to 16, 17 is right, 17.8, 18, 19, so it's not perfect.

And in general, you can't assume that a deconvolution can exactly create the exact thing that you want, because there's just not enough. There's only 9 things here, and there's 16 things you're trying to create. But it's made a pretty good attempt. So this is what a deconvolution looks like, a stride 1 3x3 deconvolution on a 2x2 grid cell input.

How difficult is it to create a discriminator to identify fake news versus real news? Well, you don't need anything special, that's just a classifier. So you would just use the NLP classifier from previous to previous class and lesson 4. In that case, there's no generative piece, right? So you just need a dataset that says these are the things that we believe are fake news, and these are the things we consider to be real news.

And it should actually work very well. To the best of my knowledge, if you try it, you should get as good a result as anybody else has got, whether it's good enough to be useful to practice, I don't know. Oh, I was going to say that it's very hard using the technique you've described.

Very hard. There's not a good solution that does that. Well, but I don't think anybody in our course has tried, and nobody else outside our course knows of this technique. So there's been, as we've learned, we've just had a very significant jump in NLP classification capabilities. Obviously the best you could do at this stage would be to generate a triage that says these things look pretty sketchy based on how they're written and some human could go and fact check them.

An NLP classifier and RNN can't fact check things, but it could recognize like, oh, these are written in that kind of highly popularized style which often fake news is written in, and so maybe these ones are worth paying attention to. I think that would probably be the best you could hope for without drawing on some kind of external data sources.

But it's important to remember that a discriminator is basically just a classifier and you don't need any special techniques beyond what we've already learnt to do NLP classification. So to do that kind of deconvolution in PyTorch, just say com_transport is 2D, and in the normal way you say the number of input channels, the number of output channels, the kernel size, the stride, the padding, the bias, so these parameters are all the same.

And the reason it's called a com_transpose is because actually it turns out that this is the same as the calculation of the gradient of convolution. So this is a really nice example back on the old Theano website that comes from a really nice paper which actually shows you some visualizations.

So this is actually the one we just saw of doing a 2x2 deconvolution. If there's a stride 2, then you don't just have padding around the outside, but you actually have to put padding in the middle as well. They're not actually quite implemented this way because this is slow to do.

In practice they implement them a different way, but it all happens behind the scenes, we don't have to worry about it. We've talked about this convolution arithmetic tutorial before, and if you're still not comfortable with convolutions and in order to get comfortable with deconvolutions, this is a great site to go to.

If you want to see the paper, just Google for convolution arithmetic, that'll be the first thing that comes up. Let's do it now so you know you've found it. Here it is. And so that Theano tutorial actually comes from this paper. But the paper doesn't have the animated gifs.

So it's interesting then. A deconv block looks identical to a conv block, except it's got the word transpose written here. We just go conv-related batch norm as before, it's got input filters, output filters. The only difference is that stride 2 means that the grid size will double rather than half.

Both nn_conf_transpose_2D and nn.upsample seem to do the same thing, i.e. expand grid size, height and width from the previous layer. Can we say that conv_transpose_2D is always better than upsample, since upsample is merely resizing and filling unknowns by zeros or interpolation? No, you can't. So there's a fantastic interactive paper on distill.pub called Deconvolution But the good news is everybody else does it.

If you have a look here, can you see these checkerboard artifacts? It's all like dark blue, light blue, dark blue, light blue. So these are all from actual papers, right? Basically they noticed every one of these papers with generative models has these checkerboard artifacts. And what they realized is it's because when you have a stride 2 convolution of size 3 kernel, they overlap.

And so you basically get like some pixels get twice as much, some grid cells get twice as much activation. And so even if you start with random weights, you end up with a checkerboard artifact. So you can kind of see it here. And so the deeper you get, the worse it gets.

Their advice is actually less direct from it than it ought to be. I found that for most generative models, upsampling is better. So if you do nn.upsample, then all it does is it's basically doing cooling. But it's kind of the opposite of cooling. It says let's replace this one pixel or this one grid cell with 4, 2x2.

And there's a number of ways to upsample. One is just to copy it across to those 4. Another is to use bilinear or bicubic interpolation. There are various techniques to try and create a smooth upsampled version, and you can pretty much choose any of them in PyTorch. So if you do a 2x2 upsample and then a regular stride 1 3x3 conv, that's like another way of doing the same kind of thing as a conv transpose.

It's doubling the grid size and doing some convolutional arithmetic on it. And I found for generative models it pretty much always works better. And in that distillator publication, they kind of indicate that maybe that's a good approach, but they don't just come out and say just do this, whereas I would just say just do this.

Having said that, for GANs, I haven't had that much success with it yet, and I think it probably requires some tweaking to get it to work. I'm sure some people have got it to work. The issue I think is that in the early stages, it doesn't create enough noise.

I had a version actually where I tried to do it with an upsample, and you could kind of see that the noise didn't look very noisy. So anyway, it's an interesting version. But next week when we look at style transfer and super resolution and stuff, I think you'll see an end-up sample really comes into its own.

So the generator, we can now basically start with a vector. We can decide and say, okay, let's not think of it as a vector, but actually it's a 1x1 grid cell, and then we can turn it into a 4x4 and an 8x8 and so forth. And so that's why we have to make sure it's a suitable multiple so that we can actually create something of the right size.

And so you can see it's doing the exact opposite as before, right? It's making the cell size smaller and smaller by 2 at a time, as long as it can, until it gets to half the size that we want. And then finally we add one more on at the end -- sorry, we add n more on at the end with no stride, and then we add one more com transpose to finally get to the size that we wanted, and we're done.

Finally, we put that through a than, and that's going to force us to be in the 0-to-1 range, because of course we don't want to spit out arbitrary size pixel values. So we've got a generator architecture which spits out an image of some given size with the correct number of channels and with values between 0 and 1.

So at this point we can now create our ModelData object. These things take a while to train, so I just made it 128x128, so this is just a convenient way to make it a bit faster. And that's going to be the size of the import, but then we're going to use transformations to turn it into 64x64.

There's been more recent advances which have attempted to really increase this up to high resolution sizes, but they still tend to require either a batch size of 1 or lots and lots of GPUs or whatever. We're trying to do things that we can do on single consumer GPUs here.

So here's an example of one of the 64x64 bedrooms. So we're going to do pretty much everything manually, so let's go ahead and create our two models, our generator and our discriminator. And as you can see, the DCGAN, so in other words they're the same modules that were appeared in this paper.

So if you're interested in reading the papers, it's well worth going back and looking at the DCGAN paper to see what these architectures are, because it's assumed that when you read the Wasserstein GAN paper that you already know that. Shouldn't we use a sigmoid if we want values between 0 and 1?

I always forget which one's which. So sigmoid is 0 to 1, than is 1 to -1. I think what will happen is -- I'm going to have to check that. I vaguely remember thinking about this when I was writing this notebook and realizing that 1 to -1 made sense for some reason, but I can't remember what that reason was now.

So let me get back to you about that during the week and remind me if I forget. Good question, thank you. So we've got our generator and our discriminator. So we need a function that returns a prior vector, so a bunch of noise. So we do that by creating a bunch of zeros.

nz is the size of z, so very often in our code if you see a mysterious letter, it's because that's the letter they used in the paper. So z is the size of our noise vector. So there's the size of our noise vector, and then we use a normal distribution to generate random numbers inside that.

And that needs to be a variable because it's going to be participating in the gradient updates. So here's an example of creating some noise, and so here are four different pieces of noise. So we need an optimizer in order to update our gradients. In the Wasserstein GAN paper, they told us to use rmsprop.

So that's fine. So when you see this thing saying do an rmsprop update in a paper, that's nice. We can just do an rmsprop update with pytorch. And they suggested a learning rate of 5e-neg-5. I think I found 1e-neg-4 seemed to work, so I just made it a bit bigger.

So now we need a training loop. And so this is the thing that's going to implement this algorithm. So a training loop is going to go through some number of epochs that we get to pick, so that's going to be a parameter. And so remember, when you do everything manually, you've got to remember all the manual steps to do.

So one is that you have to set your modules into training mode when you're training them, and into evaluation mode when you're evaluating them. Because in training mode, batch norm updates happen, and dropout happens. In evaluation mode, those two things get turned off. That's basically the difference. So put it into training mode.

We're going to grab an iterator from our training data loader. We're going to see how many steps we have to go through, and then we'll use TQDM to give us a progress bar, and then we're going to go through that many steps. So the first step of this algorithm is to update the discriminator.

So in this one -- they don't call it a discriminator, they call it a critic. So w are the weights of the critic. So the first step is to train our critic a little bit, and then we're going to train our generator a little bit, and then we're going to go back to the top of the loop.

So we've got a while loop on the outside, so here's our while loop on the outside, and then inside that there's another loop for the critic, and so here's our little loop inside that for the critic. We call it a discriminator. So what we're going to do now is we've got a generator, and at the moment it's random.

So our generator is going to generate stuff that looks something like this, and so we need to first of all teach our discriminator to tell the difference between that and a bedroom. It shouldn't be too hard, you would hope. So we just do it in basically the usual way, but there's a few little tweaks.

So first of all, we're going to grab a mini-batch of real bedroom photos, so we can just grab the next batch from our iterator, turn it into a variable. Then we're going to calculate the loss for that. So this is going to be, how much does the discriminator think this looks fake?

And then we're going to create some fake images, and to do that we'll create some random noise, and we'll stick it through our generator, which at this stage is just a bunch of random weights, and that's going to create a mini-batch of fake images. And so then we'll put that through the same discriminator module as before to get the loss for that.

So how fake do the fake ones look? Remember when you do everything manually, you have to zero the gradients in your loop, and if you've forgotten about that, go back to the Part 1 lesson where we do everything from scratch. So now finally, the total discriminator loss is equal to the real loss minus the fake loss.

And so you can see that here. They don't talk about the loss, they actually just talk about what are the gradient updates. So this here is the symbol for get the gradients. So inside here is the loss. And try to learn to throw away in your head all of the boring stuff.

So when you see sum over m divided by m, that means take the average. So just throw that away and replace it with np.mean in your head. There's another np.mean. So you want to get quick at being able to see these common idioms. So anytime you see 1 over m, sum over m, you go, okay, np.mean.

So we're taking the mean of, and we're taking the mean of, so that's all fine. x_i, what's x_i? It looks like it's x to the power of i, but it's not. The math notation is very overloaded. They showed us here what x_i is, and it's a set of m samples from a batch of the real data.

So in other words, this is a mini-batch. So when you see something saying sample, it means just grab a row, grab a row, grab a row, and you can see here grab at m times, and we'll call the first row x, parenthesis 1, the second row x, parenthesis 2.

One of the annoying things about math notation is the way that we index into arrays is everybody uses different approaches, subscripts, superscripts, things in brackets, combinations, commas, square brackets, whatever. So you've just got to look in the paper and be like, okay, at some point they're going to say take the i-th row from this matrix or the i-th image in this batch, how are they going to do it?

In this case, it's a superscript in parenthesis. So that's all sample means, and curly brackets means it's just a set of them. This little squiggle followed by something here means according to some probability distribution. And so in this case, and very very often in papers, it simply means, hey, you've got a bunch of data, grab a bit from it at random.

So that's the probability distribution of the data you have is the data you have. So this says grab m things at random from your prior samples, and so that means in other words call create_noise to create m random vectors. So now we've got m real images. Each one gets put through our discriminator.

We've got m bits of noise. Each one gets put through our generator to create m generated images. Each one of those gets put through, look, f(w), that's the same thing, so each one of those gets put through our discriminator to try and figure out whether they're fake or not.

And so then it's this, minus this, and the mean of that, and then finally get the gradient of that in order to figure out how to use rmsprop to update our weights using some learning weight. So in PyTorch, we don't have to worry about getting the gradients. We can just specify the loss bit, and then just say loss.backward, discriminator optimizer.step.

Now there's one key step, which is that we have to keep all of our weights, which are the parameters in a PyTorch module, in this small range between -0.01 and 0.01. Why? Because the mathematical assumptions that make this algorithm work only apply in like a small ball. I think it's kind of interesting to understand the math of why that's the case, but it's very specific to this one paper, and understanding it won't help you understand any other paper.

So only study it if you're interested. I think it's nicely explained, I think it's fun, but it won't be information that you'll reuse elsewhere unless you get super into GANs. I'll also mention, after the paper came out, an improved Frostenstein GAN came out that said there are better ways to ensure that your weight space is in this type ball, which was basically to penalize gradients that are too high.

So nowadays there are slightly different ways to do this. Anyway, that's why this line of code there is kind of the key contribution. This one line of code actually is the one line of code you add to make it a Frostenstein GAN. But the work was all in knowing that that's the thing you can do that makes everything work better.

At the end of this, we've got a discriminator that can recognize it in real bedrooms and now totally random crappy generated images. So let's now try and create some better images. So now set trainable discriminator to false, set trainable to true, zero out the gradients of the generator. And now our loss again is fw, that's the discriminator of the generator applied to some more random noise.

So here's our random noise, here's our generator, and here's our discriminator. I think I can remove that now because I think I've put it inside the discriminator but I won't change it now because it's going to confuse me. So it's exactly the same as before where we did generator on the noise and then pass that to discriminator, but this time the thing that's trainable is the generator, not the discriminator.

So in other words, in this pseudocode, the thing they update is theta, which is the generator's parameters rather than w, which is the discriminator's parameters. And so hopefully you'll see now that this w down here is telling you these are the parameters of the discriminator, this theta down here is telling you these are the parameters of the generator.

And again, it's not a universal mathematical notation, it's a thing they're doing in this particular paper, but it's kind of nice when you see some suffix like that, try to think about what it's telling you. So we take some noise, generate some images, try and figure out if they're fake or real, and use that to get gradients with respect to the generator, as opposed to earlier we got them with respect to the discriminator, and use that to update their weights with our MSProp with an alpha learning rate.

You'll see that it's kind of unfair that the discriminator is getting trained n critic times, which they set to 5, for every time that we train the generator once. And the paper talks a bit about this, but the basic idea is there's no point making the generator better if the discriminator doesn't know how to discriminate yet.

So that's why we've got this while loop. And here's that 5, and actually something which was added in the later paper is the idea that from time to time, and a bunch of times at the start, you should do more steps of the discriminator. So make sure that the discriminator is pretty capable from time to time.

So do a bunch of epochs of training the discriminator a bunch of times to get better at telling the difference between real and fake, and then do one step with making the generator being better at generating, and that is an epoch. And so let's train that for one epoch, and then let's create some noise so we can generate some examples.

So we're going to do that later. Let's first of all decrease the learning rate by 10 and do one more pass. So we've now done two epochs, and now let's use our noise to pass it to our generator, and then put it through our denormalization to turn it back into something we can see, and then plot it, and we have some bedrooms.

It's not real bedrooms, and some of them don't look particularly like bedrooms, but some of them look a lot like bedrooms. So that's the idea, that's a GAN. And I think the best way to think about a GAN is it's like an underlying technology that you'll probably never use like this, but you'll use in lots of interesting ways.

For example, we're going to use it to create now a CycleGAN, and we're going to use the CycleGAN to turn horses into zebras. You could also use it to turn Monet prints into photos, or to turn photos of Yosemite in summer into winter. So it's going to be pretty, yes, Rachel?

Two questions. One, is there any reason for using RMS props, specifically as the optimizer as opposed to Adam? I don't remember it being explicitly discussed in the paper, I don't know if it's just experimental or the theoretical reason. Have a look in the paper and see what it says, I don't recall.

And which could be a reasonable way of detecting overfitting while training, or evaluating the performance of one of these GAN models once we're done training? In other words, how does the notion of training validation test sets translate to GANs? That's an awesome question. And there's a lot of people who make jokes about how GANs is the one field where you don't need a test set, and people take advantage of that by making stuff up and saying it looks great.

There are some pretty famous problems with GANs. One of the famous problems with GANs is called mode collapse. And mode collapse happens where you look at your bedrooms and it turns out that there's basically only three kinds of bedrooms that every possible noise vector mapped to, and you look at your gallery and it turns out they're all just the same thing, or there's just three different things.

Mode collapse is easy to see if you collapse down to a small number of modes, like three or four. But what if you have a mode collapse down to 10,000 modes, so there's only 10,000 possible bedrooms that all of your noise vectors collapse to? You wouldn't be able to see it here, because it's pretty unlikely you would have two identical bedrooms out of 10,000.

Or what if every one of these bedrooms is basically a direct copy of one of the -- it basically memorized some input. Could that be happening? And the truth is most papers don't do a good job or sometimes any job of checking those things. So the question of how do we evaluate GANs, and even the point of maybe we should actually evaluate GANs properly is something that is not widely enough understood even now.

And some people are trying to really push. So Ian Goodfellow, who a lot of you will know because he came and spoke here at a lot of the book club meetings last year, and of course was the first author on the most famous deep learning book. He's the inventor of GANs, and he's been sending a continuous stream of tweets reminding people about the importance of testing GANs properly.

So if you see a paper that claims exceptional GAN results, then this is definitely something to look at. Have they talked about mode collapse? Have they talked about memorization? So this is going to be really straightforward because it's just a neural net. So all we're going to do is we're going to create an input containing lots of zebra photos, and with each one we'll pair it with an equivalent horse photo, and we'll just train a neural net that goes from one to the other.

Or you can do the same thing for every Monet painting, create a dataset containing the photo of the place. Oh wait, that's not possible because the places that Monet painted aren't there anymore, and there aren't exact zebra versions of horses. And oh wait, how the hell is this going to work?

This seems to break everything we know about what neural nets can do and how they do them. Alright Rachel, you're going to ask me a question, just spoil our whole train of thought. Come on, better be good. Can GANs be used for data augmentation? Yeah, absolutely. You can use a GAN for data augmentation.

Should you? I don't know. There are some papers that try to do semi-supervised learning with GANs. I haven't found any that are particularly compelling, showing state-of-the-art results on really interesting datasets that have been widely studied. I'm a little skeptical. The reason I'm a little skeptical is because in my experience, if you train a model with synthetic data, the neural net will become fantastically good at recognizing the specific problems of your synthetic data, and that will end up what it's learning from.

And there are lots of other ways of doing semi-supervised models which do work well. There are some places that it can work. For example, you might remember Otavio Good who created that fantastic visualization in Part 1 of the Zooming ConvNet where he kind of showed a letter going through MNIST.

He at least at that time was the number one autonomous remote-controlled car guy in autonomous remote-controlled car competitions. And he trained his model using synthetically augmented data where he basically talked real videos of a car driving around a circuit and added fake people and fake other cars and stuff like that.

And I think that worked well because he's kind of a genius and because I think he had a well-defined subset that he had to work in. But in general it's really hard to use synthetic data. I've tried using synthetic data in models for decades now, obviously not GANs because they're pretty new, but in general it's very hard to do.

Very interesting research question. So somehow these folks at Berkeley created a model that can turn a horse into a zebra despite not having any photos unless they went out there and painted horses and took before and after shots, but I believe they did it. So how the hell did they do this?

It's kind of genius. I will say the person I know who's doing the most interesting practice of CycleGAN right now is one of our students, Helena Sarin. She's the only artist I know of who is a CycleGAN artist. Here's an example I love. She created this little doodle in the top left, and then trained a CycleGAN to turn it into this beautiful painting in the bottom right.

Here are some more of her amazing works. I think it's really interesting, I mentioned at the start of this class that GANs are in the category of stuff that's not there yet, but it's nearly there, and in this case there's at least one person in the world now who's creating beautiful and extraordinary artworks using GANs, and there's lots of specifically CycleGANs, and there's actually at least maybe a dozen people I know of who are just doing interesting creative work with neural nets more generally, and the field of creative AI is going to expand dramatically.

I think it's interesting with Helena, I don't know her personally, but from what I understand of her background, she's a software developer, it's her full-time job, and an artist as her hobby, and she's kind of started combining these two by saying, "Gosh, I wonder what this particular tool could bring to my art?" And so if you follow her Twitter account, we'll make sure we add it on the wiki.

If somebody can find it, it's Helena Sarin. She basically posts a new work almost every day, and they're always pretty amazing. So here's the basic trick, and this is from the CycleGAN paper. We're going to have two images, assuming we're doing this with images, but the key thing is they're not paired images.

We don't have a data set of horses and the equivalent zebras. We've got a bunch of horses, a bunch of zebras. Grab one horse, grab one zebra. We've now got an X, let's say X is horse, and Y is zebra. We're going to train a generator, and what they call here a mapping function, that turns horse into zebra, we'll call that mapping function G, and we'll create one mapping function, generator, that turns a zebra into a horse, and we'll call that F.

We'll create a discriminator, just like we did before, which is going to get as good as possible at recognizing real from fake horses, so that'll be DX, and then another discriminator which is going to be as good as possible and recognizing real from fake zebras, we'll call that DY.

So that's kind of our starting point, but then the key thing to making this work, we're kind of generating a loss function here, right? Here's one bit of the loss function, here's the second bit of the loss function. We're going to create something called cycle consistency loss which says after you turn your horse into a zebra with your G generator and check whether or not I can recognize that it's real.

I keep forgetting which one's horse and which one's zebra, I apologize if I get my X's and Y's backwards. I turn my horse into a zebra, and then I'm going to try and turn that zebra back into the same horse that I started with. So then I'm going to have another function that's going to check whether this horse, which I've generated knowing nothing about X, generated entirely from this zebra, is similar to the original horse or not.

So the idea would be if your generated zebra doesn't look anything like your original horse, you've got no chance of turning it back into the original horse. So a loss, which compares X hat to X, is going to be really bad unless you can go into Y and back out again.

And you're probably only going to be able to do that if you're able to create a zebra that looks like the original horse so that you know what the original horse looked like. And vice versa. Take your zebra, turn it into a fake horse, and then try and turn it back into the original zebra and check that it looks like the original.

So notice here, this F is our zebra to horse. This G is our horse to zebra. So the G and the F are kind of doing two things. They're both turning the original horse into the zebra and then turning the zebra back into the original horse. So notice that there's only two generators.

There isn't a separate generator for the reverse mapping. You have to use the same generator that was used for the original mapping. So this is the cycle consistency loss. And I just think this is genius. The idea that this is a thing that could be even possible, honestly when this came out, it just never occurred to me as a thing that I could even try and solve.

It seems so obviously impossible. And then the idea that you can solve it like this, I just think it's so damn smart. So it's good to look at the equations in this paper because they're written pretty simply. It's not like some of the stuff in the Wasserstein-Gahn paper which is like lots of theoretical proofs and whatever else.

In this case, they're just equations that just lay out what's going on. And you really want to get to a point where you can read them and understand them. So let's kind of start talking through them. So we've got a horse and a zebra. So for some mapping function G, which is our horse to zebra mapping function, then there's a GAN loss, which is the bit we're already familiar with.

It says I've got a horse, a zebra, a fake zebra recognizer, and a horse to zebra generator. And the loss is what we saw before. It's our ability to draw one zebra out of our zebras and recognize whether it's real or fake. And then take a horse and turn it into a zebra and recognize whether that's real or fake.

And then you can then do one minus the other. In this case, they've got a log in there. The log's not terribly important. So this is the thing we just saw. So that's why we did Wasserstein GAN first. This is just a standard GAN loss in math form. Did you have a question, Rachel?

All of this sounds awfully like translating in one language to another than back to the original. Have GANs or any equivalent been tried in translation? Not that I know of. There's this unsupervised machine translation which does kind of do something like this, but I haven't looked at it closely enough to know if it's nearly identical or if it's just vaguely similar.

So to kind of back up to what I do know, normally with translation you require this kind of paired input. You require parallel texts. This is the French translation of this English sentence. I do know there's been a couple of recent papers that show the ability to create good quality translation models without paired data.

I haven't implemented them and I don't understand anything, I haven't implemented them, but they may well be doing the same basic idea. We'll look at it during the week and get back to you. So we've got our GAN loss. The next piece is the cycle consistency loss. So the basic idea here is that we start with our horse, use our zebra generator on that to create a zebra, use our horse generator on that to create a horse, and then compare that to the original horse.

And this double lines with a 1, we've seen this before, this is the L1 loss. So this is the sum of the absolute value of differences. Or else if this was a 2, it would be the L2 loss, or the 2 norm, which would be the sum of the square root of it, actually.

And again, we now know this squiggle idea, which is from our horses, grab a horse. So this is what we mean by sample from a distribution. There's all kinds of distributions, but most commonly in these papers we're using in empirical distribution. In other words, we've got some rows of data, grab a row.

So when you see this thing, squiggle, other thing, this thing here, when it says pdata, that means grab something from the data, and we're going to call that thing x. So from our horse's pictures, grab a horse, turn it into a zebra, turn it back into a horse, compare it to the original, and sum up the absolute values.

Do that for horse to zebra, do it for zebra to horse as well, add the two together, and that is our cycle consistency loss. So now we get our loss function, and the whole loss function depends on our horse generator, our zebra generator, our horse recognizer, our zebra recognizer discriminator, and we're going to add up the GAN loss for recognizing horses, the GAN loss for recognizing zebras, and the cycle consistency loss for our two generators.

And then we've got a lambda here, which hopefully we're kind of used to this idea now, that is when you've got two different kinds of loss, you chuck in a parameter there, you can multiply them by so they're about the same scale. We did a similar thing with our bounding box loss compared to our classifier loss when we did that localization stuff.

So then we're going to try to, for this loss function, maximize the capability of the discriminators that are discriminating whilst minimizing that for the generators. So the generators and the discriminators are going to be facing off against each other. So when you see this min-max thing in papers, you'll see it a lot.

It basically means this idea that in your training loop, one thing is trying to make something better, the other is trying to make something worse, and there's lots of ways to do it, but most commonly you will alternate between the two. And you'll often see this just referred to in math papers as min-max.

So when you see min-max, you should immediately think, okay, adversarial training. So let's look at the code. We probably won't be able to finish this today, but we're going to do something almost unheard of, which is I started looking at somebody else's code, and I was not so disgusted that I threw the whole thing away and did it myself.

I actually said I quite like this. I like it enough I'm going to show it to my students. So this is where the code comes from. So this is one of the people that created the original code for CycleGANs, and they've created a PyTorch version. I had to clean it up a little bit, but it's actually pretty damn good.

I think the first time I found code that I didn't feel the need to rewrite from scratch before I showed it to you. And so the cool thing about this is one of the reasons I like doing it this way, like finally finding something that's not awful, is that you're now going to get to see almost all the bits of fast.ai, or all the relevant bits of fast.ai, written in a different way than somebody else.

And so you're going to get to see how they do data sets, and data loaders, and models, and training loops, and so forth. So you'll find there's a Segan directory, which is basically nearly this, with some cleanups which I hope to submit as a PR sometime. It was written in a way that unfortunately made it a bit over-connected to how they were using it as a script.

I cleaned it up a little bit so I could use it as a module, but other than that it's pretty similar. So Segan is basically their code copied from their GitHub repo with some minor changes. So the way the Segan mini library has been set up is that the configuration options they're assuming are being passed in to a script.

So they've got this train options parser method, and so you can see I'm basically passing in an array of script options. Where's my data? How many threads do I want to drop out? How many iterations? What am I going to call this model? Which GPU do I want to write it on?

So that might just be an opt object, which you can then see what it contains. You'll see it contains some things I didn't mention, and that's because it's got defaults for everything else that I didn't mention. So rather than using fast.ai stuff, we're going to use Segan stuff. So the first thing we're going to need is a data loader.

And so this is also a great opportunity for you again to practice your ability to navigate through code with your editor or IDE of choice. So we're going to start with create data loader. So you should be able to go find symbol or in vim tag to jump straight to create data loader, and we can see that's creating a custom dataset loader, and then we can see custom dataset loader is a base data loader.

So basically we can see that it's going to use a standard PyTorch data loader. So that's good. And so we know if you're going to use a standard PyTorch data loader, you have to pass it a dataset. And we know that a dataset is something that contains a length and an indexer.

So presumably when we look at create dataset, it's going to do that. Here is create dataset. So this library actually does more than just CycleGAN. It handles both aligned and unaligned image pairs. We know that our image pairs are unaligned. So we've got an unaligned dataset. Okay, here it is.

And as expected, it has a getItem and a length. Good. And so obviously the length is just whatever. So A and B is our horses and zebras. We've got two sets. So whichever one is longer is the length of the data loader. And so getItem is just going to go ahead and randomly grab something from each of our two horses and zebras, open them up with Pillow or PIL, run them through some transformations, and then we could either be turning horses into zebras or zebras into horses, so there's some direction, and then it will just go ahead and return our horse and our zebra and our path to the horse and the path to zebra.

So hopefully you can kind of see that this is looking pretty similar to the kind of stuff that FastAI does. FastAI obviously does quite a lot more when it comes to transforms and performance and stuff like this. But remember, this is like research code for this one thing. It's pretty cool that they did all this work.

So we've got a data loader, so we can go and load our data into it, and so that will tell us how many minibatches are in it. That's the length of the data loader in PyTorch. Next step, we've got a data loader is to create a model. So you can go tag for create_model.

There it is. Same idea, we've got different kinds of models, so we're going to be doing a CycleGAN. So here's our CycleGAN model. So there's quite a lot of stuff in a CycleGAN model, so let's go through and find out what's going to be used. But basically at this stage, we've just called initializer.

So when we initialize it, you can see it's going to go through and it's going to define two generators, which is not surprising, a generator for our horses and a generator for our zebras. There's some way for it to generate a pool of fake data. And then here we're going to grab our GAN loss, and as we talked about, our cycle consistency loss is an L1 loss.

That's interesting, they're going to use ADAM. So obviously for CycleGANs, they found ADAM works pretty well. And so then we're going to have an optimizer for our horse discriminator, an optimizer for our zebra discriminator, and an optimizer for our generator. The optimizer for the generator is going to contain the parameters both for the horse generator and the zebra generator all in one place.

So the initializer is going to set up all of the different networks and loss functions we need, and they're all going to be stored inside this model. And so then it prints out and shows us exactly the PyTorch bottles we have. And so it's interesting to see that they're using ResNets.

And so you can see the ResNets look pretty familiar. We've got conv_batch_norm_rail_u, conv_batch_norm. So instance_norm is just the same as batch_norm, basically, but it applies it to one image at a time. The difference isn't particularly important. And you can see they're doing reflection padding just like we are. You can kind of see when you try to build everything from scratch like this, it is a lot of work.

And you can kind of get the nice little things that fast.ai does automatically for you. You kind of have to do all of them by hand and only end up with a subset of them. So over time, hopefully soon, we'll get all of this GAN stuff into fast.ai and it'll be nice and easy.

So we've got our model, and remember the model contains the loss functions, it contains the generators, it contains the discriminators, all in one convenient place. So I've gone ahead and kind of copied and pasted and slightly refactored the training loop from the code so that we can run it inside the notebook.

So this is a lot pretty familiar, right? It's a loop to go through each epoch, and a loop to go through the data. Before we did this, we set up our -- this is actually not a PyTorch dataset, I think this is what they used slightly confusingly to talk about their combined, what we would call a model data object, I guess, or the data that they need.

We'll go through that with TQDM to get a progress bar, and so now we can go through and see what happens in the model. So set input. So it's kind of a different approach to what we do in fast.ai. It's kind of neat, it's quite specific to CycleGANs, but basically internally inside this model is this idea that we're going to go into our data and grab -- we're either going horse to zebra or zebra to horse, depending on which way we go.

A is either the horse or the zebra, and vice versa, and if necessary, put it on the appropriate GPU and then grab the appropriate path. So the model now has a mini-batch of horses and a mini-batch of zebras, and so now we optimize the parameters. So it's kind of nice to see it like this.

You can see each step. First of all, try to optimize the generators, then try to optimize the horse discriminator, then try to optimize the zebra discriminator. 0 grad is part of PyTorch. Step is part of PyTorch. So the interesting bit is the actual thing which does the backpropagation on the generator.

So here it is. And let's jump to the key pieces. There's all the bits, all the formulas that we basically just saw from the paper. So let's take a horse and generate a zebra. So we've now got a fake zebra. And let's now use the discriminator to see if we can tell whether it's fake or not.

And then let's pop that into our loss function, which we set up earlier to see if we can basically get a loss function based on that prediction. Then let's do the same thing to do the GAN loss. So go in the opposite direction, and then we need to use the opposite discriminator, and then put that through the loss function again.

And then let's do the cycle-consistency loss. So again, we take our fake, which we created up here, and try and turn it back again into the original. And then let's use that cycle-consistency loss function we created earlier to compare it to the real original. And here's that lambda. So there's some weight that we used, and that was set up, actually.

We just used the default that I suggested in their options. And then do the same for the opposite direction, and then add them all together. Do the backward step, and that's it. So we can then do the same thing for the first discriminator. And since basically all the work's been done now, there's much less to do here.

So I won't step all through it, but it's basically the same basic stuff that we've already seen. So optimized parameters basically is calculating the losses and doing the optimizer step from time to time, save and print out some results. And then from time to time, update the learning rate, so they've got some learning rate annealing built in here as well.

It isn't very exciting, but we can take a look at it. So they've basically got some kind of fast AI, they've got this idea of schedulers which you can then use to update your learning rates. So I think for those of you who are interested in better understanding deep learning APIs, or interested in contributing more to fast AI, or interested in creating your own version of some of this stuff in some different backend, it's cool to look at a second kind of API that covers some subset of some of the similar things to get a sense of how are they solving some of these problems, and what are the similarities and what are the differences.

So we train that for a little while, and then we can just grab a few examples, and here we have them. So here are our horses, here they are as zebras, and here they are back as horses again. Here's a zebra, into a horse, back on a zebra, it's kind of thrown away its head for some reason, but not so much it could get it back again.

This is a really interesting one, like this is obviously not what zebras look like, but it's going to be a zebra version of that horse. It's also interesting to see its failure situations, I guess it doesn't very often see basically just an eyeball, it has no idea how to do that one.

So some of them don't work very well, this one's done a pretty good job. This one's interesting, it's done a good job of that one and that one, but for some reason the one in the middle didn't get a go. This one's a really weird shape, but it's done a reasonable job of it.

This one looks good, this one's pretty sloppy, again the fork just ahead, it's not bad. So it took me like 24 hours to train it even that far, so it's kind of slow. And I know Helena is constantly complaining on Twitter about how long these things take, I don't know how she's so productive with them.

So I will mention one more thing that just came out yesterday, which is there's now a multimodal image-to-image translation of unpaired, and so you can basically now create different cats, for instance, from this dog. So this is basically not just creating one example of the output that you want, but creating a multimodal one.

So here's a house cat to big cat, and here's a big cat to house cat, this is the paper. So this came out like yesterday or the day before, I think. I think it's pretty amazing cat and a dog. So you can kind of see how this technology is developing, and I think there's so many opportunities to maybe do this with music, or speech, or writing, or to create tools for artists, or whatever.

Alright, thanks everybody, and see you next week. (audience applauds)