Back to Index

Lesson 10: Cutting Edge Deep Learning for Coders


Chapters

0:0 Introduction
1:0 Slobs Youve Wonder
4:35 Overshoot
7:40 Study Groups
9:15 Last Week Recap
10:40 Resize Images
12:43 Center Cropping
15:2 Parallel Processing
17:39 General Approach
19:3 Append Image
20:10 Threading Local
22:54 Results
26:38 Preprocessing
27:40 Finetuning
38:39 Big Holes
43:47 Linear Model

Transcript

Some really fun stuff appearing on the forums this week, and one of the really great projects which was created by I believe our sole Bulgarian participant in the course, Slav Ivanov, wrote a great post about picking an optimizer for style transfer. This post came from a forum discussion in which I made an off-hand remark about how I know that in theory BFGS is a deterministic optimizer, it uses a line search, it approximates the Hessian, it ought to work on this kind of deterministic problem better, but I hadn't tried it myself and I hadn't seen anybody try it and so maybe somebody should try it.

I don't know if you've noticed, but pretty much every week I say something like that a number of times and every time I do I'm always hoping that somebody might go, "Oh, I wonder as well." And so Slav did wonder and so he posted a really interesting blog post about that exact question.

I was thrilled to see that the blog post got a lot of pick-up on the machine learning Reddit. It got 55 upvotes which for that subreddit is put in second place on the front page. It also got picked up by the WildML mailing list weekly summary of interesting things in AI as the second post that was listed.

So that was great. For those of you that have looked at it and kind of wondered what is it about this post that causes it to get noticed whereas other ones don't, I'm not sure I know the secret, but as soon as I read it I kind of thought, "Okay, I think a lot of people are going to read this." It gives some background, it assumes an intelligent reader, but it assumes an intelligent reader who doesn't necessarily know all about this, something like you guys six months ago.

And so it describes this is what it is and this is where this kind of thing is used and gives some examples and then goes ahead and sets up the question of different optimization algorithms and then shows lots of examples of both learning curves as well as pictures that come out of these different experiments.

And I think hopefully it's been a great experience for Slav as well because in the Reddit thread there's all kinds of folks pointing out other things that he could try, questions that weren't quite clear, and so now there's a whole kind of, actually kind of summarized in that thread a list of things that perhaps could be done next to open up a whole interesting question.

Another post which I'm not even sure if it's officially posted yet, but I got the early bird version from Brad, is this crazy thing. Here is Kanye drawn using a brush of Captain Jean-Luc Picard. In case you're wondering, is that really him, I will show you his zoomed in version.

It really is Jean-Luc Picard. And this is a really interesting idea because he points out that generally speaking when you try to use a non-artwork as your style image, it doesn't actually give very good results. It's another example of a non-artwork, it doesn't give good results. It's kind of interesting, but it's not quite what he was looking for.

But if you tile it, you totally get it, so here's Kanye using a Nintendo game controller brush. So then he tried out this Jean-Luc Picard and got okay results and kind of realized that actually the size of the texture is pretty critical. And I've never seen anybody do this before, so I think when this image gets shared on Twitter it's going to go everywhere because it's just the freakiest thing.

Freaky is good. So I think I warned you guys about your projects when I first mentioned them as being something that's very easy to overshoot a little bit and spend weeks and weeks talking about what you're eventually going to do. You've had a couple of weeks. Really it would have been nice to have something done by now rather than spending a couple of weeks wondering about what to do.

So if your team is being a bit slow agreeing on something, just start working on something yourself. Or as a team, just pick something that you can do by next Monday and write up something brief about it. So for example, if you're thinking, okay, we might do the $1 million data science poll.

That's fine. You're not going to finish it by Monday, but maybe by Monday you could have written a blog post introducing what you can learn in a week about medical imaging. Oh, it turns out it uses something called DICOM. Here are the Python DICOM libraries, and we tried to use them, and these were the things that got us kind of confused, and these are the ways that we solved them.

And here's a Python notebook which shows you some of the main ways you can look at these DICOM, for instance. So split up your project into little pieces. It's like when you enter a Kaggle competition, I always tell people submit every single day and try and put in at least half an hour a day to make it slightly better than yesterday.

So how do you put in the first day's submission? What I always do on the first day is to submit the benchmark script, which is generally like all zeroes. And then the next day I try to improve it, so I'll put in all 0.5s, and the next day I'll try to improve it.

I'll be like, "Okay, what's the average for cats? The average for dogs?" I'll submit that. And if you do that every day for 90 days, you'll be amazed at how much you can achieve. Whereas if you wait two months and spend all that time reading papers and theorizing and thinking about the best possible approach, you'll discover that you don't get any submissions yet.

Or you finally get your perfect submission in and it goes terribly and now you don't have time to make it better. I think those tips are equally useful for Kaggle competitions as well as for making sure that at the end of this part of the course you have something that you're proud of, something that you feel you did a good job in a small amount of time.

If you try and publish something every week on the same topic, you'll be able to keep going further and further on that thing. I don't know what Slav's plans are, but maybe next week he'll follow up on some of the interesting research angles that came up on Reddit, or maybe Brad will follow up on some of his additional ideas from his post.

There's a lesson 10 wiki up already which has the notebooks, and just do a git pull on the GitHub repo to get the most up-to-date on Python. Another thing that I wanted to point out is that in study groups, so we've been having study groups each Friday here, and I know some of you have had study groups elsewhere around the Bay Area.

I don't understand this gray matrix stuff. I don't get what's going on. I understand the symbols, I understand the math, but what's going on? I said maybe if you had a spreadsheet, it would all make sense. Maybe I'll create a spreadsheet. Yes, do that! And 20 minutes later I turned to him and I said, "So how do you feel about gray matrices now?" And he goes, "I totally understand them." And I looked over and he created a spreadsheet.

This was the spreadsheet he created. It's a very simple spreadsheet where it's like here's an image where the pixels are just 1, -1, and 0. It has two filters, either 1 or -1. He has the flattened convolutions next to each other, and then he's created the little dot product matrix.

I haven't been doing so much Excel stuff myself, but I think you'll learn a lot more by trying it yourself. Particularly if you try it yourself and can't figure out how to do it in Excel, then we have programs. I love Excel, so if you ask me questions about Excel, I will have a great time.

So last week we talked about the idea of learning with larger datasets. Our goal was to try and replicate the device paper. To remind you, the device paper is the one where we do a regular CNN, but the thing that we're trying to predict is not a one-hot encoding of the category, but it's the word vector of the category.

So it's an interesting problem, but one of the things interesting about it is we have to use all of ImageNet, which has its own challenges. So last week we got to the point where we had created the word vectors. And to remember the word vectors, we then had to map them to ImageNet categories.

There are 1000 ImageNet categories, so we had to create the word vector for each one. We didn't quite get all of them to match, but something like 2/3 of them matched, so we're working on 2/3 of ImageNet. We've got as far as reading all the file names for ImageNet, and then we're going to resize our images to 224x224.

I think it's a good idea to do some of this pre-processing upfront. Something that TensorFlow and PyTorch both do and Keras recently started doing is that if you use a generator, it actually does the image pre-processing in a number of separate threads in parallel behind the scenes. So some of this is a little less important than it was 6 months ago when Keras didn't do that.

It used to be that we had to spend a long time waiting for our data to get processed before it could get into the CNN. Having said that, particularly image resizing, when you've got large JPEGs, just reading them off the hard disk and resizing them can take quite a long time.

So I always like to put it into do all that resizing upfront and end up with something in a nice convenient V-coles array. Amongst other things, it means that unless you have enough money to have a huge NVMe or SSD drive, which you can put the entirety of ImageNet on, you probably have your big data sets on some kind of pretty slow spinning disk or slow rate array.

One of the nice things about doing the resizing first is that it makes it a lot smaller, and you probably can then fit it on your SSD. There's lots of reasons that I think this is good. I'm going to resize all of the ImageNet images, put them in a V-coles array on my SSD.

So here's the path, and dpath is the path to my fast SSD mount point. We talked briefly about the resizing, and we're going to do a different kind of resizing. In the past, we've done the same kind of resizing that Keras does, which is to add a black border.

If you start with something that's not square, and you make it square, you resize the largest axis to be the size of your square, which means you're left with a black border. I was concerned that any model where you have that is going to have to learn to model the black border, a) and b) that you're kind of throwing away information.

You're not using the full size of the image. And indeed, every other library or pretty much paper I've seen uses a different approach, which is to resize the smaller side of the image to the square. Now the larger size is now too big for your square, so you crop off the top and bottom, or crop off the left and right.

So this is called a center-cropping approach. Okay, that's true. What you're doing is you're throwing away compute. Like with the one where you do center-crop, you have a complete 224 thing full of meaningful pixels. Whereas with a black border, you have a 180 by 224 bit with meaningful pixels and a whole bunch of black pixels.

Yeah, that can be a problem. It works well for ImageNet because ImageNet things are generally somewhat centered. You may need to do some kind of initial step to do a heat map or something like we did in lesson 7 to figure out roughly where the thing is before you decide where to center the crop.

So these things are all compromises. But I got to say, since I switched to using this approach, I feel like my models have trained a lot faster and given better results, certainly the super resolution. I said last week that we were going to start looking at parallel processing. If you're wondering about last week's homework, we're going to get there, but some of the techniques we're about to learn, we're going to use to do last week's homework even better.

So what I want to do is I've got a CPU with something like 10 cores on it, and then each of those cores have hyperthreading, so each of those cores can do kind of two things at once. So I really want to be able to have a couple of dozen processes going on, each one resizing an image.

That's called parallel processing. Just to remind you, this is as opposed to vectorization, or SIMD, which is where a single thread operates on a bunch of things at a time. So we learned that to get SIMD working, you just have to install fellow SIMD, and it just happens, 600% speedup.

I tried it, it works. Now we're going to, as well as the 600% speedup, also get another 10 or 20x speedup by doing parallel processing. The basic approach to parallel processing in Python 3 is to set up something called either a process pool or a thread pool. So the idea here is that we've got a number of little programs running, threads or processes, and when we set up that pool, we say how many of those little programs do we want to fire up.

And then what we do is we say, okay, now I want you to use workers. I want you to use all of those workers to do some thing. And the easiest way to do a thing in Python 3 is to use Map. How many of you have used Map before?

So for those of you who haven't, Map is a very common functional programming construct that's found its way into lots of other languages, which simply says, loop through a collection and call a function on everything in that collection and return a new collection, which is the result of calling that function on that thing.

In our case, the function is resize, and the collection is imageNet_images. In fact, the collection is a bunch of numbers, 0, 1, 2, 3, 4, and so forth, and what the resize image is going to do is it's going to open that image off disk. So it's turning the number 3 into the third image resized, 224x224, and we'll return that.

So the general approach here, this is basically what it looks like to do parallel processing in Python. It may look a bit weird. We're going result equals exec.map. This is a function I want, this is the thing to map over, and then I'm saying for each thing in that list, do something.

Now this might make you think, well wait, does that mean this list has to have enough memory for every single resized image? And the answer is no, no it doesn't. One of the things that Python 3 uses a lot more is using these things they call generators, which is basically, it's something that looks like a list, but it's lazy.

It only creates that thing when you ask for it. So as I append each image, it's going to give me that image. And if this mapping is not yet finished creating it, it will wait. So this approach looks like it's going to use heaps of memory, but it doesn't.

It uses only the minimum amount of memory necessary and it does everything in parallel. So resizeImage is something which is going to open up the image, it's going to turn it into a NumPy array, and then it's going to resize it. And so then the resize does the center cropping we just mentioned, and then after it's resized it's going to get appended.

What does appendImage do? So this is a bit weird. What's going on here? What it does is it's going to actually stick it into what we call a pre-allocated array. We're learning a lot of computer science concepts here. Anybody that's done computer science before will be familiar with all of this already.

If you haven't, you probably won't. But it's important to know that the slowest thing in your computer, generally speaking, is allocating memory. It's finding some memory, it's reading stuff from that memory, it's writing to that memory, unless of course it's like cache or something. And generally speaking, if you create lots and lots of arrays and then throw them away again, that's likely to be really, really slow.

So what I wanted to do was create a single 224x224 array which is going to contain my resized image, and then I'm going to append that to my bcol's tensor. So the way you do that in Python, it's wonderfully easy. You can create a variable from this thing called threading.local.

It's basically something that looks a bit like a dictionary, but it's a very special kind of dictionary. It's going to create a separate copy of it for every thread or process. Normally when you've got lots of things happening at once, it's going to be a real pain because if two things try to use it at the same time, you get bad results or even crashes.

But if you allocate a variable like this, it automatically creates a separate copy in every thread. You don't have to worry about locks, you don't have to worry about race conditions, whatever. Once I've created this special threading.local variable, I then create a placeholder inside it which is just an array of zeros of size 224x224x3.

So then later on, I create my bcol's array, which is where I'm going to put everything eventually, and to append the image, I grab the bit of the image that I want and I put it into that preallocated thread local variable, and then I append that to my bcol's array.

So there's lots of detail here in terms of using parallel processing effectively. I wanted to briefly mention it not because I think somebody who hasn't studied computer science is now going to go, "Okay, I totally understood all that," but to give you some of the things to like search for and learn about over the next week if you haven't done any parallel programming before.

You're going to need to understand thread local storage and race conditions. In Python, there's something called the global interpreter lock, which is one of the many awful things about Python, which is that in theory two things can't happen at the same time because Python wasn't really written in a thread-safe way.

The good news is that lots of libraries are written in a thread-safe way. So if you're using a library where most of its work is being done in C, as is the case with PLOS-AMD, actually you don't have to worry about that. And I can prove it to you even because I drew a little picture.

Where is the result of serial versus parallel? The serial without SIMD version is 6 times bigger than this, so the default Python code you would have written maybe before today's course would have been 120 seconds process 2000 images. With SIMD, it's 25 seconds. With the process pull, it's 8 seconds.

For 3 workers, for 6 workers, it's 5 seconds. The thread pull is even better, 3.6 seconds for 12 workers, 3.2 seconds for 16 workers. Your mileage will vary depending on what CPU you have. Given that probably quite a lot of you are using the P2 still, unless you've got your deep learning box up and running, you'll have the same performance as other people using the P2.

You should try something like this, which is to try different numbers of workers and see what's the optimal for that particular CPU. Now once you've done that, you know. Once I went beyond 16, I didn't really get improvements. So I know that on that computer, a thread pull of size 16 is a pretty good choice.

As you can see, once you get into the right general vicinity, it doesn't vary too much. So as long as you're roughly okay, just behind you, Rachel. So that's the general approach here, is run through something in parallel, each time append it to my big holes array. And at the end of that, I've got a big holes array which I can use again and again.

So I don't re-run that code very often anymore. I've got all of the image net resized into each of 72x72, 224, and 288. And I give them different names and I just use them just like this. In fact, I think that's what Keras does now. I think it squishes.

Okay, so here's one of these things. I'm not quite sure. My guess was that I don't think it's a good idea because you're now going to have dogs of various different squish levels and you'll see an end is going to have to learn that thing. It's got another type of symmetry to learn about, level of squishiness.

Whereas if we keep everything of the same aspect ratio, I think it's going to be easier to learn so we'll get better results with less epochs of training. That's my theory and I'd be fascinated for somebody to do a really in-depth analysis of black borders versus center cropping versus squishing with image net.

So for now we can just open the big holes array and there we go. So we're now ready to create our model. I'll run through this pretty quickly because most of it's pretty boring. The basic idea here is that we need to create an array of labels which are called VEX, which contains for every image in my big holes array, it needs to contain the target word vector for that image.

Just to remind you, last week we randomly ordered the file names, so this big holes array is in random order. We've got our labels, which is the word vectors for every image. We need to do our normal pre-processing. This is a handy way to pre-process in the new version of Keras.

We're using the normal Keras ResNet model, the one that comes in keras.applications. It doesn't do the pre-processing for you, but if you create a lambda layer that does the pre-processing then you can use that lambda layer as the input tensor. So this whole thing now will do the pre-processing automatically without you having to worry about it.

So that's a good little trick. I'm not sure it's quite as neat as what we did in part 1 where we put it in the model itself, but at least this way we don't have to maintain a whole separate version of all of the models. So that's kind of what I'm doing nowadays.

When you're working on really big datasets, you don't want to process things any more than necessary and any more times than necessary. I know ahead of time that I'm going to want to do some fine-tuning. What I decided to do was I decided this is the particular layer where I'm going to do my fine-tuning.

So I decided to first of all create a model which started at the input and went as far as this layer. So my first step was to create that model and save the results of that. The next step will be to take that intermediate step and take it to the next stage I want to fine-tune to and save that.

So it's a little shortcut. There's a couple of really important intricacies to be aware of here though. The first one is you'll notice that ResNet and Inception are not used very often for transfer learning. This is something which I've not seen studied, and I actually think this is a really important thing to study.

Which of these things work best for transfer learning? But I think one of the difficulties is that ResNet and Inception are harder. The reason they're harder is that if you look at ResNet, you've got lots and lots of layers which make no sense on their own. Ditto for Inception.

They keep on splitting into 2 bits and then merging again. So what I did was I looked at the Keras source code to find out how each block is named. What I wanted to do was to say we've got a ResNet block, we've just had a merge, and then it goes out and it does a couple of convolutions, and then it comes back and does an addition.

Basically I want to get one of these. Unfortunately for some reason Keras does not name these merge cells. So what I had to do was get the next cell and then go back by 1. So it kind of shows you how little people have been working with ResNet with transfer learning.

Literally the only bits of it that make sense to transfer learn from are nameless in one of the most popular things for transfer learning, Keras. There's a second complexity when working with ResNet. We haven't discussed this much, but ResNet actually has two kinds of ResNet blocks. One is this kind, which is an identity block, and the second time is a ResNet convolution block, which they also call a bottleneck block.

What this is is it's pretty similar. One thing that's going up through a couple of convolutions and then goes and gets added together, but the other side is not an identity. The other side is a single convolution. In ResNet they throw in one of these every half a dozen blocks or so.

Why is that? The reason is that if you only have identity blocks, then all it can really do is to continually fine-tune where it's at so far. We've learned quite a few times now that these identity blocks map to the residual, so they keep trying to fine-tune the types of features that we have so far.

Whereas these bottleneck blocks actually force it from time to time to create a whole different type of features because there is no identity path through here. The shortest path still goes through a single convolution. When you think about transfer learning from ResNet, you kind of need to think about should I transfer learn from an identity block before or after or from a bottleneck block before or after.

Again, I don't think anybody has studied this or at least I haven't seen anybody write it down. I've played around with it a bit and I'm not sure I have a totally decisive suggestion for you. Clearly my guess is that the best point to grab in ResNet is the end of the block immediately before a bottleneck block.

And the reason for that is that at that level of receptive field, obviously because each bottleneck block is changing the receptive field, and at that level of semantic complexity, this is the most sophisticated version of it because it's been through a whole bunch of identity blocks to get there.

So my belief is that you want to get just before that bottleneck is the best place to transfer learn from. So that's what this is. This is the spot just before the last bottleneck layer in ResNet. So it's pretty late, and so as we know very well from part 1 with transfer learning, when you're doing something which is not too different, and in this case we're switching from one-hot encoding to word vectors, which is not too different.

You probably don't want to transfer learn from too early, so that's why I picked this fairly late stage, which is just before the final bottleneck block. So the second complexity here is that this bottleneck block has these dimensions. The output is 14x14x1024. So we have about a million images, so a million by 14x14x1024 is more than I wanted to deal with.

So I did something very simple, which was I popped in one more layer after this, which is an average pooling layer, 7x7. So that's going to take my 14x14 output and turn it into a 2x2 output. So let's say one of those activations was looking for bird's eyeballs, then it's saying in each of the 14x14 spots, how likely is it that this is a bird's eyeball?

And so after this it's now saying in each of these 4 spots, on average, how much were those cells looking like bird's eyeballs? This is losing information. If I had a bigger SSD and more time, I wouldn't have done this. But it's a good trick when you're working with these fully convolutional architectures.

You can pop an average pooling layer anywhere and decrease the resolution to something that you feel like you can deal with. So in this case, my decision was to go to 2x2 by 1024. We had a question. I was going to ask, have we talked about why we do the merge operation in some of these more complex models?

We have quite a few times, which is basically the merge was the thing which does the plus here. That's the trick to making it into a ResNet block, is having the addition of the identity with the result factor of how the convolutions. So recently I was trying to go from many filters.

So you kind of just talked about downsizing the size of the geometry. Is there a good best practice on going from, let's say, like 512 filters down to less? Or is it just as simple as doing convolution with less filters? Yeah, there's not exactly a best practice for that.

But in a sense, every single successful architecture gives you some insights about that. Because every one of them eventually has to end up with 1,000 categories if it's ResNet or three channels of 1.255 continuous if it's generative. So the best thing you can really do is, well, there's two things one is to kind of look at the successful architectures.

Another thing is, although this week is kind of the last week where we're mainly going to be looking at images, I am going to briefly next week open with a quick run through some of the things that you can look at to learn more. And one of them is going to be a paper.

In fact, two different papers which have like best practices, you know, really nice kind of descriptions of these hundred different things, these hundred different results. But all this stuff, it's still pretty artisanal. Good question. So we initially resized images to 224, right? And it ended up being as a big cause already, right?

Yes. So a couple it's like 50 giga or something. Yes. And that's compressed and uncompressed. It's like a couple of hundred giga. But, well, if you load it into memory... I'm not going to load it into memory, you'll see. So what you do is kind of place the load.

It's getting there. Yeah. So that's exactly the right segue I was looking for, so thank you. So what we're going to do now is we want to run this model we just built, just call basically dot predict on it and save the predictions. The problem is that the size of those predictions is going to be bigger than the amount of RAM I have, so I need to do it a batch at a time and save it a batch at a time.

They've got a million things, each one with this many activations. And this is going to happen quite often, right? You're either working on a smaller computer or you're working with a bigger dataset, or you're working with a dataset where you're using a larger number of activations. This is actually very easy to handle.

You just create your bcols array where you're going to store it. And then all I do is I go from 0 to the length of my array, my source array, a batch at a time. So this is creating the numbers 0, 0 plus 128, 128 plus 128, and so on and so forth.

And then I take the slice of my source array from originally 0 to 128, then from 128 to 256 and so forth. So this is now going to contain a slice of my source bcols array. This is going to create a generator which is going to have all of those slices, and of course being a generator it's going to be lazy.

So I can then enumerate through each of those slices, and I can append to my bcols array the result of predicting just on that one batch. So you've seen like predict and evaluate and fit and so forth, and the generator versions. Also in Keras there's generally an on-batch version, so there's a train on-batch and a predict on-batch.

What these do is they basically have no smarts to them at all. This is like the most basic thing. So this is just going to take whatever you give it and call predict on this thing. It won't shuffle it, it won't batch it, it's just going to throw it directly into the computation graph.

So I'm just going to take a model, it's going to call predict on just this batch of data. And then from time to time I print out how far I've gone just so that I know how I'm going. Also from time to time I call .flush, that's the thing in bcols that actually writes it to disk.

So this thing doesn't actually take very long to run. And one of the nice things I can do here is I can do some data augmentation as well. So I've added a direction parameter, and what I'm going to do is I'm going to have a second copy of all of my images which is flipped horizontally.

So to flip things horizontally, that's interesting, I think I screwed this up. To flip things horizontally, you've got batch, height, and then this is columns. So if we pass in a -1 here, then it's going to flip it horizontally. That explains why some of my results haven't been quite as good as I hoped.

So when you run this, we're going to end up with a big big holes array that's going to contain two copies of every three sites image-net-image, the activations at the layer that we have, one layer before this. So I call it once with direction forwards and one with direction backwards.

So at the end of that, I've now got nearly 2 million activations of 2x2x1024. So that's pretty close to the end of ResNet. I've then just copied and pasted from the Keras code the last few steps of ResNet. So this is the last few blocks. I added in one extra identity block just because I had a feeling that might help things along a little bit.

Again, people have not really studied this yet, so I haven't had a chance to properly experiment, but it seemed to work quite well. This is basically copied and pasted from Keras's code. I then need to copy the weights from Keras for those last few layers of ResNet. So now I'm going to repeat the same process again, which is to predict on these last few layers.

The input will be the output from the previous one. So we went like 2/3 of the way into ResNet and got those activations and put those activations into the last few stages of ResNet to get those activations. Now the outputs from this are actually just a vector of length 2048, which does fit in my RAM, so I didn't bother with calling predict on batch, I can just call .predict.

If you try this at home and don't have enough memory, you can use the predict on batch trick again. Any time you ran out of memory when calling predict, you can always just use this pattern. So at the end of all that, I've now got the activations from the penultimate layer of ResNet, and so I can do a usual transfer learning trick of creating a linear model.

My linear model is now going to try to use the number of dimensions in my word vectors as its output, and you'll see it doesn't have any activation function. That's because I'm not doing one hot encoding, my word vectors could be any size numbers, so I just leave it as linear.

And then I compile it, and then I fit it, and so this linear model is now my very first -- this is almost the same as what we did in Lesson 1. Cryptocs vs. cats. We're fine tuning a model to a slightly different target to what it was originally trained with.

It's just that we're doing it with a lot more data, so we have to be a bit more thoughtful. There's one other difference here, which is I'm using a custom loss function. And the loss function I'm using is cosine distance. You can lock that up at home if you're not familiar with it, but basically cosine distance says for these two points in space, what's the angle between them, rather than how far away are they?

The reason we're doing that is because we're about to start using k nearest neighbors. So k nearest neighbors, we're going to basically say here's the word vector we predicted, which is the word vector which is closest to it. It turns out that in really really high dimensional space, the concept of how far away something is, is nearly meaningless.

And the reason why is that in really really high dimensional space, everything sits on the edge of that space. Basically because you can imagine as you add each additional dimension, the probability that something is on the edge in that dimension, let's say the probability that it's right on the edge is like 1/10.

Then if you've only got one dimension, you've got a probability of 1/10. It's on the edge in one dimension. If you've got two dimensions, it's basically multiplicatively decreasing the probability that that happens. So in a few hundred dimensional spaces, everything is on the edge. And when everything's on the edge, everything is kind of an equal distance away from each other, more or less.

And so distances aren't very helpful. But the angle between things varies. So when you're doing anything with trying to find nearest neighbors, it's a really good idea to train things using cosine distance. And this is the formula for cosine distance. Again, this is one of these things where I'm skipping over something that you'd probably spend a week in undergrad studying.

There's heaps of information about cosine distance on the web. So for those of you already familiar with it, I won't waste your time. For those of you not, it's a very very good idea to become familiar with this. And feel free to ask on the forums if you can't find any material that makes sense.

So we've fitted our linear model. As per usual, we save our weights. And we can see how we're going. So what we've got now is something where we can fit in an image, and it will spit out a word vector. But it's something that looks like a word vector.

It has the same dimensionality as a word vector. But it's very unlikely that it's going to be the exact same vector as one of our thousand target word vectors. So if the word vector for a pug is this list of 200 floats, even if we have a perfectly puggy pug, we're not going to get that exact list of 2000 floats.

We'll have something that is similar. And when we say similar, we probably mean that the cosine distance between the perfect platonic pug and our pug is pretty small. So that's why after we get our predictions, we then have to use nearest neighbors as a second step to basically say, for each of those predictions, what are the three word vectors that are the closest to that prediction?

So we can now take those nearest neighbors and find out for a bunch of our images what are the three things it thinks it might be. For example, for this image here, its best guess was trombone, next was flute, and third was cello. This gives us some hope that this approach seems to be working okay.

It's not great yet, but it's recognized these things are musical instruments, and its third guess was in fact the correct musical instrument. So we know what to do next. What we do next is to fine-tune more layers. And because we have already saved the intermediate results from an earlier layer, that fine-tuning is going to be much faster to do.

Two more things I briefly mentioned. One is that there's a couple of different ways to do nearest neighbors. One is what's called the brute force approach, which is literally to go through everyone and see how far away it is. There's another approach which is approximate nearest neighbors. And when you've got lots and lots of things, you're trying to look for nearest neighbors, the brute force approach is going to be n^2 time.

It's going to be super slow. Approximate nearest neighbors are generally n log n time. So orders of magnitude faster if you've got a large dataset. The particular approach I'm using here is something called locality-sensitive hashing. It's a fascinating and wonderful algorithm. Anybody who's interested in algorithms, I strongly recommend you go read about it.

Let me know if you need a hand with it. My favorite kind of algorithms are these approximate algorithms. In data science, you almost never need to know something exactly, yet nearly every algorithm that people learn at university and certainly at high school are exact. We learn exact nearest neighbor algorithms and exact indexing algorithms and exact median algorithms.

Pretty much for every algorithm out there, there's an approximate version that runs an order of n or log n over n faster. One of the cool things is that once you start realizing that, you suddenly discover that all of the libraries you've been using for ages were written by people who didn't know this.

And then you realize that every sub-algorithm they've written, they could have used an approximate version. The next thing you've got to know, you've got something that runs a thousand times faster. The other cool thing about approximate algorithms is that they're generally written to provably be accurate to within so close.

And it can tell you with your parameters how close is so close, which means that if you want to make it more accurate, you run it more times with different random seeds. This thing called LSH forest is a locality-sensitive hashing forest which means it creates a bunch of these locality-sensitive hashes.

And the amazingly great thing about approximate algorithms is that each time you create another version of it, you're exponentially increasing the accuracy, or multiplicatively increasing the accuracy, but only linearly increasing the time. So if the error on one call of LSH was e, then the error on two calls is 1 - e^2.

And 3 calls is 1 - e^3. And the time you're taking is now 2n and 3n. So when you've got something where you can make it as accurate as you like with only linear increasing time, this is incredibly powerful. This is a great approximation algorithm. I wish we had more time, so I'd love to tell you all about it.

So I generally use LSH forest when I'm doing nearest neighbors because it's arbitrarily close and much faster when you've got lots of word vectors. The time that becomes important is when I move beyond ImageNet, which I'm going to do now. So let's say I've got a picture, and I don't just want to say which one of the thousand ImageNet categories is it.

Which one of the 100,000 WordNet nouns is it? That's a much harder thing to do. And that's something that no previous model could do. When you trained an ImageNet model, the only thing you could do is recognize pictures of things that were in ImageNet. But now we've got a word vector model, and so we can put in an image that spits out a word vector, and that word vector could be closer to things that are not in ImageNet at all.

Or it could be some higher level of the hierarchy, so we could look for a dog rather than a pug, or a plane rather than a 747. So here we bring in the entire set of word vectors. I'll have to remember to share these with you because these are actually quite hard to create.

And this is where I definitely want LSH_FOREST because this is going to be pretty slow. And we can now do the same thing. And not surprisingly, it's got worse. The thing that was actually cello, now cello is not even in the top 3. So this is a harder problem.

So let's try fine-tuning. So fine-tuning is the final trick I'm going to show you, just behind you Retschoff. You might remember last week we looked at creating our word vectors, and what we did was actually I created a list. I went to WordNet and I downloaded the whole of WordNet, and then I figured out which things were nouns, and then I used a Retschoff to pass out those, and then I saved that.

So we actually have the entirety of WordNet nouns. Because it's not a good enough model yet. So now that there's 80,000 nouns, there's a lot more ways to be wrong. So when it only has to say which of these thousand things is it, that's pretty easy. Which of these 80,000 things is it?

It's pretty hard. To fine-tune it, it looks very similar to our usual way of fine-tuning things, which is that we take our two models and stick them back to back, and we're now going to train the whole thing rather than just the linear model. The problem is that the input to this model is too big to fit in RAM.

So how are we going to call fit or fit generator when we have an array that's too big to fit in RAM? Well, one obvious thing to do would be to pass in the bcols array. Because to most things in Python, a bcols array looks just like a regular array.

It doesn't really look any different. The way a bcols array is actually stored is actually stored in a directory, as I'm sure you've noticed. And in that directory, it's got something called chunk length, I set it to 32 when I created these bcols arrays. What it does is it takes every 32 images and it puts them into a separate file.

Each one of these has 32 images in it, or 32 of the leading axis of the array. Now if you then try to take this whole array and pass it to .fit in Keras with shuffle, it's going to try and grab one thing from here and one thing from here and one thing from here.

Here's the bad news. For bcols to get one thing out of a chunk, it has to read and decompress the whole thing. It has to read and decompress 32 images in order to give you the one image you asked for. That would be a disaster. That would be ridiculously horribly slow.

We didn't have to worry about that when we called predict on batch. We were going not shuffling, but we were going in order. So it was just grabbing one. It was never grabbing a single image out of a chunk. But now that we want to shuffle, it would. So what we've done is somebody very helpfully actually on a Kaggle forum provided something called a bcols array iterator.

The bcols array iterator, which was kindly discovered on the forums by somebody named MP Janssen, originally written by this fellow, provides a Keras-compatible generator which grabs an entire chunk at a time. So it's a little bit less random, but given that if this has got 2 million images in and the chunk length is 32, then it's going to basically create a batch of chunks rather than a batch of images.

And so that means we have none of the performance problems, and particularly because we randomly shuffled our files. So this whole thing is randomly shuffled anyway. So this is a good trick. So you'll find the bcols array iterator on GitHub. Feel free to take a look at the code.

It's pretty straightforward. There were a few issues with the original version, so MP Janssen and I have tried to fix it up and I've written some tests for it and he's written some documentation for it. But if you just want to use it, then it's as simple as writing this.

Blah equals bcols array iterator, this is your data, these are your labels, shuffle equals true, batch size equals whatever, and then you can just call fit generator as per usual passing in that iterator and that iterator's number of items. So to all of you guys who have been asking how to deal with data that's bigger than memory, this is how you do it.

So hopefully that will make life easier for a lot of people. So we fine-tune it for a while, we do some learning annealing for a while, and this basically runs overnight for me. It takes about 6 hours to run. And so I come back the next morning and I just copy and paste my k nearest neighbors, so I get my predicted word vectors.

For each word vector, I then pass it into nearest neighbors. This is my just 1000 categories. And lo and behold, we now have cello in the top spot as we hoped. How did it go in the harder problem of looking at the 100,000 or so nouns in English? Pretty good.

I've got this one right. And just to pick another one at random, let's pick the first one. It said throne. That sure looks like a throne. So looking pretty good. So here's something interesting. Now that we have brought images and words into the same space, let's play with it some more.

So why don't we use nearest neighbors with those predictions? To the word vector which Google created, but the subset of those which are nouns according to WordNet, map to their sin set IDs. The word vectors are just the word2vec vectors that we can download off the internet. They were pre-trained by Google.

We're saying here is this image spits out a vector from a thing we just trained. We have 100,000 word2vec vectors for all the nouns in English. Which one of those is the closest to the thing that came out of our model? And the answer was throne. Hold that thought.

We'll be doing language translation starting next week. No, we don't quite do it that way, but you can think of it like that. So let's do something interesting. Let's create a nearest neighbors not for all of the word2vec vectors, but for all of our image-predicted vectors. And now we can do the opposite.

Let's take a word, we pick it random. Let's look it up in our word2vec dictionary, and let's find the nearest neighbors for that in our images. There it is. So this is pretty interesting. You can now find the images that are the most like whatever word you come up with.

Okay, that's crazy, but we can do crazier. Here is a random thing I picked. Now notice I picked it from the validation set of ImageNet, so we've never seen this image before. And honestly when I opened it up, my heart sank because I don't know what it is. So this is a problem.

What is that? So what we can do is we can call.predict on that image, and we can then do a nearest neighbors of all of our other images. There's the first, there's the second, and the third one is even somebody putting their hand on it, which is slightly crazy, but that was what the original one looked like.

In fact, if I can find it, I ran it again on a different image. I actually looked around for something weird. This is pretty weird, right? Is this a net or is it a fish? So when we then ask for nearest neighbors, we get fish in nets. So it's like, I don't know, sometimes deep learning is so magic you just kind of go out and they're possibly what's just behind you, Rachel.

Only a little bit, and maybe in a future course we might look at Dask. I think maybe even in your numerical and algebra course you might be looking at Dask. I don't think we'll cover it this course. But do look at Dask, D-A-S-K, it's super cool. No, not at all.

So these were actually labeled as this particular kind of fish. In fact that's the other thing is it's not only found fish in nets, but it's actually found more or less the same breed of fish in the nets. But when we called dot predict on those, it created a word vector which was probably like halfway between that kind of fish and a net because it doesn't know what to do, right?

So sometimes when it sees things like that, it would have been marked in imageNet as a net, and sometimes it would have been a fish. So the best way to minimize the loss function would have been to kind of hedge. So it hedged and as a result the images that were closest were the ones which actually were halfway between the two themselves.

So it's kind of a convenient accident. You absolutely can and I have, but really for nearest neighbors I haven't found anything nearly as good as cosine and that's true in all of the things I looked up as well. By the way, I should mention when you use locality-sensitive hashing in Python, by default it uses something that's equivalent to the cosine metric, so that's why the nearest neighbors work.

So starting next week we're going to be learning about sequence-to-sequence models and memory and attention methods. They're going to show us how we can take an input such as a sentence in English and an output such as a sentence in French, which is the particular case study we're going to be spending 2 or 3 weeks on.

When you combine that with this, you get image captioning. I'm not sure if we're going to have time to do it ourselves, but it will literally be trivial for you guys to take the two things and combine them and do image captioning. It's just those two techniques together. So we're now going to switch to -- actually before we take a break, I want to show you the homework.

Hopefully you guys noticed I gave you some tips because it was a really challenging one. Even though in a sense it was kind of straightforward, which was take everything that we've already learned about super-resolution and slightly change the loss function so that it does perceptual losses for style transfer instead, the details were tricky.

I'm going to quickly show you two things. First of all, I'm going to show you how I did the homework because I actually hadn't done it last week. Luckily I have enough RAM that I could read the two things all into memory, so don't forget you can just do that with a bcols array to return it into a NumPy array in memory.

So one thing I did was I created my up-sampling block to get rid of the checkerboard patterns. That was literally as simple as saying up-sampling 2D and then a 1x1 conv. So that got rid of my checkerboard patterns. The next thing I did was I changed my loss function and I decided before I tried to do style transfer with perceptual losses, let's try and do super-resolution with multiple content-loss layers.

That's one thing I'm going to have to do for style transfer is be able to use multiple layers. So I always like to start with something that works and make small little changes so it keeps working at every point. So in this case, I thought, let's first of all slightly change the loss function for super-resolution so that it uses multiple layers.

So here's how I did that. I changed my get output layer. Sorry, I changed my BGG content so it created a list of outputs, conv1 from each of the first, second and third blocks. And then I changed my loss function so it went through and added the mean squared difference for each of those three layers.

I also decided to add a weight just for fun. So I decided to go 0.1, 0.8, 0.1 because this is the layer that they used in the paper. But let's have a little bit of more precise super-resolution and a little bit of more semantic super-resolution and see how it goes.

I created this function to do a more general mean squared error. And that was basically it. Other than that line everything else was the same, so that gave me super-resolution working on multiple layers. One of the things I found fascinating is that this is the original low-res, and it's done a good job of upscaling it, but it's also fixed up the weird white balance, which really surprised me.

It's taken this obviously over-yellow shot, and this is what ceramic should look like, it should be white. And somehow it's adjusted everything, so the serviette or whatever it is in the background has gone from a yellowy-brown to a nice white, as with these cups here. It's figured out that these slightly pixelated things are actually meant to be upside-down handles.

This is on only 20,000 images. I'm very surprised that it's fixing the color because we never asked it to, but I guess it knows what a cup is meant to look like, and so this is what it's decided to do, is to make a cup the way it thinks it's meant to look.

So that was pretty cool. So then to go from there to style-transfer was pretty straightforward. I had to read in my style as before. This is the code to do this special kind of resnet block where we use valid convolutions, which means we lose two pixels each time, and so therefore we have to do a center crop.

So don't forget, lambda layers are great for this kind of thing. Whatever code you can write, chuck it in a lambda layer, and suddenly it's a Keras layer. So do my center crop. This is now a resnet block which does valid comms. This is basically all exactly the same.

We have to do a few downsamplings, and then the computation, and our upsampling, just like the supplemental paper. So the loss function looks a lot like the loss function did before, but we've got two extra things. One is the Gram matrix. So here is a version of the Gram matrix which works a batch at a time.

If any of you tried to do this a single image at a time, you would have gone crazy with how slow it took. I saw a few of you trying to do that. So here's the batch-wise version of Gram matrix. And then the second thing I needed to do was somehow feed in my style target.

Another thing I saw some of you do was feed in the style target every time feed in that array into your loss function. You can obviously calculate your style target by just calling .predict with the thing which gives you all your different style target layers, but the problem is this thing here returns a NumPy array.

It's a pretty big NumPy array, which means that then when you want to use it as a style target in training, it has to copy that back to the GPU. And copying to the GPU is very, very slow. And this is a really big thing to copy to the GPU.

So any of you who tried this, and I saw some of you try it, it took forever. So here's the trick. Call .variable on it. Turning something into a variable picks it on the GPU for you. So once you've done that, you can now treat this as a list of symbolic entities which are the GPU versions of this.

So I can now use this inside my GPU code. So here are my style targets I can use inside my loss function, and it doesn't have to do any copying backwards and forwards. So there's a subtlety, but if you don't get that subtlety right, you're going to be waiting for a week or so for your code to finish.

So those were the little subtleties which were necessary to get this to work. And once you get it to work, it does exactly the same thing basically as before. So where this gets combined with device is I wanted to try something interesting, which is in the original Perceptual Losses paper, they trained it on the COCO dataset which has 80,000 images, which didn't seem like many.

I wanted to know what would happen if we trained it on Olive ImageNet. So I did. So I decided to train a super-resolution network on Olive ImageNet. And the code's all identical, so I'm not going to explain it. Other than, you'll notice we don't have the square bracket colon square bracket here anymore because we don't want to try and read in the entirety of ImageNet into RAM.

So these are still b coles arrays. All the other code is identical until we get to here. So I use a bcoles array iterator. I can't just call .fit because .fit or .fit generator assumes that your iterator is returning your data and your labels. In our case, we don't have data and labels.

We have two things that both get fed in as two inputs, and our labels are just a list of zeros. So here's a good trick. This answers your earlier question about how do you do multi-input models on large datasets. The answer is create your own training loop which loops through a bunch of iterations, and then you can grab as many batches of data from as many different iterators as you like, and then call train on batch.

So in my case, my bcoles array iterator is going to return my high resolution and low resolution batch of images. So I go through a bunch of iterations, grab one batch of high res and low res images, and pass them as my two inputs to train on batch. So this is the only code I changed other than changing .fit generator to actually calling train.

So as you can see, this took me 4.5 hours to train and I then decreased the learning rate and I trained for another 4.5 hours. Actually, I did it overnight last night and I only had enough time to do about half of ImageNet, so this isn't even the whole thing.

But check this out. So take that model and we're going to call .predict. This is the original high res image. Here's the low res version. And here's the version that we've created. And as you can see, it's done a pretty extraordinarily good job. When you look at the original ball, there was this kind of vague yellow thing here.

It's kind of turned it into a nice little version. You can see that her eyes was like two grey blobs. It's kind of turned it into some eyes. You could just tell that that's an A, maybe if you look carefully. Now it's very clearly an A. So you can see it does an amazing job of upscaling this.

All that's still is this is a fully convolutional net and therefore is not specific to any particular input resolution. So what I can do is I can create another version of the model using our high res as the input. So now we're going to call .predict with the high res input, and that's what we get back.

So look at that, we can now see all of this detail on the basketball, which simply, none of that really existed here. It was there, but pretty hard to see what it was. And look at her hair, this kind of grey blob here. Here you can see it knows it's like little bits of pulled back hair.

So we can take any sized image and make it bigger. This to me is one of the most amazing results I've seen in deepwinding. When we train something on nearly all of ImageNet, it's a single epoch, so there's definitely no overfitting. And it's able to recognize what hair is meant to look like when pulled back into a bun is a pretty extraordinary result, I think.

Something else which I only realized later is that it's all a bit fuzzy, right? And there's this arm in the background that's a bit fuzzy. The model knows that that is meant to stay fuzzy. It knows what out-of-focus things look like. Equally cool is not just how that A is now incredibly precise and accurate, but the fact that it knows that blurry things need to stay blurry.

I don't know if you're as amazed as this as I am, but I thought this was a pretty cool result. We could run this over a 24-hour period on maybe two epochs of all of ImageNet, and presumably it would get even better still. Okay, so let's take a 7-minute break and see you back here at 5 past 8.

Okay, thanks everybody. That was fun. So we're going to do something else fun. And that is to look at -- oh, before I continue, I did want to mention one thing in the homework that I changed, which is I realized in my manually created loss function, I was already doing a mean squared error in the loss function.

But then when I told Teras to make that thing as close to 0 as possible, I had to also give it a loss function, and I was giving it MSE. And effectively that was kind of squaring my squared errors, it seemed wrong. So I've changed it to M-A-E, mean absolute error.

So when you look back over the notebooks, that's why, because this is just to say, hey, get the loss as close to 0 as possible. I didn't really want to re-square it. That didn't make any sense. So that's why you'll see that minor change. The other thing to mention is I didn't notice that when I retrained my super resolution on my new images that didn't have the black border, it gave good results much, much faster.

And so I really think that thing of learning to put the black border back in seemed to take quite a lot of effort for it. So again, hopefully some of you are going to look into that in more detail. So we're going to learn about general adversarial networks. This will kind of close off our deep dive into generative models as applied to images.

And just to remind you, the purpose of this has been to learn about generative models, not to specifically learn about super resolution or artistic style. But remember, these things can be used to create all kinds of images. So one of the groups is interested in taking a 2D photo and trying to turn it into something that you can rotate in 3D, or at least show a different angle of that 2D photo.

And that's a great example of something that this should totally work for. It's just a mapping from one image to some different image, which is like what would this image look like from above versus from the front. So keep in mind the purpose of this is just like in Part 1, we learned about classification, which you can use for 1000 things.

Now we're learning about generative models that you can use for different 1000 things. Now any generative model you build, you can make it better by adding on top of it again a generative adversarial network. And this is something I don't really feel like has been fully appreciated. People I've seen generally treat GANs as a different way of creating a generative model.

But I think of this more as like, why not create a generative model using the kind of techniques we've been talking about. But then think of it this way. Think of all the artistic style stuff we were doing in my terrible attempt at a Simpsons cartoon version of a picture.

It looked nothing like a Simpsons. So what would be one way to improve that? One way to improve that would be to create two networks. There would be one network that takes our picture, which is actually not the Simpsons, and takes another picture that actually is the Simpsons. And maybe we can train a neural network that takes those two images and spits out something saying, Is that a real Simpsons image or not?

And this thing we'll call the discriminator. So we could easily train a discriminator right now. It's just a classification network. Just use the same techniques we used in Part 1. We feed it the two images, and it's going to spit out a 1 if it's a real Simpsons cartoon, and a 0 if it's Jeremy's crappy generative model of Simpsons.

That's easy, right? We know how to do that right now. Now, go and build another model. There's two images as inputs. So you would feed it one thing that's a Simpsons and one thing that's a generative output. It's up to you to feed it one of each. Or alternatively, you could feed it one thing.

In fact, probably easier is to just feed it one thing and it spits out, Is it the Simpsons or isn't it the Simpsons? And you could just mix them and match them. Actually, it's the latter that we're going to do, so that's probably easier. We're going to have one thing which is either not a Simpsons or it is a Simpsons, and we're going to have a mix of 50/50 of those two, and we're going to have something come out saying, "What do you think?

Is it real or not?" So this thing, this discriminator, from now on we'll probably generally be calling it D. So there's a thing called D. And we can think of that as a function. D is a function that takes some input, x, which is an image, and spits out a 1 or a 0, or maybe a probability.

So what we could now do is create another neural network. And what this neural network is going to do is it's going to take as input some random noise, just like all of our generators have so far. And it's going to spit out an image. And the loss function is going to be if you take that image and stick it through D, did you manage to fool it?

So could you create something where in fact we wanted to say, "Oh yeah, totally, that's a real Simpsons." So if that was our loss function, we're going to call the generator, we'll call it G. It's just something exactly like our perceptual losses style transfer model. It could be exactly the same model.

But the loss function is now going to be take the output of that and stick it through D, the discriminator, and try to trick it. So the generator is doing well if the discriminator is getting it wrong. So one way to do this would be to take our discriminator and train it as best as we can to recognize the difference between our crappy Simpsons and real Simpsons, and then get a generator and train it to trick that discriminator.

But now at the end of that, it's probably still not very good because you realize that actually the discriminator didn't have to be very good before because my Simpsons generators were so bad. So I could now go back and retrain the discriminator based on my better generated images, and then I could go back and retrain the generator.

And back and forth I go. And that is the general approach of a GAN, is to keep going back between two things, which is training a discriminator and training a generator using a discriminator as a loss function. So we've got one thing which is discriminator on some image, and another thing which is a discriminator on a generator on some noise.

In practice, these things are going to spit out probabilities. So that's the general idea. In practice, they found it very difficult to do this like train the discriminator as best as we can, stop train the generator as best as we can, stop and so on and so forth. So instead, the original GAN paper is called Generative Adversarial Nets.

And here you can see they've actually specified this loss function. So here it is in notation. They call it minimizing the generator whilst maximizing the discriminator. This is what min max is referring to. What they do in practice is they do it a batch at a time. So they have a loop, I'm going to go through a loop and do a single batch, put it through the discriminator, that same batch, stick it through the generator, and so we're going to do it a batch at a time.

So let's look at that. So here's the original GAN from that paper, and we're going to do it on MNIST. And what we're going to do is we're going to see if we can start from scratch to create something which can create images which the discriminator cannot tell whether they're real or fake.

And it's a discriminator that has learned to be good at discriminating real from fake pictures of MNIST images. So we're loaded in MNIST, and the first thing they do in the paper is just use a standard multilayer perceptron. So I'm just going to skip over that and let's get to the perceptron.

So here's our generator. It's just a standard multilayer perceptron. And here's our discriminator, which is also a standard multilayer perceptron. The generator has a sigmoid activation, so in other words, we're going to spit out an image where all of the pixels are between 0 and 1. So if you want to print it out, we'll just multiply it by 255, I guess.

So there's our generator, there's our discriminator. So there's then the combination of the two. So take the generator and stick it into the discriminator. We can just use sequential for that. And this is actually therefore the loss function that I want on my generator. Generate something and then see if you can fool the discriminator.

So there's all my architectures set up. So the next thing I need to do is set up this thing called train, which is going to do this adversarial training. Let's go back and have a look at train. So what train is going to do is go through a bunch of epochs.

And notice here I wrap it in this TQDM. This is the thing that creates a nice little progress bar. Doesn't do anything else, it just creates a little progress bar. We learned about that last week. So the first thing I need to do is to generate some data to feed the discriminator.

So I've created a little function for that. And here's my little function. So it's going to create a little bit of data that's real and a little bit of data that's fake. So my real data is okay, let's go into my actual training set and grab some randomly selected MNIST digits.

So that's my real bit. And then let's create some fake. So noise is a function that I've just created up here, which creates 100 random numbers. So let's create some noise called g.predict. And then I'll concatenate the two together. So now I've got some real data and some fake data.

And so this is going to try and predict whether or not something is fake. So 1 means fake, 0 means real. So I'm going to return my data and my labels, which is a bunch of 0s to say they're all real and a bunch of 1s to say they're all fake.

So that's my discriminator's data. So go ahead and create a set of data for the discriminator, and then do one batch of training. Now I'm going to do the same thing for the generator. But when I train the generator, I don't want to change the discriminator's weights. So make_trainable simply goes through each layer and says it's not trainable.

So make my discriminator non-trainable and do one batch of training where I'm taking noise as my inputs. And my goal is to get the discriminator to think that they are actually real. So that's why I'm passing in a bunch of 0s, because remember 0 means real. And that's it.

And then make discriminator trainable again. So keep looking through this. Train the discriminator on a batch of half real, half fake. And then train the generator to try and trick the discriminator using all fake. Repeat. So that's the training loop. That's a basic GAN. Because we use TQDM, we get a nice little progress bar.

I kept track of the loss at each step, so there's our loss for the discriminator, and there's our loss for the generator. So our question is, what do these loss curves mean? Are they good or bad? How do we know? And the answer is, for this kind of GAN, they mean nothing at all.

The generator could get fantastic, but it could be because the discriminator is terrible. And they don't really know whether each one is good or not, so even the order of magnitude of both of them is meaningless. So these curves mean nothing. The direction of the curves mean nothing. And this is one of the real difficulties with training GANs.

And here's what happens when I plot 12 randomly selected random noise vectors stuck through there. And we have not got things that look terribly like MNIST digits and they also don't look terribly much like they have a lot of variety. This is called ModeClass. Very common problem when training GANs.

And what it means is that the generator and the discriminator have kind of reached a stalemate where neither of them basically knows how to go from here. And in terms of optimization, we've basically found a local minimum. So okay, that was not very successful. Can we do better? So the next major paper that came along was this one.

Let's go to the top so you can see it. Unsupervised representation learning with deep convolutional derivative adversarial networks. So this created something that they called DCGANs. And the main page that you want to look at here is page 3 where they say, "Call to our approach is doing these three things." And basically what they do is they just do exactly the same thing as GANs, but they do three things.

One is to use the kinds of -- well in fact all of them is to learn the tricks that we've been learning for generative models. Use an all-convolutional net, get rid of max pooling and use strata convolutions instead, get rid of fully connected layers and use lots of convolutional features instead, and add in batch null.

And then use a CNN rather than MLP. So here is that. This will look very familiar, it looks just like last lesson stuff. So the generator is going to take in a random grid of inputs. It's going to do a batch norm, up sample -- you'll notice that I'm doing even newer than this paper, I'm doing the up sampling approach because we know that's better.

Up sample, 1x1 conv, batch norm, up sample, 1x1 conv, batch norm, and then a final conv layer. The discriminator basically does the opposite, which is some 2x2 sub-samplings, so down sampling in the discriminator. Another trick that I think it's mentioned in the paper is before you do the back and forth batch for the discriminator and a batch for the generator is to train the discriminator for a fraction of an epoch, like do a few batches through the discriminator.

So at least it knows how to recognize the difference between a random image and a real image a little bit. So you can see here I actually just start by calling discriminator.fit with just a very small amount of data. So this is kind of like bootstrapping the discriminator. And then I just go ahead and call the same train as we had before with my better architectures.

And again, these curves are totally meaningless. But we have something which if you squint, you could almost convince yourself that that's a vibe. So until a week or two before this forth started, this was kind of about as good as we had. People were much better at the artisanal details of this than I was, and indeed there's a whole page called GANhacks, which had lots of tips.

But then, a couple of weeks before this class started, as I mentioned in the first class, along came the Wasserstein GAN. And the Wasserstein GAN got rid of all of these problems. And here is the Wasserstein GAN paper. And this paper is quite an extraordinary paper. And it's particularly extraordinary because -- and I think I mentioned this in the first class of this part -- most papers tend to either be math theory that goes nowhere, or kind of nice experiments in engineering where the theory bits kind of hacked on at the end and kind of meaningless.

This paper is entirely driven by theory, and then the theory goes on to show this is what the theory means, this is what we do, and suddenly all the problems go away. The loss curves are going to actually mean something, and we're going to be able to do what I said we wanted to do right at the start of this GAN section, which is to train the discriminator a whole bunch of steps and then do a generator, and then discriminator a whole bunch of steps and do the generator.

And all that is going to suddenly start working. How do we get it to work? In fact, despite the fact that this paper is both long and full of equations and theorems and proofs, and there's a whole bunch of appendices at the back with more theorems and proofs, there's actually only two things we need to do.

One is remove the log from the loss function. So rather than using cross-entropy loss, we're just going to use mean squared error. That's one change. The second change is we're going to constrain the weights so that they lie between -0.01 and +0.01. So we're going to constrain the weights to make them small.

Now in the process of saying that's all we're going to do is to kind of massively not give credit to this paper, because what this paper is is they figured out that that's what we need to do. On the forums, some of you have been reading through this paper and I've already given you some tips as to some really great walkthrough.

I'll put it on our wiki that explains all the math from scratch. But basically what the math says is this, the loss function for a GAN is not really the loss function you put into Keras. We thought we were just putting in a cross-entropy loss function, but in fact what we really care about is the difference between two distributions, the difference between the discriminator and the generator.

And the difference between two loss functions has a very different shape for the loss function on its own. So it turns out that the difference between the two cross-entropy loss functions is something called the Jensen-Shannon distance. And this paper shows that that loss function is hideous. It is not differentiable, and it does not have a nice smooth shape at all.

So it kind of explains why it is that we kept getting this mode collapse and failing to find nice minimums. Mathematically, this loss function does not behave the way a good loss function should. And previously we've not come across anything like this because we've been training a single function at a time.

We really understand those loss functions, mean squared error, cross-entropy. Even though we haven't already always derived the math in detail, plenty of people have. We know that they're kind of nice and smooth and that they have pretty nice shapes and they do what we want them to do. In this case, by training two things kind of adversarially to each other, we're actually doing something quite different.

This paper just absolutely fantastically shows, with both examples and with theory, why that's just never going to work. So the cosine distance is the difference between two things, whereas these distances that we're talking about here are the distances between two distributions, which is a much more tricky problem to deal with.

The cosine distance, actually if you look at the notebook during the week, you'll see it's basically the same as the Euclidean distance, but you normalize the data first. So it has all the same nice properties that the Euclidean distance did. The authors of this paper released their code in PyTorch.

Luckily, PyTorch, the first kind of pre-release came out in mid-January. You won't be surprised to hear that one of the authors of the paper is the main author of PyTorch. So he was writing this before he even released the code. There's lots of reasons we want to learn PyTorch anyway, so here's a good reason.

So let's look at the Wasserstein GAN in PyTorch. Most of the code, in fact other than this pretty much all the code I'm showing you in this part of the course, is very loosely based on lots of bits of other code, which I had to massively rewrite because all of it was wrong and hideous.

This code actually I only did some minor refactoring to simplify things, so this is actually very close to their code. So it was a very nice paper with very nice code, so that's a great thing. So before we look at the Wasserstein GAN in PyTorch, let's look briefly at PyTorch.

Basically what you're going to see is that PyTorch looks a lot like NumPy, which is nice. We don't have to create a computational graph using variables and placeholders and later on run in a session. I'm sure you've seen by now Keras with TensorFlow, you try to print something out with some intermediate output, it just prints out like Tensor and tells you how many dimensions it has.

And that's because all that thing is is a symbolic part of a computational graph. PyTorch doesn't work that way. PyTorch is what's called a defined-by-run framework. It's basically designed to be so fast to take your code and compile it that you don't have to create that graph in advance.

Every time you run a piece of code, it puts it on the GPU, runs it, sends it back all in one go. So it makes things look very simple. So this is a slightly cut-down version of the PyTorch tutorial that PyTorch provides on their website. So you can grab that from there.

So rather than creating np.array, you create torch.tensor. But other than that, it's identical. So here's a random torch.tensor. APIs are all a little bit different. Rather than dot shape, it's dot size. But you can see it looks very similar. And so unlike in TensorFlow or Theano, we can just say x + y, and there it is.

We don't have to say z = x + y, f = function, x and y as inputs, set as output, and function dot a vowel. No, you just go x + y, and there it is. So you can see why it's called defined-by-run. We just provide the code and it just runs it.

Generally speaking, most operations in Torch as well as having this infix version. There's also a prefix version, so this is exactly the same thing. You can often in fact nearly always add an out equals, and that puts the result in this preallocated memory. We've already talked about why it's really important to preallocate memory.

It's particularly important on GPUs. So if you write your own algorithms in PyTorch, you'll need to be very careful of this. Perhaps the best trick is that you can stick an underscore on the end of most things, and it causes it to do in place. This is basically y + = x.

That's what this underscore at the end means. So there's some good little tricks. You can do slicing just like numpy. You can turn numpy stuff into Torch tensors and vice versa by simply going dot numpy. One thing to be very aware of is that A and B are now referring to the same thing.

So if I now add underscore A + = 1, it also changes B. Vice versa, you can turn numpy into Torch by calling Torch from numpy. And again, same thing. If you change the numpy, it changes the Torch. All of that so far has been running on the CPU.

To turn anything into something that runs on the GPU, you chuck dot CUDA at the end of it. So this x + y just ran on the GPU. So where things get cool is that something like this knows not just how to do that piece of arithmetic, but it also knows how to take the gradient of that.

To make anything into something which calculates gradients, you just take your Torch tensor, wrap it in variable, and add this parameter to it. From now on, anything I do to x, it's going to remember what I did so that it can take the gradient of it. For example, x + 2, I get x3 just like a normal tensor.

So a variable and a tensor have the same API except that I can keep doing things to it. Square times 3, dot mean. Later on, I can go dot backward and dot grad and I can get the gradient. So that's the critical difference between a tensor and a variable.

They have exactly the same API except variable also has dot backward and that gets you the gradient. When I say dot gradient, the reason that this is dout dx is because I typed out dot backward. So this is the thing the derivative is respect to. So this is kind of crazy.

You can do things like while loops and get the gradients of them. It's this kind of thing pretty tricky to do with TensorFlow or Theano, these kind of computation graph approaches. So it gives you a whole lot of flexibility to define things in much more natural ways. So you can really write PyTorch just like you're writing regular old NumPy stuff.

It has plenty of libraries, so if you want to create a neural network, here's how you do a CNN. I warned you early on that if you don't know about OO in Python, you need to learn it. So here's why. Because in PyTorch, everything's kind of done using OO.

I really like this. In TensorFlow, they kind of invent their own weird way of programming rather than use Python OO. Or else PyTorch just goes, "Oh, we already have these features in the language. Let's just use them." So it's way easier, in my opinion. So to create a neural net, you create a new class, you derive from module, and then in the constructor, you create all of the things that have weights.

So conv1 is now something that has some weights. It's a 2D conv. Conv2 is something with some weights. PolyConnected1 is something with some weights. So there's all of your layers, and then you get to say exactly what happens in your forward pass. Because MaxPool2D doesn't have any weights, and Relyu doesn't have any weights, there's no need to define them in the initializer.

You can just call them as functions. But these things have weights, so they need to be kind of stateful and persistent. So in my forward pass, you literally just define what are the things that happen. .vue is the same as reshape. The whole API has different names for everything, which is mildly annoying for the first week, but you kind of get used to it.

.reshape is called .vue. During the week, if you try to use PyTorch and you're like, "How do you say blah in PyTorch?" and you can't find it, feel free to post on the forum. Having said that, PyTorch has its own discourse-based forums. And as you can see, it is just as busy and friendly as our forums.

People are posting on these all the time. So I find it a really great, helpful community. So feel free to ask over there or over here. You can then put all of that computation onto the GPU by calling .kuda. You can then take some input, put that on the GPU with .kuda.

You can then calculate your derivatives, calculate your loss, and then later on you can optimize it. This is just one step of the optimizer, so we have to kind of put that in the word. So there's the basic pieces. At the end here there's a complete process, but I think more fun will be to see the process in the Wasserstein GAN.

So here it is. I've kind of got this TorchUtils thing which you'll find in GitHub which has the basic stuff you'll want for Torch all there, so you can just import that. So we set up the batch size, the size of each image, the size of our noise vector.

And look how cool it is. I really like this. This is how you import datasets. It has a datasets module already in the TorchVision library. Here's the scifi10 dataset. It will automatically download it to this path for you if you say download equals true. And rather than having to figure out how to do the preprocessing, you can create a list of transforms.

So I think this is a really lovely API. The reason that this is so new yet has such a nice API is because this comes from a lower library called Torch that's been around for many years, and so these guys are basically started off by copying what they already had and what already works well.

So I think this is very elegant. So I've got two different things you can look at here. They're both in the paper. One is scifi10, which are these tiny little images. Another is something we haven't seen before, which is called lsun, which is a really nice dataset. It's a huge dataset with millions of images, 3 million bedroom images, for example.

We can use either one. This is pretty cool. We can then create a data loader, say how many workers to use. We already know what workers are. This is all built into the framework. Now that you know how many workers your CPU likes to use, you can just go ahead and put that number in here.

Use your CPU to load in this data in parallel in the background. We're going to start with scifi10. We've got 47,000 of those images. We'll skip over very quickly because it's really straightforward. Here's a conv block that consists of a conv2D, a batchnorm2D, and a leakyrelu. In my initializer, I can go ahead and say, "Okay, we'll start with a conv block.

Optionally have a few extra conv blocks." This is really nice. Here's a while loop that says keep adding more down sampling blocks until you've got as many as you need. That's a really nice kind of use of a while loop to simplify creating our architecture. And then a final conv block at the end to actually create the thing we want.

And then this is pretty nifty. If you pass in n GPU greater than 1, then it will call parallel.data parallel passing in those GPU IDs and it will do automatic multi-GPU training. This is by far the easiest multi-GPU training I've ever seen. That's it. That's the forward pass behind here.

We'll learn more about this over the next couple of weeks. In fact, given we're a little short of time, let's discuss that next week and let me know if you don't think we cover it. Here's the generator. It looks very, very similar. Again, there's a while loop to make sure we've gone through the right number of decom blocks.

This is actually interesting. This would probably be better off with an up-sampling block followed by a one-by-one convolution. Maybe at home you could try this and see if you get better results because this has probably got the checkerboard pattern problem. This is our generator and our discriminator. It's only 75 lines of code, nice and easy.

Everything's a little bit different in PyTorch. If we want to say what initializer to use, again we're going to use a little bit more decoupled. Maybe at first it's a little more complex but there's less things you have to learn. In this case we can call something called apply, which takes some function and passes it to everything in our architecture.

This function is something that says, "Is this a conv2D or a convtranspose2D? If so, use this initialization function." Or if it's a batch norm, use this initialization function. Everything's a little bit different. There isn't a separate initializer parameter. This is, in my opinion, much more flexible. I really like it.

As before, we need something that creates some noise. Let's go ahead and create some fixed noise. We're going to have an optimizer for the discriminator. We've got an optimizer for the generator. Here is something that does one step of the discriminator. We're going to call the forward pass, then we call the backward pass, then we return the error.

Just like before, we've got something called make_trainable. This is how we make something trainable or not trainable in PyTorch. Just like before, we have a train loop. The train loop has got a little bit more going on, partly because of the vasa_stain_gan, partly because of PyTorch. But the basic idea is the same.

For each epoch, for each batch, make the discriminator trainable, and then this is the number of iterations to train the discriminator for. Remember I told you one of the nice things about the vasa_stain_gan is that we don't have to do one batch discriminator, one batch generator, one batch discriminator, one batch generator, but we can actually train the discriminator properly for a bunch of batches.

In the paper, they suggest using 5 batches of discriminator training each time through the loop, unless you're still in the first 25 iterations. They say if you're in the first 25 iterations, do 100 batches. And then they also say from time to time, do 100 batches. So it's kind of nice by having the flexibility here to really change things, we can do exactly what the paper wants us to do.

So basically at first we're going to train the discriminator carefully, and also from time to time, train the discriminator very carefully. Otherwise we'll just do 5 batches. So this is where we go ahead and train the discriminator. And you'll see here, we clamp -- this is the same as clip -- the weights in the discriminator to fall in this range.

And if you're interested in reading the paper, the paper explains that basically the reason for this is that their assumptions are only true in this kind of small area. So that's why we have to make sure that the weights stay in this small area. So then we go ahead and do a single step with the discriminator.

Then we create some noise and do a single step with the generator. We get our fake data for the discriminator. Then we can subtract the fake from the real to get our error for the discriminator. So there's one step with the discriminator. We do that either 5 or 100 times.

Make our discriminator not trainable, and then do one step of the generator. You can see here, we call the generator with some noise, and then pass it into the discriminator to see if we tricked it or not. During the week, you can look at these two different versions and you're going to see basically the PyTorch and the Keras version of basically the same thing.

The only difference is in the two things. One is the presence of this clamping, and the second is that the loss function is mean squared error rather than cross-entropy. So let's see what happens. Here is some examples from SciFAR 10. They're certainly a lot better than our crappy DC GAN MNIST examples, but they're not great.

Why are they not great? So probably the reason they're not great is because SciFAR 10 has quite a few different kinds of categories of different kinds of things. So it doesn't really know what it's meant to be drawing a picture of. Sometimes I guess it kind of figures it out.

This must be a plane, I think. But a lot of the time it hedges and kind of draws a picture of something that looks like it might be a reasonable picture, but it's not a picture of anything in particular. On the other hand, the Lsun dataset has 3 million bedrooms.

So we would hope that when we train the Wasserstein GAN on Lsun bedrooms, we might get better results. Here's the real SciFAR 10, by the way. Here are our fake bedrooms, and they are pretty freaking awesome. So literally they started out as random noise, and everyone has been turned in like that.

It's definitely a bedroom. They're all definitely bedrooms. And then here is the real bedrooms to compare. You can kind of see here that imagine if you took this and stuck it on the end of any kind of generator. I think you could really use this to make your generator much more believable.

Any time you kind of look at it and you say, "Oh, that doesn't look like the real X," maybe you could try using a WGAN to try to make it look more like a real X. So this paper is so important. Here's the other thing. The loss function for these actually makes sense.

The discriminator and the generator loss functions actually decrease as they get better. So you can actually tell if your thing is training properly. You can't exactly compare two different architectures to each other still, but you can certainly see that the training curves are working. So now that we have, in my opinion, a GAN that actually really works reliably for the first time ever, I feel like this changes the whole equation for what generators can and can't do.

And this has not been applied to anything yet. So you can take any old paper that produces 3D outputs or segmentations or depth outputs or colorization or whatever and add this. And it would be great to see what happens, because none of that has been done before. It's not been done before because we haven't had a good way to train GANs before.

So this is kind of, I think, something where anybody who's interested in a project, yeah, this would be a great project and something that maybe you can do reasonably quickly. Another thing you could do as a project is to convert this into Keras. So you can take the Keras DC GAN notebook that we've already got and change the loss function at the weight clipping, try training on this lsunbedroom data set, and you should get the same results.

And then you can add this on top of any of your Keras stuff. So there's so much you could do this week. I don't feel like I want to give you an assignment per se, because there's a thousand assignments you could do. I think as per usual, you should go back and look at the papers.

The original GAN paper is a fairly easy read. There's a section called Theoretical Results, which is kind of like the pointless math bit. Here's some theoretical stuff. It's actually interesting to read this now because you go back and you look at this stuff where they prove various nice things about their GAN.

So they're talking about how the generative model perfectly replicates the data generating process. It's interesting to go back and look and say, okay, so they've proved these things, but it turned out to be totally pointless. It still didn't work. It didn't really work. So it's kind of interesting to look back and say, which is not to say this isn't a good paper, it is a good paper, but it is interesting to see when is the theoretical stuff useful and when not.

Then you look at the Wasserstein GAN theoretical sections, and it spends a lot of time talking about why their theory actually matters. So they have this really cool example where they say, let's create something really simple. What if you want to learn just parallel lines, and they show why it is that the old way of doing GANs can't learn parallel lines, and then they show how their different objective function can learn parallel lines.

So I think anybody who's interested in getting into the theory a little bit, it's very interesting to look at why did the proof of convergence show something that didn't show something that really turned out to matter. Where else in this paper the theory turned out to be super important and basically created something that allowed GANs to work for the first time.

So there's lots of stuff you can get out of these papers if you're interested. In terms of the notation, we might look at some of the notation a little bit more next week. But if we look, for example, at the algorithm sections, I think in general the bit I find the most useful is the bit where they actually write the pseudocode.

Even that, it's useful to learn some kind of nomenclature. For each iteration, for each step, what does this mean? Noise samples from noise prior. There's a lot of probability nomenclature which you can very quickly translate. A prior simply means np.random.something. In this case, we're probably like np.random.normal. So this just means some random number generator that you get to pick.

This one here, sample from a data generating distribution, that means randomly picks some stuff from your array. So these are the two steps. Generate some random numbers, and then randomly select some things from your array. The bit where it talks about the gradient you can kind of largely ignore, except the bit in the middle is your lost function.

You can see here, these things here is your noise, that's your noise. So noise, generator on noise, discriminator on generator on noise. So there's the bit where we're trying to fool the discriminator, and we're trying to make that tricker, so that's why we do 1-minus. And then here's getting the discriminator to be accurate, because these x's is the real data.

So that's the math version of what we just learned. The Wasserstein-Gann also has an algorithm section, so it's kind of interesting to compare the two. So here we go with Wasserstein-Gann, here's the algorithm, and basically this says exactly the same thing as the last one said, but I actually find this one a bit clearer.

Sample from the real data, sample from your priors. So hopefully that's enough to get going and look forward to talking on the forums and see how everybody gets along. Thanks everybody. (audience applauds)