back to indexLesson 10: Cutting Edge Deep Learning for Coders
Chapters
0:0 Introduction
1:0 Slobs Youve Wonder
4:35 Overshoot
7:40 Study Groups
9:15 Last Week Recap
10:40 Resize Images
12:43 Center Cropping
15:2 Parallel Processing
17:39 General Approach
19:3 Append Image
20:10 Threading Local
22:54 Results
26:38 Preprocessing
27:40 Finetuning
38:39 Big Holes
43:47 Linear Model
00:00:00.000 |
Some really fun stuff appearing on the forums this week, and one of the really great projects 00:00:12.600 |
which was created by I believe our sole Bulgarian participant in the course, Slav Ivanov, wrote 00:00:19.080 |
a great post about picking an optimizer for style transfer. 00:00:24.380 |
This post came from a forum discussion in which I made an off-hand remark about how 00:00:31.800 |
I know that in theory BFGS is a deterministic optimizer, it uses a line search, it approximates 00:00:39.360 |
the Hessian, it ought to work on this kind of deterministic problem better, but I hadn't 00:00:44.600 |
tried it myself and I hadn't seen anybody try it and so maybe somebody should try it. 00:00:49.480 |
I don't know if you've noticed, but pretty much every week I say something like that 00:00:52.480 |
a number of times and every time I do I'm always hoping that somebody might go, "Oh, 00:00:57.000 |
And so Slav did wonder and so he posted a really interesting blog post about that exact 00:01:06.720 |
I was thrilled to see that the blog post got a lot of pick-up on the machine learning Reddit. 00:01:16.080 |
It got 55 upvotes which for that subreddit is put in second place on the front page. 00:01:25.240 |
It also got picked up by the WildML mailing list weekly summary of interesting things 00:01:43.360 |
For those of you that have looked at it and kind of wondered what is it about this post 00:01:47.160 |
that causes it to get noticed whereas other ones don't, I'm not sure I know the secret, 00:01:54.160 |
but as soon as I read it I kind of thought, "Okay, I think a lot of people are going to 00:01:57.840 |
It gives some background, it assumes an intelligent reader, but it assumes an intelligent reader 00:02:03.600 |
who doesn't necessarily know all about this, something like you guys six months ago. 00:02:10.760 |
And so it describes this is what it is and this is where this kind of thing is used and 00:02:16.120 |
gives some examples and then goes ahead and sets up the question of different optimization 00:02:25.320 |
algorithms and then shows lots of examples of both learning curves as well as pictures 00:02:33.140 |
that come out of these different experiments. 00:02:36.560 |
And I think hopefully it's been a great experience for Slav as well because in the Reddit thread 00:02:42.120 |
there's all kinds of folks pointing out other things that he could try, questions that weren't 00:02:49.760 |
quite clear, and so now there's a whole kind of, actually kind of summarized in that thread 00:02:55.640 |
a list of things that perhaps could be done next to open up a whole interesting question. 00:03:03.480 |
Another post which I'm not even sure if it's officially posted yet, but I got the early 00:03:14.200 |
Here is Kanye drawn using a brush of Captain Jean-Luc Picard. 00:03:19.000 |
In case you're wondering, is that really him, I will show you his zoomed in version. 00:03:30.640 |
And this is a really interesting idea because he points out that generally speaking when 00:03:36.600 |
you try to use a non-artwork as your style image, it doesn't actually give very good 00:03:46.200 |
It's another example of a non-artwork, it doesn't give good results. 00:03:52.800 |
It's kind of interesting, but it's not quite what he was looking for. 00:03:56.200 |
But if you tile it, you totally get it, so here's Kanye using a Nintendo game controller 00:04:07.880 |
So then he tried out this Jean-Luc Picard and got okay results and kind of realized that 00:04:16.760 |
actually the size of the texture is pretty critical. 00:04:20.480 |
And I've never seen anybody do this before, so I think when this image gets shared on 00:04:28.100 |
Twitter it's going to go everywhere because it's just the freakiest thing. 00:04:36.520 |
So I think I warned you guys about your projects when I first mentioned them as being something 00:04:42.500 |
that's very easy to overshoot a little bit and spend weeks and weeks talking about what 00:04:55.040 |
Really it would have been nice to have something done by now rather than spending a couple 00:05:00.180 |
So if your team is being a bit slow agreeing on something, just start working on something 00:05:04.840 |
Or as a team, just pick something that you can do by next Monday and write up something 00:05:14.520 |
So for example, if you're thinking, okay, we might do the $1 million data science poll. 00:05:21.280 |
You're not going to finish it by Monday, but maybe by Monday you could have written a blog 00:05:24.740 |
post introducing what you can learn in a week about medical imaging. 00:05:30.280 |
Oh, it turns out it uses something called DICOM. 00:05:32.360 |
Here are the Python DICOM libraries, and we tried to use them, and these were the things 00:05:36.240 |
that got us kind of confused, and these are the ways that we solved them. 00:05:40.240 |
And here's a Python notebook which shows you some of the main ways you can look at these 00:05:49.960 |
It's like when you enter a Kaggle competition, I always tell people submit every single day 00:05:58.160 |
and try and put in at least half an hour a day to make it slightly better than yesterday. 00:06:02.720 |
So how do you put in the first day's submission? 00:06:06.080 |
What I always do on the first day is to submit the benchmark script, which is generally like 00:06:12.520 |
And then the next day I try to improve it, so I'll put in all 0.5s, and the next day 00:06:18.720 |
I'll be like, "Okay, what's the average for cats? 00:06:22.240 |
And if you do that every day for 90 days, you'll be amazed at how much you can achieve. 00:06:28.320 |
Whereas if you wait two months and spend all that time reading papers and theorizing and 00:06:33.480 |
thinking about the best possible approach, you'll discover that you don't get any submissions 00:06:38.240 |
Or you finally get your perfect submission in and it goes terribly and now you don't 00:06:46.120 |
I think those tips are equally useful for Kaggle competitions as well as for making 00:06:51.120 |
sure that at the end of this part of the course you have something that you're proud of, something 00:06:58.320 |
that you feel you did a good job in a small amount of time. 00:07:04.680 |
If you try and publish something every week on the same topic, you'll be able to keep 00:07:11.120 |
I don't know what Slav's plans are, but maybe next week he'll follow up on some of the interesting 00:07:18.200 |
research angles that came up on Reddit, or maybe Brad will follow up on some of his additional 00:07:26.600 |
There's a lesson 10 wiki up already which has the notebooks, and just do a git pull 00:07:35.620 |
on the GitHub repo to get the most up-to-date on Python. 00:07:40.600 |
Another thing that I wanted to point out is that in study groups, so we've been having 00:07:47.520 |
study groups each Friday here, and I know some of you have had study groups elsewhere 00:07:51.880 |
around the Bay Area. I don't understand this gray matrix stuff. I don't get what's going 00:08:00.520 |
on. I understand the symbols, I understand the math, but what's going on? 00:08:05.680 |
I said maybe if you had a spreadsheet, it would all make sense. Maybe I'll create a 00:08:19.240 |
spreadsheet. Yes, do that! And 20 minutes later I turned to him and I said, "So how do you 00:08:24.600 |
feel about gray matrices now?" And he goes, "I totally understand them." And I looked 00:08:28.440 |
over and he created a spreadsheet. This was the spreadsheet he created. It's a very simple 00:08:34.200 |
spreadsheet where it's like here's an image where the pixels are just 1, -1, and 0. It 00:08:39.480 |
has two filters, either 1 or -1. He has the flattened convolutions next to each other, 00:08:48.520 |
and then he's created the little dot product matrix. 00:08:54.160 |
I haven't been doing so much Excel stuff myself, but I think you'll learn a lot more by trying 00:09:01.120 |
it yourself. Particularly if you try it yourself and can't figure out how to do it in Excel, 00:09:06.520 |
then we have programs. I love Excel, so if you ask me questions about Excel, I will have 00:09:29.440 |
So last week we talked about the idea of learning with larger datasets. Our goal was to try 00:09:46.320 |
and replicate the device paper. To remind you, the device paper is the one where we do 00:09:53.700 |
a regular CNN, but the thing that we're trying to predict is not a one-hot encoding of the 00:10:01.520 |
category, but it's the word vector of the category. 00:10:07.480 |
So it's an interesting problem, but one of the things interesting about it is we have 00:10:13.280 |
to use all of ImageNet, which has its own challenges. So last week we got to the point 00:10:21.120 |
where we had created the word vectors. And to remember the word vectors, we then had 00:10:26.640 |
to map them to ImageNet categories. There are 1000 ImageNet categories, so we had to 00:10:31.320 |
create the word vector for each one. We didn't quite get all of them to match, but something 00:10:35.720 |
like 2/3 of them matched, so we're working on 2/3 of ImageNet. We've got as far as reading 00:10:43.080 |
all the file names for ImageNet, and then we're going to resize our images to 224x224. 00:10:58.640 |
I think it's a good idea to do some of this pre-processing upfront. Something that TensorFlow 00:11:06.720 |
and PyTorch both do and Keras recently started doing is that if you use a generator, it actually 00:11:13.040 |
does the image pre-processing in a number of separate threads in parallel behind the 00:11:19.700 |
scenes. So some of this is a little less important than it was 6 months ago when Keras didn't 00:11:25.520 |
do that. It used to be that we had to spend a long time waiting for our data to get processed 00:11:32.720 |
before it could get into the CNN. Having said that, particularly image resizing, when you've 00:11:42.780 |
got large JPEGs, just reading them off the hard disk and resizing them can take quite 00:11:49.200 |
a long time. So I always like to put it into do all that resizing upfront and end up with 00:11:54.700 |
something in a nice convenient V-coles array. 00:11:59.500 |
Amongst other things, it means that unless you have enough money to have a huge NVMe or 00:12:06.600 |
SSD drive, which you can put the entirety of ImageNet on, you probably have your big 00:12:12.480 |
data sets on some kind of pretty slow spinning disk or slow rate array. One of the nice things 00:12:19.360 |
about doing the resizing first is that it makes it a lot smaller, and you probably can 00:12:22.960 |
then fit it on your SSD. There's lots of reasons that I think this is good. I'm going to resize 00:12:29.480 |
all of the ImageNet images, put them in a V-coles array on my SSD. So here's the path, 00:12:37.840 |
and dpath is the path to my fast SSD mount point. We talked briefly about the resizing, 00:12:50.760 |
and we're going to do a different kind of resizing. In the past, we've done the same 00:12:54.160 |
kind of resizing that Keras does, which is to add a black border. If you start with something 00:12:59.200 |
that's not square, and you make it square, you resize the largest axis to be the size 00:13:06.120 |
of your square, which means you're left with a black border. I was concerned that any model 00:13:14.280 |
where you have that is going to have to learn to model the black border, a) and b) that 00:13:20.880 |
you're kind of throwing away information. You're not using the full size of the image. 00:13:25.400 |
And indeed, every other library or pretty much paper I've seen uses a different approach, 00:13:32.560 |
which is to resize the smaller side of the image to the square. Now the larger size is 00:13:40.440 |
now too big for your square, so you crop off the top and bottom, or crop off the left and 00:13:45.760 |
right. So this is called a center-cropping approach. 00:13:59.760 |
Okay, that's true. What you're doing is you're throwing away compute. Like with the one 00:14:15.720 |
where you do center-crop, you have a complete 224 thing full of meaningful pixels. Whereas 00:14:20.440 |
with a black border, you have a 180 by 224 bit with meaningful pixels and a whole bunch 00:14:25.060 |
of black pixels. Yeah, that can be a problem. It works well for ImageNet because ImageNet 00:14:37.720 |
things are generally somewhat centered. You may need to do some kind of initial step to 00:14:44.800 |
do a heat map or something like we did in lesson 7 to figure out roughly where the thing 00:14:48.440 |
is before you decide where to center the crop. So these things are all compromises. But I 00:14:53.520 |
got to say, since I switched to using this approach, I feel like my models have trained 00:14:57.680 |
a lot faster and given better results, certainly the super resolution. 00:15:03.320 |
I said last week that we were going to start looking at parallel processing. If you're 00:15:09.040 |
wondering about last week's homework, we're going to get there, but some of the techniques 00:15:12.680 |
we're about to learn, we're going to use to do last week's homework even better. So what 00:15:19.880 |
I want to do is I've got a CPU with something like 10 cores on it, and then each of those 00:15:28.800 |
cores have hyperthreading, so each of those cores can do kind of two things at once. So 00:15:33.680 |
I really want to be able to have a couple of dozen processes going on, each one resizing 00:15:44.440 |
Just to remind you, this is as opposed to vectorization, or SIMD, which is where a single 00:15:49.560 |
thread operates on a bunch of things at a time. So we learned that to get SIMD working, 00:15:54.480 |
you just have to install fellow SIMD, and it just happens, 600% speedup. I tried it, 00:16:02.160 |
it works. Now we're going to, as well as the 600% speedup, also get another 10 or 20x speedup 00:16:12.280 |
The basic approach to parallel processing in Python 3 is to set up something called either 00:16:19.240 |
a process pool or a thread pool. So the idea here is that we've got a number of little 00:16:25.340 |
programs running, threads or processes, and when we set up that pool, we say how many 00:16:31.120 |
of those little programs do we want to fire up. And then what we do is we say, okay, now 00:16:37.840 |
I want you to use workers. I want you to use all of those workers to do some thing. And 00:16:46.240 |
the easiest way to do a thing in Python 3 is to use Map. How many of you have used Map 00:16:53.800 |
So for those of you who haven't, Map is a very common functional programming construct 00:16:58.480 |
that's found its way into lots of other languages, which simply says, loop through a collection 00:17:03.680 |
and call a function on everything in that collection and return a new collection, which 00:17:09.360 |
is the result of calling that function on that thing. In our case, the function is resize, 00:17:16.440 |
In fact, the collection is a bunch of numbers, 0, 1, 2, 3, 4, and so forth, and what the 00:17:25.240 |
resize image is going to do is it's going to open that image off disk. So it's turning 00:17:32.560 |
the number 3 into the third image resized, 224x224, and we'll return that. 00:17:39.880 |
So the general approach here, this is basically what it looks like to do parallel processing 00:17:46.440 |
in Python. It may look a bit weird. We're going result equals exec.map. This is a function 00:17:54.960 |
I want, this is the thing to map over, and then I'm saying for each thing in that list, 00:18:00.040 |
do something. Now this might make you think, well wait, does that mean this list has to 00:18:05.480 |
have enough memory for every single resized image? And the answer is no, no it doesn't. 00:18:12.520 |
One of the things that Python 3 uses a lot more is using these things they call generators, 00:18:19.800 |
which is basically, it's something that looks like a list, but it's lazy. It only creates 00:18:24.800 |
that thing when you ask for it. So as I append each image, it's going to give me that image. 00:18:30.920 |
And if this mapping is not yet finished creating it, it will wait. So this approach looks like 00:18:37.080 |
it's going to use heaps of memory, but it doesn't. It uses only the minimum amount of 00:18:42.480 |
memory necessary and it does everything in parallel. 00:18:47.280 |
So resizeImage is something which is going to open up the image, it's going to turn it 00:18:56.040 |
into a NumPy array, and then it's going to resize it. And so then the resize does the 00:19:01.520 |
center cropping we just mentioned, and then after it's resized it's going to get appended. 00:19:07.520 |
What does appendImage do? So this is a bit weird. What's going on here? What it does 00:19:15.600 |
is it's going to actually stick it into what we call a pre-allocated array. We're learning 00:19:23.120 |
a lot of computer science concepts here. Anybody that's done computer science before will be 00:19:26.800 |
familiar with all of this already. If you haven't, you probably won't. But it's important 00:19:32.040 |
to know that the slowest thing in your computer, generally speaking, is allocating memory. 00:19:39.660 |
It's finding some memory, it's reading stuff from that memory, it's writing to that memory, 00:19:44.040 |
unless of course it's like cache or something. And generally speaking, if you create lots 00:19:49.800 |
and lots of arrays and then throw them away again, that's likely to be really, really 00:19:54.280 |
slow. So what I wanted to do was create a single 224x224 array which is going to contain 00:20:01.000 |
my resized image, and then I'm going to append that to my bcol's tensor. 00:20:08.320 |
So the way you do that in Python, it's wonderfully easy. You can create a variable from this thing 00:20:20.960 |
called threading.local. It's basically something that looks a bit like a dictionary, but it's 00:20:28.800 |
a very special kind of dictionary. It's going to create a separate copy of it for every 00:20:33.280 |
thread or process. Normally when you've got lots of things happening at once, it's going 00:20:39.360 |
to be a real pain because if two things try to use it at the same time, you get bad results 00:20:45.520 |
or even crashes. But if you allocate a variable like this, it automatically creates a separate 00:20:52.200 |
copy in every thread. You don't have to worry about locks, you don't have to worry about 00:20:56.320 |
race conditions, whatever. Once I've created this special threading.local variable, I then 00:21:04.120 |
create a placeholder inside it which is just an array of zeros of size 224x224x3. 00:21:12.000 |
So then later on, I create my bcol's array, which is where I'm going to put everything 00:21:17.040 |
eventually, and to append the image, I grab the bit of the image that I want and I put 00:21:23.600 |
it into that preallocated thread local variable, and then I append that to my bcol's array. 00:21:34.200 |
So there's lots of detail here in terms of using parallel processing effectively. I wanted 00:21:43.200 |
to briefly mention it not because I think somebody who hasn't studied computer science 00:21:47.480 |
is now going to go, "Okay, I totally understood all that," but to give you some of the things 00:21:51.240 |
to like search for and learn about over the next week if you haven't done any parallel 00:21:56.200 |
programming before. You're going to need to understand thread local storage and race conditions. 00:22:03.880 |
In Python, there's something called the global interpreter lock, which is one of the many 00:22:22.160 |
awful things about Python, which is that in theory two things can't happen at the same 00:22:28.760 |
time because Python wasn't really written in a thread-safe way. The good news is that 00:22:37.040 |
lots of libraries are written in a thread-safe way. So if you're using a library where most 00:22:42.960 |
of its work is being done in C, as is the case with PLOS-AMD, actually you don't have 00:22:48.960 |
to worry about that. And I can prove it to you even because I drew a little picture. 00:22:55.080 |
Where is the result of serial versus parallel? The serial without SIMD version is 6 times 00:23:04.320 |
bigger than this, so the default Python code you would have written maybe before today's 00:23:10.600 |
course would have been 120 seconds process 2000 images. With SIMD, it's 25 seconds. With 00:23:21.320 |
the process pull, it's 8 seconds. For 3 workers, for 6 workers, it's 5 seconds. The thread pull 00:23:28.800 |
is even better, 3.6 seconds for 12 workers, 3.2 seconds for 16 workers. 00:23:36.160 |
Your mileage will vary depending on what CPU you have. Given that probably quite a lot of 00:23:42.200 |
you are using the P2 still, unless you've got your deep learning box up and running, 00:23:46.040 |
you'll have the same performance as other people using the P2. You should try something 00:23:50.120 |
like this, which is to try different numbers of workers and see what's the optimal for 00:23:56.040 |
that particular CPU. Now once you've done that, you know. Once I went beyond 16, I didn't 00:24:01.360 |
really get improvements. So I know that on that computer, a thread pull of size 16 is 00:24:07.520 |
a pretty good choice. As you can see, once you get into the right general vicinity, it 00:24:12.480 |
doesn't vary too much. So as long as you're roughly okay, just behind you, Rachel. 00:24:23.040 |
So that's the general approach here, is run through something in parallel, each time append 00:24:27.400 |
it to my big holes array. And at the end of that, I've got a big holes array which I can 00:24:32.680 |
use again and again. So I don't re-run that code very often anymore. I've got all of the 00:24:36.880 |
image net resized into each of 72x72, 224, and 288. And I give them different names and 00:24:49.000 |
In fact, I think that's what Keras does now. I think it squishes. Okay, so here's one of 00:25:10.440 |
these things. I'm not quite sure. My guess was that I don't think it's a good idea because 00:25:16.720 |
you're now going to have dogs of various different squish levels and you'll see an end is going 00:25:21.520 |
to have to learn that thing. It's got another type of symmetry to learn about, level of 00:25:31.360 |
squishiness. Whereas if we keep everything of the same aspect ratio, I think it's going 00:25:40.120 |
to be easier to learn so we'll get better results with less epochs of training. 00:25:45.960 |
That's my theory and I'd be fascinated for somebody to do a really in-depth analysis 00:25:49.160 |
of black borders versus center cropping versus squishing with image net. 00:25:57.620 |
So for now we can just open the big holes array and there we go. So we're now ready 00:26:03.260 |
to create our model. I'll run through this pretty quickly because most of it's pretty 00:26:07.320 |
boring. The basic idea here is that we need to create an array of labels which are called 00:26:13.000 |
VEX, which contains for every image in my big holes array, it needs to contain the target 00:26:26.200 |
Just to remind you, last week we randomly ordered the file names, so this big holes 00:26:35.200 |
array is in random order. We've got our labels, which is the word vectors for every image. 00:26:44.360 |
We need to do our normal pre-processing. This is a handy way to pre-process in the new version 00:26:53.520 |
of Keras. We're using the normal Keras ResNet model, the one that comes in keras.applications. 00:27:02.000 |
It doesn't do the pre-processing for you, but if you create a lambda layer that does 00:27:08.200 |
the pre-processing then you can use that lambda layer as the input tensor. So this whole thing 00:27:16.080 |
now will do the pre-processing automatically without you having to worry about it. So that's 00:27:21.760 |
a good little trick. I'm not sure it's quite as neat as what we did in part 1 where we 00:27:26.800 |
put it in the model itself, but at least this way we don't have to maintain a whole separate 00:27:32.680 |
version of all of the models. So that's kind of what I'm doing nowadays. 00:27:44.840 |
When you're working on really big datasets, you don't want to process things any more 00:27:50.560 |
than necessary and any more times than necessary. I know ahead of time that I'm going to want 00:27:55.640 |
to do some fine-tuning. What I decided to do was I decided this is the particular layer 00:28:04.200 |
where I'm going to do my fine-tuning. So I decided to first of all create a model which 00:28:09.480 |
started at the input and went as far as this layer. So my first step was to create that 00:28:18.680 |
model and save the results of that. The next step will be to take that intermediate step 00:28:27.160 |
and take it to the next stage I want to fine-tune to and save that. So it's a little shortcut. 00:28:33.920 |
There's a couple of really important intricacies to be aware of here though. The first one 00:28:39.480 |
is you'll notice that ResNet and Inception are not used very often for transfer learning. 00:28:50.680 |
This is something which I've not seen studied, and I actually think this is a really important 00:28:54.360 |
thing to study. Which of these things work best for transfer learning? But I think one 00:28:59.280 |
of the difficulties is that ResNet and Inception are harder. The reason they're harder is that 00:29:05.240 |
if you look at ResNet, you've got lots and lots of layers which make no sense on their 00:29:11.260 |
own. Ditto for Inception. They keep on splitting into 2 bits and then merging again. So what 00:29:21.160 |
I did was I looked at the Keras source code to find out how each block is named. What 00:29:29.720 |
I wanted to do was to say we've got a ResNet block, we've just had a merge, and then it 00:29:37.320 |
goes out and it does a couple of convolutions, and then it comes back and does an addition. 00:29:44.840 |
Basically I want to get one of these. Unfortunately for some reason Keras does not name these 00:29:53.960 |
merge cells. So what I had to do was get the next cell and then go back by 1. So it kind 00:30:04.640 |
of shows you how little people have been working with ResNet with transfer learning. Literally 00:30:09.480 |
the only bits of it that make sense to transfer learn from are nameless in one of the most 00:30:15.240 |
popular things for transfer learning, Keras. There's a second complexity when working with 00:30:26.000 |
ResNet. We haven't discussed this much, but ResNet actually has two kinds of ResNet blocks. 00:30:33.080 |
One is this kind, which is an identity block, and the second time is a ResNet convolution 00:30:43.280 |
block, which they also call a bottleneck block. What this is is it's pretty similar. One thing 00:30:56.120 |
that's going up through a couple of convolutions and then goes and gets added together, but 00:31:00.160 |
the other side is not an identity. The other side is a single convolution. In ResNet they 00:31:09.360 |
throw in one of these every half a dozen blocks or so. Why is that? The reason is that if you 00:31:18.200 |
only have identity blocks, then all it can really do is to continually fine-tune where 00:31:26.000 |
it's at so far. We've learned quite a few times now that these identity blocks map to 00:31:32.760 |
the residual, so they keep trying to fine-tune the types of features that we have so far. 00:31:39.440 |
Whereas these bottleneck blocks actually force it from time to time to create a whole different 00:31:45.240 |
type of features because there is no identity path through here. The shortest path still 00:31:51.240 |
goes through a single convolution. When you think about transfer learning from ResNet, 00:31:57.360 |
you kind of need to think about should I transfer learn from an identity block before or after 00:32:03.480 |
or from a bottleneck block before or after. Again, I don't think anybody has studied this 00:32:10.080 |
or at least I haven't seen anybody write it down. I've played around with it a bit and 00:32:14.840 |
I'm not sure I have a totally decisive suggestion for you. Clearly my guess is that the best 00:32:28.440 |
point to grab in ResNet is the end of the block immediately before a bottleneck block. 00:32:36.480 |
And the reason for that is that at that level of receptive field, obviously because each 00:32:42.760 |
bottleneck block is changing the receptive field, and at that level of semantic complexity, 00:32:50.440 |
this is the most sophisticated version of it because it's been through a whole bunch 00:32:54.160 |
of identity blocks to get there. So my belief is that you want to get just before that bottleneck 00:33:06.840 |
is the best place to transfer learn from. So that's what this is. This is the spot just 00:33:16.920 |
before the last bottleneck layer in ResNet. So it's pretty late, and so as we know very 00:33:26.120 |
well from part 1 with transfer learning, when you're doing something which is not too different, 00:33:31.520 |
and in this case we're switching from one-hot encoding to word vectors, which is not too 00:33:35.360 |
different. You probably don't want to transfer learn from too early, so that's why I picked 00:33:41.920 |
this fairly late stage, which is just before the final bottleneck block. 00:33:50.880 |
So the second complexity here is that this bottleneck block has these dimensions. The 00:33:57.880 |
output is 14x14x1024. So we have about a million images, so a million by 14x14x1024 is more 00:34:08.760 |
than I wanted to deal with. So I did something very simple, which was I popped in one more 00:34:17.600 |
layer after this, which is an average pooling layer, 7x7. So that's going to take my 14x14 00:34:31.520 |
So let's say one of those activations was looking for bird's eyeballs, then it's saying 00:34:37.880 |
in each of the 14x14 spots, how likely is it that this is a bird's eyeball? And so after 00:34:44.000 |
this it's now saying in each of these 4 spots, on average, how much were those cells looking 00:34:50.980 |
like bird's eyeballs? This is losing information. If I had a bigger SSD and more time, I wouldn't 00:35:03.120 |
have done this. But it's a good trick when you're working with these fully convolutional 00:35:07.160 |
architectures. You can pop an average pooling layer anywhere and decrease the resolution 00:35:13.320 |
to something that you feel like you can deal with. So in this case, my decision was to 00:35:22.000 |
We had a question. I was going to ask, have we talked about why we do the merge operation 00:35:31.960 |
We have quite a few times, which is basically the merge was the thing which does the plus 00:35:37.680 |
here. That's the trick to making it into a ResNet block, is having the addition of the 00:35:45.240 |
identity with the result factor of how the convolutions. 00:35:55.520 |
So recently I was trying to go from many filters. So you kind of just talked about downsizing 00:36:01.600 |
the size of the geometry. Is there a good best practice on going from, let's say, like 00:36:07.360 |
512 filters down to less? Or is it just as simple as doing convolution with less filters? 00:36:14.880 |
Yeah, there's not exactly a best practice for that. But in a sense, every single successful 00:36:27.800 |
architecture gives you some insights about that. Because every one of them eventually 00:36:31.720 |
has to end up with 1,000 categories if it's ResNet or three channels of 1.255 continuous 00:36:41.440 |
if it's generative. So the best thing you can really do is, well, there's two things 00:36:46.440 |
one is to kind of look at the successful architectures. Another thing is, although this week is kind 00:36:52.120 |
of the last week where we're mainly going to be looking at images, I am going to briefly 00:36:56.200 |
next week open with a quick run through some of the things that you can look at to learn 00:37:01.200 |
more. And one of them is going to be a paper. In fact, two different papers which have like 00:37:06.360 |
best practices, you know, really nice kind of descriptions of these hundred different 00:37:11.960 |
things, these hundred different results. But all this stuff, it's still pretty artisanal. 00:37:19.520 |
Good question. So we initially resized images to 224, right? And it ended up being as a 00:37:32.840 |
big cause already, right? Yes. So a couple it's like 50 giga or something. Yes. And that's 00:37:44.080 |
compressed and uncompressed. It's like a couple of hundred giga. But, well, if you load it 00:37:50.280 |
into memory... I'm not going to load it into memory, you'll see. So what you do is kind 00:37:54.360 |
of place the load. It's getting there. Yeah. So that's exactly the right segue I was looking 00:38:03.440 |
So what we're going to do now is we want to run this model we just built, just call basically 00:38:09.640 |
dot predict on it and save the predictions. The problem is that the size of those predictions 00:38:15.960 |
is going to be bigger than the amount of RAM I have, so I need to do it a batch at a time 00:38:21.360 |
and save it a batch at a time. They've got a million things, each one with this many 00:38:26.480 |
activations. And this is going to happen quite often, right? You're either working on a smaller 00:38:32.000 |
computer or you're working with a bigger dataset, or you're working with a dataset where you're 00:38:40.480 |
This is actually very easy to handle. You just create your bcols array where you're 00:38:45.720 |
going to store it. And then all I do is I go from 0 to the length of my array, my source 00:38:55.480 |
array, a batch at a time. So this is creating the numbers 0, 0 plus 128, 128 plus 128, and 00:39:04.360 |
so on and so forth. And then I take the slice of my source array from originally 0 to 128, 00:39:12.000 |
then from 128 to 256 and so forth. So this is now going to contain a slice of my source 00:39:19.760 |
bcols array. This is going to create a generator which is going to have all of those slices, 00:39:29.520 |
and of course being a generator it's going to be lazy. So I can then enumerate through 00:39:33.800 |
each of those slices, and I can append to my bcols array the result of predicting just 00:39:45.240 |
So you've seen like predict and evaluate and fit and so forth, and the generator versions. 00:39:55.280 |
Also in Keras there's generally an on-batch version, so there's a train on-batch and a 00:40:01.280 |
predict on-batch. What these do is they basically have no smarts to them at all. This is like 00:40:07.480 |
the most basic thing. So this is just going to take whatever you give it and call predict 00:40:11.720 |
on this thing. It won't shuffle it, it won't batch it, it's just going to throw it directly 00:40:18.280 |
So I'm just going to take a model, it's going to call predict on just this batch of data. 00:40:25.000 |
And then from time to time I print out how far I've gone just so that I know how I'm 00:40:29.480 |
going. Also from time to time I call .flush, that's the thing in bcols that actually writes 00:40:36.240 |
it to disk. So this thing doesn't actually take very long to run. And one of the nice 00:40:44.080 |
things I can do here is I can do some data augmentation as well. So I've added a direction 00:40:49.800 |
parameter, and what I'm going to do is I'm going to have a second copy of all of my images 00:40:55.200 |
which is flipped horizontally. So to flip things horizontally, that's interesting, I think I 00:41:04.040 |
screwed this up. To flip things horizontally, you've got batch, height, and then this is 00:41:20.120 |
columns. So if we pass in a -1 here, then it's going to flip it horizontally. That explains 00:41:31.600 |
why some of my results haven't been quite as good as I hoped. 00:41:36.280 |
So when you run this, we're going to end up with a big big holes array that's going to 00:41:41.560 |
contain two copies of every three sites image-net-image, the activations at the layer that we have, 00:41:50.640 |
one layer before this. So I call it once with direction forwards and one with direction 00:41:57.960 |
backwards. So at the end of that, I've now got nearly 2 million activations of 2x2x1024. 00:42:08.280 |
So that's pretty close to the end of ResNet. I've 00:42:19.120 |
then just copied and pasted from the Keras code the last few steps of ResNet. So this 00:42:24.720 |
is the last few blocks. I added in one extra identity block just because I had a feeling 00:42:31.160 |
that might help things along a little bit. Again, people have not really studied this 00:42:35.120 |
yet, so I haven't had a chance to properly experiment, but it seemed to work quite well. 00:42:42.080 |
This is basically copied and pasted from Keras's code. I then need to copy the weights from 00:42:48.600 |
Keras for those last few layers of ResNet. So now I'm going to repeat the same process 00:42:54.240 |
again, which is to predict on these last few layers. The input will be the output from 00:43:01.280 |
the previous one. So we went like 2/3 of the way into ResNet and got those activations 00:43:07.880 |
and put those activations into the last few stages of ResNet to get those activations. 00:43:14.000 |
Now the outputs from this are actually just a vector of length 2048, which does fit in 00:43:23.280 |
my RAM, so I didn't bother with calling predict on batch, I can just call .predict. If you 00:43:29.880 |
try this at home and don't have enough memory, you can use the predict on batch trick again. 00:43:36.280 |
Any time you ran out of memory when calling predict, you can always just use this pattern. 00:43:47.560 |
So at the end of all that, I've now got the activations from the penultimate layer of 00:43:54.400 |
ResNet, and so I can do a usual transfer learning trick of creating a linear model. My linear 00:44:04.160 |
model is now going to try to use the number of dimensions in my word vectors as its output, 00:44:12.520 |
and you'll see it doesn't have any activation function. That's because I'm not doing one 00:44:18.480 |
hot encoding, my word vectors could be any size numbers, so I just leave it as linear. 00:44:25.920 |
And then I compile it, and then I fit it, and so this linear model is now my very first 00:44:31.640 |
-- this is almost the same as what we did in Lesson 1. Cryptocs vs. cats. We're fine 00:44:38.520 |
tuning a model to a slightly different target to what it was originally trained with. It's 00:44:48.040 |
just that we're doing it with a lot more data, so we have to be a bit more thoughtful. 00:44:53.480 |
There's one other difference here, which is I'm using a custom loss function. And the 00:44:58.560 |
loss function I'm using is cosine distance. You can lock that up at home if you're not 00:45:04.240 |
familiar with it, but basically cosine distance says for these two points in space, what's 00:45:09.600 |
the angle between them, rather than how far away are they? The reason we're doing that 00:45:14.560 |
is because we're about to start using k nearest neighbors. So k nearest neighbors, we're going 00:45:19.440 |
to basically say here's the word vector we predicted, which is the word vector which 00:45:24.600 |
is closest to it. It turns out that in really really high dimensional space, the concept 00:45:30.520 |
of how far away something is, is nearly meaningless. And the reason why is that in really really 00:45:36.440 |
high dimensional space, everything sits on the edge of that space. Basically because 00:45:43.960 |
you can imagine as you add each additional dimension, the probability that something 00:45:48.480 |
is on the edge in that dimension, let's say the probability that it's right on the edge 00:45:53.280 |
is like 1/10. Then if you've only got one dimension, you've got a probability of 1/10. 00:45:58.000 |
It's on the edge in one dimension. If you've got two dimensions, it's basically multiplicatively 00:46:03.800 |
decreasing the probability that that happens. So in a few hundred dimensional spaces, everything 00:46:09.440 |
is on the edge. And when everything's on the edge, everything is kind of an equal distance 00:46:14.480 |
away from each other, more or less. And so distances aren't very helpful. But the angle 00:46:19.560 |
between things varies. So when you're doing anything with trying to find nearest neighbors, 00:46:28.360 |
it's a really good idea to train things using cosine distance. And this is the formula for 00:46:34.920 |
cosine distance. Again, this is one of these things where I'm skipping over something that 00:46:40.460 |
you'd probably spend a week in undergrad studying. There's heaps of information about cosine distance 00:46:46.660 |
on the web. So for those of you already familiar with it, I won't waste your time. For those 00:46:50.640 |
of you not, it's a very very good idea to become familiar with this. And feel free to 00:46:56.360 |
ask on the forums if you can't find any material that makes sense. 00:47:02.000 |
So we've fitted our linear model. As per usual, we save our weights. And we can see how we're 00:47:07.940 |
going. So what we've got now is something where we can fit in an image, and it will 00:47:14.720 |
spit out a word vector. But it's something that looks like a word vector. It has the 00:47:20.600 |
same dimensionality as a word vector. But it's very unlikely that it's going to be the 00:47:25.120 |
exact same vector as one of our thousand target word vectors. So if the word vector for a 00:47:32.800 |
pug is this list of 200 floats, even if we have a perfectly puggy pug, we're not going 00:47:40.680 |
to get that exact list of 2000 floats. We'll have something that is similar. And when we 00:47:46.120 |
say similar, we probably mean that the cosine distance between the perfect platonic pug 00:47:51.560 |
and our pug is pretty small. So that's why after we get our predictions, we then have 00:48:05.800 |
to use nearest neighbors as a second step to basically say, for each of those predictions, 00:48:11.520 |
what are the three word vectors that are the closest to that prediction? 00:48:18.440 |
So we can now take those nearest neighbors and find out for a bunch of our images what 00:48:25.100 |
are the three things it thinks it might be. For example, for this image here, its best 00:48:32.080 |
guess was trombone, next was flute, and third was cello. This gives us some hope that this 00:48:39.520 |
approach seems to be working okay. It's not great yet, but it's recognized these things 00:48:44.560 |
are musical instruments, and its third guess was in fact the correct musical instrument. 00:48:49.320 |
So we know what to do next. What we do next is to fine-tune more layers. And because we 00:48:55.240 |
have already saved the intermediate results from an earlier layer, that fine-tuning is 00:49:02.960 |
Two more things I briefly mentioned. One is that there's a couple of different ways to 00:49:08.920 |
do nearest neighbors. One is what's called the brute force approach, which is literally 00:49:13.200 |
to go through everyone and see how far away it is. There's another approach which is approximate 00:49:23.960 |
nearest neighbors. And when you've got lots and lots of things, you're trying to look 00:49:28.640 |
for nearest neighbors, the brute force approach is going to be n^2 time. It's going to be super 00:49:35.000 |
slow. Approximate nearest neighbors are generally n log n time. So orders of magnitude faster 00:49:48.360 |
The particular approach I'm using here is something called locality-sensitive hashing. 00:49:53.120 |
It's a fascinating and wonderful algorithm. Anybody who's interested in algorithms, I 00:49:57.840 |
strongly recommend you go read about it. Let me know if you need a hand with it. My favorite 00:50:04.900 |
kind of algorithms are these approximate algorithms. In data science, you almost never need to 00:50:13.280 |
know something exactly, yet nearly every algorithm that people learn at university and certainly 00:50:18.640 |
at high school are exact. We learn exact nearest neighbor algorithms and exact indexing algorithms 00:50:24.520 |
and exact median algorithms. Pretty much for every algorithm out there, there's an approximate 00:50:29.800 |
version that runs an order of n or log n over n faster. One of the cool things is that once 00:50:38.960 |
you start realizing that, you suddenly discover that all of the libraries you've been using 00:50:42.400 |
for ages were written by people who didn't know this. And then you realize that every 00:50:47.520 |
sub-algorithm they've written, they could have used an approximate version. The next 00:50:51.160 |
thing you've got to know, you've got something that runs a thousand times faster. 00:50:55.040 |
The other cool thing about approximate algorithms is that they're generally written to provably 00:51:00.440 |
be accurate to within so close. And it can tell you with your parameters how close is 00:51:05.760 |
so close, which means that if you want to make it more accurate, you run it more times 00:51:12.400 |
with different random seeds. This thing called LSH forest is a locality-sensitive hashing 00:51:18.680 |
forest which means it creates a bunch of these locality-sensitive hashes. And the amazingly 00:51:24.000 |
great thing about approximate algorithms is that each time you create another version 00:51:28.360 |
of it, you're exponentially increasing the accuracy, or multiplicatively increasing the 00:51:34.200 |
accuracy, but only linearly increasing the time. So if the error on one call of LSH was 00:51:43.480 |
e, then the error on two calls is 1 - e^2. And 3 calls is 1 - e^3. And the time you're 00:51:53.600 |
taking is now 2n and 3n. So when you've got something where you can make it as accurate 00:52:01.160 |
as you like with only linear increasing time, this is incredibly powerful. This is a great 00:52:08.440 |
approximation algorithm. I wish we had more time, so I'd love to tell you all about it. 00:52:16.860 |
So I generally use LSH forest when I'm doing nearest neighbors because it's arbitrarily 00:52:21.400 |
close and much faster when you've got lots of word vectors. The time that becomes important 00:52:29.640 |
is when I move beyond ImageNet, which I'm going to do now. 00:52:34.440 |
So let's say I've got a picture, and I don't just want to say which one of the thousand 00:52:40.620 |
ImageNet categories is it. Which one of the 100,000 WordNet nouns is it? That's a much 00:52:48.240 |
harder thing to do. And that's something that no previous model could do. When you trained 00:52:54.880 |
an ImageNet model, the only thing you could do is recognize pictures of things that were 00:52:58.720 |
in ImageNet. But now we've got a word vector model, and so we can put in an image that 00:53:06.520 |
spits out a word vector, and that word vector could be closer to things that are not in 00:53:11.480 |
ImageNet at all. Or it could be some higher level of the hierarchy, so we could look for 00:53:16.840 |
a dog rather than a pug, or a plane rather than a 747. 00:53:24.220 |
So here we bring in the entire set of word vectors. I'll have to remember to share these 00:53:30.600 |
with you because these are actually quite hard to create. And this is where I definitely 00:53:36.040 |
want LSH_FOREST because this is going to be pretty slow. And we can now do the same thing. 00:53:43.280 |
And not surprisingly, it's got worse. The thing that was actually cello, now cello is not 00:53:47.680 |
even in the top 3. So this is a harder problem. So let's try fine-tuning. So fine-tuning is 00:53:56.360 |
the final trick I'm going to show you, just behind you Retschoff. 00:54:00.960 |
You might remember last week we looked at creating our word vectors, and what we did 00:54:17.800 |
was actually I created a list. I went to WordNet and I downloaded the whole of WordNet, and 00:54:32.680 |
then I figured out which things were nouns, and then I used a Retschoff to pass out those, 00:54:37.480 |
and then I saved that. So we actually have the entirety of WordNet nouns. 00:54:45.320 |
Because it's not a good enough model yet. So now that there's 80,000 nouns, there's a lot 00:55:00.040 |
more ways to be wrong. So when it only has to say which of these thousand things is it, 00:55:05.600 |
that's pretty easy. Which of these 80,000 things is it? It's pretty hard. To fine-tune it, it 00:55:19.680 |
looks very similar to our usual way of fine-tuning things, which is that we take our two models 00:55:27.360 |
and stick them back to back, and we're now going to train the whole thing rather than 00:55:37.320 |
The problem is that the input to this model is too big to fit in RAM. So how are we going 00:55:46.880 |
to call fit or fit generator when we have an array that's too big to fit in RAM? Well, 00:55:54.840 |
one obvious thing to do would be to pass in the bcols array. Because to most things in 00:56:00.280 |
Python, a bcols array looks just like a regular array. It doesn't really look any different. 00:56:07.500 |
The way a bcols array is actually stored is actually stored in a directory, as I'm sure 00:56:21.720 |
you've noticed. And in that directory, it's got something called chunk length, I set it 00:56:33.960 |
to 32 when I created these bcols arrays. What it does is it takes every 32 images and it 00:56:42.280 |
puts them into a separate file. Each one of these has 32 images in it, or 32 of the leading 00:56:58.400 |
Now if you then try to take this whole array and pass it to .fit in Keras with shuffle, 00:57:08.600 |
it's going to try and grab one thing from here and one thing from here and one thing 00:57:12.720 |
from here. Here's the bad news. For bcols to get one thing out of a chunk, it has to read 00:57:20.320 |
and decompress the whole thing. It has to read and decompress 32 images in order to give 00:57:25.800 |
you the one image you asked for. That would be a disaster. That would be ridiculously 00:57:30.640 |
horribly slow. We didn't have to worry about that when we called predict on batch. We were 00:57:39.960 |
going not shuffling, but we were going in order. So it was just grabbing one. It was 00:57:49.520 |
never grabbing a single image out of a chunk. But now that we want to shuffle, it would. 00:57:58.240 |
So what we've done is somebody very helpfully actually on a Kaggle forum provided something 00:58:07.080 |
called a bcols array iterator. The bcols array iterator, which was kindly discovered on the 00:58:19.000 |
forums by somebody named MP Janssen, originally written by this fellow, provides a Keras-compatible 00:58:34.120 |
generator which grabs an entire chunk at a time. So it's a little bit less random, but 00:58:43.680 |
given that if this has got 2 million images in and the chunk length is 32, then it's going 00:58:49.480 |
to basically create a batch of chunks rather than a batch of images. And so that means 00:58:56.700 |
we have none of the performance problems, and particularly because we randomly shuffled 00:59:03.120 |
our files. So this whole thing is randomly shuffled anyway. So this is a good trick. 00:59:10.140 |
So you'll find the bcols array iterator on GitHub. Feel free to take a look at the code. 00:59:17.000 |
It's pretty straightforward. There were a few issues with the original version, so MP 00:59:25.560 |
Janssen and I have tried to fix it up and I've written some tests for it and he's written 00:59:29.320 |
some documentation for it. But if you just want to use it, then it's as simple as writing 00:59:36.520 |
this. Blah equals bcols array iterator, this is your data, these are your labels, shuffle 00:59:45.240 |
equals true, batch size equals whatever, and then you can just call fit generator as per 00:59:49.640 |
usual passing in that iterator and that iterator's number of items. 00:59:58.200 |
So to all of you guys who have been asking how to deal with data that's bigger than memory, 01:00:06.240 |
this is how you do it. So hopefully that will make life easier for a lot of people. 01:00:14.000 |
So we fine-tune it for a while, we do some learning annealing for a while, and this basically 01:00:21.880 |
runs overnight for me. It takes about 6 hours to run. And so I come back the next morning 01:00:29.280 |
and I just copy and paste my k nearest neighbors, so I get my predicted word vectors. For each 01:00:38.040 |
word vector, I then pass it into nearest neighbors. This is my just 1000 categories. And lo and 01:00:45.280 |
behold, we now have cello in the top spot as we hoped. 01:00:50.880 |
How did it go in the harder problem of looking at the 100,000 or so nouns in English? Pretty 01:00:56.520 |
good. I've got this one right. And just to pick another one at random, let's pick the 01:01:01.680 |
first one. It said throne. That sure looks like a throne. So looking pretty good. 01:01:09.120 |
So here's something interesting. Now that we have brought images and words into the 01:01:14.440 |
same space, let's play with it some more. So why don't we use nearest neighbors with 01:01:22.400 |
those predictions? To the word vector which Google created, but the subset of those which 01:01:47.200 |
are nouns according to WordNet, map to their sin set IDs. 01:02:00.200 |
The word vectors are just the word2vec vectors that we can download off the internet. They 01:02:05.640 |
were pre-trained by Google. We're saying here is this image spits out a vector from a thing 01:02:30.280 |
we just trained. We have 100,000 word2vec vectors for all the nouns in English. Which 01:02:38.480 |
one of those is the closest to the thing that came out of our model? And the answer was 01:02:45.520 |
Hold that thought. We'll be doing language translation starting next week. No, we don't 01:02:59.780 |
quite do it that way, but you can think of it like that. 01:03:02.800 |
So let's do something interesting. Let's create a nearest neighbors not for all of the word2vec 01:03:11.120 |
vectors, but for all of our image-predicted vectors. And now we can do the opposite. Let's 01:03:17.360 |
take a word, we pick it random. Let's look it up in our word2vec dictionary, and let's 01:03:23.120 |
find the nearest neighbors for that in our images. There it is. So this is pretty interesting. 01:03:36.200 |
You can now find the images that are the most like whatever word you come up with. Okay, 01:03:46.560 |
that's crazy, but we can do crazier. Here is a random thing I picked. Now notice I picked 01:03:52.520 |
it from the validation set of ImageNet, so we've never seen this image before. And honestly 01:03:57.100 |
when I opened it up, my heart sank because I don't know what it is. So this is a problem. 01:04:02.800 |
What is that? So what we can do is we can call.predict on that image, and we can then 01:04:15.720 |
do a nearest neighbors of all of our other images. There's the first, there's the second, 01:04:23.880 |
and the third one is even somebody putting their hand on it, which is slightly crazy, 01:04:30.440 |
but that was what the original one looked like. In fact, if I can find it, I ran it 01:04:40.600 |
again on a different image. I actually looked around for something weird. This is pretty 01:04:54.200 |
weird, right? Is this a net or is it a fish? So when we then ask for nearest neighbors, 01:04:59.840 |
we get fish in nets. So it's like, I don't know, sometimes deep learning is so magic 01:05:07.040 |
you just kind of go out and they're possibly what's just behind you, Rachel. 01:05:14.240 |
Only a little bit, and maybe in a future course we might look at Dask. I think maybe even 01:05:33.000 |
in your numerical and algebra course you might be looking at Dask. I don't think we'll cover 01:05:37.360 |
it this course. But do look at Dask, D-A-S-K, it's super cool. 01:05:48.400 |
No, not at all. So these were actually labeled as this particular kind of fish. In fact that's 01:06:01.960 |
the other thing is it's not only found fish in nets, but it's actually found more or less 01:06:05.960 |
the same breed of fish in the nets. But when we called dot predict on those, it created 01:06:17.520 |
a word vector which was probably like halfway between that kind of fish and a net because 01:06:27.080 |
it doesn't know what to do, right? So sometimes when it sees things like that, it would have 01:06:30.800 |
been marked in imageNet as a net, and sometimes it would have been a fish. So the best way 01:06:35.200 |
to minimize the loss function would have been to kind of hedge. So it hedged and as a result 01:06:40.760 |
the images that were closest were the ones which actually were halfway between the two 01:06:45.080 |
themselves. So it's kind of a convenient accident. 01:06:48.840 |
You absolutely can and I have, but really for nearest neighbors I haven't found anything 01:07:18.680 |
nearly as good as cosine and that's true in all of the things I looked up as well. By 01:07:25.640 |
the way, I should mention when you use locality-sensitive hashing in Python, by default it uses something 01:07:32.280 |
that's equivalent to the cosine metric, so that's why the nearest neighbors work. 01:07:36.520 |
So starting next week we're going to be learning about sequence-to-sequence models and memory 01:07:50.880 |
and attention methods. They're going to show us how we can take an input such as a sentence 01:07:56.960 |
in English and an output such as a sentence in French, which is the particular case study 01:08:02.280 |
we're going to be spending 2 or 3 weeks on. When you combine that with this, you get image 01:08:07.680 |
captioning. I'm not sure if we're going to have time to do it ourselves, but it will 01:08:12.480 |
literally be trivial for you guys to take the two things and combine them and do image 01:08:18.680 |
captioning. It's just those two techniques together. 01:08:25.940 |
So we're now going to switch to -- actually before we take a break, I want to show you 01:08:35.760 |
the homework. Hopefully you guys noticed I gave you some tips because it was a really 01:08:42.440 |
challenging one. Even though in a sense it was kind of straightforward, which was take 01:08:48.240 |
everything that we've already learned about super-resolution and slightly change the loss 01:08:51.960 |
function so that it does perceptual losses for style transfer instead, the details were 01:08:57.840 |
I'm going to quickly show you two things. First of all, I'm going to show you how I 01:09:01.600 |
did the homework because I actually hadn't done it last week. Luckily I have enough RAM 01:09:07.640 |
that I could read the two things all into memory, so don't forget you can just do that 01:09:11.680 |
with a bcols array to return it into a NumPy array in memory. 01:09:17.520 |
So one thing I did was I created my up-sampling block to get rid of the checkerboard patterns. 01:09:22.920 |
That was literally as simple as saying up-sampling 2D and then a 1x1 conv. So that got rid of 01:09:29.040 |
my checkerboard patterns. The next thing I did was I changed my loss function and I decided 01:09:38.420 |
before I tried to do style transfer with perceptual losses, let's try and do super-resolution with 01:09:46.400 |
multiple content-loss layers. That's one thing I'm going to have to do for style transfer 01:09:52.560 |
is be able to use multiple layers. So I always like to start with something that works and 01:09:57.720 |
make small little changes so it keeps working at every point. 01:10:01.640 |
So in this case, I thought, let's first of all slightly change the loss function for 01:10:08.760 |
super-resolution so that it uses multiple layers. So here's how I did that. I changed 01:10:14.720 |
my get output layer. Sorry, I changed my BGG content so it created a list of outputs, conv1 01:10:24.920 |
from each of the first, second and third blocks. And then I changed my loss function so it 01:10:32.800 |
went through and added the mean squared difference for each of those three layers. I also decided 01:10:40.000 |
to add a weight just for fun. So I decided to go 0.1, 0.8, 0.1 because this is the layer 01:10:46.280 |
that they used in the paper. But let's have a little bit of more precise super-resolution 01:10:53.320 |
and a little bit of more semantic super-resolution and see how it goes. I created this function 01:10:59.880 |
to do a more general mean squared error. And that was basically it. Other than that line 01:11:07.960 |
everything else was the same, so that gave me super-resolution working on multiple layers. 01:11:15.960 |
One of the things I found fascinating is that this is the original low-res, and it's done 01:11:22.560 |
a good job of upscaling it, but it's also fixed up the weird white balance, which really 01:11:27.880 |
surprised me. It's taken this obviously over-yellow shot, and this is what ceramic should look 01:11:34.880 |
like, it should be white. And somehow it's adjusted everything, so the serviette or whatever 01:11:40.040 |
it is in the background has gone from a yellowy-brown to a nice white, as with these cups here. 01:11:45.560 |
It's figured out that these slightly pixelated things are actually meant to be upside-down 01:11:49.080 |
handles. This is on only 20,000 images. I'm very surprised that it's fixing the color 01:11:59.200 |
because we never asked it to, but I guess it knows what a cup is meant to look like, 01:12:06.800 |
and so this is what it's decided to do, is to make a cup the way it thinks it's meant 01:12:18.640 |
So then to go from there to style-transfer was pretty straightforward. I had to read 01:12:23.040 |
in my style as before. This is the code to do this special kind of resnet block where 01:12:30.560 |
we use valid convolutions, which means we lose two pixels each time, and so therefore 01:12:36.920 |
we have to do a center crop. So don't forget, lambda layers are great for this kind of thing. 01:12:43.080 |
Whatever code you can write, chuck it in a lambda layer, and suddenly it's a Keras layer. 01:12:47.560 |
So do my center crop. This is now a resnet block which does valid comms. This is basically 01:12:54.920 |
all exactly the same. We have to do a few downsamplings, and then the computation, and 01:13:00.720 |
our upsampling, just like the supplemental paper. 01:13:05.420 |
So the loss function looks a lot like the loss function did before, but we've got two 01:13:10.760 |
extra things. One is the Gram matrix. So here is a version of the Gram matrix which works 01:13:17.160 |
a batch at a time. If any of you tried to do this a single image at a time, you would have 01:13:21.640 |
gone crazy with how slow it took. I saw a few of you trying to do that. So here's the 01:13:30.040 |
And then the second thing I needed to do was somehow feed in my style target. Another thing 01:13:35.280 |
I saw some of you do was feed in the style target every time feed in that array into 01:13:45.320 |
your loss function. You can obviously calculate your style target by just calling .predict 01:13:53.640 |
with the thing which gives you all your different style target layers, but the problem is this 01:13:58.880 |
thing here returns a NumPy array. It's a pretty big NumPy array, which means that then when 01:14:04.600 |
you want to use it as a style target in training, it has to copy that back to the GPU. And copying 01:14:11.680 |
to the GPU is very, very slow. And this is a really big thing to copy to the GPU. So 01:14:16.760 |
any of you who tried this, and I saw some of you try it, it took forever. 01:14:21.760 |
So here's the trick. Call .variable on it. Turning something into a variable picks it 01:14:29.440 |
on the GPU for you. So once you've done that, you can now treat this as a list of symbolic 01:14:37.640 |
entities which are the GPU versions of this. So I can now use this inside my GPU code. 01:14:46.600 |
So here are my style targets I can use inside my loss function, and it doesn't have to do 01:14:57.800 |
any copying backwards and forwards. So there's a subtlety, but if you don't get that subtlety 01:15:03.520 |
right, you're going to be waiting for a week or so for your code to finish. 01:15:10.000 |
So those were the little subtleties which were necessary to get this to work. And once 01:15:15.400 |
you get it to work, it does exactly the same thing basically as before. 01:15:20.200 |
So where this gets combined with device is I wanted to try something interesting, which 01:15:26.920 |
is in the original Perceptual Losses paper, they trained it on the COCO dataset which 01:15:34.480 |
has 80,000 images, which didn't seem like many. I wanted to know what would happen if 01:15:39.680 |
we trained it on Olive ImageNet. So I did. So I decided to train a super-resolution network 01:15:49.120 |
on Olive ImageNet. And the code's all identical, so I'm not going to explain it. Other than, 01:15:57.320 |
you'll notice we don't have the square bracket colon square bracket here anymore because 01:16:02.480 |
we don't want to try and read in the entirety of ImageNet into RAM. So these are still b 01:16:06.600 |
coles arrays. All the other code is identical until we get to here. So I use a bcoles array 01:16:19.760 |
iterator. I can't just call .fit because .fit or .fit generator assumes that your iterator 01:16:33.160 |
is returning your data and your labels. In our case, we don't have data and labels. We 01:16:40.920 |
have two things that both get fed in as two inputs, and our labels are just a list of 01:16:47.720 |
So here's a good trick. This answers your earlier question about how do you do multi-input 01:16:56.600 |
models on large datasets. The answer is create your own training loop which loops through 01:17:05.200 |
a bunch of iterations, and then you can grab as many batches of data from as many different 01:17:11.160 |
iterators as you like, and then call train on batch. So in my case, my bcoles array iterator 01:17:18.720 |
is going to return my high resolution and low resolution batch of images. So I go through 01:17:24.000 |
a bunch of iterations, grab one batch of high res and low res images, and pass them as my 01:17:34.920 |
So this is the only code I changed other than changing .fit generator to actually calling 01:17:44.600 |
train. So as you can see, this took me 4.5 hours to train and I then decreased the learning 01:17:55.440 |
rate and I trained for another 4.5 hours. Actually, I did it overnight last night and I only had 01:18:00.080 |
enough time to do about half of ImageNet, so this isn't even the whole thing. But check 01:18:06.040 |
So take that model and we're going to call .predict. This is the original high res image. 01:18:14.920 |
Here's the low res version. And here's the version that we've created. And as you can 01:18:22.360 |
see, it's done a pretty extraordinarily good job. When you look at the original ball, there 01:18:29.800 |
was this kind of vague yellow thing here. It's kind of turned it into a nice little 01:18:34.000 |
version. You can see that her eyes was like two grey blobs. It's kind of turned it into 01:18:40.520 |
some eyes. You could just tell that that's an A, maybe if you look carefully. Now it's 01:18:48.520 |
very clearly an A. So you can see it does an amazing job of upscaling this. 01:18:58.120 |
All that's still is this is a fully convolutional net and therefore is not specific to any particular 01:19:03.680 |
input resolution. So what I can do is I can create another version of the model using 01:19:11.280 |
our high res as the input. So now we're going to call .predict with the high res input, 01:19:21.520 |
and that's what we get back. So look at that, we can now see all of this detail on the basketball, 01:19:32.440 |
which simply, none of that really existed here. It was there, but pretty hard to see 01:19:39.160 |
what it was. And look at her hair, this kind of grey blob here. Here you can see it knows 01:19:51.440 |
it's like little bits of pulled back hair. So we can take any sized image and make it 01:19:59.080 |
bigger. This to me is one of the most amazing results I've seen in deepwinding. When we 01:20:08.280 |
train something on nearly all of ImageNet, it's a single epoch, so there's definitely 01:20:13.000 |
no overfitting. And it's able to recognize what hair is meant to look like when pulled 01:20:18.480 |
back into a bun is a pretty extraordinary result, I think. Something else which I only 01:20:25.960 |
realized later is that it's all a bit fuzzy, right? And there's this arm in the background 01:20:35.440 |
that's a bit fuzzy. The model knows that that is meant to stay fuzzy. It knows what out-of-focus 01:20:44.680 |
things look like. Equally cool is not just how that A is now incredibly precise and accurate, 01:20:55.320 |
but the fact that it knows that blurry things need to stay blurry. I don't know if you're 01:21:01.800 |
as amazed as this as I am, but I thought this was a pretty cool result. We could run this 01:21:08.240 |
over a 24-hour period on maybe two epochs of all of ImageNet, and presumably it would 01:21:15.160 |
get even better still. Okay, so let's take a 7-minute break and see you back here at 01:21:19.240 |
5 past 8. Okay, thanks everybody. That was fun. So we're going to do something else fun. 01:21:41.840 |
And that is to look at -- oh, before I continue, I did want to mention one thing in the homework 01:21:51.600 |
that I changed, which is I realized in my manually created loss function, I was already 01:22:04.400 |
doing a mean squared error in the loss function. But then when I told Teras to make that thing 01:22:15.400 |
as close to 0 as possible, I had to also give it a loss function, and I was giving it MSE. 01:22:21.440 |
And effectively that was kind of squaring my squared errors, it seemed wrong. So I've 01:22:25.640 |
changed it to M-A-E, mean absolute error. So when you look back over the notebooks, that's 01:22:31.840 |
why, because this is just to say, hey, get the loss as close to 0 as possible. I didn't 01:22:39.240 |
really want to re-square it. That didn't make any sense. So that's why you'll see that minor 01:22:46.280 |
change. The other thing to mention is I didn't notice that when I retrained my super resolution 01:22:55.040 |
on my new images that didn't have the black border, it gave good results much, much faster. 01:23:01.800 |
And so I really think that thing of learning to put the black border back in seemed to 01:23:05.720 |
take quite a lot of effort for it. So again, hopefully some of you are going to look into 01:23:16.200 |
So we're going to learn about general adversarial networks. This will kind of close off our 01:23:22.600 |
deep dive into generative models as applied to images. And just to remind you, the purpose 01:23:30.880 |
of this has been to learn about generative models, not to specifically learn about super 01:23:35.880 |
resolution or artistic style. But remember, these things can be used to create all kinds 01:23:42.600 |
of images. So one of the groups is interested in taking a 2D photo and trying to turn it 01:23:48.320 |
into something that you can rotate in 3D, or at least show a different angle of that 01:23:52.720 |
2D photo. And that's a great example of something that this should totally work for. It's just 01:23:59.240 |
a mapping from one image to some different image, which is like what would this image 01:24:07.000 |
So keep in mind the purpose of this is just like in Part 1, we learned about classification, 01:24:14.320 |
which you can use for 1000 things. Now we're learning about generative models that you 01:24:20.800 |
Now any generative model you build, you can make it better by adding on top of it again 01:24:29.040 |
a generative adversarial network. And this is something I don't really feel like has 01:24:33.880 |
been fully appreciated. People I've seen generally treat GANs as a different way of creating 01:24:39.480 |
a generative model. But I think of this more as like, why not create a generative model 01:24:45.820 |
using the kind of techniques we've been talking about. But then think of it this way. Think 01:24:50.800 |
of all the artistic style stuff we were doing in my terrible attempt at a Simpsons cartoon 01:25:01.000 |
version of a picture. It looked nothing like a Simpsons. So what would be one way to improve 01:25:10.120 |
One way to improve that would be to create two networks. There would be one network that 01:25:18.200 |
takes our picture, which is actually not the Simpsons, and takes another picture that actually 01:25:25.280 |
is the Simpsons. And maybe we can train a neural network that takes those two images 01:25:32.720 |
and spits out something saying, Is that a real Simpsons image or not? And this thing 01:25:41.760 |
we'll call the discriminator. So we could easily train a discriminator right now. It's 01:25:52.560 |
just a classification network. Just use the same techniques we used in Part 1. We feed 01:25:57.720 |
it the two images, and it's going to spit out a 1 if it's a real Simpsons cartoon, and 01:26:04.040 |
a 0 if it's Jeremy's crappy generative model of Simpsons. That's easy, right? We know how 01:26:14.360 |
Now, go and build another model. There's two images as inputs. So you would feed it one 01:26:37.520 |
thing that's a Simpsons and one thing that's a generative output. It's up to you to feed 01:26:43.560 |
it one of each. Or alternatively, you could feed it one thing. In fact, probably easier 01:26:52.120 |
is to just feed it one thing and it spits out, Is it the Simpsons or isn't it the Simpsons? 01:26:57.360 |
And you could just mix them and match them. Actually, it's the latter that we're going 01:27:00.720 |
to do, so that's probably easier. We're going to have one thing which is either not a Simpsons 01:27:11.720 |
or it is a Simpsons, and we're going to have a mix of 50/50 of those two, and we're going 01:27:18.360 |
to have something come out saying, "What do you think? Is it real or not?" So this thing, 01:27:25.480 |
this discriminator, from now on we'll probably generally be calling it D. So there's a thing 01:27:30.400 |
called D. And we can think of that as a function. D is a function that takes some input, x, which 01:27:38.760 |
is an image, and spits out a 1 or a 0, or maybe a probability. 01:27:48.360 |
So what we could now do is create another neural network. And what this neural network 01:27:55.200 |
is going to do is it's going to take as input some random noise, just like all of our generators 01:28:03.440 |
have so far. And it's going to spit out an image. And the loss function is going to be 01:28:14.120 |
if you take that image and stick it through D, did you manage to fool it? So could you 01:28:24.440 |
create something where in fact we wanted to say, "Oh yeah, totally, that's a real Simpsons." 01:28:32.160 |
So if that was our loss function, we're going to call the generator, we'll call it G. It's 01:28:37.440 |
just something exactly like our perceptual losses style transfer model. It could be exactly 01:28:42.960 |
the same model. But the loss function is now going to be take the output of that and stick 01:28:48.840 |
it through D, the discriminator, and try to trick it. So the generator is doing well if 01:28:59.320 |
So one way to do this would be to take our discriminator and train it as best as we can 01:29:05.960 |
to recognize the difference between our crappy Simpsons and real Simpsons, and then get a 01:29:11.680 |
generator and train it to trick that discriminator. But now at the end of that, it's probably 01:29:17.560 |
still not very good because you realize that actually the discriminator didn't have to 01:29:21.320 |
be very good before because my Simpsons generators were so bad. So I could now go back and retrain 01:29:27.200 |
the discriminator based on my better generated images, and then I could go back and retrain 01:29:37.560 |
And that is the general approach of a GAN, is to keep going back between two things, 01:29:43.440 |
which is training a discriminator and training a generator using a discriminator as a loss 01:29:50.400 |
function. So we've got one thing which is discriminator on some image, and another thing 01:29:59.200 |
which is a discriminator on a generator on some noise. 01:30:17.480 |
In practice, these things are going to spit out probabilities. So that's the general idea. 01:30:29.960 |
In practice, they found it very difficult to do this like train the discriminator as 01:30:36.560 |
best as we can, stop train the generator as best as we can, stop and so on and so forth. 01:30:43.000 |
So instead, the original GAN paper is called Generative Adversarial Nets. And here you 01:30:56.240 |
can see they've actually specified this loss function. So here it is in notation. They 01:31:03.240 |
call it minimizing the generator whilst maximizing the discriminator. This is what min max is 01:31:14.040 |
What they do in practice is they do it a batch at a time. So they have a loop, I'm going 01:31:18.480 |
to go through a loop and do a single batch, put it through the discriminator, that same 01:31:22.760 |
batch, stick it through the generator, and so we're going to do it a batch at a time. 01:31:28.000 |
So let's look at that. So here's the original GAN from that paper, and we're going to do 01:31:35.040 |
it on MNIST. And what we're going to do is we're going to see if we can start from scratch 01:31:39.880 |
to create something which can create images which the discriminator cannot tell whether 01:31:49.120 |
they're real or fake. And it's a discriminator that has learned to be good at discriminating 01:31:58.800 |
So we're loaded in MNIST, and the first thing they do in the paper is just use a standard 01:32:04.120 |
multilayer perceptron. So I'm just going to skip over that and let's get to the perceptron. 01:32:12.800 |
So here's our generator. It's just a standard multilayer perceptron. And here's our discriminator, 01:32:19.840 |
which is also a standard multilayer perceptron. The generator has a sigmoid activation, so 01:32:26.400 |
in other words, we're going to spit out an image where all of the pixels are between 01:32:30.680 |
0 and 1. So if you want to print it out, we'll just multiply it by 255, I guess. 01:32:36.240 |
So there's our generator, there's our discriminator. So there's then the combination of the two. 01:32:42.540 |
So take the generator and stick it into the discriminator. We can just use sequential 01:32:46.360 |
for that. And this is actually therefore the loss function that I want on my generator. 01:32:52.720 |
Generate something and then see if you can fool the discriminator. 01:32:56.520 |
So there's all my architectures set up. So the next thing I need to do is set up this 01:33:02.980 |
thing called train, which is going to do this adversarial training. Let's go back and have 01:33:08.040 |
a look at train. So what train is going to do is go through a bunch of epochs. And notice 01:33:15.260 |
here I wrap it in this TQDM. This is the thing that creates a nice little progress bar. Doesn't 01:33:20.360 |
do anything else, it just creates a little progress bar. We learned about that last week. 01:33:25.140 |
So the first thing I need to do is to generate some data to feed the discriminator. So I've 01:33:31.480 |
created a little function for that. And here's my little function. So it's going to create 01:33:36.380 |
a little bit of data that's real and a little bit of data that's fake. 01:33:40.780 |
So my real data is okay, let's go into my actual training set and grab some randomly 01:33:47.080 |
selected MNIST digits. So that's my real bit. And then let's create some fake. So noise 01:33:57.460 |
is a function that I've just created up here, which creates 100 random numbers. So let's 01:34:02.900 |
create some noise called g.predict. And then I'll concatenate the two together. So now I've 01:34:09.920 |
got some real data and some fake data. And so this is going to try and predict whether 01:34:16.700 |
or not something is fake. So 1 means fake, 0 means real. So I'm going to return my data 01:34:26.320 |
and my labels, which is a bunch of 0s to say they're all real and a bunch of 1s to say 01:34:30.900 |
they're all fake. So that's my discriminator's data. 01:34:34.620 |
So go ahead and create a set of data for the discriminator, and then do one batch of training. 01:34:45.500 |
Now I'm going to do the same thing for the generator. But when I train the generator, 01:34:50.220 |
I don't want to change the discriminator's weights. So make_trainable simply goes through 01:34:56.100 |
each layer and says it's not trainable. So make my discriminator non-trainable and do 01:35:01.640 |
one batch of training where I'm taking noise as my inputs. And my goal is to get the discriminator 01:35:10.740 |
to think that they are actually real. So that's why I'm passing in a bunch of 0s, because 01:35:17.260 |
remember 0 means real. And that's it. And then make discriminator trainable again. 01:35:23.120 |
So keep looking through this. Train the discriminator on a batch of half real, half fake. And then 01:35:30.860 |
train the generator to try and trick the discriminator using all fake. Repeat. 01:35:39.340 |
So that's the training loop. That's a basic GAN. Because we use TQDM, we get a nice little 01:35:45.220 |
progress bar. I kept track of the loss at each step, so there's our loss for the discriminator, 01:35:54.820 |
and there's our loss for the generator. So our question is, what do these loss curves 01:35:59.180 |
mean? Are they good or bad? How do we know? And the answer is, for this kind of GAN, they 01:36:06.780 |
mean nothing at all. The generator could get fantastic, but it could be because the discriminator 01:36:12.580 |
is terrible. And they don't really know whether each one is good or not, so even the order 01:36:18.300 |
of magnitude of both of them is meaningless. So these curves mean nothing. The direction 01:36:23.340 |
of the curves mean nothing. And this is one of the real difficulties with training GANs. 01:36:29.980 |
And here's what happens when I plot 12 randomly selected random noise vectors stuck through 01:36:36.940 |
there. And we have not got things that look terribly like MNIST digits and they also don't 01:36:41.380 |
look terribly much like they have a lot of variety. This is called ModeClass. Very common 01:36:52.920 |
problem when training GANs. And what it means is that the generator and the discriminator 01:36:59.060 |
have kind of reached a stalemate where neither of them basically knows how to go from here. 01:37:06.440 |
And in terms of optimization, we've basically found a local minimum. So okay, that was not 01:37:15.500 |
So the next major paper that came along was this one. Let's go to the top so you can see 01:37:25.580 |
it. Unsupervised representation learning with deep convolutional derivative adversarial networks. 01:37:31.780 |
So this created something that they called DCGANs. And the main page that you want to 01:37:39.020 |
look at here is page 3 where they say, "Call to our approach is doing these three things." 01:37:46.900 |
And basically what they do is they just do exactly the same thing as GANs, but they do 01:37:50.860 |
three things. One is to use the kinds of -- well in fact all of them is to learn the tricks 01:37:56.980 |
that we've been learning for generative models. Use an all-convolutional net, get rid of max 01:38:03.060 |
pooling and use strata convolutions instead, get rid of fully connected layers and use 01:38:08.260 |
lots of convolutional features instead, and add in batch null. And then use a CNN rather 01:38:13.580 |
than MLP. So here is that. This will look very familiar, it looks just like last lesson stuff. 01:38:24.260 |
So the generator is going to take in a random grid of inputs. It's going to do a batch norm, 01:38:34.500 |
up sample -- you'll notice that I'm doing even newer than this paper, I'm doing the 01:38:38.460 |
up sampling approach because we know that's better. Up sample, 1x1 conv, batch norm, up 01:38:43.960 |
sample, 1x1 conv, batch norm, and then a final conv layer. The discriminator basically does 01:38:51.860 |
the opposite, which is some 2x2 sub-samplings, so down sampling in the discriminator. 01:39:00.660 |
Another trick that I think it's mentioned in the paper is before you do the back and 01:39:06.820 |
forth batch for the discriminator and a batch for the generator is to train the discriminator 01:39:13.460 |
for a fraction of an epoch, like do a few batches through the discriminator. So at least 01:39:17.500 |
it knows how to recognize the difference between a random image and a real image a little bit. 01:39:23.820 |
So you can see here I actually just start by calling discriminator.fit with just a very 01:39:29.140 |
small amount of data. So this is kind of like bootstrapping the discriminator. And then 01:39:36.300 |
I just go ahead and call the same train as we had before with my better architectures. 01:39:44.340 |
And again, these curves are totally meaningless. But we have something which if you squint, 01:39:51.260 |
you could almost convince yourself that that's a vibe. 01:39:55.660 |
So until a week or two before this forth started, this was kind of about as good as we had. 01:40:06.260 |
People were much better at the artisanal details of this than I was, and indeed there's a whole 01:40:11.700 |
page called GANhacks, which had lots of tips. But then, a couple of weeks before this class 01:40:22.060 |
started, as I mentioned in the first class, along came the Wasserstein GAN. And the Wasserstein 01:40:29.220 |
GAN got rid of all of these problems. And here is the Wasserstein GAN paper. And this 01:40:40.820 |
paper is quite an extraordinary paper. And it's particularly extraordinary because -- and 01:40:48.820 |
I think I mentioned this in the first class of this part -- most papers tend to either 01:40:55.060 |
be math theory that goes nowhere, or kind of nice experiments in engineering where the 01:41:01.540 |
theory bits kind of hacked on at the end and kind of meaningless. 01:41:06.220 |
This paper is entirely driven by theory, and then the theory goes on to show this is what 01:41:14.380 |
the theory means, this is what we do, and suddenly all the problems go away. The loss 01:41:18.340 |
curves are going to actually mean something, and we're going to be able to do what I said 01:41:22.220 |
we wanted to do right at the start of this GAN section, which is to train the discriminator 01:41:29.500 |
a whole bunch of steps and then do a generator, and then discriminator a whole bunch of steps 01:41:33.820 |
and do the generator. And all that is going to suddenly start working. 01:41:38.780 |
How do we get it to work? In fact, despite the fact that this paper is both long and 01:41:47.260 |
full of equations and theorems and proofs, and there's a whole bunch of appendices at 01:41:53.300 |
the back with more theorems and proofs, there's actually only two things we need to do. One 01:41:58.180 |
is remove the log from the loss function. So rather than using cross-entropy loss, we're 01:42:05.220 |
just going to use mean squared error. That's one change. 01:42:08.180 |
The second change is we're going to constrain the weights so that they lie between -0.01 01:42:16.180 |
and +0.01. So we're going to constrain the weights to make them small. 01:42:20.820 |
Now in the process of saying that's all we're going to do is to kind of massively not give 01:42:26.980 |
credit to this paper, because what this paper is is they figured out that that's what we 01:42:30.620 |
need to do. On the forums, some of you have been reading through this paper and I've already 01:42:36.820 |
given you some tips as to some really great walkthrough. I'll put it on our wiki that 01:42:44.420 |
explains all the math from scratch. But basically what the math says is this, the loss function 01:42:52.820 |
for a GAN is not really the loss function you put into Keras. We thought we were just 01:42:58.300 |
putting in a cross-entropy loss function, but in fact what we really care about is the difference 01:43:04.380 |
between two distributions, the difference between the discriminator and the generator. 01:43:09.460 |
And the difference between two loss functions has a very different shape for the loss function 01:43:15.300 |
on its own. So it turns out that the difference between the two cross-entropy loss functions 01:43:22.460 |
is something called the Jensen-Shannon distance. And this paper shows that that loss function 01:43:32.620 |
is hideous. It is not differentiable, and it does not have a nice smooth shape at all. 01:43:44.340 |
So it kind of explains why it is that we kept getting this mode collapse and failing to 01:43:49.620 |
find nice minimums. Mathematically, this loss function does not behave the way a good loss 01:43:55.980 |
function should. And previously we've not come across anything like this because we've 01:44:02.500 |
been training a single function at a time. We really understand those loss functions, 01:44:08.980 |
mean squared error, cross-entropy. Even though we haven't already always derived the math 01:44:14.160 |
in detail, plenty of people have. We know that they're kind of nice and smooth and that 01:44:18.620 |
they have pretty nice shapes and they do what we want them to do. In this case, by training 01:44:23.620 |
two things kind of adversarially to each other, we're actually doing something quite different. 01:44:29.540 |
This paper just absolutely fantastically shows, with both examples and with theory, why that's 01:44:49.620 |
So the cosine distance is the difference between two things, whereas these distances that we're 01:45:02.020 |
talking about here are the distances between two distributions, which is a much more tricky 01:45:07.180 |
problem to deal with. The cosine distance, actually if you look at the notebook during 01:45:12.940 |
the week, you'll see it's basically the same as the Euclidean distance, but you normalize 01:45:20.020 |
the data first. So it has all the same nice properties that the Euclidean distance did. 01:45:29.460 |
The authors of this paper released their code in PyTorch. Luckily, PyTorch, the first kind 01:45:39.820 |
of pre-release came out in mid-January. You won't be surprised to hear that one of the 01:45:45.300 |
authors of the paper is the main author of PyTorch. So he was writing this before he 01:45:51.580 |
even released the code. There's lots of reasons we want to learn PyTorch anyway, so here's 01:45:58.820 |
So let's look at the Wasserstein GAN in PyTorch. Most of the code, in fact other than this 01:46:05.140 |
pretty much all the code I'm showing you in this part of the course, is very loosely based 01:46:11.700 |
on lots of bits of other code, which I had to massively rewrite because all of it was 01:46:15.500 |
wrong and hideous. This code actually I only did some minor refactoring to simplify things, 01:46:21.620 |
so this is actually very close to their code. So it was a very nice paper with very nice 01:46:30.460 |
So before we look at the Wasserstein GAN in PyTorch, let's look briefly at PyTorch. Basically 01:46:42.460 |
what you're going to see is that PyTorch looks a lot like NumPy, which is nice. We don't 01:46:49.100 |
have to create a computational graph using variables and placeholders and later on run 01:46:56.120 |
in a session. I'm sure you've seen by now Keras with TensorFlow, you try to print something 01:47:04.540 |
out with some intermediate output, it just prints out like Tensor and tells you how many 01:47:08.540 |
dimensions it has. And that's because all that thing is is a symbolic part of a computational 01:47:14.180 |
graph. PyTorch doesn't work that way. PyTorch is what's called a defined-by-run framework. 01:47:21.780 |
It's basically designed to be so fast to take your code and compile it that you don't have 01:47:31.100 |
to create that graph in advance. Every time you run a piece of code, it puts it on the 01:47:36.580 |
GPU, runs it, sends it back all in one go. So it makes things look very simple. So this 01:47:43.260 |
is a slightly cut-down version of the PyTorch tutorial that PyTorch provides on their website. 01:47:49.620 |
So you can grab that from there. So rather than creating np.array, you create torch.tensor. 01:47:57.780 |
But other than that, it's identical. So here's a random torch.tensor. APIs are all a little 01:48:10.940 |
bit different. Rather than dot shape, it's dot size. But you can see it looks very similar. 01:48:18.500 |
And so unlike in TensorFlow or Theano, we can just say x + y, and there it is. We don't 01:48:25.260 |
have to say z = x + y, f = function, x and y as inputs, set as output, and function dot 01:48:33.660 |
a vowel. No, you just go x + y, and there it is. So you can see why it's called defined-by-run. 01:48:40.780 |
We just provide the code and it just runs it. Generally speaking, most operations in 01:48:46.440 |
Torch as well as having this infix version. There's also a prefix version, so this is exactly 01:48:53.180 |
the same thing. You can often in fact nearly always add an out equals, and that puts the 01:48:59.900 |
result in this preallocated memory. We've already talked about why it's really important 01:49:04.340 |
to preallocate memory. It's particularly important on GPUs. So if you write your own algorithms 01:49:10.580 |
in PyTorch, you'll need to be very careful of this. Perhaps the best trick is that you 01:49:15.740 |
can stick an underscore on the end of most things, and it causes it to do in place. This 01:49:20.140 |
is basically y + = x. That's what this underscore at the end means. 01:49:26.380 |
So there's some good little tricks. You can do slicing just like numpy. You can turn numpy 01:49:33.300 |
stuff into Torch tensors and vice versa by simply going dot numpy. One thing to be very 01:49:41.820 |
aware of is that A and B are now referring to the same thing. So if I now add underscore 01:49:52.660 |
A + = 1, it also changes B. Vice versa, you can turn numpy into Torch by calling Torch 01:50:02.860 |
from numpy. And again, same thing. If you change the numpy, it changes the Torch. All of that 01:50:11.800 |
so far has been running on the CPU. To turn anything into something that runs on the GPU, 01:50:16.900 |
you chuck dot CUDA at the end of it. So this x + y just ran on the GPU. 01:50:25.280 |
So where things get cool is that something like this knows not just how to do that piece 01:50:32.220 |
of arithmetic, but it also knows how to take the gradient of that. To make anything into 01:50:37.700 |
something which calculates gradients, you just take your Torch tensor, wrap it in variable, 01:50:44.900 |
and add this parameter to it. From now on, anything I do to x, it's going to remember 01:50:49.860 |
what I did so that it can take the gradient of it. For example, x + 2, I get x3 just like 01:50:58.140 |
a normal tensor. So a variable and a tensor have the same API except that I can keep doing 01:51:04.740 |
things to it. Square times 3, dot mean. Later on, I can go dot backward and dot grad and 01:51:14.620 |
I can get the gradient. So that's the critical difference between a tensor and a variable. 01:51:21.780 |
They have exactly the same API except variable also has dot backward and that gets you the 01:51:28.740 |
gradient. When I say dot gradient, the reason that this is dout dx is because I typed out 01:51:35.720 |
dot backward. So this is the thing the derivative is respect to. 01:51:44.900 |
So this is kind of crazy. You can do things like while loops and get the gradients of 01:51:50.260 |
them. It's this kind of thing pretty tricky to do with TensorFlow or Theano, these kind 01:51:56.580 |
of computation graph approaches. So it gives you a whole lot of flexibility to define things 01:52:03.260 |
in much more natural ways. So you can really write PyTorch just like you're writing regular 01:52:09.660 |
old NumPy stuff. It has plenty of libraries, so if you want to create a neural network, 01:52:19.700 |
here's how you do a CNN. I warned you early on that if you don't know about OO in Python, 01:52:26.340 |
you need to learn it. So here's why. Because in PyTorch, everything's kind of done using 01:52:32.100 |
OO. I really like this. In TensorFlow, they kind of invent their own weird way of programming 01:52:43.740 |
rather than use Python OO. Or else PyTorch just goes, "Oh, we already have these features 01:52:48.980 |
in the language. Let's just use them." So it's way easier, in my opinion. 01:52:54.980 |
So to create a neural net, you create a new class, you derive from module, and then in 01:53:01.380 |
the constructor, you create all of the things that have weights. So conv1 is now something 01:53:11.980 |
that has some weights. It's a 2D conv. Conv2 is something with some weights. PolyConnected1 01:53:16.580 |
is something with some weights. So there's all of your layers, and then you get to say 01:53:23.820 |
exactly what happens in your forward pass. Because MaxPool2D doesn't have any weights, 01:53:30.900 |
and Relyu doesn't have any weights, there's no need to define them in the initializer. 01:53:36.900 |
You can just call them as functions. But these things have weights, so they need to be kind 01:53:42.500 |
of stateful and persistent. So in my forward pass, you literally just define what are the 01:53:49.380 |
things that happen. .vue is the same as reshape. The whole API has different names for everything, 01:54:00.460 |
which is mildly annoying for the first week, but you kind of get used to it. .reshape is 01:54:04.180 |
called .vue. During the week, if you try to use PyTorch and you're like, "How do you say 01:54:09.380 |
blah in PyTorch?" and you can't find it, feel free to post on the forum. Having said that, 01:54:15.780 |
PyTorch has its own discourse-based forums. And as you can see, it is just as busy and 01:54:23.980 |
friendly as our forums. People are posting on these all the time. So I find it a really 01:54:31.060 |
great, helpful community. So feel free to ask over there or over here. 01:54:45.360 |
You can then put all of that computation onto the GPU by calling .kuda. You can then take 01:54:54.740 |
some input, put that on the GPU with .kuda. You can then calculate your derivatives, calculate 01:55:04.140 |
your loss, and then later on you can optimize it. This is just one step of the optimizer, 01:55:16.240 |
so we have to kind of put that in the word. So there's the basic pieces. At the end here 01:55:21.180 |
there's a complete process, but I think more fun will be to see the process in the Wasserstein 01:55:27.740 |
So here it is. I've kind of got this TorchUtils thing which you'll find in GitHub which has 01:55:35.100 |
the basic stuff you'll want for Torch all there, so you can just import that. So we set 01:55:45.980 |
up the batch size, the size of each image, the size of our noise vector. And look how 01:55:52.700 |
cool it is. I really like this. This is how you import datasets. It has a datasets module 01:55:59.000 |
already in the TorchVision library. Here's the scifi10 dataset. It will automatically 01:56:06.940 |
download it to this path for you if you say download equals true. And rather than having 01:56:11.660 |
to figure out how to do the preprocessing, you can create a list of transforms. 01:56:20.260 |
So I think this is a really lovely API. The reason that this is so new yet has such a 01:56:25.540 |
nice API is because this comes from a lower library called Torch that's been around for 01:56:30.220 |
many years, and so these guys are basically started off by copying what they already had 01:56:35.500 |
and what already works well. So I think this is very elegant. So I've got two different 01:56:43.620 |
things you can look at here. They're both in the paper. One is scifi10, which are these 01:56:47.740 |
tiny little images. Another is something we haven't seen before, which is called lsun, 01:56:53.700 |
which is a really nice dataset. It's a huge dataset with millions of images, 3 million 01:57:03.340 |
bedroom images, for example. We can use either one. This is pretty cool. We can then create 01:57:14.200 |
a data loader, say how many workers to use. We already know what workers are. This is 01:57:21.700 |
Now that you know how many workers your CPU likes to use, you can just go ahead and put 01:57:26.420 |
that number in here. Use your CPU to load in this data in parallel in the background. 01:57:34.620 |
We're going to start with scifi10. We've got 47,000 of those images. We'll skip over very 01:57:52.660 |
quickly because it's really straightforward. Here's a conv block that consists of a conv2D, 01:57:59.340 |
a batchnorm2D, and a leakyrelu. In my initializer, I can go ahead and say, "Okay, we'll start 01:58:07.020 |
with a conv block. Optionally have a few extra conv blocks." This is really nice. Here's 01:58:16.260 |
a while loop that says keep adding more down sampling blocks until you've got as many as 01:58:27.980 |
you need. That's a really nice kind of use of a while loop to simplify creating our architecture. 01:58:36.660 |
And then a final conv block at the end to actually create the thing we want. 01:58:42.340 |
And then this is pretty nifty. If you pass in n GPU greater than 1, then it will call 01:58:51.780 |
parallel.data parallel passing in those GPU IDs and it will do automatic multi-GPU training. 01:59:00.140 |
This is by far the easiest multi-GPU training I've ever seen. That's it. That's the forward 01:59:08.540 |
pass behind here. We'll learn more about this over the next couple of weeks. In fact, given 01:59:26.260 |
we're a little short of time, let's discuss that next week and let me know if you don't 01:59:31.820 |
think we cover it. Here's the generator. It looks very, very similar. Again, there's a 01:59:37.740 |
while loop to make sure we've gone through the right number of decom blocks. This is 01:59:45.740 |
actually interesting. This would probably be better off with an up-sampling block followed 01:59:50.020 |
by a one-by-one convolution. Maybe at home you could try this and see if you get better 01:59:54.420 |
results because this has probably got the checkerboard pattern problem. 01:59:58.300 |
This is our generator and our discriminator. It's only 75 lines of code, nice and easy. 02:00:10.020 |
Everything's a little bit different in PyTorch. If we want to say what initializer to use, 02:00:14.360 |
again we're going to use a little bit more decoupled. Maybe at first it's a little more 02:00:21.740 |
complex but there's less things you have to learn. In this case we can call something 02:00:26.300 |
called apply, which takes some function and passes it to everything in our architecture. 02:00:34.060 |
This function is something that says, "Is this a conv2D or a convtranspose2D? If so, 02:00:40.740 |
use this initialization function." Or if it's a batch norm, use this initialization function. 02:00:46.180 |
Everything's a little bit different. There isn't a separate initializer parameter. This 02:00:52.980 |
is, in my opinion, much more flexible. I really like it. 02:01:03.300 |
As before, we need something that creates some noise. Let's go ahead and create some 02:01:10.580 |
fixed noise. We're going to have an optimizer for the discriminator. We've got an optimizer 02:01:14.580 |
for the generator. Here is something that does one step of the discriminator. We're 02:01:20.060 |
going to call the forward pass, then we call the backward pass, then we return the error. 02:01:26.420 |
Just like before, we've got something called make_trainable. This is how we make something 02:01:32.060 |
trainable or not trainable in PyTorch. Just like before, we have a train loop. The train 02:01:38.540 |
loop has got a little bit more going on, partly because of the vasa_stain_gan, partly because 02:01:45.140 |
of PyTorch. But the basic idea is the same. For each epoch, for each batch, make the discriminator 02:01:58.260 |
trainable, and then this is the number of iterations to train the discriminator for. 02:02:07.340 |
Remember I told you one of the nice things about the vasa_stain_gan is that we don't have 02:02:12.100 |
to do one batch discriminator, one batch generator, one batch discriminator, one batch generator, 02:02:16.100 |
but we can actually train the discriminator properly for a bunch of batches. In the paper, 02:02:22.620 |
they suggest using 5 batches of discriminator training each time through the loop, unless 02:02:35.420 |
you're still in the first 25 iterations. They say if you're in the first 25 iterations, 02:02:42.580 |
do 100 batches. And then they also say from time to time, do 100 batches. So it's kind 02:02:49.580 |
of nice by having the flexibility here to really change things, we can do exactly what 02:02:56.580 |
So basically at first we're going to train the discriminator carefully, and also from 02:03:03.340 |
time to time, train the discriminator very carefully. Otherwise we'll just do 5 batches. 02:03:09.860 |
So this is where we go ahead and train the discriminator. And you'll see here, we clamp 02:03:16.700 |
-- this is the same as clip -- the weights in the discriminator to fall in this range. 02:03:24.540 |
And if you're interested in reading the paper, the paper explains that basically the reason 02:03:28.780 |
for this is that their assumptions are only true in this kind of small area. So that's 02:03:39.060 |
why we have to make sure that the weights stay in this small area. 02:03:43.900 |
So then we go ahead and do a single step with the discriminator. Then we create some noise 02:03:50.460 |
and do a single step with the generator. We get our fake data for the discriminator. Then 02:04:01.580 |
we can subtract the fake from the real to get our error for the discriminator. So there's 02:04:06.380 |
one step with the discriminator. We do that either 5 or 100 times. Make our discriminator 02:04:17.900 |
not trainable, and then do one step of the generator. You can see here, we call the generator 02:04:24.500 |
with some noise, and then pass it into the discriminator to see if we tricked it or not. 02:04:30.140 |
During the week, you can look at these two different versions and you're going to see 02:04:35.740 |
basically the PyTorch and the Keras version of basically the same thing. The only difference 02:04:41.720 |
is in the two things. One is the presence of this clamping, and the second is that the 02:04:48.940 |
loss function is mean squared error rather than cross-entropy. 02:04:55.620 |
So let's see what happens. Here is some examples from SciFAR 10. They're certainly a lot better 02:05:09.540 |
than our crappy DC GAN MNIST examples, but they're not great. Why are they not great? 02:05:20.740 |
So probably the reason they're not great is because SciFAR 10 has quite a few different 02:05:27.860 |
kinds of categories of different kinds of things. So it doesn't really know what it's 02:05:32.340 |
meant to be drawing a picture of. Sometimes I guess it kind of figures it out. This must 02:05:36.660 |
be a plane, I think. But a lot of the time it hedges and kind of draws a picture of something 02:05:43.580 |
that looks like it might be a reasonable picture, but it's not a picture of anything in particular. 02:05:48.180 |
On the other hand, the Lsun dataset has 3 million bedrooms. So we would hope that when 02:05:57.540 |
we train the Wasserstein GAN on Lsun bedrooms, we might get better results. Here's the real 02:06:11.700 |
Here are our fake bedrooms, and they are pretty freaking awesome. So literally they started 02:06:21.420 |
out as random noise, and everyone has been turned in like that. It's definitely a bedroom. 02:06:29.220 |
They're all definitely bedrooms. And then here is the real bedrooms to compare. 02:06:36.860 |
You can kind of see here that imagine if you took this and stuck it on the end of any kind 02:06:46.860 |
of generator. I think you could really use this to make your generator much more believable. 02:06:58.020 |
Any time you kind of look at it and you say, "Oh, that doesn't look like the real X," maybe 02:07:01.740 |
you could try using a WGAN to try to make it look more like a real X. 02:07:12.700 |
So this paper is so important. Here's the other thing. The loss function for these actually 02:07:27.440 |
makes sense. The discriminator and the generator loss functions actually decrease as they get 02:07:33.300 |
better. So you can actually tell if your thing is training properly. You can't exactly compare 02:07:40.580 |
two different architectures to each other still, but you can certainly see that the 02:07:48.100 |
So now that we have, in my opinion, a GAN that actually really works reliably for the 02:07:56.720 |
first time ever, I feel like this changes the whole equation for what generators can 02:08:05.060 |
and can't do. And this has not been applied to anything yet. So you can take any old paper 02:08:14.400 |
that produces 3D outputs or segmentations or depth outputs or colorization or whatever 02:08:22.380 |
and add this. And it would be great to see what happens, because none of that has been 02:08:28.580 |
done before. It's not been done before because we haven't had a good way to train GANs before. 02:08:34.900 |
So this is kind of, I think, something where anybody who's interested in a project, yeah, 02:08:47.220 |
this would be a great project and something that maybe you can do reasonably quickly. 02:08:53.940 |
Another thing you could do as a project is to convert this into Keras. So you can take 02:09:01.020 |
the Keras DC GAN notebook that we've already got and change the loss function at the weight 02:09:07.300 |
clipping, try training on this lsunbedroom data set, and you should get the same results. 02:09:14.800 |
And then you can add this on top of any of your Keras stuff. 02:09:19.460 |
So there's so much you could do this week. I don't feel like I want to give you an assignment 02:09:27.220 |
per se, because there's a thousand assignments you could do. I think as per usual, you should 02:09:33.100 |
go back and look at the papers. The original GAN paper is a fairly easy read. There's a 02:09:41.900 |
section called Theoretical Results, which is kind of like the pointless math bit. Here's 02:09:49.900 |
some theoretical stuff. It's actually interesting to read this now because you go back and you 02:09:54.860 |
look at this stuff where they prove various nice things about their GAN. So they're talking 02:10:01.860 |
about how the generative model perfectly replicates the data generating process. It's interesting 02:10:06.460 |
to go back and look and say, okay, so they've proved these things, but it turned out to 02:10:13.740 |
be totally pointless. It still didn't work. It didn't really work. So it's kind of interesting 02:10:20.460 |
to look back and say, which is not to say this isn't a good paper, it is a good paper, 02:10:25.900 |
but it is interesting to see when is the theoretical stuff useful and when not. Then you look at 02:10:31.900 |
the Wasserstein GAN theoretical sections, and it spends a lot of time talking about 02:10:38.780 |
why their theory actually matters. So they have this really cool example where they say, 02:10:45.340 |
let's create something really simple. What if you want to learn just parallel lines, 02:10:51.140 |
and they show why it is that the old way of doing GANs can't learn parallel lines, and 02:10:58.060 |
then they show how their different objective function can learn parallel lines. So I think 02:11:04.300 |
anybody who's interested in getting into the theory a little bit, it's very interesting 02:11:11.380 |
to look at why did the proof of convergence show something that didn't show something 02:11:20.180 |
that really turned out to matter. Where else in this paper the theory turned out to be 02:11:25.120 |
super important and basically created something that allowed GANs to work for the first time. 02:11:30.980 |
So there's lots of stuff you can get out of these papers if you're interested. In terms 02:11:37.060 |
of the notation, we might look at some of the notation a little bit more next week. 02:11:45.020 |
But if we look, for example, at the algorithm sections, I think in general the bit I find 02:12:02.580 |
the most useful is the bit where they actually write the pseudocode. Even that, it's useful 02:12:09.060 |
to learn some kind of nomenclature. For each iteration, for each step, what does this mean? 02:12:21.000 |
Noise samples from noise prior. There's a lot of probability nomenclature which you 02:12:28.460 |
can very quickly translate. A prior simply means np.random.something. In this case, we're 02:12:40.380 |
probably like np.random.normal. So this just means some random number generator that you 02:12:49.500 |
This one here, sample from a data generating distribution, that means randomly picks some 02:12:55.180 |
stuff from your array. So these are the two steps. Generate some random numbers, and then 02:13:01.940 |
randomly select some things from your array. The bit where it talks about the gradient 02:13:10.140 |
you can kind of largely ignore, except the bit in the middle is your lost function. You 02:13:15.580 |
can see here, these things here is your noise, that's your noise. So noise, generator on 02:13:26.220 |
So there's the bit where we're trying to fool the discriminator, and we're trying to make 02:13:30.660 |
that tricker, so that's why we do 1-minus. And then here's getting the discriminator 02:13:35.620 |
to be accurate, because these x's is the real data. So that's the math version of what we 02:13:45.180 |
The Wasserstein-Gann also has an algorithm section, so it's kind of interesting to compare 02:13:54.540 |
the two. So here we go with Wasserstein-Gann, here's the algorithm, and basically this says 02:14:01.740 |
exactly the same thing as the last one said, but I actually find this one a bit clearer. 02:14:08.220 |
Sample from the real data, sample from your priors. So hopefully that's enough to get 02:14:15.340 |
going and look forward to talking on the forums and see how everybody gets along. Thanks everybody.