back to indexLesson 7: Practical Deep Learning for Coders
00:00:00.000 |
This is week 7 of 7, although in a sense it's week 7 of 14. 00:00:09.380 |
No pressure and no commitment, but how many of you are thinking you might want to come 00:00:18.100 |
When we started this, I thought if 1 in 5 people come back for part 2, I'd be happy. 00:00:26.200 |
So that's the best thing I've ever seen, thank you so much. 00:00:31.140 |
In that case, that's perfect because today I'm going to show you, and I think you'll 00:00:39.480 |
be surprised and maybe a little overwhelmed, at what you can do with this little set of 00:00:47.600 |
So this is going to be part 1 of this lesson, it's going to be a whirlwind tour of a bunch 00:00:55.720 |
of different architectures, and different architectures are not just different because 00:01:00.840 |
some of them will be better at doing what they're doing, but some of them will be doing 00:01:07.880 |
And I want to set your expectations and say that looking at an architecture and understanding 00:01:17.040 |
how it does what it does is something that took me quite a few weeks to just get an intuitive 00:01:22.280 |
feel for it, so don't feel bad, because as you'll see, it's like unprogrammed, it's like 00:01:28.920 |
we're going to describe something we would think would be great if the model knew how 00:01:34.040 |
to do it, and then we'll say fit, and suddenly the model knows how to do it, and we'll look 00:01:38.920 |
at it and decide how they're going to do that. 00:01:42.720 |
The other thing I want to mention is, having said that, everything we're about to see uses 00:01:47.920 |
only the things we've done. In fact, in the first half we're only going to use CNNs. There's 00:01:53.360 |
going to be no cropping of images, there's going to be no filtering, there's going to 00:01:58.760 |
be nothing hand-tuned, it's just going to be a bunch of convolutional ordains, but we're 00:02:04.920 |
going to put them together in some interesting ways. 00:02:10.720 |
So let me start with one of the most important developments of the last year or two, which 00:02:19.720 |
is called ResNet. ResNet won the 2015 ImageNet competition. I was delighted that it won it 00:02:29.240 |
because it's an incredibly simple and intuitively understandable concept, and it's very simple 00:02:37.440 |
to implement. In fact, what I would like to do is to show you. 00:02:49.200 |
So let me describe as best as I can how ResNet works. In fact, before I describe how it works, 00:02:57.460 |
I will show you why you should care that it works. So let's for now just put aside the 00:03:04.080 |
idea that there's a thing called ResNet. It's another architecture, a lot like VGG, that's 00:03:08.580 |
used for image classification or other CNN type things. It's actually broader than just 00:03:15.840 |
image classification. And we use it just the same way as we use the VGG-16 class you're 00:03:20.080 |
familiar with. We just say create something in ResNet, and again there's different size 00:03:29.080 |
of ResNet. I'm going to use 50 because it's the smallest one and it works super well. 00:03:34.640 |
I've started adding a parameter to my versions of these networks. I've added it to the new 00:03:40.240 |
VGG as well, which is include_top. It's actually the same as the Keras author who started doing 00:03:45.960 |
his models. Basically the idea is that if you say include_top = false, you don't have 00:03:52.360 |
to go model.pop afterwards to remove the layers if you want to fine-tune. Include_top = false 00:03:58.280 |
means only include the convolutional layers basically, and I'm going to stick my own final 00:04:07.640 |
classification layers on top of that. So when I do this, it's not going to give me the last 00:04:14.080 |
few layers. Maybe the best way to explain that is to show you when I create this network, 00:04:24.320 |
I've got this thing at the end that says if include_top, and if so then we add the last 00:04:29.760 |
few layers with this last dense fully connected layer that makes it just image net things 00:04:33.860 |
that's from your thousand categories. If we're not include_top, then don't add these additional 00:04:38.400 |
layers. So this is just a thing which means you can load in a model which is specifically 00:04:45.700 |
designed for fine-tuning. It's a little shortcut. And as you'll see shortly, it has some really 00:04:50.160 |
helpful properties. We're in the Cats and Dogs competition here. The winner of the Cats and 00:04:58.600 |
Dogs competition had an accuracy of 0.985 on the private leaderboard and 0.989 on the 00:05:04.840 |
private leaderboard. We use this ResNet model in the same way as usual. We grab our batches, 00:05:11.560 |
we can pre-compute some features. And in fact, every single CNN model I'm going to show you, 00:05:17.600 |
we're always going to pre-compute the convolutional features. So everything we see today will 00:05:22.040 |
be things you can do without retraining any of the convolutional layers. So you'll see 00:05:26.720 |
pretty much everything I train will train in a small number of seconds. And that's because 00:05:32.160 |
in my experience when you're working with photos, it's almost never helpful to retrain 00:05:40.640 |
So we can stick something on top of our ResNet in the usual way. And we can say go ahead 00:05:50.760 |
and compile and fit it. And in 48 seconds, it's created a model with a 0.986 accuracy, 00:05:59.840 |
which is the winner of the private leaderboard or the second winner of the private leaderboard. 00:06:05.380 |
So that's pretty impressive. I'm going to show you how this works in a moment. ResNet's actually 00:06:12.740 |
designed to not be used with a standard bunch of dense layers, but it's designed to be used 00:06:18.400 |
with something called a global average pooling layer, which I'm about to describe to you. 00:06:23.340 |
So for now, let me just show you what happens if instead of the previous model, I use this 00:06:27.160 |
model, which has 3 layers, and compile and fit it, I get 0.9875 in 3 seconds. In fact, 00:06:40.400 |
I can even tell it that I don't want to use 224x224 images but I want to use 400x400 images. 00:06:50.740 |
And if I do that, and then I get batches, I say I want to create 400x400 images, and 00:06:58.220 |
create those features, compile and fit, I get 99.3. So this is kind of off the charts 00:07:06.040 |
to go from somewhere around 98.5 to 99.3, we're reducing the amount of error by 3rd to 1/2. 00:07:15.780 |
So this is why you should be interested in ResNet. It's incredibly accurate. We're using 00:07:22.260 |
it for the thing it's best at, which was originally, this ResNet was trained on ImageNet and the 00:07:30.540 |
Dogs and Cats competition looks a lot like ImageNet images. They're single pictures of 00:07:35.420 |
a single thing that's kind of reasonably large in the picture, they're not very big images 00:07:40.020 |
on the whole. So this is something which this ResNet approach is particularly good for. 00:07:47.180 |
So I do actually want to show you how it works, because I think it's fascinating and awesome. 00:07:51.920 |
And I'm going to stick to the same approach that we've used so far when we've talked about 00:07:58.300 |
architectures, which is that we have any shape represents a matrix of activations, and any 00:08:14.100 |
arrow represents a layer operation. So that is a convolution or a dense layer with an activation 00:08:21.300 |
function. ResNet looks a lot like VGG. So I've mentioned that there's some part of the model 00:08:32.340 |
down here that we're not going to worry about too much. We're kind of halfway through the 00:08:37.020 |
model and there's some hidden activation layer that we've got too. With VGG, the approach 00:08:46.020 |
is generally to go, the layers are basically a 3x3 conv, that gives you some activations, 00:08:55.540 |
another 3x3 conv, that gives you some activations, another 3x3 conv, that gives you some activations, 00:09:01.220 |
and then from time to time, it also does a max pulling. So each of these is representing 00:09:09.220 |
a conv layer. ResNet looks a lot like this. In fact, it has exactly that path, which is 00:09:18.820 |
a bunch of conv and values on top of each other. But it does something else, which is 00:09:25.980 |
this bit that comes out, and remember, when we have two arrows coming into a shape, that 00:09:40.100 |
You'll notice here, in fact, there's no shapes anywhere on the way here. In fact, this arrow 00:09:49.940 |
does not represent a conv, it does not represent a dense layer, it actually represents identity. 00:09:56.060 |
In other words, we do nothing at all. And this whole thing here is called a ResNet block. 00:10:08.700 |
And so ResNet, basically if we represented a ResNet block as a square, ResNet is just 00:10:13.460 |
a whole bunch of these blocks basically stacked on top of each other. And then there's an 00:10:19.300 |
input which is the input data, and then the output, of course it's yet. 00:10:26.980 |
Another way of looking at this is just to look at the code. I think the code is nice 00:10:32.500 |
and kind of intuitive to understand. Let's take a look at this thing, they call it an 00:10:41.140 |
identity block. So here's the code for what I just described, it's here. You might notice 00:10:49.180 |
that everything I just selected here looks like a totally standard VGG block. I've got 00:10:55.660 |
a conv2d, a batch normalization, and an activation function. I guess it looks like our improved 00:11:02.740 |
VGG because it's got batch normalization. Another conv2d, another batch norm, but then this 00:11:12.580 |
is the magic that makes it ResNet, this single line of code. And it does something incredibly 00:11:17.340 |
simple. It takes the result, all those 3 convolutions, and it adds it to our original input. 00:11:29.100 |
So normally, we have the output of some block is equal to a convolution of some input to 00:11:50.500 |
that block. But we're doing something different. We're saying the output to a block, so let's 00:12:00.500 |
call this "hidden state at time t + 1" is equal to the convolutions of hidden state 00:12:08.380 |
at time t plus the hidden state at time t. That is the magic which makes it ResNet. 00:12:19.860 |
So why is it that that can give us this huge improvement in the state of the art in such 00:12:28.820 |
a short period of time? And this is actually interestingly something that is somewhat controversial. 00:12:38.020 |
The authors of this paper that originally developed this describe it a number of ways. 00:12:45.060 |
They basically gave 2 main reasons. The first is they claim that you can create much deeper 00:12:50.700 |
networks this way, because when you're backpropagating the weights, backpropagating through an identity 00:12:57.060 |
is easy. You're never going to have an explosion of gradients or an explosion of activations. 00:13:03.940 |
And indeed, this did turn out to be true. The authors created a ResNet with over a thousand 00:13:10.860 |
layers and got very good results. But it also turned out to be a bit of a red herring. A 00:13:18.420 |
few months ago, some other folks created a ResNet which was not at all deep. It had like 00:13:25.500 |
40 or 50 layers, but instead it's very wide and had a lot of activations. And that did 00:13:30.460 |
even better. So it's one of these funny things that seems even the original authors might 00:13:36.020 |
have been wrong about why they built what they built. 00:13:39.140 |
The second reason why they built what they built seems to have stood the test of time, 00:13:44.180 |
which is that if we take this equation and rejig it, let's subtract that from both sides. 00:13:51.180 |
And that gives us h(t) + 1 - h(t). So the hidden activations at the next time period 00:13:59.620 |
minus the hidden activations the previous time period equals (and I'm going to replace 00:14:04.380 |
all this with R for ResNet block) a convolution of convolution of convolution applied to the 00:14:15.460 |
When you write it like that, it might make you realize something, which is all of the 00:14:26.420 |
weights we're learning are here. So we're learning a bunch of weights which allow us 00:14:31.740 |
to make our previous guess as to the predictions a little bit better. We're basically saying 00:14:40.140 |
let's take the previous predictions we've got, however we got to them, and try and build 00:14:44.980 |
a set of things which makes them a little bit better. 00:14:47.980 |
In statistics, this thing is called the residual. The residual is the difference between the 00:14:53.420 |
thing you're trying to predict and your actions. So what we basically did here was they did 00:15:01.700 |
design an architecture which without us having to do anything special automatically learns 00:15:07.740 |
how to model the residuals. It learns how to build a bunch of layers which continually 00:15:14.780 |
For those of you who have more of a machine learning background, you would recognize this 00:15:20.420 |
as essentially being boosting. Boosting refers to the idea of having a bunch of models where 00:15:28.420 |
each model tries to predict the errors of the previous model. If you have a whole chain 00:15:33.240 |
of those, you can then predict the errors on top of the errors, and add them all together, 00:15:38.460 |
and boosting is a way of getting much improved ensembles. 00:15:43.780 |
So this ResNet is not manually doing boosting, it's not manually doing anything. It's just 00:15:50.300 |
this single one extra line of code. It's all in the architecture. 00:15:58.660 |
A question about dimensionality. I would have assumed that by the time we were close to 00:16:04.980 |
output, the dimensions would be so different that element-wise addition wouldn't be possible 00:16:11.980 |
It's important to note that this input tensor is the input tensor to the block. So you'll 00:16:19.940 |
see there's no max pooling inside here, so the dimensionality remains constant throughout 00:16:27.820 |
all of these lines of code, so we can add them up. And then we can do our strides or 00:16:32.580 |
max pooling, and then we do another identity block. So we're only adding that to the input 00:16:37.780 |
of the block, not the input of the original image. 00:16:40.900 |
And that's what we want. We want to say the input to each block is our best prediction 00:16:50.740 |
Then qualitatively, how does this compare to dropout? 00:16:53.780 |
In some ways, in most ways, it's unrelated to dropout. And indeed you can add dropout 00:17:02.180 |
to ResNet. At the end of a ResNet block, after this merge, you can add dropout. So ResNet 00:17:09.420 |
is not a regularization technique per se. Having said that, it does seem to have excellent 00:17:17.580 |
generalization characteristics, and if memory serves correctly, I just searched this entire 00:17:24.180 |
code base for dropout, and it didn't appear. So the image network didn't use any dropout, 00:17:29.140 |
they didn't find it as necessary. But this is very problem-dependent. If you have only 00:17:34.900 |
a small amount of data, you may well need dropout. And I explained another reason that 00:17:40.700 |
we don't need dropout for this in just a moment. 00:17:43.580 |
In fact, I'll do that right now, which is, remember what I did here at the end was I 00:17:51.180 |
created a model which had a special kind of layer called a global average boolean layer. 00:18:00.140 |
This is the next key thing I teach you about today. It's a really important concept, it's 00:18:04.900 |
going to come up a couple more times during today's class. 00:18:09.820 |
Let's describe what this is. It's actually very simple. Here is the output of the pre-computed 00:18:23.460 |
resume. On the 224x224, the pre-computed residual blocks give us a 13x13 output with 2048. 00:18:49.100 |
One way of thinking about this would be to say, well, each of these 13x13 blocks could 00:18:57.180 |
potentially try to say how catty or how doggy edge one of those 13 blocks. 00:19:03.940 |
And so rather than max pooling, which is take the maximum of that grid, we could do average 00:19:10.980 |
pooling. Across those 13x13 areas, what is the average amount of doggyness in each one, 00:19:20.100 |
what is the average amount of cattyness in each one? 00:19:25.300 |
And that's actually what global average pooling does. What global average pooling does is it's 00:19:29.980 |
identical to average pooling 13x13 because the input to it is 13x13. So in other words, 00:19:49.060 |
whatever the input to a global average pooling layer is, it will take all of the x and all 00:19:55.020 |
of the y coordinates and just take the average for every one of these 2048 georges. 00:20:02.960 |
So let's take a look here. So what this is doing is it's taking an input of 2048 by 13x13 00:20:17.860 |
and it's going to return an output which is just a single vector of 2048. And that vector 00:20:24.460 |
is, on average, how much does this whole image have of each of those 2048 features. 00:20:33.740 |
And because ResNet was originally trained with global average pooling 2D, so you can 00:20:39.900 |
see that this is the ResNet code. In fact, it's 7x7. So this was actually written before 00:20:57.740 |
the global average pooling 2D layer existed, so they just did it manually, they just put 00:21:02.660 |
an average pooling 7x7 here. So because ResNet was trained originally with this layer here, 00:21:16.300 |
that means that it was trained such that the last identity block was basically creating 00:21:22.180 |
features that were designed to be average together. And so that means that when we used 00:21:28.720 |
this tiny little architecture, we got the best results because that was how ResNet was 00:21:38.220 |
If you had a wider network without the input fed forward to the output activation, couldn't 00:21:43.500 |
you get the same result? The extra activations in the wider network could pass the input 00:21:50.740 |
Well, you can in theory have convolutional filters that don't do anything, but the point 00:22:03.620 |
is having to learn that is learning lots and lots of filters designed to learn that. And 00:22:11.260 |
so maybe the best way I can describe this is everything I'm telling you about architectures 00:22:18.180 |
is in some ways irrelevant. You could create nothing but dense layers at every level of 00:22:23.900 |
your model. And dense layers have every input connected to every output, so every architecture 00:22:31.660 |
I'm telling you about is just a simplified version of that, we're just deleting some 00:22:39.220 |
But it's really helpful to do that. It's really helpful to help our SGD optimizer by giving 00:22:47.660 |
it, by making it so that the default thing it can do is the thing we want. So yes, in 00:22:54.460 |
theory, a convNet or a native fully connected net could learn to do the same thing that 00:22:59.540 |
ResNet does. In practice, it would take a lot of parameters for it to do so, and time to 00:23:08.180 |
do so, and so this is why we care about architectures. In practice, having a good architecture makes 00:23:14.980 |
a huge difference. That's a good question, very good question. 00:23:22.180 |
Another question, would it be fair to say that if VGG was trained with average pooling, 00:23:29.500 |
I'm not sure, so let's talk about that a little bit. One of the reasons, or maybe the main 00:23:38.900 |
reason that ResNet didn't need to drop out is because we're using global average pooling, 00:23:44.940 |
there's a hell of a lot less parameters in this model. Remember, the vast majority of 00:23:51.940 |
the parameters in the model are in the dense layers, because if you've got 'm' inputs and 00:23:56.700 |
'm' outputs, you have 'n' times 'm' connections. 00:24:00.560 |
So in VGG, I can't quite remember, but that first dense layer has something like 300 million 00:24:06.780 |
inputs, because it had every possible feature of the convolutional layer by each of the 00:24:16.060 |
3, by each convolutional layer by every one of the 4,006 outputs, so it just created a 00:24:21.020 |
lot of features and made it very easy to fit. 00:24:25.540 |
With global average pooling and indeed not having any dense layers, we have a lot less 00:24:31.460 |
parameters, so it's going to generalize better. It also generalizes better because we're treating 00:24:37.420 |
every one of those 7x7 or 13x13 areas in the same way. We're saying how doggy or catty 00:24:46.540 |
are each of these, we're just averaging them. It turns out that these global average pooling 00:24:53.760 |
layer models do seem to generalize very well, and we're going to be seeing more of that in 00:24:59.260 |
Why do we use global average pooling instead of max pooling? 00:25:04.620 |
You wouldn't want to max pool over, well, it depends. You can try both. In this case, the 00:25:31.920 |
same thing is the same thing. The same thing is the same thing. On the other hand, the 00:25:45.800 |
fisheries competition, the fish is generally a very small part of each image. So maybe in 00:25:52.780 |
the fisheries competition you should use a global max pooling layer, give it a try and 00:25:58.620 |
tell us how it goes. Because in that case, you actually don't care about all the parts 00:26:02.700 |
of the image, which have nothing to do with fish. So that would be a very interesting thing 00:26:11.900 |
ResNet is very powerful, but it has not been studied much at all for transfer learning. 00:26:26.540 |
This is not to say it won't work well for transfer learning, I just literally haven't 00:26:30.740 |
found a single paper yet where somebody has analyzed its effectiveness for transfer learning. 00:26:37.580 |
And to me, 99.9999% of what you will work on will be transfer learning. Because if you're 00:26:44.700 |
not using transfer learning, it means you're looking at a data set that is so different 00:26:48.180 |
to anything that anybody has looked at before that none of the pictures in any model was 00:26:52.420 |
remotely helpful for you, which is going to be rare. 00:27:02.220 |
Particularly all of the work I've seen on transfer learning, both in terms of capital winners 00:27:07.180 |
and in terms of papers, uses VGG. And I think one of the reasons for that is, as we talked 00:27:12.620 |
about in lesson 1, the VGG architecture really is designed to create layers of gradually 00:27:23.980 |
increasing semantic complexity. All the work I've seen on visualizing layers tends to use 00:27:30.980 |
VGG or something similar to that as well, like that Matt Seiler stuff we saw, or those 00:27:35.260 |
Joseph Nusinski videos we saw. And so we've seen how the VGG network, those kinds of networks, 00:27:42.540 |
create gradually more complex representations, which is exactly what we want to transfer. 00:27:48.660 |
Because let's just say, how different is this new domain to the previous domain, and then 00:27:55.580 |
we can pick a layer far enough back, we can try a few that the features seem to work well. 00:28:03.260 |
So for that reason, we're going to go back to looking at VGG for the rest of these architectures. 00:28:09.980 |
And I'm going to look at the fisheries competition. The fisheries competition is actually very 00:28:17.300 |
interesting. The pictures are from a dozen boats, and each one of these boats has a fixed 00:28:26.940 |
camera, and they can do daytime and nighttime shots. And so every picture has the same basic 00:28:35.540 |
shape and structure for each of the 12 boats, because it's a fixed camera. And then somewhere 00:28:40.500 |
in there, most of the time, there's one or more fish. And your job is to say what kind 00:28:51.020 |
And so one of the things that makes this interesting is that this is the kind of somewhat weird, 00:28:58.540 |
kind of complex, different thing to ImageNet, which is exactly the kind of stuff that you're 00:29:04.260 |
going to have to deal with any time you're doing some kind of computer vision problem 00:29:08.180 |
or any kind of CNN problem. It's very likely that the thing you're doing won't be quite 00:29:12.980 |
the same as what other academics have been looking at. So trying to figure out how to 00:29:16.300 |
do a good job of the fisheries competition is a great example. 00:29:23.080 |
So when I started on the fisheries competition, I just did the usual thing, which was to create 00:29:26.580 |
a VGG-16 model, fine-tuned it to have just 8 outputs, because we had to say which of 00:29:32.260 |
8 types of fish do we see in it. And then I, as per usual, pre-computed the convolutional 00:29:41.740 |
layers using the pre-trained VGG network, and then everything after that I just used 00:29:46.760 |
those pre-computed convolutional layers. And as per usual, the first thing I did was to 00:29:51.380 |
stick a few dense layers on top and see how that goes. 00:29:56.740 |
So the nice thing about this is you can see each epoch takes less than a second to run. 00:30:02.340 |
So when people talk about needing lots of data or lots of time, it's not really true 00:30:08.140 |
because for most stuff you do in real life, you're only using pre-computed convolutional 00:30:12.260 |
features. And in our validation set, we get an accuracy of 96.2%, a percentage loss of 00:30:20.420 |
0.18. That's pretty good, which seems to be recognising the fish pretty well. 00:30:28.060 |
But here's the problem. There is all kinds of data leakage going on, and this is one 00:30:35.380 |
of the most important concepts to understand when it comes to building any kind of model 00:30:45.020 |
or any kind of machine learning project leakage. 00:30:51.780 |
There was a paper, I think it actually won the KDD Best Paper Award a couple of years 00:30:57.420 |
ago from Claudia Prerich and some of her colleagues, which studied data leakage. Data leakage occurs 00:31:05.500 |
when something about the target you're trying to predict is encoded in the things that you're 00:31:12.980 |
predicting with, but that information is either not going to be available or it won't be helpful 00:31:19.140 |
in practice when you're going to use the model. 00:31:22.740 |
For example, in the fisheries competition, different boats fish in different parts of 00:31:29.380 |
the sea. Different parts of the sea have different fish in them, and so in the fisheries competition, 00:31:36.820 |
if you just use something representing which boat did the image come from, you can get 00:31:43.340 |
a pretty good, accurate validation set result. What I mean by that, for example, is here's 00:31:52.060 |
something which is very cheeky. This is a list of the size of each photo, along with 00:32:00.780 |
how many times that appears. You can see it's gone through every photo and opened it using 00:32:08.020 |
PIL, which is the Python imaging library, and greater the size. You can see that there's 00:32:13.540 |
basically a small number of sizes that appear. 00:32:16.580 |
It turns out that if you create a simple linear model that says any image of size 1192 x 670, 00:32:27.080 |
what kind of fish is that? Anything with 1280 x 720, what kind of fish is that? You get 00:32:31.700 |
a pretty accurate model because these are the different ships. The different ships have 00:32:37.820 |
different cameras and different cameras have different resolutions. This isn't helpful 00:32:41.860 |
in practice because what the fisheries people actually wanted to do was to use this to find 00:32:46.460 |
out when people are illegally or accidentally overfishing or fishing in the wrong way. So 00:32:54.060 |
if they're bringing up dolphins or something, they wouldn't know about it. 00:32:58.260 |
So any model that says I know what kind of fish this is because I know what the boat 00:33:02.500 |
is is entirely useless. So this is an example of leakage. In this particular paper I mentioned, 00:33:14.260 |
the authors looked at machine learning competitions and discovered that over 50% of them had some 00:33:20.300 |
kind of data leakage. I spoke to Claudia after she presented that paper, and I asked her 00:33:28.500 |
if she thought that regular machine learning projects in inside companies would have more 00:33:33.700 |
or less leakage than that, and she said a lot more. In competitions, people have tried 00:33:44.100 |
really hard to clean up the data ahead of time because they know that lots and lots of 00:33:47.140 |
people are going to be looking at it. And if there is leakage, you're almost certain 00:33:50.900 |
that somebody's going to find it because it's a competition. Whereas if you have leakage 00:33:55.260 |
in your data set, it's very likely you won't even know about it until you try to put the 00:34:00.580 |
model into production and discover that it doesn't work as well as you thought it would. 00:34:04.820 |
Oh, and I was just going to add that it might not even help you in the competition if your 00:34:11.460 |
test set is brand new boats that weren't in your training set. 00:34:17.940 |
So let's talk about that. So trying to win a Kaggle competition and trying to do a good 00:34:28.580 |
job is somewhat independent. So when I'm working on Kaggle, I focus on trying to win a Kaggle 00:34:34.740 |
competition. I have a clear metric and I try to optimize the metric. And sometimes that 00:34:39.620 |
means finding leakage and taking advantage of it. 00:34:43.220 |
So in this case, step number 1 for me in the fisheries competition was to say, "Can I take 00:34:48.580 |
advantage of this leakage?" I want to be very clear. This is the exact opposite of what 00:34:53.820 |
you would want to do if you were actually trying to help the fisheries people create 00:34:57.060 |
a good model. Having said that, there's $150,000 at stake and I could donate that to the Fred 00:35:02.660 |
Hollis Foundation and get lots of people their site back. So winning this would be good. 00:35:08.020 |
So let me show you how I try to take advantage of this leakage, which is totally legal in 00:35:13.660 |
a Kaggle competition and see what happened. And then I'll talk more about Rachel's issue 00:35:22.260 |
So the first thing I did was I made a list for every file of how big it was and what 00:35:27.380 |
the image dimensions were. And I did that for the validation of the training set. I 00:35:33.180 |
normalized them by subtracting the main, divided by the standard deviation. And then I created 00:35:38.260 |
an almost exact copy of the previous model I showed you, this one. But this time, rather 00:35:45.820 |
than using the sequential API, I used the functional API. But other than that, this is almost identical. 00:35:52.620 |
The only difference is in this line, what I've done is I've taken not just the input which 00:36:03.700 |
is the output of the last convolutional layer of my BGG model, but I have a second input. 00:36:11.300 |
And the second input is what size image is it. I should mention I have one-hot encoded 00:36:22.420 |
those image sizes, so they're treated as categories. 00:36:27.660 |
So I now have an additional input. One is the output of the BGG convolutional layer. 00:36:33.620 |
One is the one-hot encoded image size. I batch-monolized that, obviously. And then right at the very 00:36:41.660 |
last step, I can catenate the two together. So my model is basically a standard last few 00:36:51.740 |
layers of BGG model, so three dense layers. And then I have my input, and then I have 00:37:04.180 |
another input. It ended up being something I think I catenated, and that creates an output. 00:37:13.900 |
So what this can do now is that the last dense layer can learn to combine the image features 00:37:22.340 |
along with this metadata. This is useful for all kinds of things other than taking advantage 00:37:29.860 |
in a dastardly way of linkage. For example, if you were doing a collaborative free model, 00:37:36.740 |
you might have information about the user, such as their age, their gender, their favorite 00:37:44.300 |
genres, and they asked for a survey. This is how you incorporate that kind of metadata 00:37:55.820 |
So I batch the two together and run it. Initially it's looking encouraging. If we go back and 00:38:02.580 |
look at the standard model, we've got 0.84, 0.94, 0.95. This multi-input model is a little 00:38:14.300 |
better, 0.86, 0.95, 0.96. So that's encouraging. But interestingly, the model without using 00:38:25.420 |
the leakage gets somewhere around 96.5, 97.5, maybe 98. It's kind of all over the place, 00:38:34.860 |
which isn't a great sign, but let's say somewhere around 97, 97.5. This multi-input model, on 00:38:42.420 |
the other hand, does not get better than that. It's best is also around 97.5. Why is that? 00:38:52.820 |
This is very common when people try and utilize metadata in deep learning models. It often 00:38:58.420 |
turns out that the main thing you're looking at, in this case the image, already encodes 00:39:04.100 |
everything that your metadata has anyway. In this case, yeah, the size of the image 00:39:08.980 |
tells us what bode comes from, but you can't just look at the picture and see what bode 00:39:12.020 |
comes from. So by the later epochs, the convolutional model has learnt already to figure out what 00:39:17.340 |
bode comes from, so the linkage actually turned out not to be helpful anyway. 00:39:22.740 |
So it's amazing how often people assume they need to find metadata and incorporate it into 00:39:31.920 |
their model, and how often it turns out to be a waste of time. Because the raw, real 00:39:37.820 |
data or the audio or the pictures or the language or whatever turns out to encode all of that 00:39:44.540 |
in it. Finally, I wanted to go back to what Rachel was talking about, which is what would 00:39:54.380 |
have happened if this did work. Let's say that actually this gave us a much better validation 00:40:00.380 |
result than the non-linkage model. If I then submitted it to Kaggle and my leaderboard 00:40:07.740 |
result was great, that would tell me that I have found leakage, that the Kaggle competition 00:40:13.980 |
administrators didn't, and I'm possibly not aware of any competition. Having said that, 00:40:20.140 |
the Kaggle competition administrators first and foremost try to avoid leakage, and indeed 00:40:26.860 |
if you do try and submit this to the leaderboard, you'll find it doesn't do that great. I haven't 00:40:33.260 |
really looked into it yet, but somehow the competition administrators have simplified 00:40:38.540 |
some attempt to remove the leakage. The kind of ways that we did that when I was at Kaggle 00:40:45.020 |
would be to do things like some kind of stratified sampling where it would say there's way more 00:40:50.300 |
alcohol from this ship than this ship. Let's enforce that every ship has to have the same 00:40:56.700 |
number, same kind of fish, or something like that. But honestly, it's a very difficult 00:41:06.620 |
thing to do, and this impacts a lot more than just machine learning competitions. Every one 00:41:11.820 |
of your real-world projects, you're going to have to think long and hard about how can 00:41:17.020 |
you replicate real-world conditions in your test set. Maybe the best example I can come 00:41:23.740 |
up with is when you put your model into production, it will probably be a few months after you 00:41:29.980 |
grabbed the data and trained it. How much has the world changed? Therefore, wouldn't it 00:41:36.780 |
be great if instead you could create a test set that had data from a few months later 00:41:41.820 |
that you're trying to set? And again, you're really trying to replicate the situation that 00:41:48.140 |
you actually have when you put your model into production. 00:41:51.420 |
Two questions. One is just a note that they're releasing another test set later on in the 00:41:57.340 |
fishery competition. Question, did you do two classifications, one for the boats and one 00:42:08.900 |
I have two inputs, not two outputs. My input is the one hot encoded size of the image, 00:42:17.980 |
which I assumed is a proxy for the boat ID. Some discussion on the Kaggle forum suggested 00:42:26.220 |
that's a really small assumption. We're going to look at multi-output in a moment. In fact, 00:42:39.660 |
Another question, can you find a good way of isolating the fish on the images and then 00:42:46.860 |
Let's do that now, shall we? This is my lunch. All right, multi-output. There's a lot of 00:43:06.420 |
nice things about our Kaggle competitions are structured, and one of the things I really 00:43:09.820 |
like is that in most of them you can create or find your own data sources as long as you 00:43:17.580 |
share them with the community. So one of the people in the fisheries competition has gone 00:43:22.500 |
through and by hand put a little square around every fish, which is called annotating the 00:43:29.500 |
dataset. Specifically, this kind of annotation is called a bounding box. The bounding box 00:43:36.060 |
is a box in which your object only is. Because of the rules of Kaggle, you had to make that 00:43:42.860 |
available to everybody in the Kaggle community, which he provided a link on the Kaggle forum. 00:43:48.580 |
So I'm going to go ahead and download those. There are a bunch of JSON files that basically 00:43:52.460 |
look like this. So for each image, for each fish in that image, it had the height, width, 00:43:58.540 |
and x and y. So the details of the code don't matter too much, but I basically just went 00:44:04.620 |
and found the largest fish in each image and created a list of them. So I've got now my 00:44:14.580 |
training bounding boxes and my validation bounding boxes. For things that didn't have a fish, 00:44:19.260 |
I just had 0, 0, 0, 0. This is my empty bounding box here. 00:44:25.060 |
So as always, when I want to understand new data, the first thing to do is to look at 00:44:28.900 |
it. When we're doing computer vision problems, it's very easy to look at data because it's 00:44:32.420 |
pictures. So I went ahead and created this little show bounding box thing, and I tried 00:44:38.980 |
it on an image, and here is the fish, and here is the bounding box. 00:44:44.660 |
There are two questions, although I didn't know if you wanted to get to a good stopping 00:44:50.300 |
point on your thought. One is, adding metadata, is that not useful for both CNNs and RNNs 00:44:58.940 |
or just for CNNs? And the other one is, VGG required images all the same size and training. 00:45:08.300 |
In the fisheries case, are there different sized images being used for training and how 00:45:13.220 |
do you train a model on images with different dimensions? 00:45:16.740 |
Regarding whether metadata is useful for RNNs or CNNs, it's got nothing to do with the architecture. 00:45:28.580 |
It's entirely about the semantics of the data. If your text or audio or whatever unstructured 00:45:34.940 |
data in some way kind of encodes the same information that is in the metadata, the metadata 00:45:41.220 |
is unlikely to be helpful. For example, in the Netflix prize, in the early stages of 00:45:47.260 |
the competition, people found that it was helpful to link to IMDb and bring in information 00:45:53.500 |
about the movies. In later stages, they found it wasn't. The reason why is because in later 00:45:59.500 |
stages they had figured out how to extrapolate from the ratings themselves, they basically 00:46:06.580 |
contained implicitly all the same information. 00:46:12.180 |
How do we deal with different sized images? I'm about to show you some tricks, but so 00:46:17.780 |
far throughout this course, we have always resized everything to 224x224. Whenever you 00:46:25.140 |
use get matches, I default to resizing into 224x224 because that's what ImageNet did, 00:46:31.260 |
with the exception that in my previous ResNet model, I showed you resizing to 400x400 instead. 00:46:39.740 |
So far, and in fact everything we're doing this year, we're going to resize everything 00:46:49.220 |
So I had a question about the 400x400, is that because there are two different ResNet 00:46:57.620 |
Two different ResNet models? No, it's not. I'll show you how that happened in a moment. 00:47:03.020 |
We're going to get to that. It's kind of a little sneak peek at what we're coming to. 00:47:10.700 |
So now that we've got these bounding boxes, here is a complexity, both a practical one 00:47:17.140 |
and a kaggle one. The kaggle complexity is the rules say you're not allowed to manually 00:47:21.660 |
annotate the test set, so we can't put bounding boxes on the test set. So if, for example, 00:47:26.780 |
we want to go through and crop out just the fish in every image and just train on them, 00:47:33.340 |
this is not enough to do that because we can't do that on the test set because we don't have 00:47:38.860 |
The practical meaning of this is in practice, they're trying to create an automatic warning 00:47:44.580 |
system to let them know if somebody is taking the wrong kind of fish, they don't want to 00:47:50.260 |
have somebody drawing a box in every one. So what we're going to do is build a model 00:47:56.620 |
that can find these bounding boxes automatically. And how do we do that? It may surprise you 00:48:02.140 |
to know we use exactly the same techniques that we've always used. Here is the exact 00:48:08.220 |
same model again. This time, as well as having something at the end which has 8 softmax outputs, 00:48:17.900 |
we also have something which has 4 linear outputs, i.e. 4 outputs with no activation 00:48:24.500 |
function. What this is saying, and then what we're going to do is when we train this model, 00:48:31.140 |
we now have 2 outputs, so when we compile it, we're going to say this model has 2 outputs. 00:48:37.700 |
One is the 4 outputs with no activation function, one is the 8 softmax. When I compile it, the 00:48:46.060 |
first of those I want you to optimize for mean squared error, and the second of those 00:48:50.820 |
I want you to optimize for cross entropy loss. And the first of them I want you to multiply 00:48:56.940 |
the loss by 0.001 because the mean squared error of finding the location of an image 00:49:04.700 |
is going to be a much bigger number than the categorical cross entropy, so it's making 00:49:08.620 |
them about the same size. And then when you train it, I want you to use the bounding boxes 00:49:14.500 |
as the labels for the first output and the fish types as the labels for the second output. 00:49:22.260 |
And so what this is going to have to do is it's going to have to figure out how to come 00:49:25.260 |
up with a bunch of dense layers which is capable of doing these 2 things simultaneously. So 00:49:32.500 |
in other words, we now have something that looks like this, 2 outputs, 1 input. And notice 00:49:56.820 |
that the 2 outputs, you don't have to do it this way, but in the way I've got it, the 00:50:01.460 |
outputs both come out, both are just their own dense layer. It would be possible to do 00:50:09.980 |
it like this instead. That is to say, each of the 2 outputs could have 2 dense layers 00:50:30.460 |
of their own before. In this case though, we're going to talk about the pros and cons. 00:50:36.820 |
Both of my last layers are both going to have to use the same set of features to generate 00:50:42.940 |
both the bounding boxes and the fish classes. So let's have this go. We'll just go fit as 00:50:50.540 |
usual, but now that we have 2 outputs, we get a lot more information. We get the bounding 00:50:56.020 |
box loss, we get the fishy classification loss, we get the total loss, which is equal 00:51:04.060 |
to 0.001 x bounding box, because you can see this is over 1000 times bigger than this, 00:51:09.020 |
so you can see why I've got 0.001. So that's the 2 added together with that way. Then we 00:51:15.660 |
get the validation loss, total bounding box loss, and the validation classification loss. 00:51:23.520 |
So here is something pretty interesting. The first thing I want to point out is that after 00:51:27.380 |
I thin it a little bit, we actually get a much better accuracy. Now maybe this is counter-intuitive, 00:51:38.020 |
because we're now saying our model has exactly the same capacity as before. Our previous 00:51:43.100 |
dense layer is of size 512. And before, that last layer only had to do one thing, which 00:51:49.060 |
is to tell us what kind of fish it was. Now it has to do 2 things. It has to tell us where 00:51:53.580 |
the fish is and what kind of fish it is. But yet it's still done better. Why is it done 00:52:01.540 |
better? Well the reason it's done better is because by telling it we want you to use those 00:52:06.300 |
features to figure out where the fish is, we've given it a hint about what to look for. 00:52:12.540 |
We've really given it more information about what to work on. So interestingly, even if 00:52:17.860 |
we didn't use the bounding box for anything else, and just threw it away at this point, 00:52:22.300 |
we already have a much better model. And do you notice also the model is much more stable 00:52:26.860 |
- 97.8, 98, 98, 98.2 - before our loss was all over the place. So by having multiple 00:52:34.840 |
outputs, we've created a much more stable, resilient and accurate classification model. 00:52:42.900 |
And we also have bounding boxes. The best way to look at how accurate the bounding boxes 00:52:53.460 |
So I do a prediction for the first 10 validation examples. Support to use the validation set 00:53:00.500 |
anytime you're looking at how good your model is. This time I slightly increased the function 00:53:07.740 |
to show the bounding boxes to now create a yellow box for my prediction and a default 00:53:13.180 |
red box for my actual. So I just want to make it very clear here. We haven't done anything 00:53:23.300 |
clever. We didn't do anything to program this. We just said there is an output which we have 00:53:33.540 |
for outputs that has no activation function. And I want you to use mean squared error to 00:53:41.100 |
find a set of weights that would optimize those weights such that the bounding boxes 00:53:46.860 |
and your predictions are as close as possible. And somehow it has done that. 00:53:55.740 |
So that is to say, very often if you're trying to get a neural net to do something, your 00:54:03.380 |
first step before you create some complex programming heuristic thing is just ask the 00:54:13.180 |
Why do both in the same fitting instead of training the boxes first and feeding that 00:54:20.900 |
Well, we can, right? But the first thing I want to point out is even then I would still 00:54:26.220 |
have the first stage do both at the same time because the more compatible tasks you can 00:54:32.860 |
give it, so like where is the fish and what kind of fish it is, the more it can create 00:54:38.540 |
an internal representation that is as appropriate as possible. 00:54:43.780 |
Now if you now want to go away over the next couple of weeks and crop out these fish and 00:54:50.580 |
create the second model, I can almost guarantee you will get into the top ten of this competition. 00:55:00.800 |
And the reason I can almost guarantee that is because there was quite a similar competition 00:55:03.740 |
on Kaggle last year, or maybe earlier this year, which was trying to identify particular 00:55:11.140 |
whales and literally saying which individual whale is it, and all of the top three in that 00:55:20.580 |
competition did some kind of bounding box prediction and some kind of cropping and then 00:55:24.620 |
modeled a second layer on the cropping features. 00:55:28.300 |
Are the four bounding box outputs the vertical and horizontal size of the box and the two 00:55:35.980 |
It's whatever we were given, which was not quite that, it was the height, width, x and 00:55:44.100 |
So how many of the people in this Kaggle competition are using this sort of model? And if you came 00:55:56.980 |
up with this with a bit of tinkering, do you think that you would actually stay in the 00:56:05.700 |
top ten or would this just be sort of like an obvious thing that people would tend to 00:56:10.700 |
do, and so your ranking would basically drop over time as everyone else incorporates this? 00:56:17.980 |
So I'm going to show you a few techniques that I used this week, a few techniques I 00:56:31.420 |
used this week, but they're all very basic, they're very normal. We're at a point now 00:56:38.620 |
in this $150,000 competition where over 500 people have entered, and I am currently 20th. 00:56:47.420 |
So no, the stuff that you're learning in this course is not at all well known. There's never 00:56:55.180 |
been an applied learning course before. So the people who are above me in the competition 00:57:01.420 |
are people who have figured these things out over time and read lots of papers and studied 00:57:10.420 |
So I definitely think that people in this course, particularly if somebody teamed up 00:57:15.420 |
together would have a very good chance of winning this competition because it's a perfect fit 00:57:21.340 |
for everything we've been talking about, and particularly you can collaborate on the forums 00:57:29.420 |
I should mention, I haven't done any cropping yet. This is just using the whole image, which 00:57:36.140 |
is clearly not the right way to tackle this. I was actually intentionally trying not to 00:57:42.780 |
do too well because I'm going to have to release this to everybody on the Kaggle forum and 00:57:48.860 |
say I've done this and here's a notebook because it's $150,000. I didn't want to say here's 00:57:54.540 |
a way to get in the top 10 because that's not fair to everybody else. 00:57:58.540 |
So I think to answer your question, by the end of the competition, to win one of these 00:58:04.420 |
things, you've got to do everything right at every point. Every time you fail, you have 00:58:10.380 |
to keep trying again. Tenacity is part of winning these things. I know from experience 00:58:14.740 |
the feeling of being on top of the leaderboard and waking up the next day and finding that 00:58:19.140 |
five people have passed you. But the thing is, you then know they have found something 00:58:25.060 |
that is there and you haven't found it. That's part of what makes competing in the Kaggle 00:58:29.780 |
competition so different to doing academic papers or looking at old Kaggle competitions 00:58:35.780 |
that are long gone. It's a really great test of your own processes and your own grit. What 00:58:45.740 |
you'll probably find yourself doing is repeatedly fucking around with hyperparameters and minor 00:58:53.020 |
architectural details because it's just so addictive until eventually you go away and 00:58:58.300 |
go 'okay, what's a totally different way of thinking about this problem?' 00:59:03.340 |
So I hope some of you will consider seriously investing in putting an hour a day into a 00:59:10.700 |
competition because I learned far more doing that than everything else I've ever done in 00:59:16.660 |
machine learning. It's totally different to just playing around. And after it, it's something 00:59:25.660 |
that every real-world project I've done is greatly better than that experience. 00:59:32.660 |
To give you a sense of this, here's number 6. I can't even see that fish, but it's done 00:59:46.260 |
a pretty good job. And I think maybe it kind of knows that people tend to float around 00:59:52.220 |
where the fish is or something, because it's pretty hard to see. As you can see, this is 00:59:55.860 |
just a 224x224 image. So this model is doing a pretty great job, and the amount of time 01:00:06.340 |
I've got a section here on data augmentation. Before we look at finding things without manually 01:00:32.180 |
annotating bounding boxes, I'd like to talk more about different size images. 01:00:36.660 |
So let's talk about sizes. Let's specifically talk about in which situations is our model 01:00:46.300 |
going to be sensitive to the size of the input, like a pre-trained model with pre-trained weights. 01:00:56.020 |
And it's all about what are these layer operations exactly? If it's a dense layer, then there's 01:01:05.420 |
a weight going from every input to every output. And so if you have a different sized input, 01:01:14.420 |
then that's not going to work at all, because the weight matrix for your dense layer is 01:01:19.020 |
just simply of the wrong size. Who knows what it should do. What if it's a convolutional 01:01:26.700 |
layer? If it's a convolutional layer, then we have a little set of weights for each 3x3 01:01:37.260 |
block for each different feature, and then that 3x3 block is going to be slid over to 01:01:42.060 |
create the outputs. If the image is bigger, it doesn't change the number of weights. It 01:01:49.820 |
just means that block is going to be slid around more, and the output will be bigger. 01:01:56.300 |
A max pooling layer doesn't have any weights. A batch normalization layer simply cares about 01:02:03.340 |
the number of weights of the previous layer. So really, when you think about it, the only 01:02:08.700 |
layer that really cares about what size your input is is a dense layer. And remember that 01:02:14.820 |
with VGG, nearly all of the layers are convolutional layers. 01:02:20.300 |
So that's why it is that we can say not only include top = false, we can say not only include 01:02:30.260 |
top = false, but we can also choose what size we want. So if you look at my new version 01:02:41.780 |
of the VGG model, I've actually got something here that says if size is not equal to 224 01:02:51.740 |
then don't try to add the fully connected blocks at all, just return that. 01:03:06.400 |
So in other words, if we cut off whatever our architecture is before any dense layers 01:03:12.940 |
happen, then we're going to be able to use it on any size input to at least create those 01:03:30.380 |
There's no particular reason it has to be fixed. A dense layer has to be fixed because 01:03:35.380 |
a dense layer has a specific weight matrix. And the input to that weight matrix generally 01:03:41.000 |
is the flattened out version of the previous convolutional layer, and the size of that 01:03:46.660 |
depends on the size of the image. But the convolutional weight matrix simply depends 01:03:58.540 |
So let's try it. And specifically we're going to try building something called a fully convolutional 01:04:06.220 |
net, which is going to have no dense layers at all. So the input, as usual, will be the 01:04:12.780 |
output of the last VGG convolutional layer. But this time, when we create our VGG 16 model, 01:04:31.660 |
we're going to tell it we want it to be 640 by 360. 01:04:35.980 |
Now be careful here. When we talk about matrices, we talk about rows by columns. When we talk 01:04:42.880 |
about images, we talk about columns by rows. So a 640 by 360 image is a 360 by 640 matrix. 01:04:52.700 |
I mention this because I screwed it up. But I knew I screwed it up because I always draw 01:04:57.100 |
pictures. So when I drew the picture and saw this little squashed boat, I knew that I'd 01:05:06.180 |
This is the exact same VGG-16 network we've been using since I added batch norm. So nothing's 01:05:16.540 |
been changed other than this one piece of code I just showed you which says you can 01:05:21.740 |
use different sizes, and if you do, don't add the fully connected layers. 01:05:31.380 |
So now that I've got this VGG model which is expecting a 640 by 360 input, I can then 01:05:40.980 |
add to it my top layers. And this time, my top layers are going to get in an input which 01:05:49.380 |
is of size 22 by 40. So normally, our VGG's final layer is 14 by 14, or if you include 01:05:59.460 |
the final max pooling, it's 7 by 7. In this case, it's 22 by 40, and that's because we've 01:06:07.180 |
told it we're not going to pass it a 224 by 224, we're going to pass it a 640 by 360. 01:06:13.540 |
So this is what happens. We end up with a different output shape. So if we now try to 01:06:18.540 |
pass that to the same dense layer we used before, it wouldn't work, so it would be the 01:06:21.900 |
wrong size. But we're actually going to do something very different anyway, we're not 01:06:26.620 |
going to use any pre-trained fully connected weights. We're instead going to have, in fact, 01:06:34.220 |
no dense layers at all. Instead, we're going to go conv.maxpool, conv.maxpool, conv.maxpool, 01:06:47.100 |
So the best way to look at that is to see what's happening to our shape. So it goes 01:06:53.900 |
in 22 by 40 until the max pooling, 11 by 20 until the max pooling, 5 by 10. And then because 01:07:01.340 |
this is rectangular, the last max pooling I did a 1,2 shape, so that gives me a square 01:07:12.740 |
Then I do a convolutional layer in which I have just 8 filters. And remember, there are 01:07:19.380 |
8 types of fish. There are no other weights after this. And in fact, even the dropout 01:07:25.020 |
is not doing anything because I've set my p value to 0. So ignore that dropout layer. 01:07:31.220 |
So we're going straight from a convolutional layer, which is going to be grid size 5 by 01:07:35.700 |
5, and have 8 filters, and then we're going to average across the 5 by 5, and that's going 01:07:47.780 |
So if we now say, please train this model, and please try and make these 8 things equal 01:07:53.420 |
to the classes of fish. Now you have to think backwards. How would it do that? If it was 01:07:59.940 |
to do that for us, and it will because it's going to use SGD, what would it have to do? 01:08:05.940 |
Well it has no ability to use any weights to get to this point, so it has to do everything 01:08:13.300 |
by the time it gets to this point. Which means this convolution2D layer is going to have 01:08:18.380 |
to have in each of its 5 grid areas something saying, how fishy is that area? Because that's 01:08:24.260 |
all it can do. After that, all it can do is to average them together. 01:08:28.860 |
So we haven't done anything specifically to calculate it that way, we just created an 01:08:33.900 |
architecture that has to do that. Now my feeling is that ought to work pretty well because 01:08:38.940 |
as we saw in that earlier picture, the fish only appears in one little spot. And indeed 01:08:44.260 |
as we discussed earlier, maybe even a global max pooling could even be better. 01:08:50.500 |
So let's try this. We can fit it as per usual, and you can see here even without using bounding 01:08:56.140 |
boxes, we've got a pretty stable and pretty good result in about 30 seconds, 97.6. 01:09:05.420 |
When I then tried this on the Kaggle leaderboard, I got a much better result. In fact to show 01:09:12.100 |
you my submissions, the 20th place was me just averaging together 4 different models, 01:09:25.300 |
4 of the models that I'm showing you today. But this one on its own was 0.986, which would 01:09:42.100 |
be 20 seconds. So this model on its own would get its 20 second position. And no data augmentation, 01:09:55.820 |
no pseudo-labeling, we're not using the validation set to help us, which you should when you 01:10:05.060 |
So you can get 20 second position with this very simple approach, which is to use a slightly 01:10:10.420 |
larger image and use a fully convolutional network. There's something else cool about 01:10:16.220 |
this fully convolutional network, which can get us into 20 second position. And that is 01:10:21.060 |
that we can actually look at the output of this layer, and remember it's 5x5. 01:10:40.620 |
VGG, as always before, is the input to this model. So I first of all calculated every 01:10:47.060 |
single model I'm showing you today, I pre-computed the output of the last convolutional layer 01:11:00.180 |
So I go get data, and I say I want to get a 360, 640 sized data, and so that gives me 01:11:08.100 |
my image, and then I -- this is data augmentation which I'm not doing at the moment, I then 01:11:19.580 |
create my model, pop off the last layer, because I don't want the last max pooling layer, so 01:11:25.780 |
that's the size, and then call predict to get the features from that last layer. 01:11:37.960 |
So it's what we always do, it's just the only difference is that we passed 360, 640 to our 01:11:44.140 |
constructor for the model, and we passed 360, 640 to the get data command. 01:11:53.500 |
I'm always skipping that bit, but everything I'm showing you today is taking as input the 01:12:15.580 |
A couple of reasons why. The first because the authors of the paper which created the 01:12:20.060 |
fully convolutional net found that it worked pretty well. The global average pooling 2D 01:12:27.500 |
layer has been discussed, turns out to have excellent generalization characteristics. 01:12:31.100 |
So you'll notice here we have no dropout, and yet we're in 22nd place on the leaderboard 01:12:42.380 |
And then the final reason is the thing I'm about to show you, which is that we basically 01:12:46.700 |
have maintained a sense of kind of x-y coordinates all the way through, which means that we can 01:12:57.100 |
actually now visualize this last layer. And I want to do that before I take the next question. 01:13:05.380 |
So I can say, let's create a function which takes our model's input as input and our fourth 01:13:14.060 |
from last layer as output, that is that convolutional layer that I showed you. 01:13:19.740 |
And then I'm going to take that and I'm going to pass into it the features of my first validation 01:13:28.780 |
image and draw a picture of it for this picture, and here is my picture. And so you can see 01:13:38.900 |
it's done exactly what we thought it would do, which is it's had to figure out that there's 01:13:42.380 |
a fishy bit here. So these fully convolutional networks have a nice side effect, which is 01:13:51.900 |
that they allow us to find whereabouts the interesting parts are. 01:13:56.940 |
The default parameters for max pooling are 2,2, so it's taking each 2x2 square and replacing 01:14:13.820 |
it with the largest value in that 2x2 square. So this is not the most high-res heat map 01:14:21.300 |
we've ever seen. So the obvious thing to make it all more high-res would be to remove all 01:14:26.460 |
the max pooling layers. So here's exactly the same thing as before, but I've removed 01:14:34.060 |
So that means that my model now remains at 22x40 all the way through, everything else 01:14:41.220 |
is the same. And that indeed does not give quite as accurate a result, we get 95.2 rather 01:14:50.540 |
than 97.6. On the other hand, we do have a much higher resolution grid, so if we now 01:15:01.020 |
do exactly the same thing to create the heat map, and the other thing we're going to do 01:15:06.140 |
is resize the heat map to 360x640, and by default, this resize command will try and 01:15:13.300 |
interpolate. So it's going to replace big pixels with interpolated small pixels. 01:15:20.420 |
And that gives us, for this image, this answer, which is much more interesting. And so now 01:15:28.780 |
we can stick one on top of the other, like so. And this tells us a lot. It tells us that 01:15:38.340 |
on the whole, this is doing a good job of saying the thing that mattered, the fishy 01:15:42.860 |
thing, the albacore thing specifically, because we're asking here for the albacore plus. Remember, 01:15:50.140 |
the layer, that layer of the model is 8x22x40, so we have to ask how much like albacore is 01:15:59.820 |
each of those areas, or how much like shark is each of those areas. 01:16:03.460 |
So when we called this function, it returned basically a heat map for every type of fish, 01:16:13.620 |
and so we can pass in 0 for albacore, or here's a cool one. Class number 4 is nofish. So one 01:16:23.940 |
of the classes you have to predict in this competition is nofish. So we could say, tell 01:16:28.540 |
us how much each part of this picture looks like the nofish class. What happens is if 01:16:34.860 |
you look at the nofish version, it's basically the exact opposite of this. You get a big 01:16:39.620 |
blue spot here, and pink or round it. The other thing I wanted to point out here is these 01:16:48.380 |
areas of pinkishness that are not where the fish is. This is telling me that our model 01:16:54.500 |
is not currently just looking for fish. It's also looking, if we look at this pink here, 01:17:01.100 |
it's looking for particular characteristics of the boat. 01:17:06.860 |
So this is suggesting to me that since it's not all concentrated on the fish, I do think 01:17:12.740 |
that there's some data leakage still coming through. 01:17:17.180 |
I think we know everything about why it's working. We have set up a model where we've 01:17:37.700 |
said we want you to predict each of the 8 fish classes. We have set it up such that 01:17:48.860 |
the last layer simply averages the answers from the previous layer. The previous layer 01:17:55.300 |
we have set up so it has the 8 classes we need. So that's obviously the only way you 01:18:00.500 |
can average and get the right number of classes. We know that SGD is a general optimization 01:18:07.780 |
approach which will find a set of parameters which solves the problem that you give it 01:18:15.740 |
So really, when you think of it that way, unless it failed to train, which it could 01:18:23.500 |
for all kinds of reasons, unless it failed to train, it could only get a decent answer 01:18:31.360 |
if it solved it in this way. If it actually looked at each area and figured out how fishy 01:18:36.940 |
We're not doing attention models in this part of the course, per se. I would say for now, 01:18:51.940 |
the simple attention model that I would do would be to find the largest area of the heat 01:18:58.980 |
mat and crop that, and maybe compare that to the bounding boxes and make sure they look 01:19:05.860 |
about the same and those that don't, you might want to hand fix. And if you hand fix them, 01:19:10.860 |
you have to give that back to the Kaggle community of course because that's hand labeling. 01:19:17.700 |
And honestly, that's the state of the art. In terms of who wins the money in Kaggle, 01:19:25.540 |
that's how the Kaggle winners have won these kinds of competitions is by having a two-stage 01:19:31.220 |
pipeline where first of all they find the thing of interest and then they zoom into 01:19:40.220 |
Actually the other thing that you might want to do is to orient the fish so that the tail 01:19:47.420 |
is kind of in the same place and the head is in the same place. Make it as easy as possible 01:19:52.100 |
basically for your consonant to do what it needs to do. 01:20:03.020 |
You guys might have heard of another architecture called Inception. A combination of Inception 01:20:09.580 |
plus ResNet won this year's ImageNet competition. And I want to give you a very quick hint as 01:20:21.300 |
to how it works. I have built the world's tiniest little Inception network here in this screen. 01:20:31.700 |
One of the reasons I want to show it to you is because it actually uses the same technique 01:20:36.060 |
that we heard from Ben Bowles that he used. Do you remember in his language model, Quid, 01:20:42.940 |
Ben used a trick where he had multiple different convolution filter sizes and ran all of them 01:20:49.680 |
and concatenated them together? That's actually what the Inception network does. 01:21:09.880 |
To align the head and tail, the easiest way would be to hand annotate the head and hand 01:21:14.300 |
annotate the tail. That was what was done in the whale competition. 01:21:30.540 |
Hand labeling always has errors, and indeed there are quite a few people in the forum 01:21:35.340 |
who have various bounding boxes that they don't think are correct. It's great to have 01:21:39.780 |
an automatic approach which ought to give about the same answer as the hand approach, 01:21:44.740 |
and you can then compare the two and use the best of both worlds. 01:21:50.460 |
And in general, this idea of combining human intelligence and machine intelligence seems 01:21:55.640 |
to be a great approach, particularly early on. You can do that for the first few bounding 01:22:01.140 |
boxes to improve your bounding box model and then use that to gradually make the model 01:22:11.380 |
have to ask you less and less for your input. 01:22:40.080 |
The heatmap you don't need to. The heatmap was just visualizing one of the layers of 01:22:50.700 |
the network. We didn't use the bounding boxes, we didn't do anything special. It's just a 01:22:56.620 |
side effect of this kind of model. You can visualize the last convolutional layer and 01:23:14.820 |
There's so many ways of interpreting neural nets, and one of them is to draw pictures 01:23:19.160 |
of the intermediate activations. You can also draw pictures of the intermediate gradients. 01:23:24.520 |
There's all kinds of things you can draw pictures of. 01:23:38.300 |
The Inception network is going to use this trick where we're going to use multiple different 01:23:56.120 |
convolutional filter sizes. Just like in ResNet, there's this idea of a ResNet block which is 01:24:10.520 |
repeated again and again. In the Inception network, there's an Inception block which 01:24:14.620 |
is repeated again and again. I've created a version of one here. I have one thing which 01:24:22.400 |
takes my input and does a 1x1 convolution. I've got one thing that takes the input and 01:24:27.320 |
does a 5x5 convolution. I've got one thing that takes the input and does 2 3x3 convolutions. 01:24:33.560 |
I've got one thing that takes the input and just average pulls it. And then we concatenate 01:24:40.640 |
So what this is doing is each Inception block is basically able to look for things at various 01:24:46.560 |
different scales and create a single feature map at the end which adds all those things 01:24:54.600 |
So once I've defined that, I can create a model that just goes Inception block, Inception 01:24:58.760 |
block, Inception block, Comm2D, global average pulling 2D, output. I haven't managed to get 01:25:05.600 |
this to work terribly well yet. I've got the same kind of results. I haven't actually tried 01:25:11.840 |
submitting this to Kaggle. Part of the purpose of this is to give you guys a sense of the 01:25:23.040 |
kinds of things we'll be doing next year. This idea of we've built the basic pieces 01:25:28.920 |
now of convolutions, fully connected layers, activation functions, SGD, and really from 01:25:38.800 |
here, deep learning is putting these pieces together. What are the ways people have learned 01:25:43.400 |
about putting these things together in ways that solve problems as well as possible? 01:25:50.200 |
And so the Inception network is one of these ways. And the other thing I wanted to do was 01:25:54.000 |
to give you plenty of things to think about over the next couple of months and play with. 01:25:58.840 |
So hopefully this notebook is going to be full of things you can experiment with and 01:26:03.800 |
maybe even try submitting some Kaggle results. 01:26:10.240 |
I guess the warnings about the Inception network are a bit similar to the warnings about the 01:26:13.440 |
ResNet network. Like ResNet, the Inception network is available, actually Keras. I haven't 01:26:20.440 |
converted one to my standard approach, but Keras has an Inception network that you can 01:26:25.960 |
download and use. It hasn't been well-studied in terms of its transfer learning capabilities. 01:26:36.040 |
Again I haven't seen people who have won Kaggle competitions using transfer learning of Inception 01:26:42.160 |
network, so it's just a little bit less well-studied. But like ResNet, the combination of Inception 01:26:50.120 |
plus ResNet is the most recent image network. So if you are looking to really start with 01:26:56.760 |
the most predictive model, this is where you would want to start. 01:27:02.600 |
So I want to finish off on a very different note, which is looking at RNNs. I've spent 01:27:16.840 |
much more time on CNNs than RNNs. The reason is that this course is really all about being 01:27:23.200 |
pragmatic. It's about teaching you the stuff that works, and in the vast majority of areas 01:27:29.080 |
where I see people using deep learning to solve their problems, they're using CNNs. 01:27:38.520 |
Having said that, some of the most challenging problems are now being solved with RNNs like 01:27:45.160 |
speech recognition and language translation. So when you use Google Translate now, you're 01:27:51.000 |
using RNNs. My suspicion is you're going to come across these kinds of problems a lot 01:27:58.120 |
less often, but I also suspect that in a business context, a very common kind of problem is a 01:28:06.400 |
time series problem, like looking at the time series of click events on your website or 01:28:13.080 |
e-commerce transactions or logistics or whatever. 01:28:20.000 |
These sequence-to-sequence RNNs we've been looking at, which we've been using to create 01:28:25.240 |
Nietzschean philosophy, are identical to the ones you would use to analyze a sequence of 01:28:31.680 |
e-commerce transactions and try to find anomalies. 01:28:35.400 |
So I think CNNs are more practically important for most people in most organizations right 01:28:43.320 |
now, but RNNs also have a lot of opportunities, and of course we'll also be looking at them 01:28:50.880 |
when it comes to attentional models next year, which is figuring out in a really big image 01:28:58.200 |
Question - Does Inception have the merge characteristic? 01:29:05.040 |
The Inception merge is a concat rather than that, which is the same as what we saw when 01:29:09.720 |
we looked at Ben Bowles' quid NLP model. We're taking multiple convolution filter sizes and 01:29:20.160 |
we're sticking them next to each other. So that feature basically contains information 01:29:27.240 |
about 5x5 features and 3x3 features and 1x1 features. 01:29:32.920 |
And so when you add them together, you lose that information. ResNet does that for a very 01:29:37.800 |
specific reason, which is we want to cause at all our residuals. In Inception, we don't 01:29:43.600 |
want that. Inception, we want to keep them all in the feature space. 01:29:55.280 |
The other reason I wanted to look at RNNs is that last week we looked at building an 01:30:02.600 |
RNN nearly from scratch in Theano. And I say nearly from scratch because there was one 01:30:09.460 |
key step which it did for us, which was the gradients. Really understanding how the gradients 01:30:19.040 |
are calculated is not something you would probably ever have to do by hand, but I think 01:30:25.600 |
it can be very helpful to your intuition of training neural networks to be able to trace 01:30:33.280 |
And so for that reason, this is kind of the one time in this course over this year and 01:30:37.640 |
next year's course where we're going to really go through and actually calculate the gradients 01:30:42.840 |
ourselves. So here is a recurrent neural network in pure Python. And the reason I'm doing a 01:30:49.720 |
recurrent neural network in pure Python is this is kind of the hardest. RNNs are the 01:30:54.200 |
hardest thing to get your head around backpropagating gradients. So if you look at this and study 01:31:00.720 |
this and step through this over the next couple of months, you will really be able to get 01:31:04.720 |
a great understanding of what a neural net is really doing. There's going to be no magic 01:31:09.160 |
or mystery because this whole thing is going to be every line of code, something that you 01:31:15.560 |
So if we're going to do it all ourselves, we have to write everything ourselves. So 01:31:21.560 |
if we want a sigmoid function, we have to write the sigmoid function. Any time we write 01:31:25.680 |
any function, we also have to create this derivative. So I'm going to use this approach 01:31:31.520 |
where _d is the derivative function. So I'm going to have relu and the derivative of relu. 01:31:40.280 |
And I'll just kind of check myself as I go along that they look reasonable. The Euclidean 01:31:45.040 |
distance and the derivative of the Euclidean distance. 01:31:50.520 |
The cross entropy and the derivative of the cross entropy. And note here that I am clipping 01:31:56.560 |
my predictions because if you have zeros or ones there, you're going to get infinities 01:32:04.800 |
So you have to be careful of this. This did actually happen. I didn't have this Euclidean 01:32:08.800 |
at first and I was starting to get infinities and this is necessary. My softmax is the derivative 01:32:18.560 |
So then I basically go through and I double check that the answers I get with my versions 01:32:22.200 |
are the same as the answers I get with the theana versions to make sure they're all correct 01:32:32.160 |
So I am going to use as my activation function relu, which means the derivative is relu derivative 01:32:37.800 |
and my loss function is cross entropy derivative. I also have to write my own scan. So you guys 01:32:46.120 |
remember scan. Scan is this thing where we go through a sequence one step at a time, 01:32:51.560 |
calling a function on each element of the sequence. And each time the function is going to get 01:32:56.040 |
two things, it's going to get the next element of the sequence as well as the previous result 01:33:02.640 |
So for example, scan of add two things together on the integers from 0 to 5 is going to give 01:33:12.560 |
us the cumulative sum. And remember the reason we do this is because GPUs don't know how to 01:33:19.560 |
do loops, so our theano version used a scan. And I wanted to make this as close to the 01:33:26.320 |
In theano, scan is not implemented like this with a for loop. In theano, they use a very 01:33:31.840 |
clever approach which basically creates a tree where it does a whole lot of the things 01:33:36.320 |
kind of simultaneously and gradually combines them together. Next year we may even look 01:33:47.040 |
So in order to create our Nietzschean philosophy, we need an input and an output. So we have 01:33:53.560 |
the eight character sequences, one hot encoded for our inputs, and the eight character sequences 01:34:01.600 |
moved across by one, one hot encoded for our outputs. And we've got our vocab size, which 01:34:08.040 |
is 86 characters. So here's our input and output shapes, 75,000 phrases, each one has 01:34:16.880 |
eight characters in, and each of those eight characters is a one-hot encoded vector of 01:34:24.440 |
So we first of all need to do the forward pass. So the forward pass is to scan through 01:34:34.400 |
all of the characters in the nth phrase, the input and output, calling some function. And 01:34:42.160 |
so here is the forward pass. And this is basically identical to what we saw in theano. In theano, 01:34:50.000 |
So to create the hidden state, we have to take the dot product of x with its weight 01:34:55.120 |
matrix and the dot product of the hidden with its weight matrix, and then we have to put 01:35:00.760 |
all that through the activation function. And then to create the predictions, we have to 01:35:06.360 |
take the dot product of the hidden with its weight matrix and then put that through softmax. 01:35:15.240 |
And so we have to make sure we keep track of all of the state that it needs, so at the 01:35:18.880 |
end we will return the loss, the pre-hidden and pre-pred, because we're going to use them 01:35:31.120 |
each time we go through. In the back prop, we'll be using those. We need to know the 01:35:36.800 |
hidden state, of course, we have to keep track of that because we're going to be using it 01:35:40.720 |
the next time through the RNN. And of course, we're going to need our actual predictions. 01:35:46.960 |
So that's the forward pass, very similar to the other one. The backward pass is the bit 01:35:52.560 |
I wanted to show you, and I want to show you how I think about it. 01:36:04.360 |
This is how I think about it. All of my arrows, I've reversed their direction. And the reason 01:36:10.480 |
for that is that when we create a derivative, we're really saying how does the input change, 01:36:16.720 |
how does a change in the input impact the output? And to do that, we have to use the 01:36:21.560 |
chain rule, we have to go back from the end all the way back to the start. 01:36:25.920 |
So this is our output last hidden layer activation matrix. This is our loss, which is adding together 01:36:37.360 |
all of the losses of each of the characters. If we want the derivative of the loss with 01:36:43.280 |
respect to this hidden activation, we would have to take the derivative of the loss with 01:36:47.560 |
respect to this output activation and multiply it by the derivative of this output activation 01:36:53.520 |
with respect to this hidden activation. We have to then multiply them together because 01:36:58.280 |
that's the chain rule. The chain rule basically tells you to go from some function of some 01:37:06.960 |
other function of x, the derivative is the product of those functions. 01:37:17.640 |
So I find it really helpful to literally draw the arrows. So let's draw the arrow from the 01:37:24.160 |
loss function to each of the outputs as well. And so to calculate the derivatives, we basically 01:37:32.280 |
have to go through and undo each of those steps. In order to figure out how that input 01:37:39.560 |
would change that output, we have to basically undo it. We have to go back along the arrow 01:37:46.680 |
So how do we get from the loss to the output? So to do that, we need the derivative of the 01:37:55.720 |
loss function. If we're going to go back to the activation function, we're going to need 01:38:02.880 |
the derivative of the activation function as well. So you can see it here. This is a 01:38:08.280 |
single backward pass. We grab one of our inputs, one of our outputs, and then we go backwards 01:38:18.560 |
through each one, each of the 8 characters from the end to the start. So grab our input 01:38:23.840 |
character and our output character, and the first thing you want is the derivative of 01:38:29.960 |
pre-pred. Remember pre-pred was the prediction prior to putting it through the softmax. So 01:38:38.800 |
that was the bit I just showed you. It's the derivative of the softmax times the derivative 01:38:47.320 |
So the derivative of the loss is going to get us from here back to here, and then derivative 01:38:53.600 |
of the softmax gets us from here back to the other side of the activation function. That 01:38:59.520 |
basically gets us to here. So that's what that gets us to. So we want to keep going 01:39:09.920 |
further, which is we want to get back to the other side of the hidden. We want to get all 01:39:25.320 |
For those of you that haven't done vector calculus, which I'm sure is many of you, just 01:39:30.360 |
take my word for it. The derivative of a matrix multiplication is the multiplication with 01:39:37.480 |
the transpose of that matrix. So in order to take the derivative of the pre-hidden times 01:39:45.920 |
its weights, we simply take it by the transpose of its weights. So this is the derivative 01:39:55.880 |
And remember the hidden, we've actually got 2 arrows coming back out of it, and also we've 01:40:02.160 |
got 2 arrows coming into it. So we're going to have to add together that derivative and 01:40:10.860 |
So here is the second part. So there it is with respect to the outputs, and there it 01:40:16.920 |
is with respect to the hidden. And then finally, we have to undo the activation function. So 01:40:24.080 |
multiply it by the derivative of the activation function. So that's the chain rule that gets 01:40:34.480 |
So now that we've got those two pieces of information, we can update our weights. So 01:40:41.160 |
we can now say for the blue line, what are these weights now going to equal? So we basically 01:40:48.800 |
have to take the derivative that we got to at this point, which we called dprered. We 01:40:56.080 |
have to multiply by our learning rate, which we're calling alpha. And then we have to undo 01:41:02.960 |
the multiplication by the hidden state to get the derivative with respect to the weights. 01:41:08.360 |
And I created this little columnify function to do that. So it's turning a vector into 01:41:14.200 |
a column, so essentially taking its transpose if you like. 01:41:17.720 |
So that gives me my new output weights. My new hidden weights are basically the same 01:41:23.040 |
thing. It's the learning rate times the derivative that we just calculated, and then we have 01:41:29.240 |
to undo its weights and our new input weights, again at the learning rate, times the pre-hidden 01:41:40.080 |
So I'll go through that very quickly. The details aren't important, but if you're interested 01:41:45.720 |
it might be fun to look at it over the Christmas break or the next few days. Because you can 01:41:52.120 |
see in this here is all of the steps necessary to do that through an RNN, which is also why 01:42:01.960 |
we would never want to do this by hand again. 01:42:04.600 |
So when I wrote this code, luckily I did it before I got my code. You can see I've written 01:42:13.040 |
after every one the dimensions of each matrix and vector because it just makes your head 01:42:17.200 |
hurt. So thank God, Theano does this for us. But I think it's useful to see it. 01:42:27.640 |
So finally, I now just have to create my initial weight matrices, which are normally distributed 01:42:34.360 |
matrices where these normal distribution, I'm going to use the square root of 2 divided 01:42:39.120 |
by the number of inputs because that's that Glauro thing, ditto for my y matrix, and remember 01:42:45.000 |
for my hidden matrix for a simple RNN, we will use the identity matrix to initialize 01:42:57.880 |
We haven't got to that bit yet, so it depends how we use this. At this stage all we've done 01:43:02.600 |
is we've defined the matrices and we've defined the transitions. And whether we maintain state 01:43:09.000 |
will depend entirely on what we do next, which is the loop. So here is our loop. In our loop 01:43:16.440 |
we're going to go through a bunch of examples, we should really go through all of them, but 01:43:19.480 |
I was too lazy to wait. Run one forward step, and then one backward step, and then from 01:43:30.120 |
time to time print out how we're getting along. 01:43:34.960 |
So in this case, the forward step is passing to scan the initial state is a whole bunch 01:43:46.400 |
of zeros. So currently this is resetting the state, it's not doing it statefully. If you 01:43:55.360 |
wanted to do it statefully, it would be pretty easy to change. You would have to have the 01:43:59.800 |
final state returned by this and keep track of it and then feed it back the next time 01:44:06.040 |
If you're interested, maybe you could try that. Having said that, you probably won't 01:44:09.680 |
get great results because remember that when you do things statefully, you're much more 01:44:14.240 |
likely to have gradients and activations explode unless you do a GIU or an LSTM. So my guess 01:44:22.320 |
is it probably won't work very well. So that was a very quick fly-through and really more 01:44:33.320 |
showing you around the code so that if you're interested, you can check it out. 01:44:39.880 |
What I really wanted to do was get onto this more interesting type of RNN, which is actually 01:44:47.640 |
two interesting types of RNN called Long Short-Term Memory and Gated Recurrent Unit. Many of you 01:44:57.120 |
will have heard of the one on the left, LSTM. 01:45:11.120 |
For stateful RNNs, you can't exactly have minivatches because you're doing one at a 01:45:18.480 |
time. In our case, we were going through it in order. Using minivatches is a great way 01:45:29.520 |
to parallelize things on the GPU and make things run faster, but we have to be careful 01:45:41.400 |
So LSTMs a lot of you will have heard about because they've been pretty popular over the 01:45:44.840 |
last couple of years for all kinds of cool stuff that Google does. On the right, however, 01:45:51.000 |
is the GRU, which is simpler and better than the LSTM. So I'm not going to talk about the 01:45:59.720 |
LSTM, I'm going to talk about the GRU. They're both techniques for building your recurrent 01:46:06.640 |
neural network where your gradients are much less likely to explode. 01:46:12.360 |
They're another great interesting example of a clever architecture, but it's just going 01:46:19.440 |
to be more of using the same ideas that we've seen again and again. 01:46:25.600 |
What we have here on the right-hand side is this box. It's basically zooming into what's 01:46:37.320 |
going on inside one of these circles in a GRU. So normally in our standard RNN, what's 01:46:43.560 |
going on in here is pretty simple, which is we do a multiplication by this WH weight matrix 01:46:49.080 |
and stick it through an activation function, and we grab our input, do it by a multiplication 01:46:54.520 |
by weight matrix and grab its, and do it through its activation function, and we add the two 01:47:02.240 |
At GRU, though, it's going to do something more complex. We still have the input coming 01:47:06.720 |
in and the output going out, so that's what these arrows are. They're representing our 01:47:11.360 |
new input character and our prediction. But what's going on in the middle is more complex. 01:47:18.360 |
We still have our hidden state, just like before. But in a normal RNN, the hidden state 01:47:27.720 |
each time simply updates itself. It just goes through a weight matrix and an activation 01:47:39.560 |
But in this case, you can see that the loop looks like it's going back to come direct 01:47:43.640 |
with self, but then there's this gate here. So it's actually not just a self-loop, there's 01:47:52.520 |
something more complicated. So in order to understand what's going on, we're going to 01:47:55.600 |
have to follow across to the right hand side. 01:47:58.880 |
So on the right hand side, you can see that the hidden state is going to go through another 01:48:03.440 |
gate. So what's a gate? A gate is simply a little mini-neural network which is going 01:48:15.000 |
to output a bunch of numbers between 0 and 1, which we're going to multiply by its input. 01:48:22.080 |
In this particular one, the R stands for reset. And so the numbers between 0 and 1, if they 01:48:29.880 |
were all 0, then the thing coming out of the reset gate would be just a big bunch of 0's. 01:48:35.560 |
In other words, it would allow this network to forget the hidden state. Or it could be 01:48:42.960 |
a big bunch of 1's which would allow the network to remember all of the hidden state. Do we 01:48:51.400 |
want it to remember or forget? We don't know, which is why we implement this gate using 01:48:57.200 |
a little neural network. And this little neural network is going to have two inputs, which 01:49:02.840 |
is the input to the gate, the input to the GIU unit, and the current hidden state. 01:49:11.240 |
And so it's going to learn a set of weights that it's going to use to decide when to forget. 01:49:18.400 |
So it's now got the ability to forget what it knows. And that's what the reset gate does. 01:49:25.200 |
So assuming that the reset gate has at least some non-zero entries, which it most surely 01:49:30.160 |
will most of the time, then whatever comes through we're going to call h_tilde, or in 01:49:35.440 |
my code I call it h_new. So this is the new value of the hidden state after being reset. 01:49:47.220 |
And so then finally, that goes up to this top bit here. The original hidden state goes 01:49:53.200 |
up to this top bit here. And then there's a gate which decides how much of each one should 01:50:00.020 |
we have. So this is an update gate. This update gate is going to decide if it's 1, we'll take 01:50:12.160 |
more from this side. If it's 0, we'll take more from this side. And again, that's implemented 01:50:17.360 |
as a little neural network. I think the easiest way to understand this is probably to look 01:50:23.380 |
at the code. So I have implemented this in Theano. You can use a GIU in Keras by simply 01:50:30.200 |
replacing the words simple RNN with GIU. So you don't really need to know this to use 01:50:36.040 |
it, and you get pretty good results. But here's what it looks like when implemented. We don't 01:50:47.560 |
just have a hidden input weight matrix and an output weight matrix anymore, we also have 01:50:54.240 |
a hidden input weight matrix for our little reset gate, and for our update gate. 01:51:04.400 |
So here is the definition of a gate. A gate is something which takes its inputs, its hidden 01:51:10.080 |
state, its hidden state weights, its input weights, and its biases. It does a dot product 01:51:17.940 |
of the x with w_x, a dot product of h with w_h, and adds the biases and sticks to its 01:51:23.180 |
true or single action. So that's what I meant by a mini-neuronet. It's hardly a neuronet, 01:51:28.880 |
it's just got one layer. So that's the definition of the reset gate and the update gate. 01:51:37.920 |
And so in our step function, this is the thing that runs each time on the scan, it looks 01:51:45.320 |
exactly the same as what we looked at last week. The output equals the hidden state times 01:51:51.600 |
the hidden weight matrix plus the hidden biases. The new hidden state equals our inputs times 01:52:01.520 |
its weights and the hidden state times its weights plus the biases, but this time the 01:52:07.440 |
hidden weights are multiplied by the reset gate. And the reset gate is just a little 01:52:15.320 |
neural net. So now that we have h new, our actual new hidden state is equal to that times 01:52:26.960 |
1 minus the update gate plus our previous hidden state times the update gate. So you 01:52:32.560 |
can see that update plus 1 minus update will add to 1. So you can see why it's been drawn 01:52:41.560 |
like so, which is that this can really be anywhere at either end or somewhere in between. 01:52:49.040 |
So the update gate decides how much is h new going to replace the new hidden state with. 01:52:59.560 |
So actually, although people tend to talk about LSTMs and GRUs as being pretty complex, 01:53:04.640 |
it really wasn't that hard to write. The key outcome of this though is that because we 01:53:14.000 |
now have these reset and update gates, is that it has the ability to learn these special 01:53:19.580 |
sets of weights to make sure that it throws away state when that's a good idea, or to 01:53:27.120 |
ignore state when that's a good idea. And so these extra degrees of freedom allow SGD 01:53:37.120 |
And so again, this is one of these things where we're coming up with architectures which 01:53:42.600 |
just try to make it easier for the optimizer to come up with good answers. Everything after 01:53:49.000 |
this is identical to what we looked at last week. That goes into the scan function, to 01:53:53.240 |
calculate the loss, we calculate the gradients, we do the SGD updates, and we chuck it into 01:54:03.920 |
So I think really the main reason I wanted to do all that today was to show you the backdrop 01:54:14.480 |
example. I know some learning styles are more detail oriented as well, and so I think some 01:54:22.200 |
of you hopefully will have found that helpful. Any time you find yourself wondering how the 01:54:29.360 |
hell did this neural network do this, you can come back to this piece of code and that's 01:54:40.280 |
That's one way of thinking about it. Where you really get successful with neural nets 01:54:44.520 |
though is when you go to a whole other level and you don't think of it at that level anymore, 01:54:49.320 |
but instead you start thinking, if I'm an optimizer and I'm given an architecture like 01:54:56.240 |
this, what would I have to do in order to optimize it? 01:55:01.200 |
And once you start thinking like that, then you can start thinking in this kind of like 01:55:06.920 |
upside down way that is necessary to come up with good architectures. You can start 01:55:12.600 |
to understand why it is that this convolution layer followed by this average pooling layer 01:55:20.200 |
gives the answers that it does. Why does it work? You get that real intuition for what's 01:55:26.700 |
So there's two ways, two levels at which you need to think about neural nets. The sooner 01:55:36.560 |
you can think of it at this super high level, I feel like the sooner you'll do well with 01:55:40.680 |
them. One of the best ways to do that is to, over the next couple of weeks, run this FISH 01:55:48.640 |
notebook yourself and screw around with it a lot. Make sure that you know how to do 01:55:54.960 |
these things that I did where I actually create a little function that allows me to spit out 01:56:03.680 |
the output of any of the layers and visualize it. Make sure you kind of know how to inspect 01:56:09.000 |
it and really look at the inputs and outputs. I think that's the best way to get an intuition. 01:56:18.400 |
So this was kind of like, particularly the first half of this class was a bit of a preview 01:56:25.160 |
of next year. In the first 6 weeks, you learn all the pieces. And then today, we very rapidly 01:56:33.320 |
tried putting those pieces together in a thousand different ways and saw what happened. There's 01:56:40.640 |
a million more ways that we know of, and probably a billion more ways we don't know of. So knowing 01:56:47.680 |
this little set of tools, convolutions, fully connected layers, activation functions, SGD, 01:56:58.280 |
you're now able to be an architect, create these architectures. Keras' functional API 01:57:05.800 |
makes it ridiculously easy. I created all of the architectures you see today, this week, 01:57:13.760 |
while I was sick and my baby wasn't sleeping. My brain was not even working, that's how 01:57:18.960 |
easy Keras makes this. It takes a few weeks to build your comfort level up, but hopefully 01:57:31.000 |
you can try that. And most importantly, over the next few weeks, as Rachel and I, maybe 01:57:39.240 |
with some of your help, start to develop the MOOC, you guys can stay talking on the forums 01:57:46.880 |
about keep working through whatever problems you're interested in. Whether it be the projects 01:57:52.080 |
that you want to apply these things to in your own organizations or your personal passion 01:57:56.240 |
projects or if you want to try and win a competition or two. Rachel and I are still going to be 01:58:02.880 |
on the forums. And then in a few weeks time, when the MOOC goes online, hopefully there's 01:58:11.000 |
going to be thousands of people joining this community. So we'll be like the seed. So I 01:58:18.640 |
really hope you guys will stay a part of it and help. Can you imagine that first day when 01:58:26.680 |
half the people still think that a python is a snake and don't know how to connect to 01:58:33.280 |
an AWS instance? You'll all be able to say, read the wiki, here's the page, oh yeah, I 01:58:39.200 |
had that problem too. And hopefully our goal here is to create a new generation of deep 01:58:46.560 |
learning practitioners, people who have useful problems that they're trying to solve and 01:58:54.000 |
can use this tool to solve them, rather than create more and more exclusive, heavily mathematical 01:59:03.880 |
content that's designed to put people off. So that's our hope. That's really why we're 01:59:11.040 |
doing this. Rachel, anything else that we should add before we wrap up? 01:59:18.280 |
Okay, well thank you so much. It really has been a genuine pleasure and I'm so happy to 01:59:24.840 |
hear that most of you are going to see you again next year. You guys obviously will get 01:59:32.560 |
first dibs on places for next year's course. If the MOOC successful, next year's course 01:59:39.400 |
could be quite popular, so I do suggest that you do nonetheless get your applications in 01:59:44.680 |
not too late. They'll certainly go through with priority. 01:59:50.680 |
Be aware if you're not already, we don't send email much, really the forums is our main 01:59:59.840 |
way to communicate and Slack to some extent. So if you want to see what's going on, that's 02:00:04.840 |
the places to look. And of course, our wiki is the knowledge base that we're creating 02:00:11.520 |
for everybody. So anytime you see something missing on the wiki or something you think 02:00:15.000 |
could be improved, edit it. Even if you're not sure if you're saying the right thing, 02:00:19.320 |
you can add a little comment after it's saying "I'm not sure if this is correct." The next 02:00:25.200 |
Thanks so much everybody. I hope you all have a great vacation season.