back to indexLesson 7: Deep Learning 2
Chapters
0:0
1:0 Part Two
1:18 Generative Modeling
3:5 Standard Fully Connected Network
13:26 Repackage Variable
50:50 Update Gate
57:38 Cosine Annealing Callback
64:24 Need for Rigor in Experiments in Deep Learning
67:20 Create a Model from Scratch
70:30 Create a Learn Object from a Custom Model
73:59 Convolution
75:37 Stride to Convolution
77:12 Adaptive Max Pooling
80:59 Learning Rate Finder
85:43 Batch Normalization
86:21 Batch Norm
88:20 Normalizing the Inputs
111:0 Increasing the Depth of the Model
113:22 Resnet Block
119:6 Bottleneck Layer
121:25 The Transformer Architecture
129:11 Class Activation Maps
00:00:00.000 |
The last class of Part 1, I guess the theme of Part 1 is classification and regression 00:00:12.200 |
with deep learning, and specifically it's about identifying and learning the best practices 00:00:21.920 |
We started out with, here are three lines of code to do image classification, and gradually 00:00:28.600 |
we've been, well the first four lessons were then kind of going through NLP, structured 00:00:35.480 |
data, cognitive filtering and kind of understanding some of the key pieces, and most importantly 00:00:39.920 |
understanding how to actually make these things work well in practice. 00:00:45.520 |
And then the last three lessons are then kind of going back over all of those topics in 00:00:50.240 |
kind of reverse order to understand more detail about what was going on and understanding 00:00:55.680 |
what the code looks like behind the scenes and wanting to write them from scratch. 00:01:02.400 |
Part 2 of the course will move from a focus on classification and regression, which is 00:01:09.920 |
kind of predicting 'a' thing, like 'a' number, or at most a small number of things, like 00:01:17.800 |
And we'll focus more on generative modelling. 00:01:21.040 |
generative modelling means predicting lots of things. 00:01:25.560 |
For example, creating a sentence, such as in neural translation, or image captioning, 00:01:31.640 |
or question-answering, or creating an image, such as in style transfer, super-resolution, 00:01:44.240 |
And then in Part 2, it'll move away from being just, here are some best practices, established 00:01:51.960 |
best practices either through people that have written papers or through research that 00:01:56.280 |
Fast AI has done and kind of got convinced that these are best practices, to some stuff 00:02:04.560 |
Some stuff which is maybe recent papers that haven't been fully tested yet, and sometimes 00:02:11.600 |
in Part 2, papers will come out in the middle of the course, and we'll change direction 00:02:16.040 |
with the course and study that paper because it's just interesting. 00:02:19.880 |
And so if you're interested in learning a bit more about how to read a paper and how 00:02:26.360 |
to implement it from scratch and so forth, then that's another good reason to do Part 00:02:32.560 |
It still doesn't assume any particular math background, but it does assume that you're 00:02:40.720 |
prepared to spend time digging through the notation and understanding it and converting 00:02:49.520 |
Alright, so where we're up to is RNNs at the moment. 00:02:54.880 |
I think one of the issues I find most with teaching RNNs is trying to ensure that people 00:03:00.840 |
understand that they're not in any way different or unusual or magical, they're just a standard 00:03:11.320 |
Let's go back to the standard fully connected network which looks like this. 00:03:15.200 |
To remind you, the arrows represent one or more layer operations, generally speaking 00:03:25.160 |
In this case, they're matrix modifications, followed by ReLU or THAN. 00:03:32.800 |
The arrows of the same color represent exactly the same weight matrix being used. 00:03:41.760 |
And so one thing which was just slightly different from previous fully connected networks we've 00:03:46.320 |
seen is that we have an input coming in not just at the first layer but also at the second 00:03:55.040 |
And we tried a couple of approaches, one was concatenating the inputs and one was adding 00:04:00.540 |
But there was nothing at all conceptually different about this. 00:04:09.760 |
We had a model where we basically defined the three arrows colors we had as three different 00:04:22.600 |
And by using the linear class, we got actually both the weight matrix and the bias vector 00:04:31.920 |
And then we went through and we did each of our embeddings, put it through our first linear 00:04:37.360 |
layer and then we did each of our, we call them hidden, I think they were orange arrows. 00:04:48.640 |
And in order to avoid the fact that there's no orange arrow coming into the first one, 00:04:54.880 |
we decided to invent an empty matrix and that way every one of these rows looked the same. 00:05:01.400 |
And so then we did exactly the same thing except we used a loop just to refactor the 00:05:12.360 |
So it was just a code refactoring, there was no change of anything conceptually. 00:05:18.720 |
And since we were doing a refactoring, we took advantage of that to increase the number 00:05:23.240 |
of characters to 8 because I was too lazy to type 8 linear layers, but I'm quite happy 00:05:31.720 |
So this now loops through this exact same thing, but we had 8 of these rather than 3. 00:05:43.280 |
So then we refactored that again by taking advantage of nn.rnn, which basically puts 00:05:49.880 |
that loop together for us and keeps track of this h as it goes along for us. 00:06:00.400 |
And so by using that we were able to replace the loop with a single call. 00:06:06.340 |
And so again, that's just a refactoring, doing exactly the same thing. 00:06:15.640 |
So then we looked at something which was mainly designed to save some training time, which 00:06:23.360 |
was previously, if we had a big piece of text, so we've got a movie review, we were basically 00:06:38.560 |
splitting it up into 8-character segments, and we'd grab segment number 1 and use that 00:06:51.600 |
But in order to make sure we used all of the data, we didn't just split it up like that, 00:06:56.560 |
we actually said here's our whole thing, the first will be to grab this section, the second 00:07:05.440 |
will be to grab that section, then that section, then that section, and each time we're predicting 00:07:14.560 |
And so I was a bit concerned that that seems pretty wasteful because as we calculate this 00:07:21.560 |
section, nearly all of it overlaps with the previous section. 00:07:26.440 |
So instead what we did was we said what if we actually did split it into non-overlapping 00:07:34.360 |
pieces and we said let's grab this section here and use it to predict every one of the 00:07:49.200 |
And then let's grab this section here and use it to predict every one of the characters 00:07:54.600 |
So after we look at the first character in, we try to predict the second character. 00:07:59.240 |
And then after we look at the second character, we try to predict the third character. 00:08:04.240 |
And then what if you perceptive folks asked a really interesting question, or expressed 00:08:11.280 |
a concern, which was, after we got through the first point here, we kind of threw away 00:08:27.040 |
our H activations and started a new one, which meant that when it was trying to use character 00:08:34.760 |
1 to predict character 2, it's got nothing to go on. 00:08:40.360 |
It's only done one linear layer, and so that seems like a problem, which indeed it is. 00:08:49.200 |
So we're going to do the obvious thing, which is let's not throw away H. So let's not throw 00:09:04.920 |
Every time we call forward, in other words every time we do a new mini-batch, we're creating 00:09:11.640 |
our hidden state, which remember is the orange circles, we're resetting it back to a bunch 00:09:20.400 |
And so as we go to the next non-overlapping section, we're saying forget everything that's 00:09:25.880 |
But in fact, the whole point is we know exactly where we are, we're at the end of the previous 00:09:30.860 |
section and about to start the next contiguous section, so let's not throw it away. 00:09:35.460 |
So instead the idea would be to cut this out, move it up to here, store it away in self, 00:09:51.640 |
So we're going to do that, and there's going to be some minor details to get right. 00:10:01.960 |
So here's the model, it's nearly identical, but I've got, as expected, one more line in 00:10:15.560 |
my constructor where I call something called init_hidden, and as expected init_hidden sets 00:10:32.720 |
And then as you can see our RNN now takes in self.h, and it, as before, spits out our 00:10:44.120 |
And so now the trick is to now store that away inside self.h. 00:10:53.720 |
If you think about it, if I was to simply do it like that, and now I train this on a 00:11:02.960 |
document that's a million characters long, then the size of this unrolled RNN is the 00:11:17.840 |
And so that's fine going forwards, but when I finally get to the end and I say here's 00:11:22.720 |
my character, and actually remember we're doing multi-output now, so multi-output looks 00:11:29.440 |
Or if we were to draw the unrolled version of multi-output, we would have a triangle 00:11:39.320 |
So the problem is then when we do backpropagation, we're calculating how much does the error 00:11:47.080 |
at character 1 impact the final answer, how much does the error at character 2 impact 00:11:55.640 |
And so we need to go back through and say how do we have to update our weights based 00:12:04.680 |
And so if there are a million characters, my unrolled RNN is a million layers long, 00:12:11.520 |
I have a 1 million layer fully connected network. 00:12:17.160 |
And I didn't have to write the million layers because I have the for loop and the for loop 00:12:20.800 |
is hidden away behind the self dot RNN, but it's still there. 00:12:28.720 |
So this is actually a 1 million layer fully connected network. 00:12:32.600 |
And so the problem with that is it's going to be very memory intensive because in order 00:12:37.160 |
to do the chain rule, I have to be able to multiply it every step like f'u g'x. 00:12:46.680 |
So I have to remember those values u, the value of every set of layers, so I'm going 00:12:53.160 |
to have to remember all those million layers, and I'm going to have to do a million multiplications, 00:12:57.680 |
and I'm going to have to do that every batch. 00:13:03.360 |
So to avoid that, we basically say from time to time, I want you to forget your history. 00:13:12.760 |
So we can still remember the state, which is to remember what's the actual values in 00:13:17.200 |
our hidden matrix, but we can remember the state without remembering everything about 00:13:24.720 |
So there's a little function called repackaged variable, which literally is just this. 00:13:37.400 |
It just simply says, grab the tensor out of it, because remember the tensor itself doesn't 00:13:45.400 |
have any concept of history, and create a new variable out of that. 00:13:50.200 |
And so this variable is going to have the same value, but no history of operations, 00:13:56.320 |
and therefore when it tries to backpropagate, it'll stop there. 00:14:01.160 |
So basically what we're going to do then is we're going to call this in our forward. 00:14:05.800 |
So that means it's going to do 8 characters, it's going to backpropagate through 8 layers, 00:14:13.680 |
it's going to keep track of the actual values in our hidden state, but it's going to throw 00:14:18.480 |
away at the end of those 8 its history of operations. 00:14:24.040 |
So this approach is called backprop through time, and when you read about it online, people 00:14:31.880 |
make it sound like a different algorithm, or some big insight or something, but it's 00:14:39.360 |
It's just saying hey, after our for loop, just throw away your history of operations 00:14:47.480 |
So we're keeping our hidden state, but we're not keeping our hidden state's history. 00:14:55.400 |
So that's wrinkle number 1, that's what this repackage bar is doing. 00:15:00.340 |
So when you see bptt, that's referring to backprop through time, and you might remember 00:15:07.320 |
we saw that in our original RNN lesson, we had a variable called bptt = 70, and so when 00:15:15.880 |
we set that, we're actually saying how many layers to backprop through. 00:15:21.200 |
Another good reason not to backprop through too many layers is if you have any kind of 00:15:25.760 |
gradient instability like gradient explosion or gradient spanishing, the more layers you 00:15:39.040 |
On the other hand, a longer value for bptt means that you're able to explicitly capture 00:15:52.600 |
So that's something that you get to tune when you create your RNN. 00:16:02.320 |
Wrinkle number 2 is how are we going to put the data into this. 00:16:10.600 |
It's all very well the way I described it just now where we said we could do this, and 00:16:23.880 |
we can first of all look at this section, then this section, then this section, but 00:16:30.080 |
we want to do a mini-batch at a time, we want to do a bunch at a time. 00:16:36.920 |
So in other words, we want to say let's do it like this. 00:16:51.200 |
So mini-batch number 1 would say let's look at this section and predict that section. 00:16:58.080 |
And at the same time in parallel, let's look at this totally different section and predict 00:17:04.040 |
And at the same time in parallel, let's look at this totally different section and predict 00:17:10.600 |
And so then, because remember in our hidden state, we have a vector of hidden state for 00:17:18.080 |
everything in our mini-batch, so it's going to keep track of at the end of this is going 00:17:22.200 |
to be a vector here, a vector here, a vector here, and then we can move across to the next 00:17:27.320 |
one and say okay, for this part of the mini-batch, use this to predict that, and use this to predict 00:17:38.000 |
So you can see that we've got a number of totally separate bits of our text that we're 00:17:47.440 |
So hopefully this is going to ring a few bells for you, because what happened was back when 00:17:55.720 |
we started looking at TorchText for the first time, we started talking about how it creates 00:18:01.720 |
And I said what happened was we took our whole big long document consisting of the entire 00:18:10.720 |
works of Nietzsche, or all of the IMDb reviews concatenated together, or whatever, and a lot 00:18:17.320 |
of you, not surprisingly, because this is really weird at first, a lot of you didn't 00:18:23.120 |
What I said was we split this into 64 equal-sized chunks, and a lot of your brains went, "Jermi 00:18:31.160 |
just said we split this into chunks of size 64." 00:18:36.360 |
Jermi said we split it into 64 equal-sized chunks. 00:18:41.300 |
So if this whole thing was length 64 million, which would be a reasonable sized corpus, 00:18:50.360 |
then each of our 64 chunks would have been of length 1 million. 00:18:57.440 |
And so then what we did was we took the first chunk of 1 million and we put it here. 00:19:03.360 |
And then we took the second chunk of 1 million and we put it here. 00:19:06.920 |
The third chunk of 1 million, we put it here. 00:19:14.880 |
And then each mini-batch consisted of us going, "Let's split this down here, and here, and 00:19:25.120 |
And each of these is of size BPTT, which I think we had something like 70. 00:19:36.640 |
And so what happened was we said, "All right, let's look at our first mini-batch is all 00:19:44.040 |
So we do all of those at once and predict everything offset by 1. 00:19:52.320 |
And then at the end of that first mini-batch, we went to the second chunk and used each 00:19:58.000 |
one of these to predict the next one offset by 1. 00:20:02.760 |
So that's why we did that slightly weird thing, is that we wanted to have a bunch of things 00:20:08.960 |
we can look through in parallel, each of which hopefully are far enough away from each other 00:20:16.240 |
that we don't have to worry about the fact that the truth is the start of this million 00:20:21.200 |
characters was actually in the middle of a sentence, but who cares? 00:20:26.480 |
Because it only happens once every million characters. 00:20:30.800 |
I was wondering if you could talk a little bit more about augmentation for this kind 00:20:44.080 |
No, I can't because I don't really know a good way. 00:20:48.640 |
It's one of the things I'm going to be studying between now and Part 2. 00:20:54.920 |
There have been some recent developments, particularly something we talked about in the 00:20:59.640 |
machine learning course, which I think we briefly mentioned here, which was somebody 00:21:03.640 |
for a recent Kaggle competition won it by doing data augmentation by randomly inserting 00:21:16.640 |
Something like that may be useful here, and I've seen some papers that do something like 00:21:20.680 |
that, but I haven't seen any kind of recent-ish state-of-the-art NLP papers that are doing 00:21:32.240 |
this kind of data augmentation, so it's something we're planning to work on. 00:21:47.000 |
So there's a couple of things to think about when you pick your BPTT. 00:21:49.840 |
The first is that you'll note that the matrix size for a mini-batch has BPTT by batch size. 00:22:07.280 |
So one issue is your GPU RAM needs to be able to fit that by your embedding matrix, because 00:22:14.960 |
every one of these is going to be of length embedding, plus all of the hidden state. 00:22:21.320 |
So one thing is if you get a CUDA out of memory error, you need to reduce one of those. 00:22:29.920 |
If your training is very unstable, like your loss is shooting off to NAN suddenly, then 00:22:38.080 |
you could try decreasing your BPTT because you've got less layers to gradient explode 00:22:46.160 |
You could try decreasing your BPTT because it's got to do one of those steps at a time, 00:22:59.160 |
There's a recent thing called QRNN, which we'll hopefully talk about in Part 2 which 00:23:04.040 |
kind of does paralyze it, but the versions we're looking at don't paralyze it. 00:23:08.160 |
So that would be the main issues, look at performance, look at memory, and look at stability, 00:23:13.880 |
and try and find a number that's as high as you can make it, but all of those things 00:23:24.680 |
So trying to get all that chunking and lining up to work is more code than I want to write, 00:23:33.260 |
so for this section we're going to go back and use Torch Text again. 00:23:41.880 |
When you're using APIs like FastAI and Torch Text, which in this case these two APIs are 00:23:47.440 |
designed to work together, you often have a choice which is like, okay, this API has 00:23:55.320 |
a number of methods that expect the data in this kind of format, and you can either change 00:24:01.080 |
your data to fit that format, or you can write your own data set subclass to handle the format 00:24:11.000 |
I've noticed on the forum a lot of you are spending a lot of time writing your own data 00:24:16.080 |
set classes, whereas I am way lazier than you and I spend my time instead changing my 00:24:25.960 |
Either is fine, and if you realize there's a kind of a format of data that me and other 00:24:34.760 |
people are likely to be seeing quite often and it's not in the FastAI library, then by 00:24:38.720 |
all means write the data set subclass, submit it as a PR, and then everybody can benefit. 00:24:45.640 |
In this case, I just thought I want to have some niche data fed into Torch Text, I'm just 00:24:55.400 |
going to put it in the format that Torch Text kind of already supports. 00:24:58.740 |
So Torch Text already has, or at least the FastAI wrapper around Torch Text, already 00:25:03.840 |
has something where you can have a training path and a validation path and one or more 00:25:09.880 |
text files in each path containing a bunch of stuff that's concatenated together for 00:25:16.040 |
So in this case, all I did was I made a copy of my nature file, copied it into training, 00:25:22.480 |
made another copy, stuck it into the validation, and then in the training set, I deleted the 00:25:29.560 |
last 20% of rows, and in the validation set, I deleted all except for the last 20% of rows. 00:25:38.800 |
In this case, I found that easier than writing a custom data set class. 00:25:44.400 |
The other benefit of doing it that way was that I felt like it was more realistic to 00:25:49.240 |
have a validation set that wasn't a random shuffled set of rows of text, but was like 00:25:55.880 |
a totally separate part of the corpus, because I feel like in practice you're very often 00:26:01.140 |
going to be saying, "Oh, I've got these books or these authors I'm learning from, and then 00:26:06.880 |
I want to apply it to these different books and these different authors." 00:26:10.080 |
So I felt like getting a more realistic validation of my nature model, I should use a whole separate 00:26:18.280 |
piece of the text, so in this case it was the last 20% of the rows of the corpus. 00:26:25.760 |
So I haven't created this for you intentionally, because this is the kind of stuff I want you 00:26:32.480 |
practicing is making sure that you're familiar enough, comfortable enough with bash or whatever 00:26:37.760 |
you can create these, and that you understand what they need to look like and so forth. 00:26:43.560 |
So in this case, you can see I've now got a train and a validation here, and then I could 00:26:56.920 |
So you can see I've literally just got one file in it, because when you're doing a language 00:27:01.400 |
model, i.e. predicting the next character or predicting the next word, you don't really 00:27:08.160 |
It's fine if you do have separate files, but they just get concatenated together anyway. 00:27:14.240 |
So that's my source data, and so here is the same lines of code that we've seen before, 00:27:20.720 |
and let's go over them again because it's a couple of lessons ago. 00:27:23.880 |
So in Torch Text, we create this thing called a field, and a field initially is just a description 00:27:36.440 |
In this case, I'm going to say lowercase it, because I don't -- now I think about it, there's 00:27:43.920 |
no particular reason to have done this lowercase, uppercase would work fine too. 00:27:49.960 |
And so you might remember last time we used a tokenization function which largely spit 00:27:55.800 |
on white space and tried to do clever things with punctuation, and that gave us the word 00:28:01.160 |
In this case, I want a character model, so I actually want every character put into a 00:28:07.660 |
So I can just use the function list in Python, because list in Python does that. 00:28:18.320 |
So this is where you can kind of see like, understanding how libraries like Torch Text 00:28:24.780 |
and FastAI are designed to be extended can make your life a lot easier. 00:28:29.900 |
So when you realize that very often, both of these libraries kind of expect you to pass 00:28:36.400 |
a function that does something, and then you realize, oh, I can write any function I like. 00:28:44.680 |
So this is now going to mean that each mini-batch is going to contain a list of characters. 00:28:51.480 |
And so here's where we get to define all our different parameters. 00:28:56.080 |
And so to make it the same as previous sections of this notebook, I'm going to use the same 00:29:00.960 |
batch size, the same number of characters, and I'm going to rename it to bptt since we 00:29:09.360 |
The number of the size of the embedding, and the size of our hidden state. 00:29:15.440 |
Remembering the size of our hidden state simply means going all the way back to the start, 00:29:24.760 |
and then hidden simply means the size of the state that's created by each of those orange 00:29:39.240 |
So having done that, we can then create a little dictionary saying what's our training, 00:29:45.080 |
In this case, I don't have a separate test set, so I'll just use the same thing. 00:29:49.760 |
And then I can say I want a language model data subclass with model data, I'm going to 00:29:55.320 |
grab it from text files, and this is my path, and this is my field, which I defined earlier, 00:30:05.360 |
and these are my files, and these are my hyperparameters. 00:30:11.640 |
MinFract is not going to do anything actually in this case because I don't think there's 00:30:15.120 |
going to be any character that appears less than 3 times, so that's probably redundant. 00:30:22.180 |
So at the end of that, it says there's going to be 963 batches to go through. 00:30:28.760 |
And so if you think about it, that should be equal to the number of tokens divided by 00:30:34.880 |
the batch size divided by bptt, because that's the size of each of those rectangles. 00:30:46.120 |
You'll find that in practice it's not exactly that, and the reason it's not exactly that 00:30:51.200 |
is that the authors of TorchText did something pretty smart, which I think we've briefly 00:30:58.600 |
They said we can't shuffle the data, like with images we like to shuffle the order so 00:31:03.240 |
every time we see them in a different order, so there's a bit more randomness. 00:31:06.480 |
We can't shuffle because we need to be contiguous, but what we could do is basically randomize 00:31:20.480 |
It's not always going to give us exactly 8 characters long, 5% of the time it'll actually 00:31:27.800 |
cut it in half, and then it's going to add on a small little standard deviation to make 00:31:37.400 |
So it's going to be slightly different to 8 on average. 00:31:44.600 |
So a mini-batch needs to do a matrix multiplication, and the mini-batch size has to remain constant 00:32:08.960 |
because we've got this h-weight matrix that has to line up in size with the size of the 00:32:20.720 |
But the sequence length can change, no problem. 00:32:33.560 |
So that's why we have 963, so the length of a data loader is how many mini-batches, in 00:32:41.760 |
Number of tokens is how many unique things are in the vocabulary. 00:32:45.960 |
And remember, after we run this line, text now does not just contain a description of 00:32:55.240 |
what we want, but it also contains an extra attribute called vocab, which contains stuff 00:33:04.040 |
like a list of all of the unique items in the vocabulary and a reverse mapping from each 00:33:18.960 |
So that text object is now an important thing to keep track of. 00:33:35.480 |
So the class is exactly the same as the class we've had before. 00:33:39.120 |
The only key difference is to call init_hidden, which sets out. 00:33:44.240 |
So h is not a variable anymore, it's now an attribute, self.h is a variable containing 00:33:53.320 |
Now I mentioned that batch size remains constant each time, but unfortunately when I said that 00:34:02.760 |
And the way that I lied to you is that the very last mini-batch will be shorter. 00:34:09.960 |
The very last mini-batch is actually going to have less than 64 -- it might be exactly 00:34:14.080 |
the right size if it so happens that this data set is exactly divisible by bptt times 00:34:20.800 |
But it probably isn't, so the last batch will probably have a little bit less. 00:34:26.320 |
And so that's why I do a little check here that says let's check that the batch size 00:34:31.000 |
inside self.h is going to be the height, the number of activations, and the width is going 00:34:48.080 |
Check that that's equal to the actual batch size length that we've received. 00:34:58.160 |
And if they're not the same, then set it back to zeroes again. 00:35:03.120 |
So this is just a minor little ring call that basically at the end of each epoch, it's going 00:35:12.600 |
And so then as soon as it starts the next epoch, it's going to see that they're not 00:35:16.360 |
the same again, and it will reinitialize it to the correct full batch size. 00:35:20.960 |
So that's why if you're wondering, there's an init hidden not just in the constructor, 00:35:26.240 |
but also inside forward, it's to handle this end of each epoch, start of each epoch difference. 00:35:34.680 |
Not an important point by any means, but potentially confusing when you see it. 00:35:47.640 |
The last ring call is something that slightly sucks about PyTorch, and maybe somebody can 00:35:55.000 |
be nice enough to try and fix it with a PR if anybody feels like it, which is that the 00:36:00.400 |
loss functions such as softmax are not happy receiving a rank 3 tensor. 00:36:10.040 |
Remember a rank 3 tensor is just another way of saying a dimension 3 array. 00:36:17.040 |
There's no particular reason they ought to not be happy receiving a rank 3 tensor. 00:36:21.520 |
Like somebody could write some code to say hey, a rank 3 tensor is probably a sequence 00:36:25.960 |
length by batch size by results thing, and so you should just do it for each of the two 00:36:42.920 |
Funnily enough, it can handle rank 2 or rank 4, but not rank 3. 00:36:49.640 |
So we've got a rank 2 tensor containing, for each time 00:37:04.920 |
period (I can't remember which way around the axes are, but whatever) for each time 00:37:11.340 |
period for each batch, we've got our predictions. 00:37:19.920 |
And then we've got our actuals for each time period for each batch, we've got our predictions, 00:37:33.800 |
And so we just want to check whether they're the same. 00:37:36.400 |
And so in an ideal world, our loss function would check item 1 1, then item 1 2, and then 00:37:42.320 |
item 1 3, but since that hasn't been written, we just have to flatten them both out. 00:37:48.320 |
We can literally just flatten them out, put rows to rows. 00:37:53.160 |
And so that's why here I have to use .view, and so .view says the number of columns will 00:38:04.800 |
be equal to the size of the vocab, because remember we're going to end up with a probability 00:38:11.120 |
And then the number of rows is however big is necessary, which will be equal to batch 00:38:22.780 |
And then you may be wondering where I do that for the target, and the answer is torch text 00:38:31.000 |
knows that the target needs to look like that, so torch text has already done that for us. 00:38:35.840 |
So torch text automatically changes the target to be flattened out. 00:38:40.320 |
And you might actually remember if you go back to lesson 4 when we actually looked at 00:38:45.920 |
a mini-batch that spat out of torch text, we noticed actually that it was flattened, and 00:38:51.640 |
I said we'll learn about why later, and so later is now arrived. 00:39:09.080 |
Recreate the hidden state if the batch size changes, flatten out, and then use torch text 00:39:22.200 |
So once we do those things, we can then create our model, create our optimizer with that model's 00:39:38.600 |
One thing to be careful of here is that softmax now, as of PyTorch 0.3, requires that we pass 00:39:56.000 |
in a number here saying which axis do we want to do the softmax over. 00:40:03.040 |
So at this point, this is a 3-dimensional tensor, and so we want to do the softmax over 00:40:11.160 |
So when I say which axis do we do the softmax over, remember we divide by, so we go e to 00:40:21.400 |
So it's saying which axis do we sum over, so which axis do we want to sum to 1. 00:40:26.520 |
And so in this case, clearly we want to do it over the last axis, because the last axis 00:40:31.200 |
is the one that contains the probability per letter of the alphabet, and we want all of 00:40:40.880 |
So therefore, to run this notebook, you're going to need PyTorch 0.3, which just came 00:40:50.040 |
So if you're doing this on the MOOC, you're fine, I'm sure you've got at least 0.3 or later. 00:41:02.760 |
The really great news is that 0.3, although it does not yet officially support Windows, 00:41:11.000 |
I successfully installed 0.3 from Conda yesterday by typing Conda install PyTorch in Windows. 00:41:18.520 |
I then attempted to use the entirety of Lesson 1, and every single part worked. 00:41:27.520 |
So for those who are interested in doing deep learning on their laptop, I can definitely 00:41:36.000 |
The New Surface Book 15" has a GTX 1060 6GB GPU in it, and it was running about 3 times 00:41:47.800 |
slower than my 1080Ti, which I think means it's about the same speed as an AWS P2 instance. 00:42:00.360 |
And as you can see, it's also a nice convertible tablet that you can write on, and it's thin 00:42:05.240 |
and light, so I've never seen such a good deep learning box. 00:42:11.160 |
Also I successfully installed Linux on it, and all of the fastai stuff worked on the 00:42:16.920 |
Linux as well, so a really good option if you're interested in a laptop that can run 00:42:27.720 |
So that's going to be aware of with this dm= -1. 00:42:31.840 |
So then we can go ahead and construct this, and we can call fit, and we're basically going 00:42:37.520 |
to get pretty similar results to what we got before. 00:42:45.120 |
So then we can go a bit further with our RNN by just unpacking it a bit more. 00:42:54.000 |
And so this is now exactly the same thing, gives exactly the same answers, but I have 00:43:08.300 |
And so this is just something, I won't spend time on it, but you can check it out. 00:43:12.280 |
So instead, I've now defined RNN as RNN cell, and I've copied and pasted the code above. 00:43:18.880 |
Don't run it, this is just for your reference, from PyTorch. 00:43:22.640 |
This is the definition of RNN cell in PyTorch. 00:43:26.040 |
And I want you to see that you can now read PyTorch source code and understand it. 00:43:31.980 |
Not only that, you'll recognize it as something we've done before. 00:43:35.040 |
It's a matrix multiplication of the weights by the inputs plus biases. 00:43:41.400 |
So f.linear simply does a matrix product followed by an addition. 00:43:46.940 |
And interestingly, you'll see they do not concatenate the input bit and the hidden bit, 00:43:55.080 |
they sum them together, which is our first approach. 00:43:59.480 |
As I said, you can do either, neither one is right or wrong, but it's interesting to 00:44:05.520 |
Can you give us an insight about what are they using that particular activation function? 00:44:14.800 |
I think we might have briefly covered this last week, but very happy to do it again if 00:44:39.080 |
So in other words, it's a sigmoid function, double the height -1, literally, they're equal. 00:44:47.720 |
So it's a nice function in that it's forcing it to be no smaller than -1, no bigger than 00:44:57.220 |
And since we're multiplying by this weight matrix again and again and again and again, 00:45:03.040 |
we might worry that a ReLU, because it's unbounded, might have more of a gradient explosion problem. 00:45:12.600 |
Having said that, you can actually ask PyTorch for an RNN cell which uses a different nonlinearity. 00:45:25.240 |
So you can see by default it uses than, but you can ask for a ReLU as well. 00:45:30.560 |
But most people seem to, pretty much everybody still seems to use than as far as I can tell. 00:45:37.260 |
So you can basically see here, this is all the same except now I've got an RNN cell, 00:45:41.120 |
which means now I need to put my for loop back. 00:45:44.280 |
And you can see every time I call my little linear function, I just append the result onto 00:45:53.880 |
And at the end, the result is that all stacked up together. 00:45:59.320 |
So I'm just trying to show you how nothing inside PyTorch is mysterious, you should find 00:46:06.360 |
you get basically exactly the same answer from this as the previous one. 00:46:13.720 |
In practice you would never write it like this, but what you may well find in practice 00:46:17.780 |
is that somebody will come up with a new kind of RNN cell, or a different way of keeping 00:46:23.440 |
track of things over time, or a different way of doing regularization. 00:46:27.080 |
And so inside fastai's code, you will find that we do this by hand because we use some 00:46:38.960 |
regularization approaches that aren't supported by PyTorch. 00:46:45.400 |
So another thing I'm not going to spend much time on but I'll mention briefly is that nobody 00:46:54.480 |
And the reason we don't use that RNN cell in practice is even though the than is here, 00:47:00.420 |
you do tend to find gradient explosions are still a problem, so we have to use pretty 00:47:06.400 |
low learning rates to get these to train, and pretty small values for bptt to get them 00:47:15.640 |
So what we do instead is we replace the RNN cell with something like this. 00:47:21.280 |
This is called a GRU cell, and here's a picture of it, and there's the equations for it. 00:47:38.320 |
So basically I'll show you both quickly, but we'll talk about it much more in Part 2. 00:47:44.200 |
We've got our input, and our input normally gets multiplied by a weight matrix to create 00:48:00.400 |
That's not what happens, and then of course we add it to the existing activations. 00:48:07.720 |
In this case, our input goes into this h_tilde temporary thing, and it doesn't just get added 00:48:16.000 |
to our previous activations, but our previous activations get multiplied by this value R. 00:48:29.240 |
And how do we calculate this value, it goes between 0 and 1 in our reset gate? 00:48:35.760 |
Well the answer is, it's simply equal to a matrix product between some weight matrix 00:48:42.280 |
and the concatenation of our previous hidden state and our new input. 00:48:47.600 |
In other words, this is a little one hidden layer neural net. 00:48:52.920 |
And in particular it's a one hidden layer neural net because we then put it through 00:48:58.760 |
One of the things I hate about mathematical notation is symbols are overloaded a lot. 00:49:03.800 |
When you see sigma, that means standard deviation. 00:49:06.640 |
When you see it next to a parenthesis like this, it means the sigmoid function. 00:49:11.440 |
So in other words, that which looks like that. 00:49:26.120 |
So this is like a little mini-neuronet with no hidden layers, so to think of it another 00:49:32.640 |
And I mentioned this briefly because it's going to come up a lot in part 2, so it's 00:49:39.440 |
It's this idea that in the very learning itself, you can have little mini-neuronets inside 00:49:48.840 |
And so this little mini-neuronet is going to be used to decide how much of my hidden 00:49:57.320 |
And so it might learn that in this particular situation, forget everything you know. 00:50:04.600 |
When you see a full stop, you should throw away nearly all of your hidden state. 00:50:09.800 |
That is probably something it would learn, and that's very easy for it to learn using 00:50:16.200 |
And so that goes through to create my new hidden state along with the input. 00:50:22.040 |
And then there's a second thing that happens, which is there's this gate here called z. 00:50:27.200 |
And what z says is you've got some amount of your previous hidden state plus your new 00:50:34.240 |
input, and it's going to go through to create your new state. 00:50:38.880 |
And I'm going to let you decide to what degree do you use this new input version of your 00:50:45.760 |
hidden state, and to what degree will you just leave the hidden state the same as before. 00:50:50.300 |
So this thing here is called the update gate. 00:50:54.920 |
The first is to throw away some hidden state when deciding how much to incorporate that 00:50:59.800 |
versus my new input, and how much to update my hidden state versus just leave it exactly 00:51:07.580 |
And the equation hopefully is going to look pretty familiar to you, which is check this 00:51:14.080 |
Remember how I said you want to start to recognize some common ways of looking at things? 00:51:21.120 |
Well here I have a 1 minus something by a thing, and a something without the 1 minus 00:51:30.080 |
by a thing, which remember is a linear interpolation. 00:51:35.120 |
So in other words, the value of z is going to decide to what degree do I have keep the 00:51:42.680 |
previous hidden state, and to what degree do I use the new hidden state. 00:51:48.800 |
So that's why they draw it here as this kind of like, it's not actually a switch, but you 00:51:56.880 |
You can be like, oh it's here, or it's here, or it's here to decide how much to update. 00:52:06.520 |
It's a little mini-neuronet with its own weight matrix to decide how much to update, a little 00:52:10.520 |
mini-neuronet with its own weight matrix to decide how much to reset, and then that's 00:52:14.520 |
used to do an interpolation between the two hidden states. 00:52:18.440 |
So that's called a GRU, gated recurrent network. 00:52:24.600 |
There's the definition from the PyTorch source code. 00:52:28.440 |
They have some slight optimizations here that if you're interested in we can talk about 00:52:32.560 |
them on the forum, but it's exactly the same formula we just saw. 00:52:38.580 |
And so if you go nn.giu, then it uses this same code, but it replaces the RNN cell with 00:52:49.560 |
And as a result, rather than having something where we're getting a 1.54, we're now getting 00:52:59.600 |
down to 1.40, and we can keep training even more, get right down to 1.36. 00:53:05.640 |
So in practice, a GRU, or very nearly equivalently, we'll see in a moment, an LSTM, is in practice 00:53:16.780 |
So the RT and HT are ultimately scalars after they go through the sigmoid, but they're applied 00:53:27.600 |
Yes, although of course one for each mini-batch. 00:53:42.880 |
On the excellent Chris Olar's blog, there's an understanding LSTM networks post, which 00:53:50.840 |
you can read all about this in much more detail if you're interested. 00:53:54.600 |
And also, the other one I was dealing with here is WildML, I also have a good blog post 00:54:00.280 |
If somebody wants to be helpful, feel free to put them in the lesson wiki. 00:54:09.960 |
So then putting it all together, I'm now going to replace my GRU with an LSTM. 00:54:16.040 |
I'm not going to bother showing you the cell for this, it's very similar to GRU. 00:54:20.240 |
But the LSTM has one more piece of state in it called the cell state, not just the hidden 00:54:27.200 |
So if you do use an LSTM, you now inside your init_hidden have to return a tuple of matrices. 00:54:33.720 |
They're exactly the same size as the hidden state, but you just have to return the tuple. 00:54:40.080 |
The details don't matter too much, but we can talk about it during the week if you're 00:54:47.760 |
When you pass in, you still pass in self.h, it still returns a new value of h, you still 00:54:54.020 |
So this code is identical to the code before. 00:54:57.400 |
One thing I've done though is I've added dropout inside my RNN, which you can do with the PyTorch 00:55:05.000 |
RNN function, so that's going to do dropout after each time step. 00:55:09.680 |
And I've doubled the size of my hidden layer since I've now added 0.5 dropout, and so my 00:55:14.400 |
hope was that this would be able to learn more but be more resilient as it does so. 00:55:24.440 |
So then I wanted to show you how to take advantage of a little bit more fast.ai magic without 00:55:35.400 |
And so I'm going to show you how to use callbacks, and specifically we're going to do SGDR without 00:55:47.760 |
So to do that, we create our model again, just a standard PyTorch model. 00:55:52.680 |
And this time, rather than going, remember the usual PyTorch approach is opt=optim.atom 00:56:00.200 |
and you pass in the parameters and the learning rate, I'm not going to do that, I'm going to 00:56:04.400 |
use the fast.ai layer optimizer class, which takes my optim class constructor from PyTorch. 00:56:16.360 |
It takes my model, it takes my learning rate, and optionally takes weight decay. 00:56:24.720 |
And so this class is tiny, it doesn't do very much at all. 00:56:29.700 |
The key reason it exists is to do differential learning rates and differential weight decay. 00:56:35.960 |
But the reason we need to use it is that all of the mechanics inside fast.ai assumes that 00:56:43.300 |
So if you want to use callbacks or SGDR or whatever in code where you're not using the 00:56:50.040 |
learner class, then you need to use, rather than saying opt=optim.atom, and here's my 00:57:03.440 |
So that gives us a layer optimizer object, and if you're interested, basically behind 00:57:09.840 |
the scenes, you can now grab a .opt property which actually gives you the optimizer. 00:57:20.160 |
You don't have to worry about that yourself, but that's basically what happens behind the 00:57:24.240 |
The key thing we can now do is that when we call fit, we can pass in that optimizer, and 00:57:36.240 |
And specifically we're going to use the cosine annealing callback. 00:57:41.940 |
And so the cosine annealing callback requires a layer optimizer object. 00:57:47.560 |
And so what this is going to do is it's going to do cosine annealing by changing the learning 00:57:57.000 |
So the details are terribly important, we can talk about them on the forum, it's really 00:58:03.800 |
Which is that now that we've done this, we can say create a cosine annealing callback 00:58:09.120 |
which is going to update the learning rates in this layer optimizer. 00:58:14.560 |
The length of an epoch is equal to this here. 00:58:20.920 |
Well it's whatever the length of this data loader is, because it's going to be doing the 00:58:25.680 |
cosine annealing, it needs to know how often to reset. 00:58:31.480 |
And then you can pass in the cycle melt in the usual way. 00:58:34.800 |
And then we can even save our model automatically, like remember how there was that cycle save 00:58:41.760 |
name parameter that we can pass to learn.fit? 00:58:46.480 |
It sets an on-cycle end callback, and so here I have to find that callback as being something 00:58:57.080 |
So there's quite a lot of cool stuff that you can do with callbacks. 00:59:02.400 |
Callbacks are basically things where you can define at the start of training, or at the 00:59:06.320 |
start of an epoch, or at the start of a batch, or at the end of training, or at the end of 00:59:09.920 |
an epoch, or at the end of a batch, please call this code. 00:59:13.640 |
And so we've written some for you, including SGDR, which is the cosine annealing callback. 00:59:21.760 |
And then Sahar recently wrote a new callback to implement the new approach to decoupled 00:59:28.760 |
We use callbacks to draw those little graphs of the loss of a time, so there's lots of 00:59:36.600 |
So in this case, by passing in that callback, we're getting SGDR, and that's able to get 00:59:43.480 |
us down to 1.31 here, and then we can train a little bit more, and eventually get down 00:59:59.520 |
And so if we pass in a few characters of text, we get not surprisingly an e after 4 or thus. 01:00:07.160 |
Let's do then 400, and now we have our own Nietzsche. 01:00:12.700 |
So Nietzsche tends to start his sections with a number and a dot. 01:00:15.600 |
So 293, perhaps that every life of values of blood, of intercourse, when it senses there 01:00:22.000 |
is unscrupulous, his very rights and still impulse love. 01:00:25.920 |
So it's slightly less clear than Nietzsche normally, but it gets the tone right. 01:00:33.240 |
And it's actually quite interesting to play around with training these character-based 01:00:39.220 |
language models, to run this at different levels of loss, to get a sense of what does 01:00:46.160 |
You really notice that this is like 1.25, and at slightly worse, like 1.3, this looks 01:00:57.440 |
There's punctuation in random places and nothing makes sense. 01:01:02.560 |
And you start to realize that the difference between Nietzsche and random junk is not that 01:01:12.120 |
And so if you train this for a little bit longer, you'll suddenly find it's making more 01:01:18.360 |
So if you are playing around with NLP stuff, particularly generative stuff like this, and 01:01:24.000 |
you're like, the results are kind of okay but not great, don't be disheartened because 01:01:30.600 |
that means you're actually very very nearly there. 01:01:33.520 |
The difference between something which is starting to create something which almost 01:01:37.880 |
vaguely looks English if you squint, and something that's actually a very good generation, it's 01:01:48.800 |
So let's take a 5-minute break, we'll come back at 7.45 and we're going to go back to 01:02:09.660 |
So now we're looking at lesson 7, sci-fi 10 notebook. 01:02:20.280 |
It's a really well-known dataset in academia. 01:02:24.480 |
And it's actually pretty old by computer vision standards, well before ImageNet was around, 01:02:34.760 |
You might wonder why we're going to be looking at such an old dataset, and actually I think 01:02:41.080 |
small datasets are much more interesting than ImageNet. 01:02:46.840 |
Because most of the time you're likely to be working with stuff with a small number 01:02:51.340 |
of thousands of images rather than 1.5 million images. 01:02:56.040 |
Some of you will work with 1.5 million images, but most of you won't. 01:02:59.520 |
So learning how to use these kind of datasets I think is much more interesting. 01:03:03.660 |
Often also a lot of the stuff we're looking at in medical imaging, we're looking at the 01:03:08.120 |
specific area where there's a lung nodule, you're probably looking at 32x32 pixels at 01:03:14.520 |
most as being the area where that lung nodule actually exists. 01:03:18.720 |
And so sci-fi 10 is small both in terms of it doesn't have many images, and the images 01:03:22.760 |
are very small, and so therefore I think in a lot of ways it's much more challenging than 01:03:33.920 |
And also, most importantly, you can run stuff much more quickly on it, so it's much better 01:03:38.580 |
to test out your algorithms with something you can run quickly, and it's still challenging. 01:03:44.960 |
And so I hear a lot of researchers complain about how they can't afford to study all the 01:03:50.920 |
different versions of their algorithm properly because it's too expensive, and they're doing 01:03:56.720 |
So it's literally a week of expensive GPU work for every study they do, and I don't understand 01:04:03.920 |
why you would do that kind of study on ImageNet, it doesn't make sense. 01:04:08.800 |
And so there's been a lot of debate about this this week because a really interesting 01:04:18.200 |
researcher named Ali Rahami at NIPS this week gave a talk, a really great talk about the 01:04:24.480 |
need for rigor in experiments in deep learning, and he felt like there's a lack of rigor. 01:04:31.000 |
And I've talked to him about it quite a bit since that time, and I'm not sure we yet quite 01:04:38.880 |
understand each other as to where we're coming from, but we have very similar kinds of concerns, 01:04:44.160 |
which is basically people aren't doing carefully tuned, carefully thought about experiments, 01:04:50.280 |
but instead they throw lots of GPUs and lots of data and consider that a day. 01:04:55.320 |
And so this idea of saying, well, is my algorithm meant to be good at small images, at small 01:05:04.200 |
data sets, well if so, let's study it on so far 10 rather than studying it on ImageNet 01:05:09.280 |
and then do more studies of different versions of the algorithm, turning different bits on 01:05:13.520 |
and off, understand which parts are actually important, and so forth. 01:05:19.600 |
People also complain a lot about MNIST, which we've looked at before, and I would say the 01:05:24.520 |
same thing about MNIST, which is like if you're actually trying to understand which parts 01:05:28.360 |
of your algorithm make a difference and why, using MNIST for that kind of study is a very 01:05:34.120 |
And all these people who complain about MNIST, I think they're just showing off. 01:05:38.320 |
They're saying, I work at Google and I have a pod of TPUs and I have $100,000 a week of 01:05:47.120 |
But I think that's all it is, it's just signaling rather than actually academically rigorous. 01:05:53.560 |
Okay, so sci-fi 10, you can download from here. 01:05:58.000 |
This person has very kindly made it available in image form. 01:06:03.600 |
If you Google for sci-fi 10, you'll find a much less convenient form, so please use this 01:06:10.360 |
Once you download it, you can use it in the usual way. 01:06:16.840 |
So here's a list of the classes that are there. 01:06:21.760 |
Now you'll see here I've created this thing called stats. 01:06:25.320 |
Normally when we've been using pre-trained models, we have been saying transforms from 01:06:33.040 |
model, and that's actually created the necessary transforms to convert our dataset into a normalized 01:06:41.960 |
dataset based on the means and standard deviations of each channel in the original model that 01:06:48.880 |
In our case, this time we've got to train a model from scratch, so we have no such thing. 01:06:54.760 |
So we actually need to tell it the mean and standard deviation of our data to normalize 01:07:02.160 |
And so in this case, I haven't included the code here to do it. 01:07:05.000 |
You should try and try this yourself to confirm that you can do this and understand where 01:07:09.960 |
But this is just the mean per channel and the standard deviation per channel of all 01:07:19.080 |
So we're going to try and create a model from scratch. 01:07:25.300 |
And so the first thing we need is some transformations. 01:07:28.680 |
So for sci-fi 10, people generally do data augmentation of simply flipping randomly horizontally. 01:07:37.980 |
So here's how we can create a specific list of augmentations to use. 01:07:44.400 |
And then they also tend to add a little bit of black padding around the edge and then 01:07:49.560 |
randomly pick a 32x32 spot from within that padded image. 01:07:54.800 |
So if you add the pad parameter to any of the fastai transform creators, it'll do that 01:08:04.080 |
And so in this case, I'm just going to add 4 pixels around each size. 01:08:12.200 |
And so now that I've got my transforms, I can go ahead and create my image_classifier 01:08:21.160 |
I'm going to use a batch size of 256 because these are pretty small, so it's going to let 01:08:33.120 |
And just to show you how tough this is, what's that? 01:08:43.360 |
So I guess it's this big thing, whatever the thing is called, there's your frog. 01:08:49.880 |
So these are the kinds of things that we want to look at. 01:08:54.440 |
So I'm going to start out, so our student, Karim, we saw one of his posts earlier in 01:08:59.560 |
this course, he made this really cool notebook which shows how different optimizers work. 01:09:11.300 |
So Karim made this really cool notebook, I think it was maybe last week, in which he 01:09:15.920 |
showed how to create various different optimizers from scratch. 01:09:19.920 |
So this is kind of like the Excel thing I had, but this is the Python version of Momentum 01:09:24.480 |
and Adam and Nesterov and Adagrad, all written from scratch, which is very cool. 01:09:29.340 |
One of the nice things he did was he showed a tiny little general-purpose fully connected 01:09:41.660 |
So here's a simple class which has a list of fully connected layers. 01:09:50.200 |
Whenever you create a list of layers in PyTorch, you have to wrap it in an nn.module list just 01:09:55.840 |
to tell PyTorch to register these as attributes. 01:10:01.800 |
And so then we just go ahead and flatten the data that comes in, because it's fully connected 01:10:05.960 |
layers, and then go through each layer and call that linear layer, do the value to it, 01:10:16.400 |
So there's a really simple approach, and so we can now take that model and now I'm going 01:10:23.160 |
to show you how to step up one level of the API higher. 01:10:26.720 |
Rather than calling the fit function, we're going to create a learn object, but we're 01:10:30.500 |
going to create a learn object from a custom model. 01:10:34.720 |
And so we can do that by saying we want a convolutional learner, we want to create it 01:10:38.880 |
from a model and from some data, and the model is this one. 01:10:44.880 |
This is just a general PyTorch model, and this is a model data object of the usual kind. 01:10:53.240 |
So this is a bit easier than what we just saw with the RNN -- we don't have to fiddle 01:10:57.080 |
around with layer optimizers and cosine annealing callbacks and whatever. 01:11:01.640 |
This is now a learner that we can do all the usual stuff with, but we can do it with any 01:11:10.500 |
So if we just go Learn, that will go ahead and print it out. 01:11:15.520 |
You can see we've got 3,072 features coming in because we've got 32 by 32 pixels by 3 01:11:23.120 |
And then we've got 40 features coming out of the first layer, that's going to go into 01:11:26.840 |
the second layer, 10 features coming out because we've got the sci-fi 10 categories. 01:11:34.920 |
You can call dot summary to see that in a little bit more detail. 01:11:38.520 |
We can do LRfind, we can plot that, and we can then go fit, and we can use cycle length, 01:11:47.320 |
So with a simple -- how many hidden layers do we have? 01:11:52.720 |
One hidden layer, one output layer, one hidden layer model. 01:11:59.840 |
And here we can see the number of parameters we have is over 120,000. 01:12:12.220 |
So not great, so let's kind of try and improve it. 01:12:17.280 |
And so the goal here is we're going to try and eventually replicate the basic architecture 01:12:24.960 |
So that's where we're going to try and get to here, to gradually build up to a ResNet. 01:12:30.560 |
So the first step is to replace our fully connected model with a convolutional model. 01:12:37.080 |
So to remind you, a fully connected layer is simply doing a dot product. 01:12:49.560 |
So if we had all of these data points and all of these weights, then we basically do 01:13:01.040 |
some product of all of those together, in other words it's a matrix model. 01:13:09.520 |
And so the weight matrix is going to contain every element of the input for every element 01:13:18.360 |
So that's why we have here a pretty big weight matrix. 01:13:26.020 |
And so that's why despite the fact that we have such a crappy accuracy, we have a lot 01:13:31.420 |
of parameters because in this very first layer we've got 3072 coming in and 40 coming out, 01:13:44.400 |
And so we end up not using them very efficiently because we're basically saying every single 01:13:51.720 |
And of course what we really want to do is find groups of 3x3 pixels that have particular 01:13:57.160 |
patterns to them, and remember we call that a convolution. 01:14:12.420 |
We have a little 3x3 section of our image and a corresponding 3x3 set of filters, or 01:14:21.840 |
a filter with a 3x3 kernel, and we just do a sum product of just that 3x3 by that 3x3. 01:14:31.000 |
And then we do that for every single part of our image. 01:14:36.580 |
And so when we do that across the whole image, that's called a convolution. 01:14:40.760 |
And remember, in this case we actually had multiple filters, so the result of that convolution 01:14:46.920 |
actually had a tensor with an additional third dimension to it effectively. 01:14:56.920 |
So let's take exactly the same code that we had before, but we're going to replace nn.linear 01:15:08.040 |
Now what I want to do in this case is each time I have a layer, I want to make the next 01:15:16.540 |
And so the way I did that in my Excel example was I used max_pooling. 01:15:23.080 |
So max_pooling took every 2x2 section and replaced it with its maximum value. 01:15:31.020 |
Nowadays we don't use that kind of max_pooling much at all. 01:15:36.120 |
Instead nowadays what we tend to do is do what's called a Stride 2 convolution. 01:15:40.960 |
A Stride 2 convolution, rather than saying let's go through every single 3x3, it says 01:15:55.600 |
So rather than moving this 3x3 1 to the right, we move it 2 to the right. 01:16:01.040 |
And then when we get to the end of the row, rather than moving one row down, we move two 01:16:09.740 |
And so a Stride 2 convolution has the same kind of effect as a max_pooling, which is 01:16:15.100 |
you end up halving the resolution in each dimension. 01:16:23.960 |
We can say we want it to be 3x3 by saying kernel size, and then the first two parameters 01:16:28.440 |
are exactly the same as nn.linear, they're the number of features coming in and the number 01:16:35.720 |
So we create a module list of those layers, and then at the very end of that, so in this 01:16:43.400 |
case I'm going to say I've got three channels coming in, the first one layer will come out 01:16:52.200 |
So if we look at the summary, we're going to start with a 32x32, we're going to spit 01:17:04.020 |
And so what do we do now to get that down to a prediction of one of 10 classes? 01:17:11.200 |
What we do is we do something called adaptive max_pooling, and this is what is pretty standard 01:17:16.880 |
now for state-of-the-art algorithms, is that the very last layer we do a max_pool, but 01:17:24.720 |
rather than doing a 2x2 max_pool, we say it doesn't have to be 2x2, it could have been 01:17:31.320 |
3x3, which is like replace every 3x3 pixels with its maximum, it could have been 4x4. 01:17:37.720 |
Adaptive max_pool is where you say, I'm not going to tell you how big an area to pool, 01:17:43.920 |
but instead I'm going to tell you how big a resolution to create. 01:17:49.760 |
So if I said, for example, I think my input here is 28x28, if I said do a 14x14 adaptive 01:17:59.080 |
max_pool, that would be the same as a 2x2 max_pool, because in other words it's saying 01:18:07.080 |
If I said do a 2x2 adaptive max_pool, then that would be the same as saying do a 14x14 01:18:17.280 |
And so what we pretty much always do in modern CNNs is we make our penultimate layer a 1x1 01:18:28.480 |
So in other words, find the single largest cell and use that as our new activation. 01:18:39.520 |
And so once we've got that, we've now got a 1x1 tensor, or actually 1x1 by number of 01:18:49.920 |
So we can then on top of that go x.view, x.size, -1, and actually there are no other dimensions 01:19:02.960 |
So this is going to return a matrix of mini-batch by number of features. 01:19:09.520 |
And so then we can feed that into a linear layer with however many classes we need. 01:19:17.680 |
So you can see here the last thing I pass in is how many classes am I trying to predict, 01:19:22.560 |
and that's what's going to be used to create that last layer. 01:19:25.200 |
So it goes through every convolutional layer, does a convolution, does a ReLU, does an adaptive 01:19:34.640 |
This dot view just gets rid of those trailing unit axes, the 1,1 axis, which is not necessary. 01:19:43.840 |
That allows us to feed that into our final linear layer that bits out something of size 01:19:59.960 |
The adaptive max_pool makes it 80 by 1 by 1, and then our dot view makes it just mini-batch 01:20:08.160 |
size by 80, and then finally a linear layer which takes it from 80 to 10, which is what 01:20:16.480 |
So that's our most basic -- you'd call this a fully convolutional network, so a fully convolutional 01:20:23.160 |
network is something where every layer is convolutional except for the very last. 01:20:32.280 |
So again, we can now go lr.find, and now in this case when I did lr.find, it went through 01:20:39.840 |
the entire data set and was still getting better. 01:20:43.440 |
And so in other words, the default final learning rate it tries is 10, and even at that point 01:20:51.600 |
So you can always override the final learning rate by saying end_lr=, and that'll just get 01:20:59.320 |
So here is the learning rate finder, and so I picked 10^-1, trained that for a while, and 01:21:09.640 |
that's looking pretty good, so then I tried it with a cycle length of 1, and it's starting 01:21:16.360 |
So you can see here the number of parameters I have here are 500, 7000, 28000, about 30,000. 01:21:28.100 |
So I have about 1/4 of the number of parameters, but my accuracy has gone up from 47% to 60%. 01:21:36.640 |
And the time per epoch here is under 30 seconds, and here also. 01:21:44.800 |
And that's not surprising because when you use small simple architectures, most of the 01:21:49.060 |
time is the memory transfer, the actual time during the compute is trivial. 01:21:57.820 |
So I'm going to refactor this slightly because I want to try and put less stuff inside my 01:22:04.400 |
forward, and so calling relu every time doesn't seem ideal. 01:22:09.160 |
So I'm going to create a new class called conv_layer, and the conv_layer class is going 01:22:15.740 |
to contain a convolution with a kernel size of 3 and a stride of 2. 01:22:20.960 |
One thing I'm going to do now is add padding. 01:22:23.800 |
Did you notice here the first layer went from 32x32 to 15x15, not 16x16? 01:22:31.620 |
And the reason for that is that at the very edge of your convolution, here, see how this 01:22:43.320 |
first convolution, there isn't a convolution where the middle is the top left point because 01:22:52.840 |
Or else if we had put a row of 0's at the top and a row of 0's at the edge of each column, 01:23:02.600 |
So pad=1 adds that little layer of 0's around the edge for us. 01:23:11.320 |
And so this way we're going to make sure that we go 32x32 to 16x16 to 8x8. 01:23:17.080 |
It doesn't matter too much when you've got these bigger layers, but by the time you get 01:23:20.400 |
down to 4x4, you really don't want to throw away a whole piece. 01:23:28.560 |
So by refactoring it to put this with its defaults here, and then in the forward I'll 01:23:34.440 |
put the ReLU in here as well, it makes my ConvNet a little bit smaller and more to the 01:23:41.080 |
point it's going to be easier for me to make sure that everything's correct in the future 01:23:47.740 |
So now you know not only how to create your own neural network model, but how to create 01:23:58.880 |
This is such a cool thing about PyTorch is a layer definition and a neural network definition 01:24:09.520 |
And so anytime you've got a layer, you can use it as a neural net, anytime you have a 01:24:16.800 |
So this is now the exact same thing as we had before. 01:24:23.380 |
And another thing just to show you, you can do things differently. 01:24:26.060 |
Back here, my max_pull I did as an object, I used the class nn.adaptive_max_pull, and 01:24:35.440 |
I stuck it in this attribute and then I called it. 01:24:40.760 |
There's no weights inside max_pulling, so I can actually do it with a little bit less 01:24:49.000 |
So everything that you can do as a class, you can also do as a function inside this 01:24:58.200 |
So this should be a tiny bit better because this time I've got the padding. 01:25:05.080 |
I didn't train it for as long to actually check, so let's skip over that. 01:25:13.640 |
So one issue here is that in the end, when I tried to add more layers, I had trouble 01:25:25.960 |
The reason I was having trouble training it was if I used larger learning rates, it would 01:25:30.920 |
go off to nin, and if I used smaller learning rates, it kind of takes forever and doesn't 01:25:40.420 |
So to make my model more resilient, I'm going to use something called batch normalization, 01:25:48.360 |
And batchnorm is a couple of years old now, and it's been pretty transformative since 01:25:55.320 |
it came along because it suddenly makes it really easy to train deeper networks. 01:26:01.880 |
So the network I'm going to create is going to have more layers. 01:26:05.120 |
I've got 1, 2, 3, 4, 5 convolutional layers plus a fully connected layer. 01:26:10.840 |
So back in the old days, that would be considered a pretty deep network and we'd be considered 01:26:16.840 |
Nowadays it's super simple thanks to batchnorm. 01:26:20.700 |
Now to use batchnorm, you can just write in nn.batchnorm, but to learn about it, we're 01:26:28.240 |
So the basic idea of batchnorm is that we've got some vector of activations. 01:26:36.440 |
Any time I draw a vector of activations, obviously I mean you can repeat it for the minibatch, 01:26:41.840 |
So we've got some vector of activations, and it's coming into some layer, so probably some 01:26:49.840 |
convolutional matrix multiplication, and then something comes out the other side. 01:26:56.600 |
So imagine this is just a matrix multiply, say it was an identity matrix. 01:27:11.160 |
Then every time I multiply it by that across lots and lots of layers, my activations are 01:27:15.360 |
not getting bigger, they're not getting smaller, they're not changing at all. 01:27:19.960 |
That's all fine, but imagine if it was actually like 2, 2, 2. 01:27:27.160 |
And so if every one of my weight matrices or filters was like that, then my activations 01:27:34.760 |
And so suddenly I've got this exponential growth, and in deep models that's going to 01:27:41.040 |
be a disaster because my gradients are exploding at an exponential rate. 01:27:46.840 |
And so the challenge you have is that it's very unlikely unless you try carefully to 01:27:54.600 |
deal with it that your weight matrices on average are not going to cause your activations 01:28:03.040 |
to keep getting smaller and smaller, or keep getting bigger and bigger. 01:28:06.280 |
You have to carefully control things to make sure that they stay at a reasonable size, 01:28:16.760 |
So we start things off with 0 mean standard deviation 1 by normalizing the inputs, but 01:28:23.920 |
what we'd really like to do is to normalize every layer, not just the inputs. 01:28:36.620 |
So here I've created a bn layer which is exactly like my conv layer. 01:28:40.800 |
It's got my conv2d with my stride, my padding. 01:28:45.200 |
I do my conv and my relu, and then I calculate the mean of each channel or of each filter, 01:28:55.560 |
and the standard deviation of each channel or each filter, and then I subtract the means 01:29:04.540 |
So now I don't actually need to normalize my input at all because it's actually going 01:29:12.240 |
It's normalizing it per channel, and for later layers it's normalizing it per filter. 01:29:21.120 |
So it turns out that's not enough because SGD is bloody-minded. 01:29:29.960 |
And so if SGD decided that it wants the weight matrix to be like so, where that matrix is 01:29:37.480 |
something which is going to increase the values overall repeatedly, then subtract the means 01:29:46.720 |
and divide by the standard deviations just means the next mini-batch is going to try 01:29:53.880 |
So it turns out that this actually doesn't help, it literally does nothing because SGD 01:30:00.680 |
is just going to go ahead and undo the next mini-batch. 01:30:06.280 |
So what we do is we create a new multiplier for each channel and a new added value for 01:30:18.600 |
each channel, and we just start them out as the addition is just a bunch of zeros, so for 01:30:25.440 |
the first layer, 3 zeros, and the multiplier for the first layer is just 3 ones. 01:30:31.360 |
So the number of filters for the first layer is just 3. 01:30:35.060 |
And so we then basically undo exactly what we just did, or potentially we undo them. 01:30:42.160 |
So by saying this is an nn.parameter, that tells PyTorch you're allowed to learn these 01:30:50.400 |
So initially it says subtract the means, divide by the standard deviations, multiply by 1, 01:30:57.820 |
add on 0, okay that's fine, nothing much happened there. 01:31:07.280 |
Like if it wants to kind of scale the layer up, it doesn't have to scale up every single 01:31:13.000 |
value in the matrix, it can just scale up this single trio of numbers, self.m. 01:31:21.240 |
If it wants to shift it all up or down a bit, it doesn't have to shift the entire weight 01:31:25.480 |
matrix, it can just shift this trio of numbers, self.a. 01:31:31.980 |
So I will say this, at this talk I mentioned at Nip's Ali Rahimi's talk about rigor, he 01:31:38.080 |
actually pointed to this batch norm paper as being a particularly useful, particularly 01:31:46.160 |
interesting paper where a lot of people don't necessarily know why it works. 01:31:56.080 |
And so if you're thinking subtracting out the means and then adding some learned weights 01:32:02.000 |
of exactly the same rank and size sounds like a weird thing to do, there are a lot of people 01:32:13.480 |
So at the moment I think the best is intuitively what's going on here is that we're normalizing 01:32:22.480 |
the data and then we're saying you can then shift it and scale it using far fewer parameters 01:32:31.400 |
than would have been necessary if I was asking you to actually shift and scale the entire 01:32:42.040 |
More importantly, in practice, what this does is it basically allows us to increase our learning 01:32:51.660 |
rates and it increases the resilience of training and allows us to add more layers. 01:32:56.960 |
So once I added a bn layer rather than a conv layer, I found I was able to add more layers 01:33:08.620 |
to my model and it's still trained effectively. 01:33:12.240 |
Question 6 Are we worried about anything that maybe we are divided by something very small 01:33:27.880 |
I think in the PyTorch version it would probably be divided by self.studs plus epsilon or something. 01:33:38.480 |
This worked fine for me, but that is definitely something to think about if you were trying 01:33:46.840 |
Question 8 So the self.m and self.a, I'm guessing it's 01:33:55.600 |
getting updated through backpropagation as well? 01:33:58.400 |
Answer 9 Yeah, so by saying it's an nn.parameter, that's 01:34:02.360 |
how we flag to PyTorch to learn it through backprop. 01:34:09.880 |
The other interesting thing it turns out that BatchNorm does is it regularizes. 01:34:16.000 |
In other words, you can often decrease or remove dropout or decrease or remove weight 01:34:23.720 |
And the reason why is if you think about it, each many batch is going to have a different 01:34:30.280 |
mean and a different standard deviation to the previous mini-batch. 01:34:37.080 |
Because they keep changing, it's kind of changing the meaning of the filters in this subtle way. 01:34:42.840 |
And so it's adding a regularization effect because it's noise. 01:34:46.240 |
When you add noise of any kind, it regularizes your model. 01:34:55.560 |
In the real version of BatchNorm, you don't just use this batch's mean and standard deviation, 01:35:02.560 |
but instead you take an exponentially weighted moving average standard deviation and mean. 01:35:09.020 |
And so if you wanted to exercise to try during the week, that would be a good thing to try. 01:35:14.240 |
But I will point out something very important here, which is if self.training. 01:35:19.560 |
When we are doing our training loop, this will be true when it's being applied to the 01:35:27.580 |
training set, and it will be false when it's being applied to the validation set. 01:35:33.200 |
And this is really important because when you're going through the validation set, you 01:35:36.560 |
do not want to be changing the meaning of the model. 01:35:41.280 |
So this really important idea is that there are some types of layer that are actually 01:35:48.240 |
sensitive to what the mode of the network is, whether it's in training mode or, as PyTorch 01:35:56.560 |
calls it, evaluation mode, or we might say test mode. 01:36:02.160 |
We actually had a bug a couple of weeks ago when we did our Mininet for MovieLens, the 01:36:08.080 |
collaborative filtering, we actually had f.dropout in our forward pass without protecting it with 01:36:14.800 |
a if self.training f.dropout, as a result of which we were actually doing dropout in 01:36:21.560 |
the validation piece as well as the training piece, which obviously isn't what you want. 01:36:27.240 |
So I've actually gone back and fixed this by changing it to using an n.dropout. 01:36:34.360 |
And nn.dropout has already been written for us to check whether it's being used in training 01:36:41.520 |
Or alternatively, I could have added an if self.training before I use the dropout here. 01:36:49.640 |
So it's important to think about that, and the main two, or pretty much the only two 01:36:55.280 |
built into PyTorch where this happens is dropout and that's not. 01:37:02.240 |
And so interestingly, this is also a key difference in fast.ai, which no other library does, is 01:37:09.480 |
that these means and standard deviations get updated in training mode in every other library 01:37:20.400 |
as soon as you basically say I'm training, regardless of whether that layer is set to 01:37:27.100 |
And it turns out that with a pre-trained network, that's a terrible idea. 01:37:31.400 |
If you have a pre-trained network, the specific values of those means and standard deviations 01:37:36.040 |
in batch norm, if you change them, it changes the meaning of those pre-trained layers. 01:37:41.640 |
And so in fast.ai, always by default it won't touch those means and standard deviations 01:37:49.760 |
As soon as you unfreeze it, it'll start updating them. 01:38:01.040 |
If you set learn.bnfreeze true, it says never touch these means and standard deviations. 01:38:06.920 |
And I've found in practice that that often seems to work a lot better for pre-trained 01:38:14.800 |
models, particularly if you're working with data that's quite similar to what the pre-trained 01:38:29.160 |
Looks like you did a lot more work calculating the aggregates, you know, as you... 01:38:38.040 |
Well, you're doing more work than you would normally do, essentially you're calculating 01:38:42.800 |
all these aggregates as you go through each layer. 01:38:46.680 |
Wouldn't this mean you're training like your epoch time loser? 01:38:53.680 |
If you think about what a conv has to do, a conv has to go through every 3x3 with a stride 01:39:01.960 |
and do this multiplication and then addition. 01:39:05.240 |
That is a lot more work than simply calculating the per-channel mean. 01:39:11.600 |
So it adds a little bit of time, but it's less time-intensive than the convolution. 01:39:18.000 |
So how would you basically position the batch norm? 01:39:21.240 |
Would it be right after the convolutional layer, or would it be after the relu? 01:39:28.680 |
So at the moment, we have it after the relu, and in the original batch norm paper, I believe 01:39:41.120 |
So there's this idea of something called an ablation study, and an ablation study is something 01:39:48.720 |
where you basically try kind of turning on and off different pieces of your model to 01:39:58.840 |
And one of the things that wasn't done in the original batch norm paper was any kind 01:40:02.980 |
of really effective ablation study, and one of the things therefore that was missing was 01:40:08.040 |
this question which you just asked, which is where do you put the batch norm, before 01:40:14.400 |
And so since that time, that oversight has caused a lot of problems because it turned 01:40:20.200 |
out the original paper didn't actually put it in the best spot. 01:40:25.000 |
And so then other people since then have now figured that out, and now every time I show 01:40:29.060 |
people code where it's actually in the spot that turns out to be better, people always 01:40:33.380 |
say your batch norm is in the wrong spot, and I have to go back and say no, I know that's 01:40:37.720 |
what the paper said, but it turned out that's not actually the right spot, and so it's kind 01:40:42.680 |
So there's been a lot of question about that. 01:40:46.000 |
So, a little bit of a higher level question, so we started out with CIFAR data, so is the 01:40:58.440 |
basic reasoning that you use a smaller data set to quickly train a new model, and then 01:41:06.480 |
you take the same model and you're using a much bigger data set to get a higher accuracy 01:41:19.580 |
So if you had a large data set, or if you were interested in the question of how good 01:41:28.000 |
is this technique on a large data set, then yes, what you just said would be what I would 01:41:33.240 |
I would do lots of testing on a small data set which I had already discovered had the 01:41:38.640 |
same kinds of properties as my larger data set, and therefore my conclusions would likely 01:41:43.260 |
carry forward and then I would test them at the end. 01:41:46.200 |
Having said that, personally, I'm actually more interested in actually studying small 01:41:53.640 |
data sets for their own sake because I find most people I speak to in the real world don't 01:42:00.960 |
have a million images, they have somewhere between about 2,000 and 20,000 images seems 01:42:09.640 |
So I'm very interested in having fewer rows because I think it's more valuable in practice. 01:42:17.800 |
I'm also pretty interested in small images, not just for the reason you mentioned which 01:42:22.120 |
is it allows me to test things out more quickly, but also as I mentioned before, often a small 01:42:28.640 |
part of an image actually turns out to be what you're interested in that's certainly 01:42:39.040 |
The first is on what you mentioned in terms of small data sets, particularly medical imaging 01:42:44.440 |
if you've heard of, I guess, is it vicarious to start up in the specialization and one 01:42:49.480 |
So your opinions on that, and then the second being, this is related to I guess Ali's talk 01:42:56.600 |
at NIPS, so I don't want to say it's controversial, but like Yann LeCun, there was like a really, 01:43:02.000 |
I guess, controversial thread attacking it in terms of what you're talking about as a 01:43:05.920 |
baseline of theory just not keeping up with practice. 01:43:11.080 |
And so I guess I was starting with Yann, whereas Ali actually, he tweeted at me quite a bit 01:43:15.760 |
trying to defend like he wasn't attacking Yann at all, but in fact, he was trying to 01:43:23.760 |
support him, but I just kind of feel like a lot of theory as you go is just sort of 01:43:29.920 |
It's hard to keep up other than an archive from Andre Keparthi to keep up, but if the 01:43:35.040 |
theory isn't keeping up but the industry is the one that's actually setting the standard, 01:43:38.520 |
then doesn't that mean that people who are actual practitioners are the ones like Yann 01:43:43.920 |
LeCun that are publishing the theory that are keeping up to date, or is like academic 01:43:49.420 |
So I don't have any comments on the vicarious papers because I haven't read them. 01:43:52.960 |
I'm not aware of any of them as actually showing better results than other papers, but I think 01:44:00.960 |
they've come a long way in the last 12 months, so that might be wrong. 01:44:05.720 |
I think the discussion between Yann LeCun and Ali Rahimi is very interesting because 01:44:09.040 |
they're both smart people who have interesting things to say. 01:44:12.680 |
Unfortunately, a lot of people talk Ali's talk as meaning something which he says it 01:44:20.040 |
didn't mean, and when I listen to his talk I'm not sure he didn't actually mean it at 01:44:24.880 |
the time, but he clearly doesn't mean it now, which is, he's now said many times he was 01:44:30.160 |
not talking about theory, he was not saying we need more theory at all. 01:44:37.320 |
And so specifically he's also now saying he wished he hadn't used the word rigor, which 01:44:42.880 |
I also wish because rigor is kind of meaningless and everybody can kind of say when he says 01:44:55.680 |
So lots of people have kind of taken his talk as being like "Oh yes, this proves that nobody 01:45:00.080 |
else should work in neural networks unless they are experts at the one thing I'm an expert 01:45:06.200 |
So I'm going to catch up with him and talk more about this in January and hopefully we'll 01:45:13.120 |
But basically what we can clearly agree on, and I think Yann LeCun also agrees on, is careful 01:45:22.000 |
experiments are important, just doing things on massive amounts of data using massive amounts 01:45:28.080 |
of TPUs or GPUs is not interesting of itself, and we should instead try to design experiments 01:45:35.120 |
that give us the maximum amount of insight into what's going on. 01:45:38.400 |
So Jeremy, is it a good statement to say something like, so dropout and bashnorm are very different 01:45:53.160 |
Dropout is a regularization technique and bashnorm has maybe some realization effect 01:45:59.400 |
but it's actually just about convergence of the optimization method. 01:46:04.040 |
And I would further say I can't see any reason not to use batchnorm. 01:46:13.920 |
There are versions of batchnorm that in certain situations turned out not to work so well, 01:46:20.560 |
but people have figured out ways around that for nearly every one of those situations now. 01:46:26.240 |
So I would always seek to find a way to use batchnorm. 01:46:30.800 |
It may be a little harder in RNNs at least, but even there, there are ways of doing batchnorm 01:46:41.080 |
Try and always use batchnorm on every layer if you can. 01:46:44.440 |
The question that somebody asked is, does it mean I can stop normalizing my data? 01:47:03.400 |
It does, although do it anyway because it's not at all hard to do it, and at least that 01:47:06.240 |
way the people using your data, I don't know, they kind of know how you've normalized it. 01:47:12.920 |
And particularly with these issues around a lot of libraries, in my opinion, my experiments 01:47:21.800 |
don't deal with batchnorm correctly for pre-trained models. 01:47:26.680 |
Just remember that when somebody starts retraining, those averages and stuff are going to change 01:47:32.320 |
for your dataset, and so if your new dataset has very different input averages, it could 01:47:40.440 |
So yeah, I went through a period where I actually stopped normalizing my data, and things kind 01:47:57.280 |
All I've done is I've changed conv_layer to bn_layer, but I've done one more thing, which 01:48:03.720 |
is I'm trying to get closer and closer to modern approaches, which I've added a single 01:48:08.480 |
convolutional layer at the start, with a bigger kernel size and a stride of 1. 01:48:18.640 |
So the basic idea is that I want my first layer to have a richer input. 01:48:25.960 |
So before my first layer had an input of just 3, because it was just 3 channels. 01:48:31.360 |
But if I start with my image, and I kind of take a bigger area, and I do a convolution 01:48:50.320 |
using that bigger area, in this case I'm doing 5x5, then that kind of allows me to try and 01:49:00.800 |
find more interesting, richer features in that 5x5 area. 01:49:06.640 |
And so then I spit out a bigger output, in this case I spit out 10 5x5 filters. 01:49:15.620 |
And so the idea is pretty much every state-of-the-art convolutional architecture now starts out 01:49:21.840 |
with a single conv layer with like a 5x5 or 7x7 or sometimes even like 11x11 convolution 01:49:32.000 |
with quite a few filters, something like 32 filters coming out. 01:49:41.200 |
And it's just a way of trying to -- because I used a stride of 1 and a padding of kernel 01:49:49.220 |
size -1/2, that means that my output is going to be exactly the same size as my input, but 01:49:56.440 |
So this is just a good way of trying to create a richer starting point for my sequence of 01:50:05.640 |
So that's the basic theory of why I've added this single convolution, which I just do once 01:50:11.140 |
at the start, and then I just go through all my layers, and then I do my adaptive max pooling 01:50:22.960 |
And so you'll see now I can go from 60% and after a couple it's 45%, now after a couple 01:50:34.360 |
it's 57%, and after a few more I'm up to 68%. 01:50:38.760 |
So you can see the batch norm and tiny bit, the conv layer at the start, it's helping. 01:50:45.160 |
And what's more, you can see this is still increasing. 01:50:52.160 |
So given that this is looking pretty good, an obvious thing to try is to try increasing 01:51:03.680 |
And now I can't just add more of my stride 2 layers, because remember how at half the 01:51:12.640 |
I'm basically down to 2x2 at the end, so I can't add much more. 01:51:18.080 |
So what I did instead was I said, okay, here's my original layers, these are my stride 2 layers, 01:51:30.940 |
And so now I'm saying zip my stride 2 layers and my stride 1 layers together, and so first 01:51:38.920 |
of all do the stride 2 and then do the stride 1. 01:51:42.300 |
So this is now actually twice as deep, but I end up with the exact same 2x2 that I had 01:51:58.760 |
And so if I try this, here after 1, 2, 3, 4 epochs is at 65%, after 1, 2, 3 epochs I'm 01:52:10.960 |
And so the reason it hasn't helped is I'm now too deep even for batch norm to handle 01:52:20.720 |
So my depth is now 1, 2, 3, 4, 5 times 2 is 10, 11, conv1, 12. 01:52:31.100 |
So 12 layers deep, it's possible to train a standard convNet 12 layers deep, but it 01:52:39.480 |
And it certainly doesn't seem to be really helping much, if at all. 01:52:43.100 |
So that's where I'm instead going to replace this with a ResNet. 01:52:49.120 |
So ResNet is our final stage, and what a ResNet does is I'm going to replace our BN layer, 01:52:57.880 |
I'm going to inherit from BN layer, and replace our forward with that. 01:53:04.080 |
And that's it, everything else is going to be identical. 01:53:07.840 |
But now I'm going to do way lots of layers, I'm going to make it 4 times deeper, and it's 01:53:13.160 |
going to train beautifully, just because of that. 01:53:22.120 |
So this is called a ResNet block, and as you can see I'm saying my predictions equals my 01:53:36.240 |
input plus some function, in this case a convolution of my input. 01:53:47.460 |
And so I'm now going to shuffle that around a little bit. 01:54:09.440 |
That's my prediction from the previous layer. 01:54:14.680 |
And so what this is then doing is it's trying to fit a function to the difference between 01:54:22.580 |
And so the difference is actually the residual. 01:54:36.860 |
So if this is what I'm trying to calculate, my actual y value, and this is the thing that 01:54:45.560 |
I've most recently calculated, then the difference between the two is basically the error in 01:54:53.920 |
And so this is therefore saying that try to find a set of convolutional weights that attempts 01:55:06.920 |
So in other words, if we have some inputs coming in, and then we have this function 01:55:17.600 |
which is basically trying to predict the error, it's like how much are we off by, right? 01:55:25.960 |
So we basically add on this additional prediction of how much were we wrong by. 01:55:31.020 |
And then we add on another prediction of how much were we wrong by that time. 01:55:35.240 |
And add on another prediction of how much were we wrong by that time. 01:55:38.520 |
Then each time we're kind of zooming in, getting closer and closer to our correct answer. 01:55:45.720 |
And each time we're saying we've got to a certain point, but we've still got an error. 01:55:53.320 |
So let's try and create a model that just predicts that residual, and add that onto 01:55:59.120 |
And then let's build another model that predicts the residual, and add that onto our previous 01:56:04.360 |
And if we keep doing that again and again, we should get closer and closer to our answer. 01:56:10.760 |
And this is based on a theory called boosting, which people that have done some machine learning 01:56:18.960 |
And so basically the trick here is that by specifying that as being the thing that we're 01:56:30.440 |
trying to calculate, then we kind of get boosting for free. 01:56:40.680 |
It's because we can just juggle that around to show that actually it's just calculating 01:56:55.920 |
As you can see here, I've now got my standard batch norm layer, which is something which 01:57:02.200 |
is going to reduce my size by 2 because it's got the stride 2. 01:57:06.960 |
And then I've got a ResNet layer of stride 1, and another ResNet layer of stride 1. 01:57:11.560 |
I think I said that was 4 of these, it's actually 3 of these. 01:57:15.580 |
So this is now 3 times deeper, I've zipped through all of those. 01:57:19.120 |
And so I've now got a function of a function of a function. 01:57:23.440 |
So 3 layers per group, and then my conv at the start, and my linear at the end. 01:57:30.680 |
So this is now 3 times bigger than my original. 01:57:35.440 |
And if I fit it, you can see it just keeps going up, and up, and up, and up. 01:57:40.560 |
I keep fitting it more, it keeps going up, and up, and up, and up, and up. 01:57:44.800 |
And it's still going up when I kind of got bored. 01:57:48.940 |
So the ResNet has been a really important development, and it's allowed us to create 01:58:04.000 |
The full ResNet does not quite look the way I've described it here. 01:58:09.480 |
The full ResNet doesn't just have one convolution, but it actually has two convolutions. 01:58:16.440 |
So the way people normally draw ResNet blocks is they normally say you've got some input 01:58:21.400 |
coming into the layer, it goes through one convolution, two convolutions, and then gets 01:58:42.920 |
And then you'll see also, in every block, one of them is not a ResNet block, but a standard 01:59:06.000 |
This is called a bottleneck layer, and the idea is this is not a ResNet block. 01:59:11.040 |
So from time to time, we actually change the geometry, we're doing the stride too. 01:59:16.600 |
In ResNet, we don't actually use just a standard convolutional layer, there's actually a different 01:59:22.080 |
form of bottleneck block that I'm not going to teach you in this course, I'm going to 01:59:26.880 |
But as you can see, even this somewhat simplified version of a ResNet still works pretty well. 01:59:38.360 |
And so here I've just increased all of my sizes, I have still got my 3, and also I've 01:59:46.920 |
So at this point, I'm going to say this is, other than the minor simplification of ResNet, 01:59:52.680 |
a reasonable approximation of a good starting point for a modern architecture. 01:59:58.560 |
And so now I've added in my point 2 dropout, I've increased the size here, and if I train 02:00:03.720 |
this, I can train it for a while, it's going pretty well, I can then add in TTA at the end, 02:00:12.800 |
And this is at a point now where literally I wrote this whole notebook in like 3 hours. 02:00:18.560 |
We can create this thing in 3 hours, and this is like an accuracy that in 2012, 2013 was 02:00:26.320 |
considered pretty much state-of-the-art for SciFi 10. 02:00:30.560 |
Nowadays, the most recent results are like 97%, there's plenty of room we can still improve, 02:00:45.600 |
There isn't really anything -- when we start looking in Part 2 at how to get this right 02:00:52.640 |
up to state-of-the-art, you'll see it's basically better approaches to data augmentation, better 02:00:57.360 |
approaches to regularization, some tweaks on ResNet, but it's all basically this idea. 02:01:04.400 |
"So is the training on the residual method, is that only, looks like it's a generic thing 02:01:19.400 |
Yeah, yes it is, but it's been ignored everywhere else. 02:01:24.360 |
In NLP, something called the transformer architecture recently appeared, and it was shown to be the 02:01:31.400 |
state-of-the-art for translation, and it's got a simple ResNet structure in it. 02:01:40.000 |
I haven't really seen anybody else take advantage of it. 02:01:44.600 |
This general approach, we call these skip connections, this idea of skipping over a 02:01:48.400 |
layer and doing an identity, it's been appearing a lot in computer vision and nobody else much 02:01:55.400 |
seems to be using it, even though there's nothing computer vision specific about it. 02:02:03.780 |
So final stage I want to show you is how to use an extra feature of PyTorch to do something 02:02:12.600 |
cool, and it's going to be a segue into Part 2. 02:02:16.720 |
It's going to be our first little hint as to what else we can build on these neural nets. 02:02:22.280 |
It's also going to take us all the way back to lesson 1, which is we're going to do dogs 02:02:29.080 |
So going all the way back to dogs and cats, we're going to create a ResNet-34. 02:02:34.700 |
So these different ResNet-34, 50, 101, they're basically just different numbers, different 02:02:43.640 |
size blocks, it's like how many of these pieces do you have before each bottleneck block, and 02:02:49.960 |
then how many of these sets of super blocks do you have. 02:02:55.880 |
So if you look at the TorchVision source code, you can actually see the definition of these 02:03:01.720 |
different ResNets, you'll see they're all just different parameters. 02:03:09.440 |
So we're going to use ResNet-34, and so we're going to do this a little bit more by hand. 02:03:16.240 |
So if this is my architecture, this is just the name of a function, then I can call it 02:03:23.160 |
And then true, if we look at the definition, is do I want the pre-trained, so in other 02:03:28.600 |
words, is it going to load in the pre-trained image net weights. 02:03:32.560 |
So m now contains a model, and so I can take a look at it like so. 02:03:39.600 |
And so you can see here what's going on is that inside here I've got my initial 2D convolution, 02:03:54.120 |
And interestingly in this case, it actually starts out with a 7x7 strived 2. 02:03:58.680 |
There's the padding that we talked about to make sure that we don't lose the edges. 02:04:02.520 |
There's our batchnorm, there's our ReLU, and you get the idea, right? 02:04:08.520 |
And then so here you can now see there's a layer that contains a bunch of blocks. 02:04:15.120 |
So here's a block which contains a conv, batchnorm, ReLU, conv, batchnorm. 02:04:20.600 |
You can't see it printed, but after this is where it does the addition. 02:04:25.680 |
So there's like a whole ResNet block, and then another ResNet block, and then another 02:04:33.560 |
And then you can see also sometimes you see one where there's a strived 2. 02:04:40.760 |
So here's actually one of these bottleneck layers. 02:04:47.240 |
So you can kind of see how this is structured. 02:04:51.440 |
So in our case, sorry I skipped over this a little bit, but the approach that we ended 02:05:00.280 |
up using for ReLU was to put it before our batchnorm, which we've got batchnorm, ReLU, 02:05:26.160 |
So you can see the order that they're using it here. 02:05:29.600 |
And you'll find there's 3 different versions of ResNet floating around. 02:05:36.520 |
The one which actually turns out to be the best is called the Preact ResNet, which has 02:05:41.840 |
a different ordering again, but you can look it up. 02:05:48.160 |
It's basically a different order of where the ReLU and where the batchnorm sit. 02:05:53.240 |
So we're going to start with a standard ResNet 34, and normally what we do is we need to 02:06:01.000 |
now turn this into something that can predict dogs versus cats. 02:06:06.480 |
So currently the final layer has 1000 features because ImageNet has 1000 features. 02:06:16.500 |
So when you use conf-learner from pre-trained in fast.ai, it actually deletes this layer 02:06:29.440 |
And something that as far as I know is unique to fast.ai is we see this average pooling 02:06:35.840 |
layer of size 7x7, so this is basically the adaptive pooling layer. 02:06:41.320 |
But whoever wrote this didn't know about adaptive pooling, so they manually said I know it's 02:06:47.520 |
So in fast.ai, we replace this with adaptive pooling, but we actually do both adaptive average 02:06:52.120 |
pooling and adaptive max pooling, and we then concatenate the two together, which is something 02:07:00.080 |
we invented, but at the same time we invented it, somebody wrote a paper about it, so we 02:07:06.560 |
But I think we're the only library that provides it, and certainly anyone that does it by default. 02:07:13.440 |
We're going to, for the purpose of this exercise though, do a simple version where we delete 02:07:18.120 |
the last two layers, so we'll grab all the children of the model, we'll delete the last 02:07:22.160 |
two layers, and then instead we're going to add a convolution which just has two outputs. 02:07:35.080 |
Then we're going to do our average pooling, and then we're going to do our softmax. 02:07:41.680 |
So that's a model which you'll see that this one has a fully connected layer at the end, 02:07:49.820 |
this one does not have a fully connected layer at the end. 02:07:53.000 |
But if you think about it, this convolutional layer is going to be 2 filters only, and it's 02:08:05.200 |
And so once we then do the average pooling, it's going to end up being just two numbers 02:08:11.240 |
So this is a different way of producing just two numbers. 02:08:13.560 |
I'm not going to say it's better, I'm just going to say it's different, but there's a 02:08:20.400 |
We can now train this model in the usual way. 02:08:23.480 |
So we can say transforms.model, image_classifier_data_from_paths, and then we can use that conv_learner_from_model_data 02:08:32.640 |
I'm now going to freeze every single layer except for that one, and this is the 4th last 02:08:48.320 |
So we get 99.1% accuracy, so this approach is working fine. 02:08:55.200 |
We can now do something called class_activation_maps. 02:09:05.800 |
What we're going to do is we're going to try to look at this particular cat, and we're 02:09:11.360 |
going to use a technique called class_activation_maps where we take our model and we ask it which 02:09:16.920 |
parts of this image turned out to be important. 02:09:21.040 |
And when we do this, it's going to feed out, this is the picture it's going to create. 02:09:27.040 |
And so as you can see here, it's found the cat. 02:09:31.920 |
Well the way it did that, we'll kind of work backwards, is to produce this matrix. 02:09:38.320 |
You'll see in this matrix, there's some pretty big numbers around about here which correspond 02:09:49.980 |
This matrix is simply equal to the value of this feature matrix times this py vector. 02:10:04.600 |
The py vector is simply equal to the predictions, which in this case said I'm 100% confident 02:10:13.360 |
So this is just equal to the value of, if I just call the model passing in our cat, then 02:10:39.960 |
In other words, the value that comes out of the final convolutional layer. 02:10:52.120 |
And so you can see here, the shape of features is 2 filters by 7x7. 02:11:01.880 |
So the idea is, if we multiply that vector by that tensor, then it's going to end up 02:11:12.080 |
grabbing all of the first channel, because that's a 1, and none of the second channel, 02:11:20.480 |
And so therefore it's going to return the value of the last convolutional layer for 02:11:31.560 |
But if you think about it, the first section lines up with being a cat, the second section 02:11:38.440 |
So if we multiply that tensor by that tensor, we end up with this matrix. 02:11:45.840 |
And this matrix is which parts are most like a cat. 02:11:51.640 |
Or to put it another way, in our model, the only thing that happened after the convolutional 02:12:00.840 |
So the average pooling layer took that 7x7 grid and said average out how much each part 02:12:08.360 |
And so my final prediction was the average cattiness of the whole thing. 02:12:17.040 |
And so because it had to be able to average out these things to get the average cattiness, 02:12:22.640 |
that means I could then just take this matrix and resize it to be the same size as my original 02:12:29.740 |
cat and just overlay it on top to get this heatmap. 02:12:34.440 |
So the way you can use this technique at home is to basically calculate this matrix on some 02:12:44.960 |
You can calculate this matrix on a quick small little ConvNet and then zoom into the bit 02:12:50.280 |
that has the highest value, and then rerun it just on that part. 02:12:55.720 |
So this is the area that seems to be the most like a cat or the most like a dog, that zoom 02:13:04.120 |
So I skipped over that pretty quickly because we ran out of time. 02:13:09.680 |
And so we'll be learning more about these kind of approaches in Part 2 and we can talk 02:13:13.040 |
about it more on the forum, but hopefully you get the idea. 02:13:15.400 |
The one thing I totally skipped over was how do we actually ask for that particular layer. 02:13:22.960 |
I'll let you read about this during the week, but basically there's a thing called a hook. 02:13:29.120 |
So we called save_features, which is this little class that we wrote that goes register_forward_hook. 02:13:39.520 |
And basically a forward_hook is a special PyTorch thing that every time it calculates 02:13:49.960 |
It's like a callback that happens every time it calculates a layer. 02:13:52.840 |
And so in this case, it just saved the value of the particular layer that I was interested 02:14:01.260 |
And so that way I was able to go inside here and grab those features out after I was done. 02:14:11.000 |
So I called save_features, that gives me my hook, and then later on I can just grab the 02:14:18.380 |
So I skipped over that pretty quickly, but if you look in the PyTorch docs, they have 02:14:26.040 |
Can you spend five minutes talking about your journey into deep learning and finally how 02:14:37.440 |
can we keep up with important research that is important to practitioners? 02:14:46.320 |
I think I'll close more on the latter bit, which is like what now? 02:14:50.480 |
So for those of you who are interested, you should aim to come back for part 2. 02:14:57.540 |
If you're aiming to come back for part 2, how many people would like to come back for 02:15:05.000 |
So if you want to come back for part 2, be aware of this. 02:15:08.700 |
By that time, you're expected to have mastered all of the techniques we've learned in part 02:15:14.000 |
There's plenty of time between now and then, even if you haven't done much or any ML before, 02:15:19.440 |
but it does assume that you're going to be working at the same level of intensity from 02:15:24.840 |
now until then that you have been with practicing. 02:15:27.920 |
So generally speaking, the people who did well in part 2 last year had watched each 02:15:33.560 |
of the videos about three times, and some of the people I knew had actually discovered 02:15:39.800 |
they learned some of them off by heart by mistake. 02:15:44.520 |
And make sure you get to the point that you can recreate the notebooks without watching 02:15:50.640 |
And so to make it more interesting, obviously try and recreate the notebooks using different 02:15:58.520 |
And definitely then just keep up with the forum and you'll see people keep on posting 02:16:03.000 |
more stuff about recent papers and recent advances, and over the next couple of months 02:16:08.060 |
you'll find increasingly less and less of it seems weird and mysterious, and more and 02:16:16.480 |
And so it's a bit of a case of staying tenacious, there's always going to be stuff that you 02:16:21.840 |
don't understand yet, but you'll be surprised. 02:16:25.280 |
If you go back to lesson 1 and 2 now, you'll be like, oh that's all trivial. 02:16:32.720 |
So that's kind of hopefully a bit of your learning journey, and I think the main thing 02:16:40.160 |
I've noticed is the people who succeed are the ones who just keep working at it. 02:16:44.680 |
So not coming back here every Monday, you're not going to have that forcing function. 02:16:49.120 |
I've noticed the forum suddenly gets busy at 5pm on a Monday. 02:16:53.560 |
It's like, oh, the course is about to start and suddenly these questions start coming 02:16:58.240 |
So now that you don't have that forcing function, try and use some other technique to give yourself 02:17:05.120 |
Maybe you can tell your partner at home, I'm going to try and produce something every Saturday 02:17:09.320 |
for the next 4 weeks, or I'm going to try and finish reading this paper or something. 02:17:16.280 |
So I hope to see you all back in March, and regardless whether I do or don't, it's been 02:17:22.240 |
a really great pleasure to get to know you all, and I hope to keep seeing you on the