back to indexLesson 12 (2019) - Advanced training techniques; ULMFiT from scratch
Chapters
0:0 Introduction
1:5 Learner refactor
3:43 Mixup
7:57 Data augmentation
18:40 Label smoothing
21:50 Half precision floating point
23:40 Nvidia Apex
24:15 Loss scale
26:5 Mixups
28:0 ResNet
31:50 Coma Flare
36:15 Res Blocks
46:45 Results
48:35 Transfer learning
50:5 Training from scratch
00:00:06.560 |
And this is an exciting lesson because it's where we're going to wrap up all the pieces 00:00:14.000 |
And you might be surprised to hear that we're going to wrap up all the pieces for NLP because 00:00:19.600 |
But actually everything we've done is equally applicable to NLP. 00:00:24.560 |
So there's very little to do to get a state-of-the-art result on IMDB sentiment analysis from scratch. 00:00:34.200 |
Before we do, let's finally finish off this slide we've been going through for three lessons 00:00:39.960 |
I promised, not promised, that we would get something state-of-the-art on ImageNet. 00:00:48.000 |
So we're going to finish off, mix up, label smoothing, and resnets. 00:00:58.600 |
Before we look at the new stuff, 09B learner. 00:01:06.840 |
I've made a couple of minor changes that I thought you might be interested in. 00:01:12.680 |
So remember last week we refactored the learner to get rid of that awful separate runner. 00:01:17.280 |
So there's just now one thing, made a lot of our code a lot easier. 00:01:20.280 |
There's still this concept left behind that when you started fitting, you had to tell 00:01:25.320 |
each callback what its learner or runner was. 00:01:28.400 |
I've moved that, because they're all totally attached now, I've moved that to the init. 00:01:36.240 |
And so now you can call add cbs to add a whole bunch of callbacks, or add cb to add one callback. 00:01:42.600 |
And that happens automatically at the start of training. 00:01:46.960 |
More interesting was when I did this little reformatting exercise where I took all these 00:01:53.920 |
callbacks that used to be on the line underneath the thing before them and lined them up over 00:01:58.360 |
here and suddenly realized that now I can answer all the questions I have in my head 00:02:03.120 |
about our callback system, which is what exactly are the steps in the training loop? 00:02:09.440 |
What exactly are the callbacks that you can use in the training loop? 00:02:18.760 |
Are there any callbacks that don't have a step? 00:02:22.200 |
So it's one of these interesting things where I really don't like the idea of automating 00:02:30.620 |
your formatting and creating rules for formatting when something like this can just, as soon 00:02:38.560 |
And for me, understanding my code is the only way to make it work. 00:02:41.840 |
Because debugging machine learning code is awful. 00:02:45.920 |
So you've got to make sure that the thing you write makes sense. 00:02:56.440 |
Then more interestingly, we used to create the optimizer in init. 00:03:02.680 |
And you could actually pass in an already created optimizer. 00:03:06.960 |
And the only thing now you can pass in is an optimization function. 00:03:10.040 |
So something that will create an optimizer, which is what we've always been doing anyway. 00:03:14.060 |
And by doing that, we can now create our optimizer when we start fitting. 00:03:21.560 |
Because when we do things like discriminative learning rates and gradual unfreezing and 00:03:25.080 |
layer groups and stuff, we can change things. 00:03:30.440 |
So that's a more significant -- it's like one line of code, but it's conceptually a 00:03:42.760 |
And now let's move on to mixup and label smoothing. 00:03:46.720 |
So I'm really excited about the stuff we saw at the end of the last lesson where we saw 00:03:52.120 |
how we can use the GPU to do data augmentation. 00:03:55.440 |
Fully randomized, fully GPU accelerated data augmentation using just plain PyTorch operations. 00:04:04.560 |
But it's quite possible we don't need that kind of data augmentation anymore. 00:04:10.680 |
Because in our experimentation with this data augmentation called mixup, we found we can 00:04:15.600 |
remove most other data augmentation and get amazingly good results. 00:04:23.760 |
And also when you use mixup, you can train for a really long time and get really good 00:04:31.040 |
And in terms of the results, you can get -- what happened in the bag of tricks paper was they 00:04:38.920 |
-- when they turned mixup on, they also started training for 200 epochs instead of 120. 00:04:48.480 |
So be a bit careful when you interpret their paper table when it goes from label smoothing 00:05:04.200 |
They're also nearly doubling the number of epochs they do. 00:05:07.120 |
But you can kind of get a sense that you can get big decrease in error. 00:05:12.640 |
The other thing they mention in the paper is distillation. 00:05:15.120 |
I'm not going to talk about that because it's a thing where you pre-train some much bigger 00:05:19.880 |
model like a ResNet-152, and then you try and train something that predicts the output 00:05:24.960 |
The idea of training a really big model, to train a smaller model, it's interesting, but 00:05:29.560 |
it's not exactly training in the way I normally think about it. 00:05:33.800 |
It would be an interesting assignment if somebody wanted to try adding it to the notebooks though. 00:05:39.680 |
You have all the information and I think all the skills you need to do that now. 00:05:45.740 |
So mixup, we start by grabbing our ImageNet data set and we grab the MakeRGB and resize 00:05:58.360 |
This is just our quick and dirty resize, we're already doing this for testing purposes. 00:06:02.000 |
Split it up, create a data bunch, all the normal stuff. 00:06:07.060 |
But what we're going to do is we're going to take an image like this and an image like 00:06:19.560 |
We're going to take 0.3 times this image plus 0.7 times this image and this is what it's 00:06:29.620 |
Unfortunately, Silva and I have different orderings of file names on our thing, so I wrote, it's 00:06:35.200 |
a French horn and a tench but actually Silva clearly doesn't have French horn or tenches 00:06:42.880 |
So we're going to create a greater augmentation where every time we predict something we're 00:06:49.200 |
going to be predicting a mix of two things like this. 00:06:52.160 |
So we're going to both take the linear combination, 0.3 and 0.7, of the two images but then we're 00:07:01.080 |
going to have to do that for the labels as well, right? 00:07:03.520 |
There's no point predicting the one hot encoded output of this breed of doggy where there's 00:07:14.000 |
So we're also going to have, we're not going to have one hot encoded output, we're going 00:07:17.560 |
to have a 0.7 encoded doggy and a 0.3 encoded gas pump. 00:07:34.060 |
Wow, there are people talking about things that aren't deep learning. 00:07:46.640 |
So the paper's a pretty nice, easy read by paper standards and I would definitely suggest 00:07:57.560 |
So I've told you what we're going to do, implementation-wise, we have to decide what number to use here. 00:08:06.600 |
And this is a data augmentation method, so the answer is we'll randomize it. 00:08:10.560 |
But we're not going to randomize it from 0 to 1 uniform or 0 to 0.5 uniform, but instead 00:08:16.960 |
we're going to randomize it using shapes like this. 00:08:22.240 |
In other words, when we grab a random number, most of the time it'll be really close to 00:08:27.760 |
0 or really close to 1, and just occasionally it'll be close to 0.5. 00:08:31.680 |
So that way most of the time it'll be pretty easy for our model because it'll be predicting 00:08:35.960 |
one and only one thing, and just occasionally it'll be predicting something that's a pretty 00:08:43.360 |
So the ability to grab random numbers, that this is basically the histogram, the smoothed 00:08:50.120 |
histogram of how often we're going to see those numbers, is called sampling from a probability 00:08:58.880 |
And basically in nearly all these cases you can start with a uniform random number or 00:09:03.320 |
a normal random number and put it through some kind of function or process to turn it 00:09:13.240 |
But the paper points out that this particular shape is nicely characterized by something 00:09:17.240 |
called the beta distribution, so that's what we're going to use. 00:09:21.360 |
So it was interesting drawing these because it requires a few interesting bits of math, 00:09:29.800 |
which some of you may be less comfortable with or entirely uncomfortable with. 00:09:35.480 |
For me, every time I see this function, which is called the gamma function, I kind of break 00:09:42.480 |
out in sweats, not just because I've got a cold, but it's like the idea of functions 00:09:47.240 |
that I don't-- like how do you describe this thing? 00:09:50.400 |
But actually, it turns out that like most things, once you look at it, it's actually 00:09:57.080 |
And we're going to be using this function, so I'll just quickly explain what's going 00:10:00.520 |
We're going to start with a factorial function, so 1 times 2 times 3 times 4, whatever, right? 00:10:06.960 |
And here these red dots is just the value of the factorial function for a few different 00:10:14.720 |
But don't think of the factorial function as being 1 times 2 times 3 times 4, or times 00:10:20.360 |
n, whatever, but divide both sides by n, and now you've got-- or divide both sides by n, 00:10:28.720 |
and now you've got like factorial n divided by n equals 1 times 2 times 3, so it equals 00:10:39.960 |
And so when you define it like that, you suddenly realize there's no reason that you kind of 00:10:43.440 |
have a function that's not just on the integers-- not just on the integers, but is everywhere. 00:10:49.120 |
This is the point where I stop with the math, right? 00:10:51.000 |
Because to me, if I need a sine function, or a log function, or an x-punk fin, or whatever, 00:10:55.040 |
I type it into my computer and I get it, right? 00:10:57.140 |
So the actual how you get it is not at all important. 00:11:00.480 |
But the fact of knowing what these functions are and how they're defined is useful. 00:11:07.640 |
Weirdly enough, they have a log gamma function. 00:11:09.760 |
So we can take log gamma and go e to the power of that to get a gamma function. 00:11:13.480 |
And you'll see here, I am breaking my no Greek letters rule. 00:11:17.880 |
And the reason I'm breaking that rule is because a function like this doesn't have a kind of 00:11:24.200 |
domain-specific meaning, or a pure physical analogy, which is how we always think about 00:11:34.480 |
And so if you're going to call it gamma, you may as well write it like that. 00:11:38.280 |
And why this matters is when you start using it. 00:11:42.400 |
Like look at the difference between writing it out with the actual Unicode and operators 00:11:49.280 |
versus what would happen if you wrote it out long form in Python. 00:11:54.400 |
Like when you're comparing something to a paper, you want something that you can look 00:11:57.920 |
at and straight away say like, oh, that looks very familiar. 00:12:02.080 |
And as long as it's not familiar, you might want to think about how to make it more familiar. 00:12:06.160 |
So I just briefly mentioned that writing these math symbols nowadays is actually pretty easy. 00:12:12.480 |
On Linux, there's a thing called a compose key which is probably already set up for you. 00:12:16.880 |
And if you Google it, you can learn how to turn it on. 00:12:18.800 |
And it's basically like you'll press like the right alt button or the caps lock button. 00:12:26.080 |
So for example, all the Greek letters are compose and then star, and then the English 00:12:31.560 |
So for example, if I want to do lambda, I would go composed L. So it's just as quick 00:12:40.920 |
Most of the Greek letters are available on a Mac keyboard just with option. 00:12:43.760 |
Unfortunately, nobody's created a decent compose key for Mac yet. 00:12:46.960 |
There's a great compose key for Windows called win compose. 00:12:49.920 |
Anybody who's working with, you know, Greek letters should definitely install and learn 00:12:58.360 |
So there's our gamma function nice and concise. 00:13:03.320 |
And so it turns out that this is how you calculate the value of the beta function, which is the 00:13:09.160 |
So as I said, the details aren't important, but they're the tools that you can use. 00:13:13.120 |
The basic idea is that we now have something where we can pick some parameter, which is 00:13:17.720 |
called alpha, where if it's high, then it's much more likely that we get a equal mix. 00:13:27.200 |
And this is really important because for data augmentation, we need to be able to tune a 00:13:30.240 |
lever that says how much regularization am I doing? 00:13:37.920 |
And the reason it's important to be able to print these plots out is that when you change 00:13:41.560 |
your alpha, you want to plot it out and see what it looks like, right? 00:13:49.600 |
So it turns out that all we need to do then is we don't actually have to 0.7 hot encode 00:14:02.360 |
It's actually identical to simply go, I guess it is lambda times the first loss plus 1 minus 00:14:23.160 |
And again, as you can see, we're using the same letters that we'd expect to see in the 00:14:28.720 |
And mixup, remember, is something which is going to change our loss function. 00:14:33.400 |
So we need to know what loss function to change. 00:14:35.680 |
So when you begin fitting, you find out what the old loss function on the learner was when 00:14:43.720 |
And then when we calculate loss, we can just go ahead and say, oh, if it's invalidation, 00:14:51.300 |
And if we're training, then we'll calculate the loss on two different sets of images. 00:14:57.400 |
One is just the regular set, and the second is we'll grab all the other images and randomly 00:15:04.120 |
permute one and randomly pick one to share with. 00:15:08.720 |
So we do that for the image, and we do that for the loss. 00:15:19.160 |
In the last lesson, I created an EWMA function, Exponentially Weighted Moving Average Function, 00:15:25.880 |
which is a really dumb name for it, because actually it was just a linear combination 00:15:31.000 |
It was like V times alpha plus V1 times alpha plus V2 times 1 minus alpha. 00:15:38.280 |
You create exponentially weighted moving averages with it by applying it multiple times, but 00:15:42.600 |
the actual function is a linear combination, so I've renamed that to linear combination, 00:15:48.240 |
So this mixup is a linear combination of our actual images and some randomly permuted images 00:15:57.640 |
And our loss is a linear combination of the loss of our two different parts, our normal 00:16:04.320 |
mini-batch and our randomly permuted mini-batch. 00:16:06.800 |
One of the nice things about this is if you think about it, this is all being applied 00:16:14.160 |
So super powerful augmentation system, which isn't going to add any overhead to our code. 00:16:22.480 |
One thing to be careful of is that we're actually replacing the loss function, and loss functions 00:16:34.280 |
And most PyTorch loss functions, you can say, after calculating the loss function for everything 00:16:39.400 |
in the mini-batch, either return a rank 1 tensor of all of the loss functions for the 00:16:44.880 |
mini-batch, or add them all up, or take the average. 00:16:50.200 |
But we just have to make sure that we do the right thing. 00:16:53.800 |
So I've just got a little function here that does the mean or sum, or nothing at all, as 00:16:59.960 |
And so then we need to make sure that we create our new loss function, that at the end, it's 00:17:07.240 |
going to reduce it in the way that they actually asked for. 00:17:11.400 |
But then we have to turn off the reduction when we actually do mixup, because we actually 00:17:16.640 |
need to calculate the loss on every image for both halves of our mixup. 00:17:23.280 |
So this is a good place to use a context manager, which we've seen before. 00:17:27.640 |
So we just created a tiny little context manager, which will just find out what the previous 00:17:32.020 |
reduction was, save it away, get rid of it, and then put it back when it's finished. 00:17:41.280 |
But with that in place, the actual mixup itself is very little code. 00:17:55.560 |
And I've been mainly playing with alpha at 0.2, so this is a bit more than I'm used to. 00:17:59.380 |
But somewhere around that vicinity is pretty normal. 00:18:05.660 |
And that's like-- it's really interesting, because you could use this for layers other 00:18:15.020 |
You could use it on the first layer, maybe with the embeddings. 00:18:18.040 |
So you could do mixup augmentation in NLP, for instance. 00:18:23.920 |
That's something which people haven't really dug into deeply yet. 00:18:28.000 |
But it seems to be an opportunity to add augmentation in many places where we don't really see it 00:18:35.880 |
Which means we can train better models with less data, which is why we're here. 00:18:45.440 |
So now we've drawn some random number lambda. 00:18:50.120 |
So I've got 0.7 of a dog and 0.3 of a gas station. 00:18:53.640 |
And the correct answer would be a rank one tensor which has 0.7 in one spot and 0.3 in 00:19:04.840 |
Softmax isn't going to want to do that for me, because Softmax really wants just one 00:19:08.400 |
of my values to be high, because it's got an e to the top, as we've talked about. 00:19:14.860 |
So we-- to really use mixup well-- and not just use mixup well, but any time your data 00:19:21.560 |
is-- the labels on the data, you're not 100% sure they're correct. 00:19:25.720 |
You don't want to be asking your model to predict one. 00:19:30.320 |
You want to be-- don't predict, I'm 100% sure it's this label, because you've got label 00:19:36.040 |
You've got incorrect labels, or you've got mixup, mixing, or whatever. 00:19:38.960 |
So instead, we say, oh, don't use one hot encoding for the dependent variable, but use a little 00:19:50.000 |
So then the correct answer is to say, I'm 90% sure this is the answer. 00:19:54.720 |
And then all of your probabilities have to add to one. 00:19:57.360 |
So then all of the negatives, you just put 0.1 divided by n minus one, and all the rest. 00:20:05.640 |
And it's a really simple but astonishingly effective way to handle noisy labels. 00:20:13.600 |
I keep on hearing people saying, oh, we can't use deep learning in this medical problem, 00:20:20.560 |
because the diagnostic labels in the reports are not perfect, and we don't have a gold 00:20:26.980 |
It actually turns out that particularly if you lose label smoothing, noisy data is generally 00:20:32.400 |
Like, there's plenty of examples of people using this where they literally randomly permute 00:20:38.440 |
half the labels to make them like 50% wrong, and they still get good results, really good 00:20:44.160 |
So don't listen to people in your organization saying, we can't start modeling until we do 00:20:56.200 |
And if they are, then maybe you can skip all the cleanup work or do them simultaneously. 00:21:01.360 |
So label smoothing ends up just being the cross entropy loss as before times if epsilon 00:21:09.960 |
is 0.1 and 0.9 plus 0.1 times the cross entropy for everything divided by n. 00:21:17.800 |
And the nice thing is that's another linear combination. 00:21:21.040 |
So once you kind of create one of these little mathematical refactorings that tend to pop 00:21:24.520 |
up everywhere and make your code a little bit easier to read and a little bit harder 00:21:28.760 |
to stuff up, every time I have to write a piece of code, there's a very high probability 00:21:34.860 |
So the less I have to write, the less debugging I'm going to have to do later. 00:21:39.540 |
So we can just pop that in as a loss function and away we go. 00:21:46.400 |
So that's a super powerful technique which has been around for a couple of years, those 00:21:53.400 |
two techniques, but not nearly as widely used as they should be. 00:21:58.840 |
Then if you're using a Volta, Tensor Core, 2080, any kind of pretty much any current 00:22:06.120 |
generation Nvidia graphics card, you can train using half precision floating point in theory 00:22:16.040 |
In practice it doesn't quite work out that way because there's other things going on, 00:22:22.880 |
So the other thing we've got is some work here to allow you to train in half precision 00:22:30.560 |
Now the reason it's not as simple as saying model.half, which would convert all of your 00:22:35.080 |
weights and biases and everything to half precision floating point, is because of this. 00:22:41.200 |
This is from Nvidia's materials and what they point out is that you can't just use half 00:22:48.840 |
precision everywhere because it's not accurate, it's bumpy. 00:22:53.800 |
So it's hard to get good useful gradients if you do everything in half precision, particularly 00:23:02.300 |
So instead what we do is we do the forward pass in FP16, we do the backward pass in FP16, 00:23:09.980 |
so all the hard work is done in half precision floating point, and pretty much everywhere 00:23:14.800 |
else we convert things to full precision floating point and do everything else in full precision. 00:23:20.680 |
So for example, when we actually apply the gradients by multiplying the value of the learning 00:23:28.780 |
And that means that if your learning rate's really small, in FP16 it might basically round 00:23:40.280 |
In FastAI version one, we wrote all this by hand. 00:23:44.720 |
For the lessons, we're experimenting with using a library from Nvidia called Apex. 00:23:50.120 |
Apex basically have some of the functions to do this there for you. 00:23:55.520 |
So we're using it here, and basically you can see there's a thing called model to half 00:24:01.380 |
where we just go model to half, batch norm, goes to float, and so forth. 00:24:05.700 |
So these are not particularly interesting, but they're just going through each one and 00:24:09.480 |
making sure that the right layers have the right types. 00:24:13.160 |
So once we've got those kind of utility functions in place, the actual callback's really quite 00:24:20.120 |
small and you'll be able to map every stage to that picture I showed you before. 00:24:26.520 |
So you'll be able to see when we start fitting, we convert the network to half-precision floating 00:24:33.240 |
One of the things that's kind of interesting is there's something here called loss scale. 00:24:38.920 |
After the backward pass, well probably more interestingly, after the loss is calculated, 00:24:48.080 |
we multiply it by this number called loss scale, which is generally something around 00:24:52.640 |
The reason we do that is that losses tend to be pretty small in a region where half-precision 00:24:59.720 |
So we just multiply it by 512, put it in a region that is accurate. 00:25:03.360 |
And then later on, in the backward step, we just divide by that again. 00:25:06.600 |
So that's a little tweak, but it's the difference we find generally between things working and 00:25:12.840 |
So the nice thing is now, we have something which you can just add mixed precision and 00:25:20.360 |
train and you will get often 2x, 3x speed up, certainly on vision models, also on transformers, 00:25:34.440 |
One obvious question is, is 512 the right number? 00:25:39.480 |
And it turns out getting this number right actually does make quite a difference to your 00:25:43.720 |
And so something slightly more recently is called dynamic loss scaling, which literally 00:25:48.200 |
tries a few different values of loss scale to find out at what point does it become infinity. 00:25:54.040 |
And so it dynamically figures out the highest loss scale we can go to. 00:25:59.160 |
And so this version just has the dynamic loss scaling added. 00:26:03.680 |
It's interesting that sometimes training with half-precision gives you better results than 00:26:08.800 |
training with FP32 because there's just, I don't know, a bit more randomness. 00:26:13.120 |
Maybe it regularizes a little bit, but generally it's super, super similar, just faster. 00:26:21.560 |
Is there an intuitive way to understand why mixup is better than other data augmentation 00:26:30.600 |
I think one of the things that's really nice about mixup is that it doesn't require any 00:26:43.440 |
It doesn't create any kind of lossiness, like in the corners, there's no reflection padding 00:26:52.800 |
It's also almost infinite in terms of the number of different images it can create. 00:26:59.240 |
So you've kind of got this permutation of every image with every other image, which 00:27:04.040 |
is already giant, and then in different mixes. 00:27:06.800 |
So it's just a lot of augmentation that you can do with it. 00:27:16.600 |
So there's another thing which, there's something called cutout where you just delete a square 00:27:23.680 |
There's another one where you delete a square and replace it with random pixels. 00:27:27.200 |
Something I haven't seen, but I'd really like to see people do, is to delete a square and 00:27:32.120 |
So I'd love somebody to try doing mix-up, but instead of taking the linear combination, 00:27:38.080 |
instead pick an alpha-sized, sorry, a lambda percent of the pixels, like in a square, and 00:27:47.320 |
There's another one which basically finds four different images and puts them in four 00:27:55.280 |
And they really get great results, and I'm surprised how few people are using them. 00:28:06.720 |
So let's use our random resize crop, a minimum scale of 0.35 we find works pretty well. 00:28:15.240 |
And we're not going to do any other, other than flip, we're not going to do any other 00:28:25.000 |
So far, all of our models have been boring convolutional models. 00:28:30.760 |
But obviously what we really want to be using is a resnet model. 00:28:36.000 |
We have the xresnet, which there's some debate about whether this is the mutant version of 00:28:47.840 |
And basically the xresnet is the bag of tricks, is basically the bag of tricks resnet. 00:28:58.780 |
So they have a few suggested tweaks to resnet. 00:29:13.200 |
So the first tweak is something that we've kind of talked about, and they call it resnet 00:29:18.360 |
And it's basically, hey, let's not do a big seven by seven convolution as our first layer, 00:29:26.920 |
And it's just a single linear model, which doesn't have much kind of richness to it. 00:29:33.180 |
So instead, let's do three comms in a row, three by three, right? 00:29:38.940 |
And so three, three by three comms in a row, if you think about it, the receptive field 00:29:43.040 |
of that final one is still going to be about seven by seven, right? 00:29:48.160 |
But it's got there through a much richer set of things that it can learn, because it's 00:29:54.920 |
So that's the first thing that we do in our xresnet. 00:30:03.840 |
And when we create it, we set up how many filters are they going to be for each of the 00:30:10.240 |
So the first three layers will start with channels in, inputs. 00:30:14.280 |
So that'll default to three, because normally we have three channel images, right? 00:30:18.280 |
And the number of outputs that we'll use for the first layer will be that plus one times 00:30:27.700 |
One reason is that that gives you 32 at the second layer, which is the same as what the 00:30:42.840 |
The second reason is that I've kind of played around with this quite a lot to try to figure 00:30:50.120 |
out what makes sense in terms of the receptive field, and I think this gives you the right 00:30:56.360 |
Sometimes eight is here because video graphics cards like everything to be a multiple of 00:31:05.560 |
So if this is not eight, it's probably going to be slower. 00:31:07.300 |
But one of the things here is now if you have like a one channel input, like black and white, 00:31:12.900 |
or a five channel input, like some kind of hyperspectral imaging or microscopy, then 00:31:17.600 |
you're actually changing your model dynamically to say, oh, if I've got more inputs, then 00:31:24.640 |
Which is not something I've seen anybody do before, but it's a kind of really simple, 00:31:29.000 |
nice way to improve your ResNet for different kinds of domains. 00:31:33.880 |
So that's the number of filters we have for each layer. 00:31:37.360 |
So our stem, so the stem is the very start of a CNN. 00:31:50.840 |
A conflayer is a sequential containing a bunch of layers, which starts with a conf of some 00:31:58.520 |
stride, followed by a batch norm, and then optionally followed by an activation function. 00:32:05.920 |
And our activation function, we're just going to use ReLU for now, because that's what they're 00:32:13.980 |
This is another tweak from the bag of tricks, although it goes back a couple more years 00:32:18.420 |
We initialize the batch norm, sometimes to have weights of 1, and sometimes to have weights 00:32:34.960 |
Have a look here at ResNet D. This is a standard ResNet block. 00:32:41.480 |
This path here normally doesn't have the conv and the average pool. 00:32:46.120 |
We'll talk about why they're there sometimes in a moment. 00:32:50.320 |
And the other goes 1 by 1 conv, 3 by 3 conv, 1 by 1 conv. 00:32:55.920 |
And remember, in each case, it's conv batch norm ReLU, conv batch norm ReLU. 00:32:59.720 |
And then what actually happens is it then goes conv batch norm, and then the ReLU happens 00:33:05.600 |
There's another variant where the ReLU happens before the plus, which is called preact or 00:33:10.920 |
Turns out it doesn't work quite as well for smaller models, so we're using the non-preact 00:33:20.920 |
What if we set the batch norm layer weights there to 0? 00:33:29.040 |
This does some conv, some conv, some conv, and then batch norm where the weights are 00:33:40.480 |
Because now we're adding 0 to the identity block. 00:33:43.800 |
So in other words, the whole block does nothing at all. 00:33:47.840 |
That's a great way to initialize a model, right? 00:33:50.560 |
Because we really don't want to be in a position, as we've seen, where if you've got a thousand 00:33:54.200 |
layers deep model, that any layer is even slightly changing the variance because they 00:33:59.160 |
kind of cause the gradients to spiral off to 0 or to infinity. 00:34:03.320 |
This way, literally, the entire activations are the same all the way through. 00:34:10.640 |
We set the 1, 2, 3 third conv layer to have 0 in that batch norm layer. 00:34:21.760 |
And this lets us train very deep models at very high learning rates. 00:34:26.240 |
You'll see nearly all of the academic literature about this talks about large batch sizes because, 00:34:30.800 |
of course, academics, particularly at big companies like Google and OpenAI and Nvidia 00:34:35.360 |
and Facebook, love to show off their giant data centers. 00:34:39.760 |
And so they like to say, oh, if we do 1,000 TPUs, how big a batch size can we create? 00:34:45.040 |
But for us normal people, these are also interesting because the exact same things tell us how 00:34:52.220 |
So the exact same things that let you create really big batch sizes, so you do a giant 00:34:55.320 |
batch and then you take a giant step, well, we can just take a normal sized batch, but 00:35:02.600 |
And by using higher learning rates, we train faster and we generalize better. 00:35:20.040 |
And then we're going to create a bunch of res blocks. 00:35:24.160 |
So a res block is one of these, except this is an identity path, right? 00:35:30.640 |
Unless we're doing a res net 34 or a res net 18, in which case one of these comms goes 00:35:39.840 |
So res net 34 and res net 18 only have two cons here and res net 50 onwards have three 00:35:47.120 |
So and then in res net 50 and above, the second conv, they actually squish the number of channels 00:35:54.200 |
down by four and then they expand it back up again. 00:35:57.760 |
So it could go like 64 channels to 16 channels to 64 channels. 00:36:04.560 |
So a bottleneck block is the normal block for larger res nets. 00:36:08.420 |
And then just two three by three comms is the normal for smaller res nets. 00:36:14.200 |
So you can see in our res block that we pass in this thing called expansion. 00:36:20.860 |
It's one if it's res net 18 or 34, and it's four if it's bigger, right? 00:36:26.240 |
And so if it's four, well, if it's expansion equals one, then we just add one extra conv, 00:36:33.960 |
The first conv is always a one by one, and then we add a three by three conv, or if expansion 00:36:46.440 |
Now I mentioned that there's two other things here. 00:36:52.960 |
Well, we can't use standard res blocks all the way through our model, can we? 00:36:58.120 |
Because a res block can't change the grid size. 00:37:01.120 |
We can't have a stride two anywhere here, because if we had a stride two somewhere here, 00:37:07.260 |
we can't add it back to the identity because they're now different sizes. 00:37:11.120 |
Also we can't change the number of channels, right? 00:37:14.260 |
Because if we change the number of channels, we can't add it to the identity. 00:37:18.680 |
Well, as you know, from time to time, we do like to throw in a stride two, and generally 00:37:23.840 |
when we throw in a stride two, we like to double the number of channels. 00:37:27.520 |
And so when we do that, we're going to add to the identity path two extra layers. 00:37:32.520 |
We'll add an average pooling layer, so that's going to cause the grid size to shift down 00:37:37.200 |
by two in each dimension, and we'll add a one by one conv to change the number of filters. 00:37:45.060 |
And this particular way of doing it is specific to the x res net, and it gives you a nice 00:37:51.360 |
little boost over the standard approach, and so you can see that here. 00:37:58.600 |
If the number of inputs is different to the number of filters, then we add an extra conv 00:38:02.800 |
layer, otherwise we just do no op, no operation, which is defined here. 00:38:10.480 |
And if the stride is something other than one, we add an average pooling, otherwise it's 00:38:15.160 |
a no op, and so here is our final res net block calculation. 00:38:23.680 |
So tweak for res net d is this way of doing the, they call it a downsampling path. 00:38:32.840 |
And then the final tweak is the actual ordering here of where the stride two is. 00:38:37.400 |
Usually the stride two in normal res net is at the start, and then there's a three by 00:38:44.200 |
Doing a stride two on a one by one conv is a terrible idea, because you're literally 00:38:48.400 |
throwing away three quarters of the data, and it's interesting, it took people years 00:38:53.240 |
to realize they're literally throwing away three quarters of the data, so the bag of 00:38:57.080 |
tricks folks said, let's just move the stride two to the three by three, and that makes 00:39:02.160 |
Because a stride two, three by three, you're actually hitting every pixel. 00:39:07.480 |
So the reason I'm mentioning these details is so that you can read that paper and spend 00:39:12.760 |
time thinking about each of those res net tweaks, do you understand why they did that? 00:39:19.280 |
It wasn't some neural architecture search, try everything, brainless, use all our computers 00:39:26.800 |
So let's sit back and think about how do we actually use all the inputs we have, and how 00:39:33.040 |
do we actually take advantage of all the computation that we're doing, right? 00:39:37.600 |
So it's a very, most of the tweaks are stuff that exists from before, and they've cited 00:39:43.220 |
all those, but if you put them all together, it's just a nice, like, here's how to think 00:39:53.640 |
So we create a res net block for every res layer, and so here it is, creating the res 00:40:02.080 |
net block, and so now we can create all of our res nets by simply saying, this is how 00:40:12.560 |
So res net 18 is just two, two, two, two, 34 is three, four, six, three, and then secondly 00:40:17.560 |
is changing the expansion factor, which as I said for 18 and 34 is one, and for the bigger 00:40:26.200 |
So that's a lot of information there, and if you haven't spent time thinking about architecture 00:40:30.440 |
before, it might take you a few reads and lessons to put the sink in, but I think it's 00:40:35.120 |
a really good idea to try to spend time thinking about that, and also to, like, experiment, 00:40:44.580 |
The other thing to point out here is that this -- the way I've written this, it's like 00:40:53.000 |
this is the whole -- this is the whole res net, right, other than the definition of conflayer, 00:41:00.320 |
It fits on the screen, and this is really unusual. 00:41:03.720 |
Most res nets you see, even without the bag of tricks, 500, 600, 700 lines of code, right? 00:41:10.720 |
And if every single line of code has a different arbitrary number at 16 here and 32 there and 00:41:17.480 |
average pool here and something else there, like, how are you going to get it right? 00:41:20.800 |
And how are you going to be able to look at it and say, what if I did this a little bit 00:41:26.240 |
So for research and for production, you want to get your code refactored like this for 00:41:32.880 |
your architecture so that you can look at it and say, what exactly is going on, is it 00:41:37.960 |
written correctly, okay, I want to change this to be in a different layer, how do I 00:41:45.000 |
It's really important for effective practitioners to be able to write nice, concise architectures 00:41:52.440 |
so that you can change them and understand them. 00:41:58.960 |
We can train it with or without mixup, it's up to us. 00:42:03.520 |
Label smoothing cross entropy is probably always a good idea, unless you know that your labels 00:42:13.320 |
And let's check out to see what our model is doing. 00:42:16.800 |
So we've already got a model summary, but we're just going to rewrite it to use our, 00:42:20.680 |
the new version of learner that doesn't have runner anymore. 00:42:23.940 |
And so we can print out and see what happens to our shapes as they go through the model. 00:42:30.200 |
And you can change this print mod here to true, and it'll print out the entire blocks 00:42:36.440 |
So that would be a really useful thing to help you understand what's going on in the 00:42:45.600 |
We can tell you how many channels are coming in, how many channels are coming out, and 00:42:49.840 |
it'll adapt automatically to our data that way. 00:42:54.020 |
So we can create our learner, we can do our LR find. 00:42:58.920 |
And now that we've done that, let's create a one cycle learning rate annealing. 00:43:05.000 |
So one cycle learning rate annealing, we've seen all this before. 00:43:09.420 |
We keep on creating these things like 0.3, 0.7 for the two phases or 0.3, 0.2, 0.5 for 00:43:16.160 |
So I add a little create phases that will build those for us automatically. 00:43:23.200 |
So here's our standard one cycle annealing, and here's our parameter scheduler. 00:43:30.640 |
And so one other thing I did last week was I made it that callbacks, you don't have to 00:43:38.440 |
You can also pass them to the fit function, and it'll just run those callbacks to the 00:43:43.120 |
This is a great way to do parameter scheduling. 00:43:51.760 |
So I would love to see people beat my benchmarks here. 00:43:57.880 |
And so so far, the best I've got for 128, 5 epochs is 84.6. 00:44:07.800 |
So maybe with some fiddling around, you can find something that's even better. 00:44:11.560 |
And with these kind of leaderboards, where a lot of these things can train in, this is 00:44:15.720 |
two and a half minutes on a standard, I think it was a GTX 1080 Ti, you can quickly try 00:44:23.240 |
And what I've noticed is that the results I get in 5 epochs on 128 pixel image net models 00:44:30.060 |
carry over a lot to image net training or bigger models. 00:44:34.920 |
So you can learn a lot by not trying to train giant models. 00:44:40.320 |
So compete on this leaderboard to become a better practitioner to try out things, right? 00:44:45.680 |
And if you do have some more time, you can go all the way to 400 epochs, that might take 00:44:49.440 |
And then of course, also we've got image wolf, which is just doggy photos, and is much harder. 00:44:56.200 |
And actually, this one, I find an even better test case, because it's a more difficult data 00:45:10.400 |
So we can refactor all that stuff of adding all these different callbacks and stuff into 00:45:21.680 |
And we can just pass in an architecture and our data and our loss function and our optimization 00:45:25.440 |
function and what kind of callbacks do we want, just yes or no. 00:45:32.640 |
And if you don't pass in C in and C out, we'll grab it from your data for you. 00:45:37.140 |
And then we'll just pass that off to the learner. 00:45:42.160 |
So now if you want to create a CNN, it's just one line of code, adding in whatever we want, 00:45:50.320 |
And so we get the same result when we fit it. 00:45:53.080 |
So we can see this all put together in this ImageNet training script, which is in fast 00:46:02.760 |
And this entire thing will look entirely familiar to you. 00:46:07.320 |
It's all stuff that we've now built from scratch, with one exception, which is this bit, which 00:46:21.680 |
And you can easily use multiple GPUs by simply doing data parallel or too distributed. 00:46:28.900 |
Other than that, yeah, this is all stuff that you see. 00:46:46.720 |
So what happens if we try to train this for just 60 epochs? 00:46:54.680 |
So benchmark results on ImageNet, these are all the Keras and PyTorch models. 00:46:58.500 |
It's very hard to compare them because they have different input sizes. 00:47:01.900 |
So we really should compare the ones with our input size, which is 224. 00:47:05.760 |
So a standard ResNet -- oh, it scrolled off the screen. 00:47:12.560 |
So ResNet 50 is so bad, it's actually scrolled off the screen. 00:47:15.280 |
So let's take ResNet 101 as a 93.3% accuracy. 00:47:21.520 |
And it was also trained for 90 epochs, so trained for 50% longer, 93.3. 00:47:30.960 |
So this, like, extremely simple architecture that fits on a single screen and was built 00:47:37.360 |
entirely using common sense, trained for just 60 epochs, actually gets us even above ResNet 00:47:46.960 |
So the only things above it were trained on much, much larger images. 00:47:53.240 |
And also, like, NASNet large is so big, I can't train it. 00:47:57.560 |
I just keep on running out of memory in time. 00:48:00.040 |
And Inception ResNet version 2 is really, really fiddly and also really, really slow. 00:48:04.280 |
So we've now got, you know, this beautiful nice ResNet, XResNet 50 model, which, you 00:48:12.720 |
know, is built in this very first principles common sense way and gets astonishingly great 00:48:20.240 |
So I really don't think we all need to be running to neural architecture search and 00:48:27.680 |
hyperparameter optimization and blah, blah, blah. 00:48:29.960 |
We just need to use, you know, good common sense thinking. 00:48:34.260 |
So I'm super excited to see how well that worked out. 00:48:41.420 |
So now that we have a nice model, we want to be able to do transfer learning. 00:48:48.120 |
I mean, you all know how to do transfer learning, but let's do it from scratch. 00:48:52.540 |
So what I'm going to do is I'm going to transfer learn from ImageWolf to the pets data set 00:49:10.840 |
Notice how we're using all the stuff we've built. 00:49:17.440 |
So we're still not using anything from fast AI v1. 00:49:22.320 |
This is all stuff that if you want to know what's going on, you can go back to that previous 00:49:26.240 |
lesson and see what did we build and how did we build it and step through the code. 00:49:31.160 |
There's a CNN learner that we just built in the last notebook. 00:49:36.260 |
These five lines of code I got sick of typing, so let's dump them into a single function 00:49:44.780 |
It's going to create our momentum annealing and our learning rate annealing and create 00:49:50.500 |
So now with that we can just say schedule1cycle with a learning rate, what percentage of 00:49:54.700 |
the epochs are at the start, batches I should say at the start, and we could go ahead and 00:50:00.820 |
For transfer learning we should try and fit a decent model. 00:50:03.660 |
So I did 40 epochs at 11 seconds per epoch on a 1080ti. 00:50:09.480 |
So a few minutes later we've got 79.6% accuracy, which is pretty good, you know, training from 00:50:18.380 |
scratch for 10 different dog breeds with a ResNet 18. 00:50:23.060 |
So let's try and use this to create a good pets model that's going to be a little bit 00:50:29.620 |
tricky because the pets dataset has cats as well, and this model's never seen cats. 00:50:34.700 |
And also this model has only been trained on I think less than 10,000 images, so it's 00:50:39.860 |
kind of unusually small thing that we're trying to do here, so it's an interesting experiment 00:50:46.220 |
So the first thing we have to do is we have to save the model so that we can load it into 00:50:51.480 |
So when we save a model, what we do is we grab its state dict. 00:50:57.740 |
Now we actually haven't written this, but it would be like three lines of code if you 00:51:00.700 |
want to write it yourself, because all it does is it literally creates a dictionary, 00:51:04.800 |
an order dict is just a Python standard library dictionary that has an order, where the keys 00:51:09.980 |
are just the names of all the layers, and for sequential the index of each one, and 00:51:15.260 |
then you can look up, say, 10.bias, and it just returns the weights. 00:51:21.100 |
So you can easily turn a module into a dictionary, and so then we can create somewhere to save 00:51:26.840 |
our model, and torch.save will save that dictionary. 00:51:30.860 |
You can actually just use pickle here, works fine, and actually behind the scenes, torch.save 00:51:35.540 |
is using pickle, but they kind of like add some header to it to say like it's basically 00:51:41.700 |
a magic number that when they read it back, they make sure it is a PyTorch model file 00:51:45.820 |
and that it's the right version and stuff like that, but you can totally use pickle. 00:51:51.540 |
And so the nice thing is now that we know that the thing we've saved is just a dictionary. 00:51:56.660 |
So you can fiddle with it, but if you have trouble loading something in the future, just 00:52:01.980 |
open up, just go torch.load, put it into a dictionary, and look at the keys and look 00:52:10.780 |
So we've seen pets before, so the nice thing is that we've never used pets in part two, 00:52:19.180 |
And in this case, there's one images directory that contains all the images, and there isn't 00:52:25.260 |
a separate validation set directory, so we can't use that label with -- sorry, yeah, 00:52:32.060 |
label with -- sorry, split with grandparent thing, so we're going to have to split it 00:52:38.460 |
But remember how we've already created split by func? 00:52:41.340 |
So let's just write a function that returns true or false, depending on whether some random 00:52:51.060 |
And so now, we can just pass that to our split by func, and we're done. 00:52:57.940 |
So the nice thing is, when you kind of understand what's going on behind the scenes, it's super 00:53:06.940 |
1 is basically identical, there's a split by func that you do the same thing for. 00:53:12.920 |
So now that's split into training and validation, and you can see how nice it is that we created 00:53:19.460 |
that dunder repress so that we can print things out so easily to see what's going on. 00:53:23.900 |
So if something doesn't have a nice representation, you should monkey-patch in a dunder repress 00:53:33.540 |
So we can't label it by folder, because they're not put into folders. 00:53:41.480 |
So I need to build all this stuff in a Jupyter notebook just interactively to see what's 00:53:48.980 |
So in this case, we'll grab one name, and then let's try to construct a regular expression 00:53:58.500 |
And once we've got it, we can now turn that into a function. 00:54:01.900 |
And we can now go ahead and use that category processor we built last week to label it. 00:54:08.740 |
We're not just doggies now, doggies and kitties. 00:54:15.020 |
So now we can train from scratch pets, 37%, not great. 00:54:21.860 |
So maybe with transfer learning, we can do better. 00:54:25.860 |
So transfer learning, we can read in that imagewoof model, and then we will customize 00:54:45.580 |
But let's tell it to create a model with ten filters out, ten activations at the end. 00:54:53.100 |
Because remember, imagewoof has ten types of dog, ten breeds. 00:54:56.980 |
So to load in the pre-trained model, we're going to need to ask for a learner with ten 00:55:03.980 |
So that is something we can now grab our state dictionary that we saved earlier, and we can 00:55:18.220 |
But the learner for it is pointing at the pet's data bunch. 00:55:24.120 |
So what we now have to do is remove the final linear layer and replace it with one that 00:55:30.060 |
has the right number of activations to handle all these, which I think is 37 pet breeds. 00:55:39.300 |
So what we do is we look through all the children of the model, and we try to find the adaptive 00:55:44.300 |
average pooling layer, because that's that kind of penultimate bit, and we grab the index 00:55:48.640 |
of that, and then let's create a new model that has everything up to but not including 00:55:55.360 |
So this is everything before the adaptive average pooling. 00:56:01.820 |
So now we need to attach a new head to this body, which is going to have 37 activations 00:56:08.180 |
in the linear layer instead of 10, which is a bit tricky because we need to know how many 00:56:14.140 |
inputs are going to be required in this new linear layer. 00:56:18.180 |
And the number of inputs will be however many outputs come out of this. 00:56:23.820 |
So in other words, just before the average pooling happens in the x res net, how many 00:56:39.660 |
Grab a batch of data, put it through a cut down model, and look at the shape. 00:56:48.860 |
So we've got a 128 mini batch of 512 4x4 activations. 00:56:55.620 |
So that pred dot shape one is the number of inputs to our head. 00:57:06.020 |
But remember, we tend to not just use a max pool or just an average pool. 00:57:11.980 |
We tend to do both and concatenate them together, which is something we've been doing in this 00:57:19.460 |
But a couple of years, somebody finally did actually write a paper about it. 00:57:22.500 |
So I think this is actually an official thing now. 00:57:28.740 |
So our linear layer needs twice as many inputs because we've got two sets of pooling we did. 00:57:35.420 |
So our new model contains the whole head, plus a adaptive concat pooling, platen, and 00:57:44.420 |
And so let's replace the model with that new model we created and fit. 00:57:49.380 |
And look at that, 71% by fine tuning versus 37% training from scratch. 00:57:59.380 |
So we have a simple transfer learning working. 00:58:06.100 |
So what I did then, I do this in Jupyter all the time, I basically grabbed all the cells. 00:58:16.780 |
And then I grabbed them all, and I hit shift M to merge, and chucked a function header 00:58:22.620 |
So now I've got a function that does all the-- so these are all the lines you saw just before. 00:58:26.460 |
And I've just stuck them all together into a function. 00:58:30.580 |
It's going to take a learner and adapt it for the new data. 00:58:34.940 |
So these are all the lines of code you've already seen. 00:58:37.140 |
And so now we can just go CNN learner, load the state dict, adapt the model, and then 00:58:46.020 |
But of course, what we really like to do is to first of all train only the head. 00:58:52.420 |
So let's grab all the parameters in the body. 00:58:56.060 |
And remember, when we did that nn.sequential, the body is just the first thing. 00:59:05.100 |
So let's grab all the parameters in the body and set them to requires grad equals false. 00:59:15.420 |
And so now we can train just the head, and we get 54%, which is great. 00:59:20.520 |
So now we, as you know, unfreeze and train some more. 00:59:28.420 |
So it's better than not fine tuning, but interestingly, it's worse-- 71 versus 56-- it's worse than 00:59:41.700 |
the kind of naive fine tuning, where we didn't do any freezing. 00:59:49.220 |
Anytime something weird happens in your neural net, it's almost certainly because of batch 00:59:52.420 |
norm, because batch norm makes everything weird. 00:59:57.580 |
What happened was our frozen part of our model, which was designed for ImageWolf, those layers 01:00:06.380 |
were tuned for some particular set of mean and standard deviations, because remember, 01:00:13.740 |
the batch norm is going to subtract the mean and divide by the standard deviation. 01:00:20.580 |
But the PETS data set has different means and standard deviations, not for the input, 01:00:29.080 |
So then when we unfroze this, it basically said this final layer was getting trained 01:00:37.100 |
for everything being frozen, but that was for a different set of batch norm statistics. 01:00:42.780 |
So then when we unfroze it, everything tried to catch up, and it would be very interesting 01:00:50.180 |
to look at the histograms and stuff that we did earlier in the course and see what's really 01:00:55.020 |
going on, because I haven't really seen anybody-- I haven't really seen a paper about this. 01:01:00.720 |
Something we've been doing in FastAI for a few years now, but I think this is the first 01:01:05.020 |
course where we've actually drawn attention to it. 01:01:10.180 |
That's something that's been hidden away in the library before. 01:01:12.580 |
But as you can see, it's a huge difference, the difference between 56 versus 71. 01:01:22.580 |
And the trick is to not freeze all of the body parameters, but freeze all of the body 01:01:29.860 |
parameters that aren't in the batch norm layers. 01:01:33.940 |
And that way, when we fine-tune the final layer, we're also fine-tuning all of the batch 01:01:41.140 |
So we can create, just like before, adapt the model, and let's create something called 01:01:47.080 |
setGradient, which says, oh, if it's a linear layer at the end or a batch norm layer in 01:01:55.020 |
Otherwise, if it's got weights, set requires grad2, whatever you asked for, which we're 01:02:04.060 |
Here's a little convenient function that will apply any function you pass to it recursively 01:02:11.820 |
So now that we have apply to a model, or apply to a module, I guess, we can just pass in 01:02:17.860 |
a module, and that will be applied throughout. 01:02:22.220 |
So this way, we freeze just the non-batch norm layers, and of course, not the last layer. 01:02:30.000 |
And so actually, fine-tuning immediately is a bit better, goes from 54 to 58. 01:02:35.320 |
But more importantly, then when we unfreeze, we're back into the 70s again. 01:02:42.500 |
So this is just a super important thing to remember, if you're doing fine-tuning. 01:02:46.460 |
And I don't think there's any library other than fast.ai that does this, weirdly enough. 01:02:51.260 |
So if you're using TensorFlow or something, you'll have to write this yourself to make 01:02:56.360 |
sure that you don't freeze ever, don't ever freeze the weights in the batch norm layers 01:03:02.300 |
any time you're doing partial layer training. 01:03:05.940 |
Oh, by the way, that apply mod, I only wrote it because we're not allowed to use stuff 01:03:10.620 |
in PyTorch, but actually PyTorch has its own, it's called model.apply. 01:03:14.580 |
So you can use that now, it's the same thing. 01:03:18.940 |
Okay, so finally, for this half of the course, we're going to look at discriminative learning 01:03:27.880 |
So for discriminative learning rates, there's a few things we can do with them. 01:03:31.660 |
One is it's a simple way to do layer freezing without actually worrying about setting requires 01:03:38.100 |
We could just set the learning rate to zero for some layers. 01:03:43.460 |
So what we're going to do is we're going to split our parameters into two or more groups 01:03:52.100 |
Here's our function, it's called bnsplitter, it's going to create two groups of parameters 01:04:01.400 |
and it's going to pass the body to underscore bnsplitter, which will recursively look for 01:04:07.580 |
batch norm layers and put them in the second group or anything else with a weight goes 01:04:12.500 |
in the first group and then do it recursively. 01:04:16.240 |
And then also the second group will add everything after the head. 01:04:20.060 |
So this is basically doing something where we're putting all our parameters into the 01:04:28.060 |
So we can check, for example, that when we do bnsplitter on a model that the number of 01:04:32.860 |
parameters in the two halves is equal to the total number of parameters in the model. 01:04:38.580 |
And so now I want to check this works, right? 01:04:41.180 |
I want to make sure that if I pass this, because we now have a splitter function in the learner, 01:04:46.340 |
and that's another thing I added this week, that when you start training, it's literally 01:04:53.180 |
When we create an optimizer, it passes the model to self.splitter, which by default does 01:05:01.660 |
And so we're going to be using our bnsplitter to split it into multiple parameter groups. 01:05:08.700 |
Because this is one of these things that if I screw it up, I probably won't get an error, 01:05:13.500 |
but instead it probably won't train my last layer, or it'll train all the layers at the 01:05:17.580 |
same learning rate, or it would be hard to know if the model was bad because I screwed 01:05:26.960 |
We can't just look inside and make sure it's working, because what we're going to be doing 01:05:31.740 |
is we're going to be passing it, let's see this one, we're going to be passing it to 01:05:42.700 |
the splitter parameter when we create the learner, right? 01:05:46.220 |
So after this, it set the splitter parameter, and then when we start training, we're hoping 01:05:51.420 |
that it's going to create these two layer groups. 01:05:53.300 |
So we need some way to look inside the model. 01:05:59.620 |
Do you remember how I told you that you can actually override dundercall itself? 01:06:05.220 |
You don't just have to override a specific callback? 01:06:07.900 |
And by overriding dundercall itself, we can actually say, which callback do we want to 01:06:15.620 |
And when we hit that callback, please run this function. 01:06:19.540 |
And if you don't pass in a function, it just jumps into the debugger as soon as that callback 01:06:27.380 |
Because now I can create a function called print details that just prints out how many 01:06:32.040 |
parameter groups there are and what the hyperparameters there are, and then immediately raises the 01:06:39.140 |
And so then I can fit with my discriminative LR scheduler and my debug callback, and my 01:06:45.780 |
discriminative LR scheduler is something that now doesn't just take a learning rate, but 01:06:50.380 |
an array of learning rates and creates a scheduler for every learning rate. 01:06:55.220 |
And so I can pass that in, so I'm going to use 0 and 0.02. 01:07:00.620 |
So in other words, no training for the body and 0.03 for the head and the batch norm. 01:07:11.480 |
And so as soon as I fit, it immediately stops because the cancel train exception was raised, 01:07:17.100 |
and it prints out and says there's two parameter groups, which is what we want, and the first 01:07:21.460 |
parameter group has a learning rate of 0, which is what we want, and the second is 0.003, 01:07:27.420 |
which is right because it's 0.03, and we're using the learning rate scheduler so it starts 01:07:36.180 |
So this is just a way of saying if you're anything like me, every time you write code, 01:07:41.940 |
it will always be wrong, and for this kind of code, you won't know it's wrong, and you 01:07:47.380 |
could be writing a paper or doing a project at work or whatever in which you're not using 01:07:52.940 |
discriminative learning rates at all because of some bug because you didn't know how to 01:07:57.580 |
So make sure you can check and always assume that you screw up everything. 01:08:03.420 |
Okay, so now we can train with zero learning rate on the first layer group, and then we 01:08:13.300 |
can use discriminative learning rates with 1 and a 3 and 1 and a 2 and train a little 01:08:30.300 |
The first is we heard that you're against cross-validation for deep learning. 01:08:35.360 |
We heard that you're against cross-validation for deep learning and wanted to know why that 01:08:45.780 |
So cross-validation is a very useful technique for getting a reasonably sized validation 01:08:50.900 |
set if you don't have enough data to otherwise create a reasonably sized validation set. 01:08:56.380 |
So it was particularly popular in the days when most studies were like 50 or 60 rows. 01:09:02.740 |
If you've got a few thousand rows, it's just pointless, right? 01:09:06.580 |
Like the kind of statistical significance is going to be there regardless. 01:09:10.220 |
So I wouldn't say I'm against it, just most of the time you don't need it because if you've 01:09:15.100 |
got a thousand things in the validation set and you only care whether it's like plus or 01:09:22.780 |
So yeah, have a look and see how much your validation set accuracy is varying from run 01:09:28.100 |
And if it's too much that you can't make the decisions you need to make, then you can add 01:09:37.420 |
And what are your best tips for debugging deep learning? 01:09:45.080 |
So Chris Latner asked me this today as well, actually. 01:09:48.780 |
So I'll answer the same answer to him, which is don't make mistakes in the first place. 01:09:57.180 |
And the only way to do that is to make your code so simple that it can't possibly have 01:10:02.100 |
a mistake and to check every single intermediate result along the way to make sure it doesn't 01:10:09.260 |
Otherwise, your last month might have been like my last month. 01:10:15.060 |
Well, a month ago, I got 94.1% accuracy on ImageNet, and I was very happy. 01:10:23.460 |
And then I started a couple of weeks ago trying various tweaks. 01:10:29.860 |
And after a while, I got so frustrated, I thought I'd just repeat the previous training 01:10:33.020 |
to see if it was like what was going on with the Fluke. 01:10:43.900 |
And every time I trained it, it was costing me $150 of AWS credits. 01:10:56.300 |
So that was quite a process to even realize like it's broken. 01:11:00.020 |
Like when something, when you've written that kind of code wrong, it gets broken in ways 01:11:10.900 |
So eventually, I realized, yeah, I mean, so the first thing I'll say is, you've got to 01:11:16.940 |
be a great scientist, which means you need a journal notebook, right? 01:11:20.300 |
You need to keep track of your journal results. 01:11:23.540 |
So I had a good journal, I pasted everything that was going on, all my models into a file. 01:11:28.980 |
So I went back, I confirmed it really was 94.1. 01:11:35.580 |
And so then I could revert to the exact commit that was in fast AI at that time. 01:11:44.540 |
So I now had to figure out which change in the previous month of the entire fast AI code 01:11:54.260 |
So the first thing I tried to do was try to find a way to quickly figure out whether something 01:11:58.900 |
But after doing a few runs and plotting them in Excel, it was very clear that the training 01:12:13.140 |
And so I did a bisection search one module at a time, looking through the 15 modules 01:12:18.740 |
that had changed in that diff until eventually I find it was in the mixed precision module. 01:12:22.980 |
And then I went through each change that happened in the mixed position module. 01:12:25.460 |
So like $5,000 later, I finally found the one line of code where we had forgotten to 01:12:33.520 |
And so by failing to write dot opt, it meant that we were wrapping an Optim wrapper in 01:12:37.940 |
an Optim wrapper, rather than wrapping an Optim wrapper with an optimizer. 01:12:41.600 |
And that meant that weight decay was being applied twice. 01:12:45.420 |
So that tiny difference, like, was so insignificant that no one using the library even noticed 01:12:54.260 |
I didn't notice it wasn't working until I started trying to, you know, get state-of-the-art 01:12:58.680 |
results on ImageNet in 60 epochs with ResNet 50. 01:13:05.540 |
So yeah, I mean, debugging is hard, and worth still is most of the time you don't know. 01:13:12.060 |
So I mean, honestly, training models sucks, and deep learning is a miserable experience 01:13:18.380 |
and you shouldn't do it, but on the other hand, it gives you much better results than 01:13:23.740 |
anything else, and it's taking over the world. 01:13:25.460 |
So it's either that or get eaten by everybody else, I guess. 01:13:29.860 |
So yeah, I mean, it's so much easier to write normal code where, like, oh, you have to implement 01:13:35.020 |
a wealth authentication in your web service, and so you go in and you say, oh, here's the 01:13:39.500 |
API, and we have to take these five steps, and after each one I check that this has happened, 01:13:43.580 |
and you check off each one, and at the end you're done, and you push it, and you have 01:13:49.500 |
Even testing, it requires a totally different mindset. 01:13:57.860 |
You want to be able to see if something's changing just occasionally, because if it tests correctly 01:14:02.700 |
all the time with a random set of 42, be sure it's going to work with a random set of 41. 01:14:07.660 |
So you want non-reproducible tests, you want randomness, you want tests that aren't guaranteed 01:14:13.060 |
to always pass, but the accuracy of this integration test should be better than 0.9 nearly all the 01:14:18.380 |
time. You want to be warned if something looks off, you know? 01:14:24.980 |
And this means it's a very different software development process, because if you push something 01:14:27.980 |
to the fast AI repo and a test fails, it might not be your fault, right? 01:14:32.980 |
It might be that Jeremy screwed something up a month ago, and one test fails one out 01:14:38.580 |
So as soon as that happens, then we try to write a test that fails every time, you know? 01:14:45.220 |
So once you realize there's a problem with this thing, you try to find a way to make 01:14:48.740 |
it fail every time, but it's -- yeah, debugging is difficult, and in the end, you just have 01:14:55.300 |
to go through each step, look at your data, make sure it looks sensible, plot it, and 01:15:05.460 |
Great. Well, let's have a break and see you back here at 7.55. 01:15:12.500 |
So we've all done ULM fit in part one, and there's been a lot of stuff happening in the 01:15:36.940 |
>> Ah. Yeah, that's a good one. This is something I'm quite passionate about. 01:15:45.020 |
When you look at the great scientists in history, they all, that I can tell, had careful scientific 01:15:57.020 |
In my case, my scientific journal is a file in a piece of software called Windows Notepad, 01:16:05.140 |
and I paste things into it at the bottom, and when I want to find something, I press 01:16:09.260 |
control F. It just needs to be something that has a record 01:16:15.540 |
of what you're doing and what the results of that are, because scientists -- scientists 01:16:24.980 |
who make breakthroughs generally make the breakthrough because they look at something 01:16:28.980 |
that shouldn't be, and they go, oh, that's odd. I wonder what's going on. 01:16:35.420 |
So the discovery of the noble gases was because the scientists saw, like, one little bubble 01:16:40.340 |
left in a beaker, which they were pretty sure there shouldn't have been a little bubble 01:16:43.580 |
there anymore. Most people would just be like, oops, there's 01:16:46.900 |
a bubble, or we wouldn't even notice, but they studied the bubble, and they found noble 01:16:50.540 |
gases, or penicillin was discovered because of a, oh, that's odd. 01:16:59.620 |
And I find in deep learning, this is true as well. Like, I spent a lot of time studying 01:17:05.980 |
batch normalization in transfer learning, because a few years ago in Keras, I was getting 01:17:14.700 |
terrible transfer learning results for something I thought should be much more accurate, and 01:17:18.780 |
I thought, oh, that's odd. And I spent weeks changing everything I could, and then almost 01:17:27.140 |
randomly tried changing batch norm. So the problem is that all this fiddling around, 01:17:36.220 |
you know, 90% of it doesn't really go anywhere, but it's the other 10% that you won't be able 01:17:40.020 |
to pick it out unless you can go back and say, like, okay, that really did happen. I 01:17:45.900 |
copied and pasted the log here. So that's all I mean. 01:17:51.660 |
>> Are you also linking to your GitHub commits and datasets, sir? 01:17:55.380 |
>> No, because I've got the date there and the time. So I know the GitHub commit. So I 01:18:01.980 |
do make sure I'm pushing all the time. So, yeah. Okay. Yeah, so there's been a lot happening 01:18:14.140 |
in NLP transfer learning recently, the famous GPT2 from OpenAI and BERT and stuff like that, 01:18:20.460 |
lots of interest in transformers, which we will cover in a future lesson. One could think 01:18:28.660 |
that LSTMs are out of favor and not interesting anymore. But when you look at actually recent 01:18:36.580 |
competitive machine learning results, you see ULMFIT beating BERT. Now, I should say this 01:18:44.140 |
is not just ULMFIT beating BERT. The guys at Mwaves are super smart, amazing people. 01:18:50.500 |
So it's like two super smart, amazing people using ULMFIT bits and other people doing BERT. 01:18:55.820 |
It's definitely not true that RNNs are in the past. I think what's happened is, in fact, 01:19:06.700 |
as you'll see, transformers and CNNs for text have a lot of problems. They basically don't 01:19:13.060 |
have state. So if you're doing speech recognition, every sample you look at, you have to do an 01:19:20.220 |
entire analysis of all the samples around it again and again and again. It's ridiculously 01:19:25.860 |
wasteful or else RNNs have state. But they're fiddly and they're hard to deal with, as you'll 01:19:36.020 |
see, when you want to actually do research and change things. But partly, RNNs have state, 01:19:45.900 |
but also partly, RNNs are the only thing which has had the level of carefulness around regularization 01:19:54.780 |
that AWD LSTM did. So Stephen Meridy looked at what are all the ways I can regularize 01:20:01.780 |
this model and came up with a great set of hyperparameters for that. And there's nothing 01:20:07.220 |
like that outside of the RNN world. So, at the moment, my go-to choice definitely is 01:20:17.300 |
still ULM fit for most real-world NLP tasks. And if people find BERT or GPT2 or whatever 01:20:32.460 |
better for some real-world tasks, that would be fascinating. I would love that to happen, 01:20:36.060 |
but I haven't been hearing that from people that are actually working in industry yet. 01:20:40.500 |
I'm not seeing them win competitive machine learning stuff and so forth. So I still think 01:20:47.220 |
RNNs should be our focus, but we will also learn about transformers later. And so ULM 01:20:54.060 |
fit is just the normal transfer learning path applied to an RNN, which could be on text. 01:21:03.260 |
Interestingly, there's also been a lot of state of the art results recently on genomics 01:21:07.540 |
applications and on chemical bonding analysis and drug discovery. There's lots of things 01:21:15.860 |
that are sequences and it turns out, and we're still just at the tip of the iceberg, right? 01:21:22.020 |
Because most people that are studying like drug discovery or chemical bonding or genomics 01:21:27.660 |
have never heard of ULM fit, right? So it's still the tip of the iceberg. But those who 01:21:30.940 |
are trying it are consistently getting breakthrough results. So I think it's really interesting, 01:21:36.100 |
not just for NLP, but for all kinds of sequence classification tasks. So the basic process 01:21:41.420 |
is going to be create a language model on some large data set. And notice a language 01:21:48.580 |
model is a very general term. It means predict the next item in the sequence. So it could 01:21:53.900 |
be an audio language model that predicts the next sample in a piece of music or speech. 01:21:59.900 |
It could be predicting the next genome in a sequence or whatever, right? So that's what 01:22:07.100 |
I mean by language model. And then we fine-tune it, that language model using our in-domain 01:22:14.580 |
corpus, which in this case is going to be IMDB. And then in each case, we first have 01:22:19.260 |
to pre-process our data sets to get them ready for using an RNN on them. Language models 01:22:29.900 |
require one kind of pre-processing. Classification models require another one. And then finally 01:22:34.700 |
we can fine-tune our IMDB language model for classification. So this is the process we're 01:22:43.920 |
So Sylvain has done an amazing thing in the last week, which is basically to recreate 01:22:50.060 |
the entire AWD LSTM and ULM fit process from scratch in the next four notebooks. And there's 01:22:59.300 |
quite a lot in here, but a lot of it's kind of specific to text processing. And so some 01:23:06.660 |
of it I might skip over a little bit quickly, but we'll talk about which bits are interesting. 01:23:11.340 |
So we're going to start with the IMDB data set as we have before. And to remind you it 01:23:17.140 |
contains a training folder, an unsupervised folder, and a testing folder. So the first 01:23:23.980 |
thing we need to do is we need to create a data blocks item list subclass for text. Believe 01:23:29.780 |
it or not, that's the entire code. Because we already have a get files, so here's a get 01:23:36.080 |
files with dot text. And all you have to do is override get to open a text file like so. 01:23:44.420 |
And we're now ready to create an item list. So this is like the data blocks API is just 01:23:50.100 |
so super easy to create, you know, to handle your domain. So if you've got genomic sequences 01:23:56.740 |
or audio or whatever, this is basically what you need to do. 01:24:02.980 |
So now we've got an item list with 100,000 things in it. We've got the train, the test, 01:24:07.340 |
and the unsupervised. And we can index into it and see a text. So here's a movie review. 01:24:14.540 |
And we can use all the same stuff that we've used before. So for the previous notebook, 01:24:18.500 |
we just built a random splitter. So now we can use it on texts. So the nice thing about 01:24:22.820 |
this decoupled API is that we can mix and match things and things just work, right? And we 01:24:27.620 |
can see the representation of them. They just work. Okay, so we can't throw this movie review 01:24:34.580 |
into a model. It needs to be numbers. And so as you know, we need to tokenize and numericalize 01:24:41.140 |
this. So let's look at the details. We use spacey for tokenizing. And we do a few things 01:24:49.900 |
as we tokenize. One thing we do is we have a few pre rules. These are these are bits 01:24:56.580 |
of code that get run before tokenization. So for example, if we find br slash, we replace 01:25:03.940 |
it with a new line. Or if we find a slash or a hash, we put spaces around it. If we 01:25:09.660 |
find more than two spaces in a row, we just make it one space. Then we have these special 01:25:17.020 |
tokens. And this is what they look like as strings that we use symbolic names for them, 01:25:24.140 |
essentially. And these different tokens have various special meanings. For example, if we 01:25:29.820 |
see some non-whitespace character more than three times in a row, we replace it with this 01:25:42.300 |
is really cool, right? In Python substitution, you can pass in a function, right? So rep.sub 01:25:49.740 |
here is going to look for this and then it's going to replace it with the result of calling 01:25:53.580 |
this function, which is really nice. And so what we're going to do is we're going to stick 01:25:57.880 |
in the TK rep special token. So this means that there was a repeating token where they're 01:26:05.940 |
going to put a number, which is how many times it repeated. And then the thing that was actually 01:26:10.140 |
there. We'll do the same thing with words. There's a lot of bits of little crappy things 01:26:15.580 |
that we see in texts that we replace mainly HTML entities. We call those our default pre-rules. 01:26:24.540 |
And then this is our default list of special tokens. So for example, replace rep C C C 01:26:29.020 |
would be XX rep for C. Or replace W rep, would, would, would, would, would, would, would, would, 01:26:33.940 |
would be XXW rep 5 would. Why? Well, think about the alternatives, right? So what if 01:26:44.420 |
you read a tweet that said this was amazing 28 exclamation marks. So you can either treat 01:26:53.820 |
those 28 exclamation marks as one token. And so now you have a vocab item that is specifically 01:27:00.220 |
28 exclamation marks. You probably never see that again, so probably won't even end up 01:27:04.420 |
in your vocab. And if it did, you know, it's, it's going to be so rare that you won't be 01:27:09.540 |
able to learn anything interesting about it. But if instead we replaced it with XX rep 01:27:15.420 |
28 exclamation mark, then this is just three tokens where it can learn that lots of repeating 01:27:24.380 |
exclamation marks is a general concept that has certain semantics to it, right? So that's 01:27:30.260 |
what we're trying to do in NLP is we're trying to make it so that the things in our vocab 01:27:35.860 |
are as meaningful as possible. And the nice thing is that because we're using an LSTM, 01:27:40.860 |
we can have multi-word sequences and be confident that the LSTM will create some stateful computation 01:27:47.520 |
that can handle that sequence. Another alternative is we could have turned the 28 exclamation 01:27:53.660 |
marks into 28 tokens in a row, each one of the single exclamation mark. But now we're 01:27:58.980 |
asking our LSTM to hang on to that state for 28 time steps, which is just a lot more work 01:28:05.540 |
for it to do. And it's not going to do as good a job, right? So we want to make things 01:28:08.820 |
easy for our models. That's what pre-processing is all about. So same with all caps, right? 01:28:14.960 |
If you've got, I am shouting, then it's pretty likely that there's going to be exclamation 01:28:19.980 |
marks after that. There might be swearing after that. Like the fact that there's lots 01:28:23.820 |
of capitalized words is semantic of itself. So we replace capitalized words with a token 01:28:29.980 |
saying this is a capitalized word. And then we replace it with the lowercase word. So 01:28:34.300 |
we don't have a separate vocab item for capital am, capital shouting, capital, every damn 01:28:39.140 |
word in the dictionary. Okay. Same thing for mixed case. So I don't know, I haven't come 01:28:50.340 |
across other libraries that do this kind of pre-processing. There's little bits and pieces 01:28:54.580 |
in various papers, but I think this is a pretty good default set of rules. Notice that these 01:29:00.940 |
rules have to happen after tokenization because they're happening at a word level. So we have 01:29:04.980 |
default post rules. And then this one here adds a beginning of stream and an end of stream 01:29:12.460 |
on either side of a list of tokens. Why do we do that? These tokens turn out to be very 01:29:21.340 |
important because when your language model sees like an end of stream character token, 01:29:29.340 |
meaning like that's the end of a document, that it knows the next document is something 01:29:34.540 |
new. So it's going to have to learn the kind of reset its state to say like, oh, we're 01:29:39.660 |
not talking about the old thing anymore. So we're doing Wikipedia. We were talking about 01:29:43.660 |
Melbourne, Australia. Oh, and now there's a new token. Then we're talking about the 01:29:48.420 |
Emmys, right? So when it sees EOS, it has to learn to kind of reset its state somehow. 01:29:57.340 |
So you need to make sure that you have the tokens in place to allow your model to know 01:30:02.660 |
that these things are happening. Tokenization is kind of slow because Spacey does it so 01:30:10.180 |
carefully. I thought it couldn't possibly be necessary to do it so carefully because 01:30:14.700 |
it just doesn't seem that important. So last year I tried removing Spacey and replacing 01:30:19.420 |
it with something much simpler. My IMDB accuracy went down a lot. So actually it seems like 01:30:27.020 |
Spacey's sophisticated parser-based tokenization actually does better. So at least we can try 01:30:34.460 |
and make it fast. So Python comes with something called a process pool executor, which runs 01:30:40.060 |
things in parallel. And I wrap it around with this little thing called parallel. And so 01:30:45.180 |
here's my thing that runs, look, compose, appears everywhere. Compose the pre-rules 01:30:49.660 |
on every chunk, run the tokenizer, compose the post rules on every dock. That's processing 01:30:55.980 |
one chunk. So run them all in parallel for all the chunks. So that's that. So this is 01:31:03.020 |
a processor, which we saw last week, and this is a processor which tokenizes. And so we 01:31:11.220 |
can try it out. So we can create one and try, here's a bit of text, and let's try tokenizing. 01:31:18.180 |
And so you can see we've got beginning of stream, did int, so int is a token, comma is a token, 01:31:27.420 |
xx, match, da1, so that was a capital D, and so forth. All right, so now we need to turn 01:31:33.700 |
those into numbers, not just to have a list of words. We can turn them into numbers by 01:31:39.940 |
numericalizing, which is another processor, which basically when you call it, we find 01:31:44.820 |
out, do we have a vocab yet? Because numericalizing is just saying, what are all the unique words? 01:31:50.300 |
And the list of unique words is the vocab. So if we don't have a vocab, we'll create 01:31:55.460 |
it, okay? And then after we create it, it's just a case of calling object to int on each 01:32:05.940 |
one. So O to I is just a dictionary, right? Or if deprocessing is just grabbing each 01:32:11.940 |
thing from the vocab. So that's just an array. Okay, so we can tokenize, numericalize, run 01:32:18.580 |
it for two and a half minutes. And so we've got the xobj is the thing which returns the 01:32:27.980 |
object version, so as opposed to the numericalized version, and so we can put it back together 01:32:32.100 |
and this is what we have after it's been turned into numbers and back again. So since that 01:32:36.460 |
takes a couple of minutes, good idea to dump the labeled list so that we can then load 01:32:43.180 |
it again later without having to rerun that. All right, this is the bit which a lot of 01:32:50.740 |
people get confused about, which is how do we batch up language model data? So here's 01:32:58.620 |
this bit of text. It's very meta, it's a bit of text which is from this notebook. So the 01:33:07.700 |
first thing we're going to do is we're going to say, let's create some batch sizes, create 01:33:11.380 |
a small one for showing you what's going on, six. So let's go through and create six batches, 01:33:16.740 |
which is just all the tokens for each of those six batches. So here's, in this notebook, 01:33:21.540 |
we will go back over the example of is the first element of, so this is the first row 01:33:26.740 |
and then of classifying movie reviews we studied in part one, this is the second. So we just 01:33:31.220 |
put it into six groups, right? And then let's say we have a BPTT of five, so it's kind of 01:33:39.660 |
like our backprop through time sequence length of five, then we can split these up into groups 01:33:45.520 |
of five. And so that'll create three of them. In this notebook, we will go back over the 01:33:51.700 |
example of classifying movie reviews we studied in part one. These three things then are three 01:33:58.060 |
mini batches. And this is where people get confused because it's not that each one has 01:34:03.340 |
a different bunch of documents. Each one has the same documents over consecutive time steps. 01:34:10.880 |
This is really important. Why is it important? Because this row here in the RNN is going 01:34:18.700 |
to be getting some state about this document. So when it goes to the next batch, it needs 01:34:23.580 |
to use that state. And then it goes to the next batch, needs to use that state. So from 01:34:27.700 |
batch to batch, the state that it's building up needs to be consistent. That's why we do 01:34:33.220 |
the batches this way. >> I wanted to ask if you did any other preprocessing, 01:34:41.540 |
such as removing stop words, stemming, or limitization? >> Yeah, great question. So 01:34:53.580 |
in traditional NLP, those are important things to do. Removing stop words is removing words 01:34:58.380 |
like "ah" and "on." Stemming is like getting rid of the "ing" suffix or stuff like that. 01:35:07.800 |
It's kind of like universal in traditional NLP. It's an absolutely terrible idea. Never 01:35:14.060 |
ever do this. Because -- well, the first question is like why would you do it? Why would you 01:35:20.860 |
remove information from your neural net which might be useful? And the fact is it is useful. 01:35:27.340 |
Like stop words, your use of stop words tells you a lot about what style of language, right? 01:35:34.140 |
So you'll often have a lot less kind of articles and stuff if you're like really angry and 01:35:38.860 |
speaking really quickly. You know, the tense you're talking about is obviously very important. 01:35:45.180 |
So stemming gets rid of it. So yeah, all that kind of stuff is in the past. You basically 01:35:52.700 |
never want to do it. And in general, preprocessing data for neural nets, leave it as raw as you 01:36:00.220 |
can is the kind of rule of thumb. So for a language model, each mini batch is basically 01:36:08.780 |
going to look something like this for the independent variable. And then the dependent 01:36:13.220 |
variable will be exactly the same thing but shifted over by one word. So let's create 01:36:18.260 |
that. This thing is called LM freeloader. It would actually be better off being called 01:36:23.540 |
an LM data set. Why don't we do it right now? LM pre freeloader. LM data set. That's really 01:36:39.020 |
what it is. Okay. So an LM data set is a data set for a language model. Remember that a 01:36:48.340 |
data set is defined as something with a length and a get item. So this is a data set which 01:36:55.360 |
you can index into it. And it will grab an independent variable and a dependent variable. 01:37:02.020 |
And the independent variable is just the text from wherever you asked for, for BPTT. And 01:37:09.780 |
the dependent variable is the same thing offset by one. So you can see it here. We can create 01:37:16.060 |
a data loader using that data set. Remember that's how data loaders work. You pass them 01:37:20.780 |
a data set. And now we have something that we can iterate through, grabbing a mini batch 01:37:25.300 |
at a time. And you can see here X is XXBOS well worth watching. And Y is just well worth 01:37:33.460 |
watching. Okay. And then you can see the second batch, best performance to date. So make sure 01:37:41.340 |
you print out things that all make sense. So that's stuff that we can all dump into a single 01:37:47.140 |
function and use it again later and chuck it into a data bunch. So that's all we need for 01:37:55.100 |
a data bunch for language models. We're also going to need a data bunch for classification. 01:38:01.780 |
And that one's going to be super easy because we already know how to create data bunches 01:38:05.060 |
for classification because we've already done it for lots of image models. And for NLP it's 01:38:10.500 |
going to be exactly the same. So we create an item list. We split. We label. That's it. 01:38:21.740 |
So the stuff we did for image is not different. Only thing we've added is two preprocesses. 01:38:30.140 |
>> Question. What are the tradeoffs to consider between batch size and back propagation through 01:38:40.340 |
time? For example, BPTT10 with BS100 versus BPTT100 with BS10. Both would be passing a 01:38:48.380 |
thousand tokens at a time to the model. What should you consider when tuning the ratio? 01:38:54.340 |
>> It's a great question. I don't know the answer. I would love to know. So try it. Because 01:38:59.900 |
I haven't had time to fiddle with it. I haven't seen anybody else experiment with it. So that 01:39:03.780 |
would make a super great experiment. I think the batch size is the thing that lets it parallelize. 01:39:15.740 |
So if you don't have a large enough batch size it's just going to be really slow. But 01:39:20.740 |
on the other hand, the large batch size with a short BPTT, depending on how you use it, 01:39:26.060 |
you may end up kind of ending up with less state that's being back propagated. So the 01:39:30.120 |
question of how much that matters, I'm not sure. And when we get to our ULM classification 01:39:35.500 |
model I'll actually show you this, kind of where this comes in. Okay. So here's a couple 01:39:43.100 |
of examples of a document and a dependent variable. And what we're going to be doing 01:39:52.340 |
is we're going to be creating data loaders for them. But we do have one trick here. Which 01:40:00.300 |
is that with images, our images were always, by the time we got to modeling they were all 01:40:07.020 |
the same size. Now this is probably not how things should be. And we have started doing 01:40:13.060 |
some experiments with training with rectangular images of different sizes. But we're not quite 01:40:16.700 |
ready to show you that work because it's still a little bit fiddly. But for text we can't 01:40:24.620 |
avoid it. You know, we've got different sized texts coming in. So we have to deal with it. 01:40:29.900 |
And the way we deal with it is almost identical to how actually we're going to end up dealing 01:40:34.700 |
with when we do do rectangular images. So if you are interested in rectangular images, 01:40:38.700 |
try and basically copy this approach. Here's the approach. We are going to pad each document 01:40:49.100 |
by adding a bunch of padding tokens. So we just pick some arbitrary token which we're 01:40:53.420 |
going to tell PyTorch this token isn't text. It's just thrown in there because we have 01:40:59.460 |
to put in something to make a rectangular tensor. If we have a mini batch with a 1,000 01:41:07.100 |
word document and then a 2,000 word document and then a 20 word document, the 20 word document 01:41:12.380 |
is going to end up with 1,980 padding tokens on the end. And as we go through the RNN, 01:41:18.020 |
we're going to be totally pointlessly calculating on all these padding tokens. We don't want 01:41:24.820 |
to do that. So the trick is to sort the data first by length. So that way your first mini 01:41:34.260 |
batch will contain your really long documents and your last mini batch will create your 01:41:39.300 |
really short documents and each mini batch will not contain a very wide variety of lengths 01:41:43.580 |
of documents. So there won't be much padding and so there won't be much wasted computation. 01:41:50.020 |
So we've already looked at samplers. If you've forgotten, go back to when we created our 01:41:55.100 |
data loader from scratch and we actually created a sampler. And so here we're going to create 01:42:01.420 |
a different type of sampler and it is simply one that goes through our data, looks at how 01:42:09.100 |
many documents is in it, creates the range from zero to the number of documents, sorts 01:42:15.900 |
them by some key and returns that iterator, sorts them in reverse order. So we're going 01:42:22.340 |
to use sort sampler passing in the key, which is a lambda function that grabs the length 01:42:29.300 |
of the document. So that way our sampler is going to cause each mini batch to be documents 01:42:42.900 |
of similar lengths. The problem is we can only do this for validation, not for training because 01:42:48.700 |
for training we want to shuffle and sorting would undo any shuffling because sorting is 01:42:55.180 |
deterministic. So that's why we create something called sort ish sampler. And the sort ish 01:43:01.860 |
sampler approximately orders things by length. So every mini batch has things of similar 01:43:10.340 |
lengths but with some randomness. And the way we do this, the details don't particularly 01:43:15.580 |
matter but basically I've created this idea of a mega batch, which is something that's 01:43:20.300 |
50 times bigger than a batch and basically I sort those, okay? And so you end up with 01:43:25.860 |
these kind of like sorted mega batches and then I have random permutations within that. 01:43:32.500 |
So you can see random permutations there and there. So you can look at the code if you 01:43:36.020 |
care, the details don't matter. In the end, it's a random sort in which things of similar 01:43:43.020 |
lengths tend to be next to each other and the biggest ones tend to be at the start. 01:43:47.960 |
So now we've got a mini batch of numericalized, tokenized documents of similar lengths but 01:43:54.060 |
they're not identical lengths, right? And so you might remember the other thing when 01:43:59.340 |
we first created a data loader, we gave it two things, a sampler and a collate function. 01:44:05.860 |
And the collate function that we wrote simply said torch.stack because all our images were 01:44:11.180 |
the same size because all our images were the same size so we could just literally just 01:44:17.340 |
stick them together. We can't do that for documents because they're different sizes. 01:44:22.020 |
So we've written something called pad collate. And what Sylvan did here was he basically 01:44:28.100 |
said let's create something that's big enough to handle the longest document in the mini 01:44:31.980 |
batch and then go through every document and dump it into that big tensor either at the 01:44:39.180 |
start or at the end depending on whether you said pad first. So now we can pass the sampler 01:44:48.660 |
and the collate function to our data loader and that allows us to grab some mini batches 01:44:55.940 |
which as you can see contain padding at the end. And so here's our normal convenience 01:45:02.420 |
functions that do all those things for us and that's that, okay. So that's quite a bit 01:45:10.020 |
of preprocessing and I guess the main tricky bit is this dealing with different lengths. 01:45:17.100 |
And at that point we can create our AWD LSTM. So these are just the steps we just did to 01:45:27.300 |
create our data loader. And now we're going to create an RNN. So an RNN remember is just 01:45:37.620 |
a multi-layer network. But it's a multi-layer network that could be very, very, very many 01:45:43.740 |
layers. There could be like if it's a 2000 word document this is going to be 2000 layers. 01:45:49.140 |
So to avoid us having to write 2000 layers we used a for loop. And between every pair 01:45:56.500 |
of hidden layers we use the same weight matrix. That's why they're the same color. And that's 01:46:01.060 |
why we can use a for loop. Problem is as we've seen trying to handle 2000 layers of neural 01:46:08.860 |
net we get vanishing gradients or exploding gradients it's really, really difficult to 01:46:13.180 |
get it to work. So what are we going to do? Because it's even worse than that because 01:46:17.460 |
often we have layers going into, RNNs going into other RNNs so we actually have stacked 01:46:22.580 |
RNNs which when we unstack them it's going to be even more thousands of layers effectively. 01:46:29.820 |
So the trick is we create something called an LSTM cell. Rather than just doing a matrix 01:46:34.980 |
multiply as our layer we instead do this thing called an LSTM cell as our layer. This is 01:46:42.620 |
it here. So this is a sigmoid function and this is a tanh function. So the sigmoid function 01:46:49.700 |
remember goes from 0 to 1 and kind of nice and smooth between the two. And the tanh function 01:46:55.120 |
is identical to a sigmoid except it goes from minus 1 to 1 rather than 0 to 1. So sigmoid 01:47:00.660 |
is 0 to 1, tanh is minus 1 to 1. So here's what we're going to do. We're going to take 01:47:05.580 |
our input and we're going to have some hidden state as we've already always had in our RNNs. 01:47:11.580 |
This is just our usual hidden state. And we're going to multiply our input by some weight 01:47:18.300 |
matrix in the usual way. Then we're going to multiply our hidden state by some weight 01:47:21.820 |
matrix in the usual way and then we add the two together in the way we've done before 01:47:25.660 |
for RNNs. And then we're going to do something interesting. We're going to split the result 01:47:31.500 |
into four equal sized tensors. So the first one quarter of the activations will go through 01:47:38.060 |
this path, the next will go through this path, the next will go through this path, the next 01:47:41.780 |
will go through this path. So what this means is we kind of have like four little neural 01:47:48.460 |
nets effectively, right? And so this path goes through a sigmoid and it hits this thing called 01:47:57.200 |
the cell. Now this is the new thing. So the cell, just like hidden state, is just a rank 01:48:02.900 |
one tensor or for a mini batch, a rank two tensor. It's just some activations. And what 01:48:09.660 |
happens is we multiply it by the output of this sigmoid. So the sigmoid can go between 01:48:15.760 |
zero and one. So this, this gate has the ability to basically zero out bits of the cell state. 01:48:25.100 |
So we have the ability to basically take this state and say like delete some of it. So we 01:48:28.900 |
could look at some of these words or whatever in this LSTM and say based on looking at that, 01:48:33.660 |
we think we should zero out some of our cell state. And so now the cell state has been 01:48:39.420 |
selectively forgotten. So that's the forget gate. We then add it to the second chunk, 01:48:46.640 |
the second little mini neural net, which goes through sigmoid. So this is just our input 01:48:53.660 |
and we multiply it by the third one, which goes through a tench. So this basically allows 01:49:01.100 |
us to say, which bits of input do we care about? And then this gives us the numbers from 01:49:06.740 |
minus one to one, multiply them together. And this adds, so this is, how do we update our 01:49:11.380 |
cell state? So we add on some new state. And so now we take that cell state and we put 01:49:18.340 |
it through another, well, one thing that happens is it goes through to the next time step. 01:49:23.960 |
And the other thing that happens is it goes through one more tench to get multiplied by 01:49:28.800 |
the fourth little mini neural net, which is the output. So this is the actual, this actually 01:49:34.580 |
creates the output hidden state. So it looks like there's a lot going on, but actually 01:49:43.100 |
it's just this, right? So you've got one neural net that goes from input to hidden. It's a 01:49:49.660 |
linear layer. One that goes from hidden to hidden. Each one is going to be four times 01:49:53.960 |
the number of hidden because after we compute it and add them together, chunk splits it 01:49:59.580 |
up into four equal sized groups. Three of them go through a sigmoid. One of them goes through 01:50:06.780 |
a tench and then this is just the multiply and add that you saw. So there's kind of like 01:50:12.940 |
conceptually a lot going on in LSTM and it's certainly worth doing some more reading about 01:50:18.220 |
why this particular architecture. But one thing I will say is there's lots of other ways 01:50:23.500 |
you can set up a layer which has the ability to selectively update and selectively forget 01:50:29.620 |
things. For example, there's something called a GIU, which has one less gate. The key thing 01:50:35.060 |
seems to be giving it some way to make a decision to forget things. Cuz if you do that, then 01:50:42.620 |
it has the ability to not push state through all thousand time steps or whatever. So that's 01:50:48.820 |
our LSTM cell and so an LSTM layer, assuming we only have one layer, is just that for loop 01:50:57.900 |
that we've seen before and we're just gonna call whatever cell we asked for. So we're 01:51:02.300 |
gonna ask for an LSTM cell and it just loops through and see how the state, we can take 01:51:10.140 |
the state and we update the state. So you can see this is the classic deep learning, 01:51:15.300 |
it's like an NN.sequential, right? It's looping through a bunch of functions that are updating 01:51:19.700 |
itself. That's what makes it a deep learning network. So that's an LSTM. So that takes 01:51:30.780 |
105 milliseconds for a small net on the CPU. We could pop it onto CUDA, then it's 24 milliseconds 01:51:44.060 |
on GPU. It's not that much faster, because this loop, every time step, it's having to 01:51:51.700 |
push off another kernel launch off to the GPU and that's just slow, right? So that's 01:52:00.380 |
why we use the built in version. And the built in version behind the scenes calls a library 01:52:05.220 |
from Nvidia called cuDNN, which has created a C++ version of this. It's about the same 01:52:12.300 |
on the CPU, right? Not surprisingly, it's really not doing anything different, but on 01:52:15.780 |
the GPU goes from 24 milliseconds to 8 milliseconds. So it's dramatically faster. The good news 01:52:24.920 |
is we can create a faster version by taking advantage of something in PyTorch called JIT. 01:52:31.380 |
And what JIT does is it reads our Python and it converts it into C++ that does the same 01:52:38.140 |
thing. It compiles it the first time you use it and then it uses that compiled code. And 01:52:45.860 |
so that way it can create an on GPU loop. And so the result of that is, again, pretty similar 01:52:53.880 |
on the CPU, but on the GPU, 12 milliseconds. So, you know, not as fast as the cuDNN version, 01:53:03.300 |
but certainly a lot better than our non-JIT version. So this seems like some kind of magic 01:53:09.620 |
thing that's going to save our lives and not require us to have to come to the Swift for 01:53:13.860 |
TensorFlow lectures. But I've got bad news for you. Trying to get JIT working has been 01:53:27.460 |
honestly a bit of a nightmare. This is the third time we've tried to introduce it in 01:53:33.100 |
this course. And the other two times we've just not gotten it working or we've gotten 01:53:43.580 |
worse results. It doesn't work very well that often. And it's got a lot of weird things 01:53:50.160 |
going on. Like, for example, if you decide to comment out a line, right, and then run 01:53:57.940 |
it and then run it, you'll get this error saying unexpected indent. Like, literally, 01:54:07.180 |
it's not Python, right? So it doesn't even know how to comment out lines. It's this kind 01:54:13.480 |
of weird thing where they try to -- it's heroic. It's amazing that it works at all. But the 01:54:18.440 |
idea that you could try and turn Python, which is so not C++, into C++ is really pushing 01:54:24.580 |
at what's possible. So it's astonishing this works at all. And occasionally it might be 01:54:30.220 |
useful, but it's very, very hard to use. And when something isn't as fast as you want, 01:54:36.500 |
it's very, very hard to -- you can't profile it, you can't debug it, not in the normal 01:54:40.060 |
ways. But, you know, obviously, it will improve. It's pretty early days. It will improve. But 01:54:49.540 |
the idea of trying to parse Python and turn it into C++, literally, they're doing like 01:54:54.020 |
string interpolation behind the scenes, is kind of trying to reinvent all the stuff that 01:55:00.420 |
compilers already do, converting a language that was very explicitly not designed to do 01:55:06.540 |
this kind of thing into one that does. And I just -- I don't think this is the future. 01:55:12.540 |
So I say for now, be aware that JIT exists. Be very careful in the short term. I found 01:55:18.820 |
places where it literally gives the wrong gradients. So it goes down a totally different 01:55:22.900 |
auto grad path. And I've had models that trained incorrectly without any warnings. Because it 01:55:29.380 |
was just wrong. So be very careful. But sometimes, like, for a researcher, if you want to play 01:55:35.860 |
with different types of RNNs, this is your only option. Unless you, you know, write your 01:55:43.820 |
own C++. Or unless you try out Julia or Swift, I guess. Is there a question? 01:55:53.780 |
>> Yeah. Why do we need -- why do we need torch.cuda.synchronize? Is it kind of a lock to synchronize 01:56:04.220 |
>> Yeah, this is something that, thanks to Tom on the forum for pointing this out, it's 01:56:08.660 |
just when we're timing, without the synchronize, it's -- let's find it. So I just created a 01:56:20.900 |
little timing function here. Without the synchronize, the CUDA thing will just keep on running things 01:56:26.700 |
in the background, but will return -- it will let your CPU thread keep going. So it could 01:56:33.540 |
end up looking much faster than it actually is. So synchronize says, don't keep going in 01:56:39.140 |
my Python world until my CUDA world is finished. Okay. So now we need dropout. And this is the 01:56:47.380 |
bit that really is fantastic about AWDLSTM, is that Stephen Meredy thought about all the 01:56:54.900 |
ways in which we can regularize a model. So basically, dropout is just Bernoulli random 01:57:01.420 |
noise. So Bernoulli random noise simply means create 1s and 0s, and it's 1 with this probability. 01:57:08.340 |
Right? So create a bunch of random 1s and 0s. And then divide by 1 minus P. So that makes 01:57:14.620 |
them, in this case, to 0.5, it's randomly 0s and 2s. And the reason they're 0s and 2s is 01:57:20.180 |
because that way the standard deviation doesn't change. So we can remove dropout for inference 01:57:24.540 |
time and the activations will be still scaled correctly. And we talked about that a little 01:57:30.020 |
bit in part 1. And so now we can create our RNN dropout. And one of the nifty things here 01:57:38.220 |
is the way that Sylvain wrote this is you don't just pass in the thing to dropout, but 01:57:44.140 |
you also pass in a size. Now, normally, you would just pass in the size of the thing to 01:57:49.060 |
dropout like this. But what he did here was he passed in, for the size, size 0, 1, size 01:57:59.500 |
2. And so if you remember back to broadcasting, this means that this is going to create something 01:58:04.980 |
with a unit axis in the middle. And so when we multiply that, so here's our matrix, when 01:58:10.500 |
we multiply the dropout by that, our 0s get broadcast. This is really important, right? 01:58:18.340 |
Because this is the sequence dimension. So every time step, if you drop out time step 01:58:23.060 |
number 3, but not time step 2 or 4, you've basically broken that whole sequence's ability 01:58:30.500 |
to calculate anything because you just killed it, right? So this is called RNN dropout or 01:58:36.060 |
also called variational dropout. There's a couple of different papers that introduce 01:58:39.380 |
the same idea. And it's simply this that you do dropout on the entire sequence at a time. 01:58:46.460 |
So there's RNN dropout. The second one that Stephen Meridy showed was something he called 01:58:54.780 |
weightdrop. It actually turns out that this already existed in the computer vision world 01:59:00.380 |
where it was called dropconnect. So there's now two things with different names but are 01:59:06.780 |
the same, weightdrop and dropconnect. And this is dropout not on the activations but 01:59:13.580 |
out on the weights themselves. So you can see here when we do the forward pass, we go 01:59:17.900 |
set weights that applies dropout to the actual weights. So that's our second type of dropout. 01:59:28.380 |
The next one is embedding dropout. And this one, as you can see, it drops out an entire 01:59:37.660 |
row. This is actually a coincidence that all these rows are in order, but it drops out 01:59:41.540 |
an entire row. So by dropping it-- so what it does is it says, OK, you've got an embedding. 01:59:47.340 |
And what I'm going to do is I'm going to drop out all of the embedding-- the entire embedding 01:59:52.900 |
vector for whatever word this is. So it's dropping out entire words at a time. So that's 02:00:02.060 |
embedding dropout. So with all that in place, we can create an LSTM model. It can be a number 02:00:10.520 |
of layers. So we can create lots of LSTMs for however many layers you want. And we can loop 02:00:15.340 |
through them. And we can basically call each layer. And we've got all our different dropouts. 02:00:21.020 |
And so basically this code is just calling all the different dropouts. So that is an 02:00:27.700 |
AWDLSTM. So then we can put on top of that a simple linear model with dropout. And so 02:00:41.660 |
this simple linear model-- so it's literally just a linear model where we go dropout and 02:00:47.820 |
then call our linear model is-- we're going to create a sequential model which takes the 02:00:53.900 |
RNN, so the AWDLSTM, and passes the result to a single linear layer with dropout. And 02:01:02.340 |
that is our language model. Because that final linear layer is a thing which will figure 02:01:07.060 |
out what is the next word. So the size of that is the size of the vocab. It's good to 02:01:15.860 |
look at these little tests that we do along the way. These are the things we use to help 02:01:19.260 |
pass check that everything looks sensible. And we found, yep, everything does look sensible. 02:01:26.020 |
And then we added something that AWDLSTM did which is called gradient clipping, which is 02:01:30.620 |
a callback that just checks after the backward pass what are the gradients. And if the total 02:01:37.540 |
norm of the gradients-- so the root sum of squares of gradients is bigger than some number, 02:01:43.380 |
then we'll divide them all so that they're not bigger than that number anymore. So it's 02:01:47.060 |
just clipping those gradients. So that's how easy it is to add gradient clipping. This is 02:01:52.300 |
a super good idea, not as used as much as it should be because it really lets you train 02:01:57.220 |
things at higher learning rates and avoid kind of gradients blowing out. Then there's 02:02:06.660 |
two other kinds of regularization. This one here is called activation regularization. 02:02:14.740 |
And it's actually just an L2 loss, an L2 penalty, just like weight decay. Except the L2 penalty 02:02:24.540 |
is not on the weights, it's on the activations. So this is going to make sure that our activations 02:02:31.820 |
are never too high. And then this one's really interesting. This is called temporal activation 02:02:36.540 |
regularization. This checks how much does each activation change by from sequence step 02:02:46.020 |
to sequence step, and then take the square of that. So this is regularizing the RNN to 02:02:56.300 |
say try not to have things that massively change from time step to time step. Because 02:03:02.900 |
if it's doing that, that's probably not a good sign. So that's our RNN trainer callback. 02:03:10.740 |
We set up our loss functions, which are just normal cross-entropy loss, and also a metric 02:03:14.340 |
which is normal accuracy. But we just make sure that our batch and sequence length is 02:03:19.140 |
all flattened. So we can create our language model, add our callbacks, and fit. So once 02:03:27.620 |
we've got all that, we can use it to train that language model on Wikitex 103. So I'm 02:03:33.260 |
not going to go through this, because it literally just uses what's in the previous notebook. 02:03:37.260 |
But this shows you here's how you can download Wikitex 103, split it into articles, create 02:03:43.900 |
the text lists, split into train and valid, tokenize, numericalize, data bunchify, create 02:03:52.980 |
the model that we just saw and train it for, in this case, about five hours. Because it's 02:04:02.900 |
quite a big model. So because we don't want you to have to train for five hours this RNN, 02:04:12.260 |
you will find that you can download that small pre-trained model from this link. So you can 02:04:20.180 |
now use that on IMDB. So you can, again, grab your IMDB data set, download that pre-trained 02:04:27.300 |
model, load it in. And then we need to do one more step, which is that the embedding 02:04:37.940 |
matrix for the pre-trained Wikitex 103 model is for a different bunch of words to the IMDB 02:04:46.460 |
version. So they've got different vocabs with some overlap. So I won't go through the code, 02:04:52.180 |
but what we just do is we just go through each vocab item in the IMDB vocab, and we 02:05:00.180 |
find out if it's in the Wikitex 103 vocab, and if it is, we copy Wikitex 103's vocab 02:05:06.860 |
over. It's embedding over. So that way we'll end up with an embedding matrix for IMDB that 02:05:14.220 |
is the same as the Wikitex 103 embedding matrix. Any time there's a word that's the same, and 02:05:21.260 |
any time there's a word that's missing, we're just going to use the mean bias and the mean 02:05:26.140 |
weights. So that's all that is. Okay, so once we've done that, we can then define a splitter 02:05:35.780 |
just like before to create our layer groups. We can set up our callbacks, our learner, 02:05:41.500 |
we can fit, and so then we'll train that for an hour or so, and at the end of that we have 02:05:49.780 |
a fine-tuned IMDB language model. So now we can load up our classifier data bunch, which 02:06:01.300 |
we created earlier. That's exactly the same lines of code we had before. I'm going to 02:06:08.420 |
ignore this pack-padded-sequence stuff, but basically there's a neat little trick in PyTorch 02:06:14.780 |
where you can take data that's of different lengths and call pack-padded-sequence, pass 02:06:21.420 |
that to an RNN, and then call pad-sequence, and it basically takes things of different 02:06:28.700 |
lengths and kind of optimally handles them in an RNN. 02:06:35.460 |
So we basically update our AWD LSTM to use that. You might remember that for ULM fit, 02:06:52.460 |
we kind of create our hidden state in the LSTM for lots of time steps, and we want to 02:06:57.420 |
say, "Oh, which bit of state do we actually want to use for classification?" People used 02:07:02.780 |
to basically use the final state. Something that I tried, and it turned out to work really 02:07:07.900 |
well, so it ended up in the paper, was that we actually do an average pool and a max pool 02:07:16.380 |
and use the final state, and we concatenate them all together. So this is like the concat 02:07:22.940 |
pooling we do for images. We do the same kind of thing for text. 02:07:30.140 |
So we put all that together. This is just checking that everything looks sensible, and 02:07:36.540 |
that gives us something that we call the pooling linear classifier, which is just a list of 02:07:43.100 |
batch norm dropout linear layers and our concat pooling, and that's about it. So we just go 02:07:53.660 |
through our sentence, one BPTT at a time, and keep calling that thing and keep appending 02:08:02.020 |
So once we've done all that, we can train it. So here's our normal set of callbacks. 02:08:07.300 |
We can load our fine-tuned encoder, and we can train, and 92% accuracy, which is pretty 02:08:19.660 |
close to where the state-of-the-art was a very small number of years ago, and this is 02:08:25.100 |
not the same as we got about 94.5% or something like that, or 95% for the paper, because that 02:08:30.460 |
used a bigger model that we trained for longer. 02:08:36.740 |
So that was a super-fast zip-through ULM fit, and plenty of stuff which is probably worth 02:08:49.700 |
reading in more detail, and we can answer questions on the forum as well. So let's spend 02:08:54.380 |
the last 10 minutes talking about Swift, because the next two classes are going to be about 02:08:57.860 |
Swift. So I think anybody who's got to lesson 12 in this course should be learning Swift 02:09:07.660 |
for TensorFlow. The reason why is I think basically that Python stays a-numbered. That stuff I 02:09:13.500 |
showed you about JIT. The more I use JIT, the more I think about it, the more it looks like 02:09:18.580 |
failed examples of software development processes I've seen in the last 25 years. Whenever people 02:09:23.220 |
try to convert one language into a different language, and then you're kind of using the 02:09:26.740 |
language that you're not really using, it requires brilliant, brilliant people like the PyTorch 02:09:33.420 |
team years to make it almost kind of work. So I think Julia or Swift will eventually 02:09:43.300 |
in the coming years take over. I just don't think Python can survive, because we can't 02:09:48.420 |
write CUDA kernels in Python. We can't write RNN cells in Python and have them work reliably 02:09:54.420 |
and fast. DL libraries change all the time anyway, so if you're spending all your time 02:09:59.220 |
just studying one library and one language, then you're not going to be ready for that 02:10:03.060 |
change. So you'll need to learn something new anyway. It'll probably be Swift or Julia, 02:10:11.940 |
and I think they're both perfectly good things to look at. Regardless, I've spent time using 02:10:18.780 |
in real-world scenarios at least a couple of dozen languages, and every time I learn 02:10:23.980 |
a new language, I become a better developer. So it's just a good idea to learn a new language. 02:10:31.340 |
And like the for TensorFlow bit might put you off a bit, because I've complained a lot 02:10:34.780 |
about TensorFlow, but there's a flow in the future that's going to look almost totally 02:10:41.220 |
different to TensorFlow in the past. The things that are happening with Swift for TensorFlow 02:10:44.740 |
are so exciting. So there's basically almost no data science ecosystem for Swift, which 02:10:53.260 |
means the whole thing is open for you to contribute to. So you can make serious contributions, 02:10:58.740 |
look at any Python little library or just one function that doesn't exist in Swift and 02:11:04.980 |
write it. The Swift community doesn't have people like us. They have people that understand 02:11:14.220 |
deep learning. They're just not people who are generally in the Swift community right 02:11:17.300 |
now with some exceptions. So we are valued. And you'll be working on stuff that will look 02:11:26.500 |
pretty familiar, because we're building something a lot like fast AI, but hopefully much better. 02:11:32.660 |
So with that, I have here Chris Latner, who come on over, who started the Swift project 02:11:40.540 |
and is now running Swift for TensorFlow team at Google. And we have time for I think three 02:11:46.200 |
questions from the community for Chris and I. 02:11:49.020 |
>> Sure. Assuming someone has zero knowledge of Swift, what would be the most efficient 02:11:53.260 |
way to learn it and get up to speed with using Swift for TensorFlow? 02:11:57.220 |
>> Sure. So the courses we're teaching will assume they don't have prior Swift experience, 02:12:03.220 |
but if you're interested, you can go to Swift.org. In the documentation tab, there's a whole book 02:12:07.200 |
online. The thing I recommend is there's a thing called A-Swift Tour. You can just Google 02:12:13.060 |
for that. It gives you a really quick sense of what it looks like. And it explains the 02:12:18.180 |
basic concepts. It's super accessible. That's where I want to start. 02:12:21.420 |
>> The best version of the Swift book is on the iPad. It uses something called Swift Playgrounds, 02:12:26.260 |
which is one of these amazing things that Chris built, which basically lets you go through 02:12:29.380 |
the book in a very interactive way. It will feel a lot like the experience of using a 02:12:33.900 |
Jupyter notebook, but it's even more fancy in some ways. So you can read the book as 02:12:39.660 |
>> As Swift for TensorFlow evolves, what do you think will be the first kind of machine 02:12:45.100 |
learning work accessible to people who don't have access to big corporate data centers where 02:12:50.300 |
Swift for TensorFlow's particular strengths will make it a better choice than the more 02:12:56.180 |
>> Sure. I don't know what that first thing will be. But I think you have to look at the 02:13:02.100 |
goals of the project. And I think there's two goals for this project overall. One is to 02:13:06.220 |
be very subtractive. And subtractive of complexity. And I think that one of the things that Jeremy's 02:13:12.220 |
highlighting is that in practice, being effective in the machine learning field means you end 02:13:17.080 |
up doing a lot of weird things at different levels. And so you may be dropping down to 02:13:21.020 |
C++ or writing CUDA code, depending on what you're doing. Or playing with these other 02:13:26.900 |
systems or these other C libraries that get wrapped up with Python. But these become leaky 02:13:32.420 |
abstractions you have to deal with. So we're trying to make it so you don't have to deal 02:13:35.300 |
with a lot of that complexity. So you can stay in one language. It works top to bottom. 02:13:40.020 |
It's fast, has lots of other good things to go with it. So that's one aspect of it. The 02:13:44.020 |
other pieces, we're thinking about it from the bottom up, including the compiler bits, 02:13:48.580 |
all the systems integration pieces, the application integration pieces. And I have a theory that 02:13:53.900 |
once we get past the world of Python here, that people are going to start doing a lot 02:13:57.820 |
of really interesting things where you integrate deep learning into applications. And right 02:14:03.300 |
now the application world and the ML world are different. I mean, people literally export 02:14:09.860 |
their model into like an ONNX or TF serving or whatever, and dump it into some C++ thing 02:14:16.060 |
where it's a whole new world. It's a completely different world. And so now you have this 02:14:20.060 |
barrier between the training, the learning and the ML pieces. And you have the application 02:14:26.700 |
pieces. And often these are different teams or different people thinking about things 02:14:30.060 |
in different ways. And breaking down those kinds of barriers, I think is a really big 02:14:34.180 |
opportunity that enables new kinds of work to be done. 02:14:38.060 |
And that leads well into the next pair of questions. Does it make sense to spend efforts 02:14:43.720 |
learning and writing in Swift only, or is it worth to have some understanding of C++ 02:14:48.180 |
as well to be good in numerical computations? And then secondly, after going through some 02:14:53.180 |
of the Swift documentations, it seems like it's a very versatile language. If I understand 02:14:57.820 |
correctly, deep learning, robotics, web development, and systems programming all seem well under 02:15:02.260 |
its purview. Do you foresee Swift's influence flourishing in all these separate areas and 02:15:07.080 |
allowing for a tighter and more fluid development between disciplines? 02:15:10.700 |
Sure. So I think these are two sides of the same coin. I totally agree with Jeremy. Learning 02:15:16.500 |
new programming languages is good just because often you learn to think about things in a 02:15:23.060 |
new way, or they open up new kinds of approaches, and having more different kinds of mental 02:15:28.900 |
frameworks gives you the ability to solve problems that otherwise you might not be able 02:15:32.140 |
to do. And so learning C++ in the abstract is a good thing. Having to use C++ is a little 02:15:41.620 |
bit of a different thing, in my opinion. And so C++ has lots of drawbacks. 02:15:45.220 |
This is coming from somebody who's written a C++ compiler. 02:15:47.780 |
I've written way too much C++ myself, and maybe I'm a little bit damaged here, but C++ 02:15:53.380 |
is a super complicated language. It's also full of memory safety problems and security 02:15:58.460 |
vulnerabilities and a lot of other things that are pretty well known. It's a great language. 02:16:03.100 |
It supports tons of really important work, but one of the goals with Swift is to be a 02:16:07.580 |
full stack language and really span from the scripting all the way down to the things C++ 02:16:12.460 |
is good at, and getting C++-level performance in the same language that you can do high-level 02:16:18.380 |
machine learning frameworks in is pretty cool. I think that that's one of the really unique 02:16:25.380 |
aspects of Swift, is it was designed for compilation, for usability, for accessibility, and I'm 02:16:32.300 |
not aware of a system that's similar in that way. 02:16:37.500 |
I think we're out of time. Sorry. Yeah. One more question next time. Thanks, everybody. 02:16:45.020 |
Thank you, Chris Latner, and we'll see you next week.