Lesson 12 (2019) - Advanced training techniques; ULMFiT from scratch

Welcome to lesson 12. Wow, we're moving along. And this is an exciting lesson because it's where we're going to wrap up all the pieces both for computer vision and for NLP. And you might be surprised to hear that we're going to wrap up all the pieces for NLP because we haven't really done any NLP yet.

But actually everything we've done is equally applicable to NLP. So there's very little to do to get a state-of-the-art result on IMDB sentiment analysis from scratch. So that's what we're going to do. Before we do, let's finally finish off this slide we've been going through for three lessons now.

I promised, not promised, that we would get something state-of-the-art on ImageNet. Turns out we did. So you're going to see that today. So we're going to finish off, mix up, label smoothing, and resnets. Okay, so let's do it. Before we look at the new stuff, 09B learner. I've made a couple of minor changes that I thought you might be interested in.

It's kind of like as you refactor things. So remember last week we refactored the learner to get rid of that awful separate runner. So there's just now one thing, made a lot of our code a lot easier. There's still this concept left behind that when you started fitting, you had to tell each callback what its learner or runner was.

I've moved that, because they're all totally attached now, I've moved that to the init. And so now you can call add cbs to add a whole bunch of callbacks, or add cb to add one callback. And that happens automatically at the start of training. That's a very minor thing.

More interesting was when I did this little reformatting exercise where I took all these callbacks that used to be on the line underneath the thing before them and lined them up over here and suddenly realized that now I can answer all the questions I have in my head about our callback system, which is what exactly are the steps in the training loop?

What exactly are the callbacks that you can use in the training loop? Which step goes with which callback? Which steps don't have a callback? Are there any callbacks that don't have a step? So it's one of these interesting things where I really don't like the idea of automating your formatting and creating rules for formatting when something like this can just, as soon as I did this, I understood my code better.

And for me, understanding my code is the only way to make it work. Because debugging machine learning code is awful. So you've got to make sure that the thing you write makes sense. It's got to be simple. It's got to be really simple. So this is really simple. Then more interestingly, we used to create the optimizer in init.

And you could actually pass in an already created optimizer. I removed that. And the only thing now you can pass in is an optimization function. So something that will create an optimizer, which is what we've always been doing anyway. And by doing that, we can now create our optimizer when we start fitting.

And that turns out to be really important. Because when we do things like discriminative learning rates and gradual unfreezing and layer groups and stuff, we can change things. And then when we fit, it will all just work. So that's a more significant -- it's like one line of code, but it's conceptually a very significant change.

Okay. So that's some minor changes to 9B. And now let's move on to mixup and label smoothing. So I'm really excited about the stuff we saw at the end of the last lesson where we saw how we can use the GPU to do data augmentation. Fully randomized, fully GPU accelerated data augmentation using just plain PyTorch operations.

I think that's a big win. But it's quite possible we don't need that kind of data augmentation anymore. Because in our experimentation with this data augmentation called mixup, we found we can remove most other data augmentation and get amazingly good results. So it's just a kind of a simplicity result.

And also when you use mixup, you can train for a really long time and get really good results. So let me show you mixup. And in terms of the results, you can get -- what happened in the bag of tricks paper was they -- when they turned mixup on, they also started training for 200 epochs instead of 120.

So be a bit careful when you interpret their paper table when it goes from label smoothing 94.1 to mixup without distillation 94.6. They're also nearly doubling the number of epochs they do. But you can kind of get a sense that you can get big decrease in error. The other thing they mention in the paper is distillation.

I'm not going to talk about that because it's a thing where you pre-train some much bigger model like a ResNet-152, and then you try and train something that predicts the output of that. The idea of training a really big model, to train a smaller model, it's interesting, but it's not exactly training in the way I normally think about it.

So we're not looking at distillation. It would be an interesting assignment if somebody wanted to try adding it to the notebooks though. You have all the information and I think all the skills you need to do that now. All right. So mixup, we start by grabbing our ImageNet data set and we grab the MakeRGB and resize and turn it into a float tensor.

This is just our quick and dirty resize, we're already doing this for testing purposes. Split it up, create a data bunch, all the normal stuff. But what we're going to do is we're going to take an image like this and an image like this and we're going to combine them.

We're going to take 0.3 times this image plus 0.7 times this image and this is what it's going to look like. Unfortunately, Silva and I have different orderings of file names on our thing, so I wrote, it's a French horn and a tench but actually Silva clearly doesn't have French horn or tenches but you get the idea.

It's a mixup of two different images. So we're going to create a greater augmentation where every time we predict something we're going to be predicting a mix of two things like this. So we're going to both take the linear combination, 0.3 and 0.7, of the two images but then we're going to have to do that for the labels as well, right?

There's no point predicting the one hot encoded output of this breed of doggy where there's also a bit of a gas pump. So we're also going to have, we're not going to have one hot encoded output, we're going to have a 0.7 encoded doggy and a 0.3 encoded gas pump.

So that's the basic idea. So the mixup paper was super cool. Wow, there are people talking about things that aren't deep learning. I guess that's their priorities. So the paper's a pretty nice, easy read by paper standards and I would definitely suggest you check it out. So I've told you what we're going to do, implementation-wise, we have to decide what number to use here.

Is it 0.3 or 0.1 or 0.5 or what? And this is a data augmentation method, so the answer is we'll randomize it. But we're not going to randomize it from 0 to 1 uniform or 0 to 0.5 uniform, but instead we're going to randomize it using shapes like this.

In other words, when we grab a random number, most of the time it'll be really close to 0 or really close to 1, and just occasionally it'll be close to 0.5. So that way most of the time it'll be pretty easy for our model because it'll be predicting one and only one thing, and just occasionally it'll be predicting something that's a pretty evenly mixed combination.

So the ability to grab random numbers, that this is basically the histogram, the smoothed histogram of how often we're going to see those numbers, is called sampling from a probability distribution. And basically in nearly all these cases you can start with a uniform random number or a normal random number and put it through some kind of function or process to turn it into something like this.

So the details don't matter at all. But the paper points out that this particular shape is nicely characterized by something called the beta distribution, so that's what we're going to use. So it was interesting drawing these because it requires a few interesting bits of math, which some of you may be less comfortable with or entirely uncomfortable with.

For me, every time I see this function, which is called the gamma function, I kind of break out in sweats, not just because I've got a cold, but it's like the idea of functions that I don't-- like how do you describe this thing? But actually, it turns out that like most things, once you look at it, it's actually pretty straightforward.

And we're going to be using this function, so I'll just quickly explain what's going on. We're going to start with a factorial function, so 1 times 2 times 3 times 4, whatever, right? And here these red dots is just the value of the factorial function for a few different places.

But don't think of the factorial function as being 1 times 2 times 3 times 4, or times n, whatever, but divide both sides by n, and now you've got-- or divide both sides by n, and now you've got like factorial n divided by n equals 1 times 2 times 3, so it equals the factorial of n minus 1.

And so when you define it like that, you suddenly realize there's no reason that you kind of have a function that's not just on the integers-- not just on the integers, but is everywhere. This is the point where I stop with the math, right? Because to me, if I need a sine function, or a log function, or an x-punk fin, or whatever, I type it into my computer and I get it, right?

So the actual how you get it is not at all important. But the fact of knowing what these functions are and how they're defined is useful. PyTorch doesn't have this function. Weirdly enough, they have a log gamma function. So we can take log gamma and go e to the power of that to get a gamma function.

And you'll see here, I am breaking my no Greek letters rule. And the reason I'm breaking that rule is because a function like this doesn't have a kind of domain-specific meaning, or a pure physical analogy, which is how we always think about it. It's just a math function. And so we call it gamma, right?

And so if you're going to call it gamma, you may as well write it like that. And why this matters is when you start using it. Like look at the difference between writing it out with the actual Unicode and operators versus what would happen if you wrote it out long form in Python.

Like when you're comparing something to a paper, you want something that you can look at and straight away say like, oh, that looks very familiar. And as long as it's not familiar, you might want to think about how to make it more familiar. So I just briefly mentioned that writing these math symbols nowadays is actually pretty easy.

On Linux, there's a thing called a compose key which is probably already set up for you. And if you Google it, you can learn how to turn it on. And it's basically like you'll press like the right alt button or the caps lock button. You can choose what your compose key is.

And then a few more letters. So for example, all the Greek letters are compose and then star, and then the English letter that corresponds with it. So for example, if I want to do lambda, I would go composed L. So it's just as quick as typing non Unicode characters.

Most of the Greek letters are available on a Mac keyboard just with option. Unfortunately, nobody's created a decent compose key for Mac yet. There's a great compose key for Windows called win compose. Anybody who's working with, you know, Greek letters should definitely install and learn to use these things.

So there's our gamma function nice and concise. It looks exactly like the paper. And so it turns out that this is how you calculate the value of the beta function, which is the beta distribution. And so now here it is. So as I said, the details aren't important, but they're the tools that you can use.

The basic idea is that we now have something where we can pick some parameter, which is called alpha, where if it's high, then it's much more likely that we get a equal mix. And if it's low, it's very unlikely. And this is really important because for data augmentation, we need to be able to tune a lever that says how much regularization am I doing?

How much augmentation am I doing? So you can move your alpha up and down. And the reason it's important to be able to print these plots out is that when you change your alpha, you want to plot it out and see what it looks like, right? Make sure it looks sensible, okay?

So it turns out that all we need to do then is we don't actually have to 0.7 hot encode one thing and 0.3 hot encode another thing. It's actually identical to simply go, I guess it is lambda times the first loss plus 1 minus lambda times the second loss.

I guess we're using t here. So that's actually all we need to do. So this is our mixup. And again, as you can see, we're using the same letters that we'd expect to see in the paper. So everything should look very familiar. And mixup, remember, is something which is going to change our loss function.

So we need to know what loss function to change. So when you begin fitting, you find out what the old loss function on the learner was when you store it away. And then when we calculate loss, we can just go ahead and say, oh, if it's invalidation, there's no mixup involved.

And if we're training, then we'll calculate the loss on two different sets of images. One is just the regular set, and the second is we'll grab all the other images and randomly permute one and randomly pick one to share with. So we do that for the image, and we do that for the loss.

And that's basically it. Couple of minor things to mention. In the last lesson, I created an EWMA function, Exponentially Weighted Moving Average Function, which is a really dumb name for it, because actually it was just a linear combination of two things. It was like V times alpha plus V1 times alpha plus V2 times 1 minus alpha.

You create exponentially weighted moving averages with it by applying it multiple times, but the actual function is a linear combination, so I've renamed that to linear combination, and you'll see that so many places. So this mixup is a linear combination of our actual images and some randomly permuted images in that mini-batch.

And our loss is a linear combination of the loss of our two different parts, our normal mini-batch and our randomly permuted mini-batch. One of the nice things about this is if you think about it, this is all being applied on the GPU. So this is pretty much instant. So super powerful augmentation system, which isn't going to add any overhead to our code.

One thing to be careful of is that we're actually replacing the loss function, and loss functions have something called a reduction. And most PyTorch loss functions, you can say, after calculating the loss function for everything in the mini-batch, either return a rank 1 tensor of all of the loss functions for the mini-batch, or add them all up, or take the average.

We pretty much always take the average. But we just have to make sure that we do the right thing. So I've just got a little function here that does the mean or sum, or nothing at all, as requested. And so then we need to make sure that we create our new loss function, that at the end, it's going to reduce it in the way that they actually asked for.

But then we have to turn off the reduction when we actually do mixup, because we actually need to calculate the loss on every image for both halves of our mixup. So this is a good place to use a context manager, which we've seen before. So we just created a tiny little context manager, which will just find out what the previous reduction was, save it away, get rid of it, and then put it back when it's finished.

So there's a lot of minor details there. But with that in place, the actual mixup itself is very little code. It's a single callback. And we can then run it in the usual way. Just add mixup. Our default alpha here is 0.4. And I've been mainly playing with alpha at 0.2, so this is a bit more than I'm used to.

But somewhere around that vicinity is pretty normal. So that's mixup. And that's like-- it's really interesting, because you could use this for layers other than the input layer. You could use it on the first layer, maybe with the embeddings. So you could do mixup augmentation in NLP, for instance.

That's something which people haven't really dug into deeply yet. But it seems to be an opportunity to add augmentation in many places where we don't really see it at the moment. Which means we can train better models with less data, which is why we're here. So here's a problem.

How does Softmax interact with this? So now we've drawn some random number lambda. It's 0.7. So I've got 0.7 of a dog and 0.3 of a gas station. And the correct answer would be a rank one tensor which has 0.7 in one spot and 0.3 in the other spot and 0 everywhere else.

Softmax isn't going to want to do that for me, because Softmax really wants just one of my values to be high, because it's got an e to the top, as we've talked about. So we-- to really use mixup well-- and not just use mixup well, but any time your data is-- the labels on the data, you're not 100% sure they're correct.

You don't want to be asking your model to predict one. You want to be-- don't predict, I'm 100% sure it's this label, because you've got label noise. You've got incorrect labels, or you've got mixup, mixing, or whatever. So instead, we say, oh, don't use one hot encoding for the dependent variable, but use a little bit less than one hot encoding.

So say 0.9 hot encoding. So then the correct answer is to say, I'm 90% sure this is the answer. And then all of your probabilities have to add to one. So then all of the negatives, you just put 0.1 divided by n minus one, and all the rest. And that's called label smoothing.

And it's a really simple but astonishingly effective way to handle noisy labels. I keep on hearing people saying, oh, we can't use deep learning in this medical problem, because the diagnostic labels in the reports are not perfect, and we don't have a gold standard and whatever. It actually turns out that particularly if you lose label smoothing, noisy data is generally not an option.

Like, there's plenty of examples of people using this where they literally randomly permute half the labels to make them like 50% wrong, and they still get good results, really good results. So don't listen to people in your organization saying, we can't start modeling until we do all this cleanup work.

Start modeling right now. See if the results are OK. And if they are, then maybe you can skip all the cleanup work or do them simultaneously. So label smoothing ends up just being the cross entropy loss as before times if epsilon is 0.1 and 0.9 plus 0.1 times the cross entropy for everything divided by n.

And the nice thing is that's another linear combination. So once you kind of create one of these little mathematical refactorings that tend to pop up everywhere and make your code a little bit easier to read and a little bit harder to stuff up, every time I have to write a piece of code, there's a very high probability that I'm going to screw it up.

So the less I have to write, the less debugging I'm going to have to do later. So we can just pop that in as a loss function and away we go. So that's a super powerful technique which has been around for a couple of years, those two techniques, but not nearly as widely used as they should be.

Then if you're using a Volta, Tensor Core, 2080, any kind of pretty much any current generation Nvidia graphics card, you can train using half precision floating point in theory like 10 times faster. In practice it doesn't quite work out that way because there's other things going on, but we certainly often see 3x speedups.

So the other thing we've got is some work here to allow you to train in half precision floating point. Now the reason it's not as simple as saying model.half, which would convert all of your weights and biases and everything to half precision floating point, is because of this. This is from Nvidia's materials and what they point out is that you can't just use half precision everywhere because it's not accurate, it's bumpy.

So it's hard to get good useful gradients if you do everything in half precision, particularly often things will round off to zero. So instead what we do is we do the forward pass in FP16, we do the backward pass in FP16, so all the hard work is done in half precision floating point, and pretty much everywhere else we convert things to full precision floating point and do everything else in full precision.

So for example, when we actually apply the gradients by multiplying the value of the learning rate, we do that in FP32, single precision. And that means that if your learning rate's really small, in FP16 it might basically round down to zero, so we do it in FP32. In FastAI version one, we wrote all this by hand.

For the lessons, we're experimenting with using a library from Nvidia called Apex. Apex basically have some of the functions to do this there for you. So we're using it here, and basically you can see there's a thing called model to half where we just go model to half, batch norm, goes to float, and so forth.

So these are not particularly interesting, but they're just going through each one and making sure that the right layers have the right types. So once we've got those kind of utility functions in place, the actual callback's really quite small and you'll be able to map every stage to that picture I showed you before.

So you'll be able to see when we start fitting, we convert the network to half-precision floating point, for example. One of the things that's kind of interesting is there's something here called loss scale. After the backward pass, well probably more interestingly, after the loss is calculated, we multiply it by this number called loss scale, which is generally something around 512.

The reason we do that is that losses tend to be pretty small in a region where half-precision floating point's not very accurate. So we just multiply it by 512, put it in a region that is accurate. And then later on, in the backward step, we just divide by that again.

So that's a little tweak, but it's the difference we find generally between things working and not working. So the nice thing is now, we have something which you can just add mixed precision and train and you will get often 2x, 3x speed up, certainly on vision models, also on transformers, quite a few places.

One obvious question is, is 512 the right number? And it turns out getting this number right actually does make quite a difference to your training. And so something slightly more recently is called dynamic loss scaling, which literally tries a few different values of loss scale to find out at what point does it become infinity.

And so it dynamically figures out the highest loss scale we can go to. And so this version just has the dynamic loss scaling added. It's interesting that sometimes training with half-precision gives you better results than training with FP32 because there's just, I don't know, a bit more randomness. Maybe it regularizes a little bit, but generally it's super, super similar, just faster.

We have a question about mixup. Great. Is there an intuitive way to understand why mixup is better than other data augmentation techniques? I think one of the things that's really nice about mixup is that it doesn't require any domain-specific thinking. Do we flip horizontally or also vertically? How much can we rotate?

It doesn't create any kind of lossiness, like in the corners, there's no reflection padding or black padding. So it's kind of quite nice and clean. It's also almost infinite in terms of the number of different images it can create. So you've kind of got this permutation of every image with every other image, which is already giant, and then in different mixes.

So it's just a lot of augmentation that you can do with it. And there are other similar things. So there's another thing which, there's something called cutout where you just delete a square and replace it with black. There's another one where you delete a square and replace it with random pixels.

Something I haven't seen, but I'd really like to see people do, is to delete a square and replace it with a different image. So I'd love somebody to try doing mix-up, but instead of taking the linear combination, instead pick an alpha-sized, sorry, a lambda percent of the pixels, like in a square, and paste them on top.

There's another one which basically finds four different images and puts them in four corners. So there's a few different variations. And they really get great results, and I'm surprised how few people are using them. So let's put it all together. So here's emotionet. So let's use our random resize crop, a minimum scale of 0.35 we find works pretty well.

And we're not going to do any other, other than flip, we're not going to do any other augmentation. And now we need to create a model. So far, all of our models have been boring convolutional models. But obviously what we really want to be using is a resnet model.

We have the xresnet, which there's some debate about whether this is the mutant version of resnet or the extended version of resnet. So you can choose what the x stands for. And basically the xresnet is the bag of tricks, is basically the bag of tricks resnet. So they have a few suggested tweaks to resnet.

And here they are. So these are their little tweaks. So the first tweak is something that we've kind of talked about, and they call it resnet c. And it's basically, hey, let's not do a big seven by seven convolution as our first layer, because that's super inefficient. And it's just a single linear model, which doesn't have much kind of richness to it.

So instead, let's do three comms in a row, three by three, right? And so three, three by three comms in a row, if you think about it, the receptive field of that final one is still going to be about seven by seven, right? But it's got there through a much richer set of things that it can learn, because it's a three layer neural net.

So that's the first thing that we do in our xresnet. So here is xresnet. And when we create it, we set up how many filters are they going to be for each of the first three layers? So the first three layers will start with channels in, inputs. So that'll default to three, because normally we have three channel images, right?

And the number of outputs that we'll use for the first layer will be that plus one times eight. Why is that? It's a bit of a long story. One reason is that that gives you 32 at the second layer, which is the same as what the bag of tricks paper recommends.

As you can see. The second reason is that I've kind of played around with this quite a lot to try to figure out what makes sense in terms of the receptive field, and I think this gives you the right amount. Sometimes eight is here because video graphics cards like everything to be a multiple of eight.

So if this is not eight, it's probably going to be slower. But one of the things here is now if you have like a one channel input, like black and white, or a five channel input, like some kind of hyperspectral imaging or microscopy, then you're actually changing your model dynamically to say, oh, if I've got more inputs, then my first layer should have more activations.

Which is not something I've seen anybody do before, but it's a kind of really simple, nice way to improve your ResNet for different kinds of domains. So that's the number of filters we have for each layer. So our stem, so the stem is the very start of a CNN.

So our stem is just those three conflayers. So that's all the paper says. What's a conflayer? A conflayer is a sequential containing a bunch of layers, which starts with a conf of some stride, followed by a batch norm, and then optionally followed by an activation function. And our activation function, we're just going to use ReLU for now, because that's what they're using in the paper.

The batch norm, we do something interesting. This is another tweak from the bag of tricks, although it goes back a couple more years than that. We initialize the batch norm, sometimes to have weights of 1, and sometimes to have weights of 0. Why do we do that? Well, all right.

Have a look here at ResNet D. This is a standard ResNet block. This path here normally doesn't have the conv and the average pool. So pretend they're not there. We'll talk about why they're there sometimes in a moment. But then this is just the identity. And the other goes 1 by 1 conv, 3 by 3 conv, 1 by 1 conv.

And remember, in each case, it's conv batch norm ReLU, conv batch norm ReLU. And then what actually happens is it then goes conv batch norm, and then the ReLU happens after the plus. There's another variant where the ReLU happens before the plus, which is called preact or preactivation ResNet.

Turns out it doesn't work quite as well for smaller models, so we're using the non-preact version. Now, see this conv here? What if we set the batch norm layer weights there to 0? What's going to happen? Well, we've got an input. This is identity. This does some conv, some conv, some conv, and then batch norm where the weights are 0, so everything gets multiplied by 0.

And so out of here comes 0. So why is that interesting? Because now we're adding 0 to the identity block. So in other words, the whole block does nothing at all. That's a great way to initialize a model, right? Because we really don't want to be in a position, as we've seen, where if you've got a thousand layers deep model, that any layer is even slightly changing the variance because they kind of cause the gradients to spiral off to 0 or to infinity.

This way, literally, the entire activations are the same all the way through. So that's what we do. We set the 1, 2, 3 third conv layer to have 0 in that batch norm layer. And this lets us train very deep models at very high learning rates. You'll see nearly all of the academic literature about this talks about large batch sizes because, of course, academics, particularly at big companies like Google and OpenAI and Nvidia and Facebook, love to show off their giant data centers.

And so they like to say, oh, if we do 1,000 TPUs, how big a batch size can we create? But for us normal people, these are also interesting because the exact same things tell us how high a learning rate can we go, right? So the exact same things that let you create really big batch sizes, so you do a giant batch and then you take a giant step, well, we can just take a normal sized batch, but a much bigger than usual step.

And by using higher learning rates, we train faster and we generalize better. And so that's all good. So this is a really good little trick. Okay. So that's conv layer. So there's our stem. And then we're going to create a bunch of res blocks. So a res block is one of these, except this is an identity path, right?

Unless we're doing a res net 34 or a res net 18, in which case one of these comms goes away. So res net 34 and res net 18 only have two cons here and res net 50 onwards have three cons here. So and then in res net 50 and above, the second conv, they actually squish the number of channels down by four and then they expand it back up again.

So it could go like 64 channels to 16 channels to 64 channels. Let's call it a bottleneck layer. So a bottleneck block is the normal block for larger res nets. And then just two three by three comms is the normal for smaller res nets. So you can see in our res block that we pass in this thing called expansion.

It's either one or four. It's one if it's res net 18 or 34, and it's four if it's bigger, right? And so if it's four, well, if it's expansion equals one, then we just add one extra conv, right? Oh, sorry. The first conv is always a one by one, and then we add a three by three conv, or if expansion equals four, we add two extra comms.

So that's what the res blocks are. Now I mentioned that there's two other things here. Why are there two other things here? Well, we can't use standard res blocks all the way through our model, can we? Because a res block can't change the grid size. We can't have a stride two anywhere here, because if we had a stride two somewhere here, we can't add it back to the identity because they're now different sizes.

Also we can't change the number of channels, right? Because if we change the number of channels, we can't add it to the identity. So what do we do? Well, as you know, from time to time, we do like to throw in a stride two, and generally when we throw in a stride two, we like to double the number of channels.

And so when we do that, we're going to add to the identity path two extra layers. We'll add an average pooling layer, so that's going to cause the grid size to shift down by two in each dimension, and we'll add a one by one conv to change the number of filters.

So that's what this is. And this particular way of doing it is specific to the x res net, and it gives you a nice little boost over the standard approach, and so you can see that here. If the number of inputs is different to the number of filters, then we add an extra conv layer, otherwise we just do no op, no operation, which is defined here.

And if the stride is something other than one, we add an average pooling, otherwise it's a no op, and so here is our final res net block calculation. So that's the res block. So tweak for res net d is this way of doing the, they call it a downsampling path.

And then the final tweak is the actual ordering here of where the stride two is. Usually the stride two in normal res net is at the start, and then there's a three by three after that. Doing a stride two on a one by one conv is a terrible idea, because you're literally throwing away three quarters of the data, and it's interesting, it took people years to realize they're literally throwing away three quarters of the data, so the bag of tricks folks said, let's just move the stride two to the three by three, and that makes a lot more sense, right?

Because a stride two, three by three, you're actually hitting every pixel. So the reason I'm mentioning these details is so that you can read that paper and spend time thinking about each of those res net tweaks, do you understand why they did that? Right? It wasn't some neural architecture search, try everything, brainless, use all our computers approach.

So let's sit back and think about how do we actually use all the inputs we have, and how do we actually take advantage of all the computation that we're doing, right? So it's a very, most of the tweaks are stuff that exists from before, and they've cited all those, but if you put them all together, it's just a nice, like, here's how to think through architecture design.

And that's about it, right? So we create a res net block for every res layer, and so here it is, creating the res net block, and so now we can create all of our res nets by simply saying, this is how many blocks we have in each layer, right?

So res net 18 is just two, two, two, two, 34 is three, four, six, three, and then secondly is changing the expansion factor, which as I said for 18 and 34 is one, and for the bigger ones is four. So that's a lot of information there, and if you haven't spent time thinking about architecture before, it might take you a few reads and lessons to put the sink in, but I think it's a really good idea to try to spend time thinking about that, and also to, like, experiment, right?

And try to think about what's going on. The other thing to point out here is that this -- the way I've written this, it's like this is the whole -- this is the whole res net, right, other than the definition of conflayer, this is the whole res net. It fits on the screen, and this is really unusual.

Most res nets you see, even without the bag of tricks, 500, 600, 700 lines of code, right? And if every single line of code has a different arbitrary number at 16 here and 32 there and average pool here and something else there, like, how are you going to get it right?

And how are you going to be able to look at it and say, what if I did this a little bit differently? So for research and for production, you want to get your code refactored like this for your architecture so that you can look at it and say, what exactly is going on, is it written correctly, okay, I want to change this to be in a different layer, how do I do it?

It's really important for effective practitioners to be able to write nice, concise architectures so that you can change them and understand them. Okay. So that's our X res net. We can train it with or without mixup, it's up to us. Label smoothing cross entropy is probably always a good idea, unless you know that your labels are basically perfect.

Let's just create a little res net 18. And let's check out to see what our model is doing. So we've already got a model summary, but we're just going to rewrite it to use our, the new version of learner that doesn't have runner anymore. And so we can print out and see what happens to our shapes as they go through the model.

And you can change this print mod here to true, and it'll print out the entire blocks and then show you what's going on. So that would be a really useful thing to help you understand what's going on in the model. All right. So here's our architecture. It's nice and easy.

We can tell you how many channels are coming in, how many channels are coming out, and it'll adapt automatically to our data that way. So we can create our learner, we can do our LR find. And now that we've done that, let's create a one cycle learning rate annealing.

So one cycle learning rate annealing, we've seen all this before. We keep on creating these things like 0.3, 0.7 for the two phases or 0.3, 0.2, 0.5 for three phases. So I add a little create phases that will build those for us automatically. This one we've built before. So here's our standard one cycle annealing, and here's our parameter scheduler.

And so one other thing I did last week was I made it that callbacks, you don't have to pass to the initializer. You can also pass them to the fit function, and it'll just run those callbacks to the fit functions. This is a great way to do parameter scheduling.

And there we go. And so 83.2. So I would love to see people beat my benchmarks here. So here's the image net site. And so so far, the best I've got for 128, 5 epochs is 84.6. So yeah, we're super close. So maybe with some fiddling around, you can find something that's even better.

And with these kind of leaderboards, where a lot of these things can train in, this is two and a half minutes on a standard, I think it was a GTX 1080 Ti, you can quickly try things out. And what I've noticed is that the results I get in 5 epochs on 128 pixel image net models carry over a lot to image net training or bigger models.

So you can learn a lot by not trying to train giant models. So compete on this leaderboard to become a better practitioner to try out things, right? And if you do have some more time, you can go all the way to 400 epochs, that might take a couple of hours.

And then of course, also we've got image wolf, which is just doggy photos, and is much harder. And actually, this one, I find an even better test case, because it's a more difficult data set. So we've got a 90% is my best for this. So I hope somebody can beat me.

I really do. So we can refactor all that stuff of adding all these different callbacks and stuff into a single function called CNN learner. And we can just pass in an architecture and our data and our loss function and our optimization function and what kind of callbacks do we want, just yes or no.

And we'll just set everything up. And if you don't pass in C in and C out, we'll grab it from your data for you. And then we'll just pass that off to the learner. So that makes things easier. So now if you want to create a CNN, it's just one line of code, adding in whatever we want, except label smoothing, blah, blah, blah.

And so we get the same result when we fit it. So we can see this all put together in this ImageNet training script, which is in fast AI, in example, slash train ImageNet. And this entire thing will look entirely familiar to you. It's all stuff that we've now built from scratch, with one exception, which is this bit, which is using multiple GPUs.

So we're not covering that. But that's just an acceleration tweak. And you can easily use multiple GPUs by simply doing data parallel or too distributed. Other than that, yeah, this is all stuff that you see. And there's label smoothing cross-entropy. There's mixup. Here's something we haven't written. Save the model after every epoch.

Maybe you want to write that one. That would be a good exercise. So what happens if we try to train this for just 60 epochs? This is what happens. So benchmark results on ImageNet, these are all the Keras and PyTorch models. It's very hard to compare them because they have different input sizes.

So we really should compare the ones with our input size, which is 224. So a standard ResNet -- oh, it scrolled off the screen. So ResNet 50 is so bad, it's actually scrolled off the screen. So let's take ResNet 101 as a 93.3% accuracy. So that's twice as many layers as we used.

And it was also trained for 90 epochs, so trained for 50% longer, 93.3. When I trained this on ImageNet, I got 94.1. So this, like, extremely simple architecture that fits on a single screen and was built entirely using common sense, trained for just 60 epochs, actually gets us even above ResNet 152.

Because that's 93.8. We've got 94.1. So the only things above it were trained on much, much larger images. And also, like, NASNet large is so big, I can't train it. I just keep on running out of memory in time. And Inception ResNet version 2 is really, really fiddly and also really, really slow.

So we've now got, you know, this beautiful nice ResNet, XResNet 50 model, which, you know, is built in this very first principles common sense way and gets astonishingly great results. So I really don't think we all need to be running to neural architecture search and hyperparameter optimization and blah, blah, blah.

We just need to use, you know, good common sense thinking. So I'm super excited to see how well that worked out. So now that we have a nice model, we want to be able to do transfer learning. So how do we do transfer learning? I mean, you all know how to do transfer learning, but let's do it from scratch.

So what I'm going to do is I'm going to transfer learn from ImageWolf to the pets data set that we used in lesson one. That's our goal. So we start by grabbing ImageWolf. We do the standard data block stuff. Let's use label-smoothing cross-entropy. Notice how we're using all the stuff we've built.

This is our atom optimizer. This is our label-smoothing cross-entropy. This is the data block API we wrote. So we're still not using anything from fast AI v1. This is all stuff that if you want to know what's going on, you can go back to that previous lesson and see what did we build and how did we build it and step through the code.

There's a CNN learner that we just built in the last notebook. These five lines of code I got sick of typing, so let's dump them into a single function called schedule1cycle. It's going to create our phases. It's going to create our momentum annealing and our learning rate annealing and create our schedulers.

So now with that we can just say schedule1cycle with a learning rate, what percentage of the epochs are at the start, batches I should say at the start, and we could go ahead and fit. Okay. For transfer learning we should try and fit a decent model. So I did 40 epochs at 11 seconds per epoch on a 1080ti.

So a few minutes later we've got 79.6% accuracy, which is pretty good, you know, training from scratch for 10 different dog breeds with a ResNet 18. So let's try and use this to create a good pets model that's going to be a little bit tricky because the pets dataset has cats as well, and this model's never seen cats.

And also this model has only been trained on I think less than 10,000 images, so it's kind of unusually small thing that we're trying to do here, so it's an interesting experiment to see if this works. So the first thing we have to do is we have to save the model so that we can load it into a pets model.

So when we save a model, what we do is we grab its state dict. Now we actually haven't written this, but it would be like three lines of code if you want to write it yourself, because all it does is it literally creates a dictionary, an order dict is just a Python standard library dictionary that has an order, where the keys are just the names of all the layers, and for sequential the index of each one, and then you can look up, say, 10.bias, and it just returns the weights.

Okay. So you can easily turn a module into a dictionary, and so then we can create somewhere to save our model, and torch.save will save that dictionary. You can actually just use pickle here, works fine, and actually behind the scenes, torch.save is using pickle, but they kind of like add some header to it to say like it's basically a magic number that when they read it back, they make sure it is a PyTorch model file and that it's the right version and stuff like that, but you can totally use pickle.

And so the nice thing is now that we know that the thing we've saved is just a dictionary. So you can fiddle with it, but if you have trouble loading something in the future, just open up, just go torch.load, put it into a dictionary, and look at the keys and look at the values and see what's going on.

So let's try and use this for pets. So we've seen pets before, so the nice thing is that we've never used pets in part two, but our data blocks API totally works. And in this case, there's one images directory that contains all the images, and there isn't a separate validation set directory, so we can't use that label with -- sorry, yeah, label with -- sorry, split with grandparent thing, so we're going to have to split it randomly.

But remember how we've already created split by func? So let's just write a function that returns true or false, depending on whether some random number is large or small. And so now, we can just pass that to our split by func, and we're done. So the nice thing is, when you kind of understand what's going on behind the scenes, it's super easy for you to customize things.

And fast.i.v. 1 is basically identical, there's a split by func that you do the same thing for. So now that's split into training and validation, and you can see how nice it is that we created that dunder repress so that we can print things out so easily to see what's going on.

So if something doesn't have a nice representation, you should monkey-patch in a dunder repress so you can print out what's going on. Now we have to label it. So we can't label it by folder, because they're not put into folders. Instead, we have to look at the file name.

So let's grab one file name. So I need to build all this stuff in a Jupyter notebook just interactively to see what's going on. So in this case, we'll grab one name, and then let's try to construct a regular expression that grabs just the doggy's name from that. And once we've got it, we can now turn that into a function.

And we can now go ahead and use that category processor we built last week to label it. And there we go. There's all the kinds of doggy we have. We're not just doggies now, doggies and kitties. Okay. So now we can train from scratch pets, 37%, not great. So maybe with transfer learning, we can do better.

So transfer learning, we can read in that imagewoof model, and then we will customize it for pets. So let's create a CNN for pets. This is now the pet's data bunch. But let's tell it to create a model with ten filters out, ten activations at the end. Because remember, imagewoof has ten types of dog, ten breeds.

So to load in the pre-trained model, we're going to need to ask for a learner with ten activations. So that is something we can now grab our state dictionary that we saved earlier, and we can load it into our model. So this is now an imagewoof model. But the learner for it is pointing at the pet's data bunch.

So what we now have to do is remove the final linear layer and replace it with one that has the right number of activations to handle all these, which I think is 37 pet breeds. So what we do is we look through all the children of the model, and we try to find the adaptive average pooling layer, because that's that kind of penultimate bit, and we grab the index of that, and then let's create a new model that has everything up to but not including that bit.

So this is everything before the adaptive average pooling. So this is the body. So now we need to attach a new head to this body, which is going to have 37 activations in the linear layer instead of 10, which is a bit tricky because we need to know how many inputs are going to be required in this new linear layer.

And the number of inputs will be however many outputs come out of this. So in other words, just before the average pooling happens in the x res net, how many activations are there? How many channels? Well, there's an easy way to find out. Grab a batch of data, put it through a cut down model, and look at the shape.

And the answer is, there's 512. Okay? So we've got a 128 mini batch of 512 4x4 activations. So that pred dot shape one is the number of inputs to our head. And so we can now create our head. This is basically it here, our linear layer. But remember, we tend to not just use a max pool or just an average pool.

We tend to do both and concatenate them together, which is something we've been doing in this course forever. But a couple of years, somebody finally did actually write a paper about it. So I think this is actually an official thing now. And it generally gives a nice little boost.

So our linear layer needs twice as many inputs because we've got two sets of pooling we did. So our new model contains the whole head, plus a adaptive concat pooling, platen, and our linear. And so let's replace the model with that new model we created and fit. And look at that, 71% by fine tuning versus 37% training from scratch.

So that looks good. So we have a simple transfer learning working. So what I did then, I do this in Jupyter all the time, I basically grabbed all the cells. I hit C to copy, and then I hit V to paste. And then I grabbed them all, and I hit shift M to merge, and chucked a function header on top.

So now I've got a function that does all the-- so these are all the lines you saw just before. And I've just stuck them all together into a function. I call it adapt model. It's going to take a learner and adapt it for the new data. So these are all the lines of code you've already seen.

And so now we can just go CNN learner, load the state dict, adapt the model, and then we can start training. But of course, what we really like to do is to first of all train only the head. So let's grab all the parameters in the body. And remember, when we did that nn.sequential, the body is just the first thing.

That's the whole ResNet body. So let's grab all the parameters in the body and set them to requires grad equals false. So it's frozen. And so now we can train just the head, and we get 54%, which is great. So now we, as you know, unfreeze and train some more.

Uh-oh. So it's better than not fine tuning, but interestingly, it's worse-- 71 versus 56-- it's worse than the kind of naive fine tuning, where we didn't do any freezing. So what's going on there? Anytime something weird happens in your neural net, it's almost certainly because of batch norm, because batch norm makes everything weird.

And that's true here, too. What happened was our frozen part of our model, which was designed for ImageWolf, those layers were tuned for some particular set of mean and standard deviations, because remember, the batch norm is going to subtract the mean and divide by the standard deviation. But the PETS data set has different means and standard deviations, not for the input, but inside the model.

So then when we unfroze this, it basically said this final layer was getting trained for everything being frozen, but that was for a different set of batch norm statistics. So then when we unfroze it, everything tried to catch up, and it would be very interesting to look at the histograms and stuff that we did earlier in the course and see what's really going on, because I haven't really seen anybody-- I haven't really seen a paper about this.

Something we've been doing in FastAI for a few years now, but I think this is the first course where we've actually drawn attention to it. That's something that's been hidden away in the library before. But as you can see, it's a huge difference, the difference between 56 versus 71.

So the good news is it's easily fixed. And the trick is to not freeze all of the body parameters, but freeze all of the body parameters that aren't in the batch norm layers. And that way, when we fine-tune the final layer, we're also fine-tuning all of the batch norm layers' weights and biases.

So we can create, just like before, adapt the model, and let's create something called setGradient, which says, oh, if it's a linear layer at the end or a batch norm layer in the middle, return. Don't change the gradient. Otherwise, if it's got weights, set requires grad2, whatever you asked for, which we're going to start false.

Here's a little convenient function that will apply any function you pass to it recursively to all of the children of a model. So now that we have apply to a model, or apply to a module, I guess, we can just pass in a module, and that will be applied throughout.

So this way, we freeze just the non-batch norm layers, and of course, not the last layer. And so actually, fine-tuning immediately is a bit better, goes from 54 to 58. But more importantly, then when we unfreeze, we're back into the 70s again. So this is just a super important thing to remember, if you're doing fine-tuning.

And I don't think there's any library other than fast.ai that does this, weirdly enough. So if you're using TensorFlow or something, you'll have to write this yourself to make sure that you don't freeze ever, don't ever freeze the weights in the batch norm layers any time you're doing partial layer training.

Oh, by the way, that apply mod, I only wrote it because we're not allowed to use stuff in PyTorch, but actually PyTorch has its own, it's called model.apply. So you can use that now, it's the same thing. Okay, so finally, for this half of the course, we're going to look at discriminative learning rates.

So for discriminative learning rates, there's a few things we can do with them. One is it's a simple way to do layer freezing without actually worrying about setting requires grad. We could just set the learning rate to zero for some layers. So let's start by doing that. So what we're going to do is we're going to split our parameters into two or more groups with a function.

Here's our function, it's called bnsplitter, it's going to create two groups of parameters and it's going to pass the body to underscore bnsplitter, which will recursively look for batch norm layers and put them in the second group or anything else with a weight goes in the first group and then do it recursively.

And then also the second group will add everything after the head. So this is basically doing something where we're putting all our parameters into the two groups we want to treat differently. So we can check, for example, that when we do bnsplitter on a model that the number of parameters in the two halves is equal to the total number of parameters in the model.

And so now I want to check this works, right? I want to make sure that if I pass this, because we now have a splitter function in the learner, and that's another thing I added this week, that when you start training, it's literally just this. When we create an optimizer, it passes the model to self.splitter, which by default does nothing at all.

And so we're going to be using our bnsplitter to split it into multiple parameter groups. And so how do we debug that? How do we make sure it's working? Because this is one of these things that if I screw it up, I probably won't get an error, but instead it probably won't train my last layer, or it'll train all the layers at the same learning rate, or it would be hard to know if the model was bad because I screwed up my code or not.

So we need a way to debug it. We can't just look inside and make sure it's working, because what we're going to be doing is we're going to be passing it, let's see this one, we're going to be passing it to the splitter parameter when we create the learner, right?

So after this, it set the splitter parameter, and then when we start training, we're hoping that it's going to create these two layer groups. So we need some way to look inside the model. So of course, we're going to use a callback. And this is something that's super cool.

Do you remember how I told you that you can actually override dundercall itself? You don't just have to override a specific callback? And by overriding dundercall itself, we can actually say, which callback do we want to debug? And when we hit that callback, please run this function. And if you don't pass in a function, it just jumps into the debugger as soon as that callback is hit, otherwise call the function.

So this is super handy, right? Because now I can create a function called print details that just prints out how many parameter groups there are and what the hyperparameters there are, and then immediately raises the cancel train exception to stop. And so then I can fit with my discriminative LR scheduler and my debug callback, and my discriminative LR scheduler is something that now doesn't just take a learning rate, but an array of learning rates and creates a scheduler for every learning rate.

And so I can pass that in, so I'm going to use 0 and 0.02. So in other words, no training for the body and 0.03 for the head and the batch norm. And so as soon as I fit, it immediately stops because the cancel train exception was raised, and it prints out and says there's two parameter groups, which is what we want, and the first parameter group has a learning rate of 0, which is what we want, and the second is 0.003, which is right because it's 0.03, and we're using the learning rate scheduler so it starts out 10 times smaller.

So this is just a way of saying if you're anything like me, every time you write code, it will always be wrong, and for this kind of code, you won't know it's wrong, and you could be writing a paper or doing a project at work or whatever in which you're not using discriminative learning rates at all because of some bug because you didn't know how to check.

So make sure you can check and always assume that you screw up everything. Okay, so now we can train with zero learning rate on the first layer group, and then we can use discriminative learning rates with 1 and a 3 and 1 and a 2 and train a little bit more, and that all works.

Okay, so that's all the tweaks we have. Any questions, Rachel? A bit too tangential questions come up. They're my favorite. The first is we heard that you're against cross-validation for deep learning. We heard that you're against cross-validation for deep learning and wanted to know why that is. And the second question...

Let's do it one at a time. Okay. Okay. So cross-validation is a very useful technique for getting a reasonably sized validation set if you don't have enough data to otherwise create a reasonably sized validation set. So it was particularly popular in the days when most studies were like 50 or 60 rows.

If you've got a few thousand rows, it's just pointless, right? Like the kind of statistical significance is going to be there regardless. So I wouldn't say I'm against it, just most of the time you don't need it because if you've got a thousand things in the validation set and you only care whether it's like plus or minus 1%, it's totally pointless.

So yeah, have a look and see how much your validation set accuracy is varying from run to run. And if it's too much that you can't make the decisions you need to make, then you can add cross-validation. And what are your best tips for debugging deep learning? So Chris Latner asked me this today as well, actually.

So I'll answer the same answer to him, which is don't make mistakes in the first place. And the only way to do that is to make your code so simple that it can't possibly have a mistake and to check every single intermediate result along the way to make sure it doesn't have a mistake.

Otherwise, your last month might have been like my last month. What happened in my last month? Well, a month ago, I got 94.1% accuracy on ImageNet, and I was very happy. And then I started a couple of weeks ago trying various tweaks. And none of the tweaks seemed to help.

And after a while, I got so frustrated, I thought I'd just repeat the previous training to see if it was like what was going on with the Fluke. And I couldn't repeat it. I was now getting 93.5 instead of 94.1. And I trained it like a bunch of times.

And every time I trained it, it was costing me $150 of AWS credits. So I wasn't thrilled about this. And it was six hours of waiting. So that was quite a process to even realize like it's broken. This is the kind of thing. Like when something, when you've written that kind of code wrong, it gets broken in ways you don't even notice.

It was broken for weeks in fast AI. And nobody noticed. So eventually, I realized, yeah, I mean, so the first thing I'll say is, you've got to be a great scientist, which means you need a journal notebook, right? You need to keep track of your journal results. So I had a good journal, I pasted everything that was going on, all my models into a file.

So I went back, I confirmed it really was 94.1. I could see exactly when it was. And so then I could revert to the exact commit that was in fast AI at that time. And I reran it, and I got 94.1. So I now had to figure out which change in the previous month of the entire fast AI code base caused this to break.

So the first thing I tried to do was try to find a way to quickly figure out whether something was broken. But after doing a few runs and plotting them in Excel, it was very clear that the training was identical until epoch 50. So until epoch 50 out of 60.

So there was no shortcut. And so I did a bisection search one module at a time, looking through the 15 modules that had changed in that diff until eventually I find it was in the mixed precision module. And then I went through each change that happened in the mixed position module.

So like $5,000 later, I finally found the one line of code where we had forgotten to write the four letters dot opt. And so by failing to write dot opt, it meant that we were wrapping an Optim wrapper in an Optim wrapper, rather than wrapping an Optim wrapper with an optimizer.

And that meant that weight decay was being applied twice. So that tiny difference, like, was so insignificant that no one using the library even noticed it wasn't working. I didn't notice it wasn't working until I started trying to, you know, get state-of-the-art results on ImageNet in 60 epochs with ResNet 50.

So yeah, I mean, debugging is hard, and worth still is most of the time you don't know. So I mean, honestly, training models sucks, and deep learning is a miserable experience and you shouldn't do it, but on the other hand, it gives you much better results than anything else, and it's taking over the world.

So it's either that or get eaten by everybody else, I guess. So yeah, I mean, it's so much easier to write normal code where, like, oh, you have to implement a wealth authentication in your web service, and so you go in and you say, oh, here's the API, and we have to take these five steps, and after each one I check that this has happened, and you check off each one, and at the end you're done, and you push it, and you have integration tests, and that's it, right?

Even testing, it requires a totally different mindset. So you don't want reproducible tests. You want tests with randomness. You want to be able to see if something's changing just occasionally, because if it tests correctly all the time with a random set of 42, be sure it's going to work with a random set of 41.

So you want non-reproducible tests, you want randomness, you want tests that aren't guaranteed to always pass, but the accuracy of this integration test should be better than 0.9 nearly all the time. You want to be warned if something looks off, you know? And this means it's a very different software development process, because if you push something to the fast AI repo and a test fails, it might not be your fault, right?

It might be that Jeremy screwed something up a month ago, and one test fails one out of every thousand times. So as soon as that happens, then we try to write a test that fails every time, you know? So once you realize there's a problem with this thing, you try to find a way to make it fail every time, but it's -- yeah, debugging is difficult, and in the end, you just have to go through each step, look at your data, make sure it looks sensible, plot it, and try not to make mistakes in the first place.

Great. Well, let's have a break and see you back here at 7.55. So we've all done ULM fit in part one, and there's been a lot of stuff happening in the -- oh, okay. Let's do the question. >> What do you mean by a scientific journal? >> Ah. Yeah, that's a good one.

This is something I'm quite passionate about. When you look at the great scientists in history, they all, that I can tell, had careful scientific journal practices. In my case, my scientific journal is a file in a piece of software called Windows Notepad, and I paste things into it at the bottom, and when I want to find something, I press control F.

It just needs to be something that has a record of what you're doing and what the results of that are, because scientists -- scientists who make breakthroughs generally make the breakthrough because they look at something that shouldn't be, and they go, oh, that's odd. I wonder what's going on.

So the discovery of the noble gases was because the scientists saw, like, one little bubble left in a beaker, which they were pretty sure there shouldn't have been a little bubble there anymore. Most people would just be like, oops, there's a bubble, or we wouldn't even notice, but they studied the bubble, and they found noble gases, or penicillin was discovered because of a, oh, that's odd.

And I find in deep learning, this is true as well. Like, I spent a lot of time studying batch normalization in transfer learning, because a few years ago in Keras, I was getting terrible transfer learning results for something I thought should be much more accurate, and I thought, oh, that's odd.

And I spent weeks changing everything I could, and then almost randomly tried changing batch norm. So the problem is that all this fiddling around, you know, 90% of it doesn't really go anywhere, but it's the other 10% that you won't be able to pick it out unless you can go back and say, like, okay, that really did happen.

I copied and pasted the log here. So that's all I mean. >> Are you also linking to your GitHub commits and datasets, sir? >> No, because I've got the date there and the time. So I know the GitHub commit. So I do make sure I'm pushing all the time.

So, yeah. Okay. Yeah, so there's been a lot happening in NLP transfer learning recently, the famous GPT2 from OpenAI and BERT and stuff like that, lots of interest in transformers, which we will cover in a future lesson. One could think that LSTMs are out of favor and not interesting anymore.

But when you look at actually recent competitive machine learning results, you see ULMFIT beating BERT. Now, I should say this is not just ULMFIT beating BERT. The guys at Mwaves are super smart, amazing people. So it's like two super smart, amazing people using ULMFIT bits and other people doing BERT.

It's definitely not true that RNNs are in the past. I think what's happened is, in fact, as you'll see, transformers and CNNs for text have a lot of problems. They basically don't have state. So if you're doing speech recognition, every sample you look at, you have to do an entire analysis of all the samples around it again and again and again.

It's ridiculously wasteful or else RNNs have state. But they're fiddly and they're hard to deal with, as you'll see, when you want to actually do research and change things. But partly, RNNs have state, but also partly, RNNs are the only thing which has had the level of carefulness around regularization that AWD LSTM did.

So Stephen Meridy looked at what are all the ways I can regularize this model and came up with a great set of hyperparameters for that. And there's nothing like that outside of the RNN world. So, at the moment, my go-to choice definitely is still ULM fit for most real-world NLP tasks.

And if people find BERT or GPT2 or whatever better for some real-world tasks, that would be fascinating. I would love that to happen, but I haven't been hearing that from people that are actually working in industry yet. I'm not seeing them win competitive machine learning stuff and so forth.

So I still think RNNs should be our focus, but we will also learn about transformers later. And so ULM fit is just the normal transfer learning path applied to an RNN, which could be on text. Interestingly, there's also been a lot of state of the art results recently on genomics applications and on chemical bonding analysis and drug discovery.

There's lots of things that are sequences and it turns out, and we're still just at the tip of the iceberg, right? Because most people that are studying like drug discovery or chemical bonding or genomics have never heard of ULM fit, right? So it's still the tip of the iceberg.

But those who are trying it are consistently getting breakthrough results. So I think it's really interesting, not just for NLP, but for all kinds of sequence classification tasks. So the basic process is going to be create a language model on some large data set. And notice a language model is a very general term.

It means predict the next item in the sequence. So it could be an audio language model that predicts the next sample in a piece of music or speech. It could be predicting the next genome in a sequence or whatever, right? So that's what I mean by language model. And then we fine-tune it, that language model using our in-domain corpus, which in this case is going to be IMDB.

And then in each case, we first have to pre-process our data sets to get them ready for using an RNN on them. Language models require one kind of pre-processing. Classification models require another one. And then finally we can fine-tune our IMDB language model for classification. So this is the process we're going to go through from scratch.

So Sylvain has done an amazing thing in the last week, which is basically to recreate the entire AWD LSTM and ULM fit process from scratch in the next four notebooks. And there's quite a lot in here, but a lot of it's kind of specific to text processing. And so some of it I might skip over a little bit quickly, but we'll talk about which bits are interesting.

So we're going to start with the IMDB data set as we have before. And to remind you it contains a training folder, an unsupervised folder, and a testing folder. So the first thing we need to do is we need to create a data blocks item list subclass for text.

Believe it or not, that's the entire code. Because we already have a get files, so here's a get files with dot text. And all you have to do is override get to open a text file like so. And we're now ready to create an item list. So this is like the data blocks API is just so super easy to create, you know, to handle your domain.

So if you've got genomic sequences or audio or whatever, this is basically what you need to do. So now we've got an item list with 100,000 things in it. We've got the train, the test, and the unsupervised. And we can index into it and see a text. So here's a movie review.

And we can use all the same stuff that we've used before. So for the previous notebook, we just built a random splitter. So now we can use it on texts. So the nice thing about this decoupled API is that we can mix and match things and things just work, right?

And we can see the representation of them. They just work. Okay, so we can't throw this movie review into a model. It needs to be numbers. And so as you know, we need to tokenize and numericalize this. So let's look at the details. We use spacey for tokenizing. And we do a few things as we tokenize.

One thing we do is we have a few pre rules. These are these are bits of code that get run before tokenization. So for example, if we find br slash, we replace it with a new line. Or if we find a slash or a hash, we put spaces around it.

If we find more than two spaces in a row, we just make it one space. Then we have these special tokens. And this is what they look like as strings that we use symbolic names for them, essentially. And these different tokens have various special meanings. For example, if we see some non-whitespace character more than three times in a row, we replace it with this is really cool, right?

In Python substitution, you can pass in a function, right? So rep.sub here is going to look for this and then it's going to replace it with the result of calling this function, which is really nice. And so what we're going to do is we're going to stick in the TK rep special token.

So this means that there was a repeating token where they're going to put a number, which is how many times it repeated. And then the thing that was actually there. We'll do the same thing with words. There's a lot of bits of little crappy things that we see in texts that we replace mainly HTML entities.

We call those our default pre-rules. And then this is our default list of special tokens. So for example, replace rep C C C would be XX rep for C. Or replace W rep, would, would, would, would, would, would, would, would, would be XXW rep 5 would. Why? Well, think about the alternatives, right?

So what if you read a tweet that said this was amazing 28 exclamation marks. So you can either treat those 28 exclamation marks as one token. And so now you have a vocab item that is specifically 28 exclamation marks. You probably never see that again, so probably won't even end up in your vocab.

And if it did, you know, it's, it's going to be so rare that you won't be able to learn anything interesting about it. But if instead we replaced it with XX rep 28 exclamation mark, then this is just three tokens where it can learn that lots of repeating exclamation marks is a general concept that has certain semantics to it, right?

So that's what we're trying to do in NLP is we're trying to make it so that the things in our vocab are as meaningful as possible. And the nice thing is that because we're using an LSTM, we can have multi-word sequences and be confident that the LSTM will create some stateful computation that can handle that sequence.

Another alternative is we could have turned the 28 exclamation marks into 28 tokens in a row, each one of the single exclamation mark. But now we're asking our LSTM to hang on to that state for 28 time steps, which is just a lot more work for it to do.

And it's not going to do as good a job, right? So we want to make things easy for our models. That's what pre-processing is all about. So same with all caps, right? If you've got, I am shouting, then it's pretty likely that there's going to be exclamation marks after that.

There might be swearing after that. Like the fact that there's lots of capitalized words is semantic of itself. So we replace capitalized words with a token saying this is a capitalized word. And then we replace it with the lowercase word. So we don't have a separate vocab item for capital am, capital shouting, capital, every damn word in the dictionary.

Okay. Same thing for mixed case. So I don't know, I haven't come across other libraries that do this kind of pre-processing. There's little bits and pieces in various papers, but I think this is a pretty good default set of rules. Notice that these rules have to happen after tokenization because they're happening at a word level.

So we have default post rules. And then this one here adds a beginning of stream and an end of stream on either side of a list of tokens. Why do we do that? These tokens turn out to be very important because when your language model sees like an end of stream character token, meaning like that's the end of a document, that it knows the next document is something new.

So it's going to have to learn the kind of reset its state to say like, oh, we're not talking about the old thing anymore. So we're doing Wikipedia. We were talking about Melbourne, Australia. Oh, and now there's a new token. Then we're talking about the Emmys, right? So when it sees EOS, it has to learn to kind of reset its state somehow.

So you need to make sure that you have the tokens in place to allow your model to know that these things are happening. Tokenization is kind of slow because Spacey does it so carefully. I thought it couldn't possibly be necessary to do it so carefully because it just doesn't seem that important.

So last year I tried removing Spacey and replacing it with something much simpler. My IMDB accuracy went down a lot. So actually it seems like Spacey's sophisticated parser-based tokenization actually does better. So at least we can try and make it fast. So Python comes with something called a process pool executor, which runs things in parallel.

And I wrap it around with this little thing called parallel. And so here's my thing that runs, look, compose, appears everywhere. Compose the pre-rules on every chunk, run the tokenizer, compose the post rules on every dock. That's processing one chunk. So run them all in parallel for all the chunks.

So that's that. So this is a processor, which we saw last week, and this is a processor which tokenizes. And so we can try it out. So we can create one and try, here's a bit of text, and let's try tokenizing. And so you can see we've got beginning of stream, did int, so int is a token, comma is a token, xx, match, da1, so that was a capital D, and so forth.

All right, so now we need to turn those into numbers, not just to have a list of words. We can turn them into numbers by numericalizing, which is another processor, which basically when you call it, we find out, do we have a vocab yet? Because numericalizing is just saying, what are all the unique words?

And the list of unique words is the vocab. So if we don't have a vocab, we'll create it, okay? And then after we create it, it's just a case of calling object to int on each one. So O to I is just a dictionary, right? Or if deprocessing is just grabbing each thing from the vocab.

So that's just an array. Okay, so we can tokenize, numericalize, run it for two and a half minutes. And so we've got the xobj is the thing which returns the object version, so as opposed to the numericalized version, and so we can put it back together and this is what we have after it's been turned into numbers and back again.

So since that takes a couple of minutes, good idea to dump the labeled list so that we can then load it again later without having to rerun that. All right, this is the bit which a lot of people get confused about, which is how do we batch up language model data?

So here's this bit of text. It's very meta, it's a bit of text which is from this notebook. So the first thing we're going to do is we're going to say, let's create some batch sizes, create a small one for showing you what's going on, six. So let's go through and create six batches, which is just all the tokens for each of those six batches.

So here's, in this notebook, we will go back over the example of is the first element of, so this is the first row and then of classifying movie reviews we studied in part one, this is the second. So we just put it into six groups, right? And then let's say we have a BPTT of five, so it's kind of like our backprop through time sequence length of five, then we can split these up into groups of five.

And so that'll create three of them. In this notebook, we will go back over the example of classifying movie reviews we studied in part one. These three things then are three mini batches. And this is where people get confused because it's not that each one has a different bunch of documents.

Each one has the same documents over consecutive time steps. This is really important. Why is it important? Because this row here in the RNN is going to be getting some state about this document. So when it goes to the next batch, it needs to use that state. And then it goes to the next batch, needs to use that state.

So from batch to batch, the state that it's building up needs to be consistent. That's why we do the batches this way. >> I wanted to ask if you did any other preprocessing, such as removing stop words, stemming, or limitization? >> Yeah, great question. So in traditional NLP, those are important things to do.

Removing stop words is removing words like "ah" and "on." Stemming is like getting rid of the "ing" suffix or stuff like that. It's kind of like universal in traditional NLP. It's an absolutely terrible idea. Never ever do this. Because -- well, the first question is like why would you do it?

Why would you remove information from your neural net which might be useful? And the fact is it is useful. Like stop words, your use of stop words tells you a lot about what style of language, right? So you'll often have a lot less kind of articles and stuff if you're like really angry and speaking really quickly.

You know, the tense you're talking about is obviously very important. So stemming gets rid of it. So yeah, all that kind of stuff is in the past. You basically never want to do it. And in general, preprocessing data for neural nets, leave it as raw as you can is the kind of rule of thumb.

So for a language model, each mini batch is basically going to look something like this for the independent variable. And then the dependent variable will be exactly the same thing but shifted over by one word. So let's create that. This thing is called LM freeloader. It would actually be better off being called an LM data set.

Why don't we do it right now? LM pre freeloader. LM data set. That's really what it is. Okay. So an LM data set is a data set for a language model. Remember that a data set is defined as something with a length and a get item. So this is a data set which you can index into it.

And it will grab an independent variable and a dependent variable. And the independent variable is just the text from wherever you asked for, for BPTT. And the dependent variable is the same thing offset by one. So you can see it here. We can create a data loader using that data set.

Remember that's how data loaders work. You pass them a data set. And now we have something that we can iterate through, grabbing a mini batch at a time. And you can see here X is XXBOS well worth watching. And Y is just well worth watching. Okay. And then you can see the second batch, best performance to date.

So make sure you print out things that all make sense. So that's stuff that we can all dump into a single function and use it again later and chuck it into a data bunch. So that's all we need for a data bunch for language models. We're also going to need a data bunch for classification.

And that one's going to be super easy because we already know how to create data bunches for classification because we've already done it for lots of image models. And for NLP it's going to be exactly the same. So we create an item list. We split. We label. That's it.

So the stuff we did for image is not different. Only thing we've added is two preprocesses. >> Question. What are the tradeoffs to consider between batch size and back propagation through time? For example, BPTT10 with BS100 versus BPTT100 with BS10. Both would be passing a thousand tokens at a time to the model.

What should you consider when tuning the ratio? >> It's a great question. I don't know the answer. I would love to know. So try it. Because I haven't had time to fiddle with it. I haven't seen anybody else experiment with it. So that would make a super great experiment.

I think the batch size is the thing that lets it parallelize. So if you don't have a large enough batch size it's just going to be really slow. But on the other hand, the large batch size with a short BPTT, depending on how you use it, you may end up kind of ending up with less state that's being back propagated.

So the question of how much that matters, I'm not sure. And when we get to our ULM classification model I'll actually show you this, kind of where this comes in. Okay. So here's a couple of examples of a document and a dependent variable. And what we're going to be doing is we're going to be creating data loaders for them.

But we do have one trick here. Which is that with images, our images were always, by the time we got to modeling they were all the same size. Now this is probably not how things should be. And we have started doing some experiments with training with rectangular images of different sizes.

But we're not quite ready to show you that work because it's still a little bit fiddly. But for text we can't avoid it. You know, we've got different sized texts coming in. So we have to deal with it. And the way we deal with it is almost identical to how actually we're going to end up dealing with when we do do rectangular images.

So if you are interested in rectangular images, try and basically copy this approach. Here's the approach. We are going to pad each document by adding a bunch of padding tokens. So we just pick some arbitrary token which we're going to tell PyTorch this token isn't text. It's just thrown in there because we have to put in something to make a rectangular tensor.

If we have a mini batch with a 1,000 word document and then a 2,000 word document and then a 20 word document, the 20 word document is going to end up with 1,980 padding tokens on the end. And as we go through the RNN, we're going to be totally pointlessly calculating on all these padding tokens.

We don't want to do that. So the trick is to sort the data first by length. So that way your first mini batch will contain your really long documents and your last mini batch will create your really short documents and each mini batch will not contain a very wide variety of lengths of documents.

So there won't be much padding and so there won't be much wasted computation. So we've already looked at samplers. If you've forgotten, go back to when we created our data loader from scratch and we actually created a sampler. And so here we're going to create a different type of sampler and it is simply one that goes through our data, looks at how many documents is in it, creates the range from zero to the number of documents, sorts them by some key and returns that iterator, sorts them in reverse order.

So we're going to use sort sampler passing in the key, which is a lambda function that grabs the length of the document. So that way our sampler is going to cause each mini batch to be documents of similar lengths. The problem is we can only do this for validation, not for training because for training we want to shuffle and sorting would undo any shuffling because sorting is deterministic.

So that's why we create something called sort ish sampler. And the sort ish sampler approximately orders things by length. So every mini batch has things of similar lengths but with some randomness. And the way we do this, the details don't particularly matter but basically I've created this idea of a mega batch, which is something that's 50 times bigger than a batch and basically I sort those, okay?

And so you end up with these kind of like sorted mega batches and then I have random permutations within that. So you can see random permutations there and there. So you can look at the code if you care, the details don't matter. In the end, it's a random sort in which things of similar lengths tend to be next to each other and the biggest ones tend to be at the start.

So now we've got a mini batch of numericalized, tokenized documents of similar lengths but they're not identical lengths, right? And so you might remember the other thing when we first created a data loader, we gave it two things, a sampler and a collate function. And the collate function that we wrote simply said torch.stack because all our images were the same size because all our images were the same size so we could just literally just stick them together.

We can't do that for documents because they're different sizes. So we've written something called pad collate. And what Sylvan did here was he basically said let's create something that's big enough to handle the longest document in the mini batch and then go through every document and dump it into that big tensor either at the start or at the end depending on whether you said pad first.

So now we can pass the sampler and the collate function to our data loader and that allows us to grab some mini batches which as you can see contain padding at the end. And so here's our normal convenience functions that do all those things for us and that's that, okay.

So that's quite a bit of preprocessing and I guess the main tricky bit is this dealing with different lengths. And at that point we can create our AWD LSTM. So these are just the steps we just did to create our data loader. And now we're going to create an RNN.

So an RNN remember is just a multi-layer network. But it's a multi-layer network that could be very, very, very many layers. There could be like if it's a 2000 word document this is going to be 2000 layers. So to avoid us having to write 2000 layers we used a for loop.

And between every pair of hidden layers we use the same weight matrix. That's why they're the same color. And that's why we can use a for loop. Problem is as we've seen trying to handle 2000 layers of neural net we get vanishing gradients or exploding gradients it's really, really difficult to get it to work.

So what are we going to do? Because it's even worse than that because often we have layers going into, RNNs going into other RNNs so we actually have stacked RNNs which when we unstack them it's going to be even more thousands of layers effectively. So the trick is we create something called an LSTM cell.

Rather than just doing a matrix multiply as our layer we instead do this thing called an LSTM cell as our layer. This is it here. So this is a sigmoid function and this is a tanh function. So the sigmoid function remember goes from 0 to 1 and kind of nice and smooth between the two.

And the tanh function is identical to a sigmoid except it goes from minus 1 to 1 rather than 0 to 1. So sigmoid is 0 to 1, tanh is minus 1 to 1. So here's what we're going to do. We're going to take our input and we're going to have some hidden state as we've already always had in our RNNs.

This is just our usual hidden state. And we're going to multiply our input by some weight matrix in the usual way. Then we're going to multiply our hidden state by some weight matrix in the usual way and then we add the two together in the way we've done before for RNNs.

And then we're going to do something interesting. We're going to split the result into four equal sized tensors. So the first one quarter of the activations will go through this path, the next will go through this path, the next will go through this path, the next will go through this path.

So what this means is we kind of have like four little neural nets effectively, right? And so this path goes through a sigmoid and it hits this thing called the cell. Now this is the new thing. So the cell, just like hidden state, is just a rank one tensor or for a mini batch, a rank two tensor.

It's just some activations. And what happens is we multiply it by the output of this sigmoid. So the sigmoid can go between zero and one. So this, this gate has the ability to basically zero out bits of the cell state. So we have the ability to basically take this state and say like delete some of it.

So we could look at some of these words or whatever in this LSTM and say based on looking at that, we think we should zero out some of our cell state. And so now the cell state has been selectively forgotten. So that's the forget gate. We then add it to the second chunk, the second little mini neural net, which goes through sigmoid.

So this is just our input and we multiply it by the third one, which goes through a tench. So this basically allows us to say, which bits of input do we care about? And then this gives us the numbers from minus one to one, multiply them together. And this adds, so this is, how do we update our cell state?

So we add on some new state. And so now we take that cell state and we put it through another, well, one thing that happens is it goes through to the next time step. And the other thing that happens is it goes through one more tench to get multiplied by the fourth little mini neural net, which is the output.

So this is the actual, this actually creates the output hidden state. So it looks like there's a lot going on, but actually it's just this, right? So you've got one neural net that goes from input to hidden. It's a linear layer. One that goes from hidden to hidden. Each one is going to be four times the number of hidden because after we compute it and add them together, chunk splits it up into four equal sized groups.

Three of them go through a sigmoid. One of them goes through a tench and then this is just the multiply and add that you saw. So there's kind of like conceptually a lot going on in LSTM and it's certainly worth doing some more reading about why this particular architecture.

But one thing I will say is there's lots of other ways you can set up a layer which has the ability to selectively update and selectively forget things. For example, there's something called a GIU, which has one less gate. The key thing seems to be giving it some way to make a decision to forget things.

Cuz if you do that, then it has the ability to not push state through all thousand time steps or whatever. So that's our LSTM cell and so an LSTM layer, assuming we only have one layer, is just that for loop that we've seen before and we're just gonna call whatever cell we asked for.

So we're gonna ask for an LSTM cell and it just loops through and see how the state, we can take the state and we update the state. So you can see this is the classic deep learning, it's like an NN.sequential, right? It's looping through a bunch of functions that are updating itself.

That's what makes it a deep learning network. So that's an LSTM. So that takes 105 milliseconds for a small net on the CPU. We could pop it onto CUDA, then it's 24 milliseconds on GPU. It's not that much faster, because this loop, every time step, it's having to push off another kernel launch off to the GPU and that's just slow, right?

So that's why we use the built in version. And the built in version behind the scenes calls a library from Nvidia called cuDNN, which has created a C++ version of this. It's about the same on the CPU, right? Not surprisingly, it's really not doing anything different, but on the GPU goes from 24 milliseconds to 8 milliseconds.

So it's dramatically faster. The good news is we can create a faster version by taking advantage of something in PyTorch called JIT. And what JIT does is it reads our Python and it converts it into C++ that does the same thing. It compiles it the first time you use it and then it uses that compiled code.

And so that way it can create an on GPU loop. And so the result of that is, again, pretty similar on the CPU, but on the GPU, 12 milliseconds. So, you know, not as fast as the cuDNN version, but certainly a lot better than our non-JIT version. So this seems like some kind of magic thing that's going to save our lives and not require us to have to come to the Swift for TensorFlow lectures.

But I've got bad news for you. Trying to get JIT working has been honestly a bit of a nightmare. This is the third time we've tried to introduce it in this course. And the other two times we've just not gotten it working or we've gotten worse results. It doesn't work very well that often.

And it's got a lot of weird things going on. Like, for example, if you decide to comment out a line, right, and then run it and then run it, you'll get this error saying unexpected indent. Like, literally, it's not Python, right? So it doesn't even know how to comment out lines.

It's this kind of weird thing where they try to -- it's heroic. It's amazing that it works at all. But the idea that you could try and turn Python, which is so not C++, into C++ is really pushing at what's possible. So it's astonishing this works at all. And occasionally it might be useful, but it's very, very hard to use.

And when something isn't as fast as you want, it's very, very hard to -- you can't profile it, you can't debug it, not in the normal ways. But, you know, obviously, it will improve. It's pretty early days. It will improve. But the idea of trying to parse Python and turn it into C++, literally, they're doing like string interpolation behind the scenes, is kind of trying to reinvent all the stuff that compilers already do, converting a language that was very explicitly not designed to do this kind of thing into one that does.

And I just -- I don't think this is the future. So I say for now, be aware that JIT exists. Be very careful in the short term. I found places where it literally gives the wrong gradients. So it goes down a totally different auto grad path. And I've had models that trained incorrectly without any warnings.

Because it was just wrong. So be very careful. But sometimes, like, for a researcher, if you want to play with different types of RNNs, this is your only option. Unless you, you know, write your own C++. Or unless you try out Julia or Swift, I guess. Is there a question?

>> Yeah. Why do we need -- why do we need torch.cuda.synchronize? Is it kind of a lock to synchronize CUDA threads or something? >> Yeah, this is something that, thanks to Tom on the forum for pointing this out, it's just when we're timing, without the synchronize, it's -- let's find it.

So I just created a little timing function here. Without the synchronize, the CUDA thing will just keep on running things in the background, but will return -- it will let your CPU thread keep going. So it could end up looking much faster than it actually is. So synchronize says, don't keep going in my Python world until my CUDA world is finished.

Okay. So now we need dropout. And this is the bit that really is fantastic about AWDLSTM, is that Stephen Meredy thought about all the ways in which we can regularize a model. So basically, dropout is just Bernoulli random noise. So Bernoulli random noise simply means create 1s and 0s, and it's 1 with this probability.

Right? So create a bunch of random 1s and 0s. And then divide by 1 minus P. So that makes them, in this case, to 0.5, it's randomly 0s and 2s. And the reason they're 0s and 2s is because that way the standard deviation doesn't change. So we can remove dropout for inference time and the activations will be still scaled correctly.

And we talked about that a little bit in part 1. And so now we can create our RNN dropout. And one of the nifty things here is the way that Sylvain wrote this is you don't just pass in the thing to dropout, but you also pass in a size.

Now, normally, you would just pass in the size of the thing to dropout like this. But what he did here was he passed in, for the size, size 0, 1, size 2. And so if you remember back to broadcasting, this means that this is going to create something with a unit axis in the middle.

And so when we multiply that, so here's our matrix, when we multiply the dropout by that, our 0s get broadcast. This is really important, right? Because this is the sequence dimension. So every time step, if you drop out time step number 3, but not time step 2 or 4, you've basically broken that whole sequence's ability to calculate anything because you just killed it, right?

So this is called RNN dropout or also called variational dropout. There's a couple of different papers that introduce the same idea. And it's simply this that you do dropout on the entire sequence at a time. So there's RNN dropout. The second one that Stephen Meridy showed was something he called weightdrop.

It actually turns out that this already existed in the computer vision world where it was called dropconnect. So there's now two things with different names but are the same, weightdrop and dropconnect. And this is dropout not on the activations but out on the weights themselves. So you can see here when we do the forward pass, we go set weights that applies dropout to the actual weights.

So that's our second type of dropout. The next one is embedding dropout. And this one, as you can see, it drops out an entire row. This is actually a coincidence that all these rows are in order, but it drops out an entire row. So by dropping it-- so what it does is it says, OK, you've got an embedding.

And what I'm going to do is I'm going to drop out all of the embedding-- the entire embedding vector for whatever word this is. So it's dropping out entire words at a time. So that's embedding dropout. So with all that in place, we can create an LSTM model. It can be a number of layers.

So we can create lots of LSTMs for however many layers you want. And we can loop through them. And we can basically call each layer. And we've got all our different dropouts. And so basically this code is just calling all the different dropouts. So that is an AWDLSTM. So then we can put on top of that a simple linear model with dropout.

And so this simple linear model-- so it's literally just a linear model where we go dropout and then call our linear model is-- we're going to create a sequential model which takes the RNN, so the AWDLSTM, and passes the result to a single linear layer with dropout. And that is our language model.

Because that final linear layer is a thing which will figure out what is the next word. So the size of that is the size of the vocab. It's good to look at these little tests that we do along the way. These are the things we use to help pass check that everything looks sensible.

And we found, yep, everything does look sensible. And then we added something that AWDLSTM did which is called gradient clipping, which is a callback that just checks after the backward pass what are the gradients. And if the total norm of the gradients-- so the root sum of squares of gradients is bigger than some number, then we'll divide them all so that they're not bigger than that number anymore.

So it's just clipping those gradients. So that's how easy it is to add gradient clipping. This is a super good idea, not as used as much as it should be because it really lets you train things at higher learning rates and avoid kind of gradients blowing out. Then there's two other kinds of regularization.

This one here is called activation regularization. And it's actually just an L2 loss, an L2 penalty, just like weight decay. Except the L2 penalty is not on the weights, it's on the activations. So this is going to make sure that our activations are never too high. And then this one's really interesting.

This is called temporal activation regularization. This checks how much does each activation change by from sequence step to sequence step, and then take the square of that. So this is regularizing the RNN to say try not to have things that massively change from time step to time step. Because if it's doing that, that's probably not a good sign.

So that's our RNN trainer callback. We set up our loss functions, which are just normal cross-entropy loss, and also a metric which is normal accuracy. But we just make sure that our batch and sequence length is all flattened. So we can create our language model, add our callbacks, and fit.

So once we've got all that, we can use it to train that language model on Wikitex 103. So I'm not going to go through this, because it literally just uses what's in the previous notebook. But this shows you here's how you can download Wikitex 103, split it into articles, create the text lists, split into train and valid, tokenize, numericalize, data bunchify, create the model that we just saw and train it for, in this case, about five hours.

Because it's quite a big model. So because we don't want you to have to train for five hours this RNN, you will find that you can download that small pre-trained model from this link. So you can now use that on IMDB. So you can, again, grab your IMDB data set, download that pre-trained model, load it in.

And then we need to do one more step, which is that the embedding matrix for the pre-trained Wikitex 103 model is for a different bunch of words to the IMDB version. So they've got different vocabs with some overlap. So I won't go through the code, but what we just do is we just go through each vocab item in the IMDB vocab, and we find out if it's in the Wikitex 103 vocab, and if it is, we copy Wikitex 103's vocab over.

It's embedding over. So that way we'll end up with an embedding matrix for IMDB that is the same as the Wikitex 103 embedding matrix. Any time there's a word that's the same, and any time there's a word that's missing, we're just going to use the mean bias and the mean weights.

So that's all that is. Okay, so once we've done that, we can then define a splitter just like before to create our layer groups. We can set up our callbacks, our learner, we can fit, and so then we'll train that for an hour or so, and at the end of that we have a fine-tuned IMDB language model.

So now we can load up our classifier data bunch, which we created earlier. That's exactly the same lines of code we had before. I'm going to ignore this pack-padded-sequence stuff, but basically there's a neat little trick in PyTorch where you can take data that's of different lengths and call pack-padded-sequence, pass that to an RNN, and then call pad-sequence, and it basically takes things of different lengths and kind of optimally handles them in an RNN.

So we basically update our AWD LSTM to use that. You might remember that for ULM fit, we kind of create our hidden state in the LSTM for lots of time steps, and we want to say, "Oh, which bit of state do we actually want to use for classification?" People used to basically use the final state.

Something that I tried, and it turned out to work really well, so it ended up in the paper, was that we actually do an average pool and a max pool and use the final state, and we concatenate them all together. So this is like the concat pooling we do for images.

We do the same kind of thing for text. So we put all that together. This is just checking that everything looks sensible, and that gives us something that we call the pooling linear classifier, which is just a list of batch norm dropout linear layers and our concat pooling, and that's about it.

So we just go through our sentence, one BPTT at a time, and keep calling that thing and keep appending the results. So once we've done all that, we can train it. So here's our normal set of callbacks. We can load our fine-tuned encoder, and we can train, and 92% accuracy, which is pretty close to where the state-of-the-art was a very small number of years ago, and this is not the same as we got about 94.5% or something like that, or 95% for the paper, because that used a bigger model that we trained for longer.

So that was a super-fast zip-through ULM fit, and plenty of stuff which is probably worth reading in more detail, and we can answer questions on the forum as well. So let's spend the last 10 minutes talking about Swift, because the next two classes are going to be about Swift.

So I think anybody who's got to lesson 12 in this course should be learning Swift for TensorFlow. The reason why is I think basically that Python stays a-numbered. That stuff I showed you about JIT. The more I use JIT, the more I think about it, the more it looks like failed examples of software development processes I've seen in the last 25 years.

Whenever people try to convert one language into a different language, and then you're kind of using the language that you're not really using, it requires brilliant, brilliant people like the PyTorch team years to make it almost kind of work. So I think Julia or Swift will eventually in the coming years take over.

I just don't think Python can survive, because we can't write CUDA kernels in Python. We can't write RNN cells in Python and have them work reliably and fast. DL libraries change all the time anyway, so if you're spending all your time just studying one library and one language, then you're not going to be ready for that change.

So you'll need to learn something new anyway. It'll probably be Swift or Julia, and I think they're both perfectly good things to look at. Regardless, I've spent time using in real-world scenarios at least a couple of dozen languages, and every time I learn a new language, I become a better developer.

So it's just a good idea to learn a new language. And like the for TensorFlow bit might put you off a bit, because I've complained a lot about TensorFlow, but there's a flow in the future that's going to look almost totally different to TensorFlow in the past. The things that are happening with Swift for TensorFlow are so exciting.

So there's basically almost no data science ecosystem for Swift, which means the whole thing is open for you to contribute to. So you can make serious contributions, look at any Python little library or just one function that doesn't exist in Swift and write it. The Swift community doesn't have people like us.

They have people that understand deep learning. They're just not people who are generally in the Swift community right now with some exceptions. So we are valued. And you'll be working on stuff that will look pretty familiar, because we're building something a lot like fast AI, but hopefully much better.

So with that, I have here Chris Latner, who come on over, who started the Swift project and is now running Swift for TensorFlow team at Google. And we have time for I think three questions from the community for Chris and I. >> Sure. Assuming someone has zero knowledge of Swift, what would be the most efficient way to learn it and get up to speed with using Swift for TensorFlow?

>> Sure. So the courses we're teaching will assume they don't have prior Swift experience, but if you're interested, you can go to Swift.org. In the documentation tab, there's a whole book online. The thing I recommend is there's a thing called A-Swift Tour. You can just Google for that. It gives you a really quick sense of what it looks like.

And it explains the basic concepts. It's super accessible. That's where I want to start. >> The best version of the Swift book is on the iPad. It uses something called Swift Playgrounds, which is one of these amazing things that Chris built, which basically lets you go through the book in a very interactive way.

It will feel a lot like the experience of using a Jupyter notebook, but it's even more fancy in some ways. So you can read the book as you experiment. >> As Swift for TensorFlow evolves, what do you think will be the first kind of machine learning work accessible to people who don't have access to big corporate data centers where Swift for TensorFlow's particular strengths will make it a better choice than the more traditional Python frameworks?

>> Sure. I don't know what that first thing will be. But I think you have to look at the goals of the project. And I think there's two goals for this project overall. One is to be very subtractive. And subtractive of complexity. And I think that one of the things that Jeremy's highlighting is that in practice, being effective in the machine learning field means you end up doing a lot of weird things at different levels.

And so you may be dropping down to C++ or writing CUDA code, depending on what you're doing. Or playing with these other systems or these other C libraries that get wrapped up with Python. But these become leaky abstractions you have to deal with. So we're trying to make it so you don't have to deal with a lot of that complexity.

So you can stay in one language. It works top to bottom. It's fast, has lots of other good things to go with it. So that's one aspect of it. The other pieces, we're thinking about it from the bottom up, including the compiler bits, all the systems integration pieces, the application integration pieces.

And I have a theory that once we get past the world of Python here, that people are going to start doing a lot of really interesting things where you integrate deep learning into applications. And right now the application world and the ML world are different. I mean, people literally export their model into like an ONNX or TF serving or whatever, and dump it into some C++ thing where it's a whole new world.

It's a completely different world. And so now you have this barrier between the training, the learning and the ML pieces. And you have the application pieces. And often these are different teams or different people thinking about things in different ways. And breaking down those kinds of barriers, I think is a really big opportunity that enables new kinds of work to be done.

Very powerful. And that leads well into the next pair of questions. Does it make sense to spend efforts learning and writing in Swift only, or is it worth to have some understanding of C++ as well to be good in numerical computations? And then secondly, after going through some of the Swift documentations, it seems like it's a very versatile language.

If I understand correctly, deep learning, robotics, web development, and systems programming all seem well under its purview. Do you foresee Swift's influence flourishing in all these separate areas and allowing for a tighter and more fluid development between disciplines? Sure. So I think these are two sides of the same coin.

I totally agree with Jeremy. Learning new programming languages is good just because often you learn to think about things in a new way, or they open up new kinds of approaches, and having more different kinds of mental frameworks gives you the ability to solve problems that otherwise you might not be able to do.

And so learning C++ in the abstract is a good thing. Having to use C++ is a little bit of a different thing, in my opinion. And so C++ has lots of drawbacks. This is coming from somebody who's written a C++ compiler. I've written way too much C++ myself, and maybe I'm a little bit damaged here, but C++ is a super complicated language.

It's also full of memory safety problems and security vulnerabilities and a lot of other things that are pretty well known. It's a great language. It supports tons of really important work, but one of the goals with Swift is to be a full stack language and really span from the scripting all the way down to the things C++ is good at, and getting C++-level performance in the same language that you can do high-level machine learning frameworks in is pretty cool.

I think that that's one of the really unique aspects of Swift, is it was designed for compilation, for usability, for accessibility, and I'm not aware of a system that's similar in that way. Great. I'm really looking forward to it. Can we ask one more question? I think we're out of time.

Sorry. Yeah. One more question next time. Thanks, everybody. Thank you, Chris Latner, and we'll see you next week.

Lesson 12 (2019) - Advanced training techniques; ULMFiT from scratch

Chapters

Transcript