back to indexLesson 4: Deep Learning 2018
Chapters
0:0
10:11 The Whole Thing Now Isn't Going To Work As Well It's Not Going To Recognize that Image Right so It Has to in Order for this To Work It Has To Try and Find a Representation that that Actually Continues To Work Even As Random Half of the Activations Get Thrown Away every Time All Right so It's a It's It's I Guess About Four Years Old Now Three or Four Years Old and It's Been Absolutely Critical in Making Modern Deep Learning Work and the Reason Why Is It Really Just About Solve the Problem of Generalization for Us Before Drop Out Came Along if You Try To Train a Model with Lots of Parameters and You Were Overfitting and You Already Tried All the Imitation You Could and You Already Had As Much Data as You Could You There Were some Other Things You Could Try but to a Large Degree You Were Kind Of Stuck Okay and So Then Geoffrey Hinton and His Colleagues Came Up with this this Dropout Idea That Was Loosely Inspired
17:40 We Actually End Up with in this Case a Reasonably Good Result because We'Re Not Training It for Very Long and this Particular Pre-Trained Network Is Really Well Suited to this Particular Problem Yesterday so Jeremy What Kind of Piece Should We Were Using by Default so the One That's There by Default for the First Layer Is 0 25 and for the Second Layer Is 0 5 That Seems To Work Pretty Well for Most Things Right So like It's It's It You Don't Necessarily Need To Change It At All Basically if You Find It's Overfitting
28:3 What We'Re Telling Our Neural Net down the Track Is that for every Different Level of Say Year You Know 2000 2001 2002 You Can Treat It Totally Differently Where Else if We Say It's Continuous Its Have To Come Up with some Kind of Like Function some Kind of Smooth Ish Function Right and So Often Even for Things like a Year That Actually Are Continuous but They Don't Actually Have Many Distinct Levels It Often Works Better To Treat It as Categorical so another Good Example Day of Week Right So like Day of Week between Naught & 6 It's a Number and It Means Something Motifs between 3 & 5 Is Two Days and Has Meaning
45:7 Create the Learner
45:44 Embeddings
46:4 Continuous Variables
57:5 Embedding Matrices
61:56 Distributed Representation
71:50 Custom Metrics
79:14 Data Augmentation
79:32 What Is Dropout Doing
84:48 Nlp
85:10 Language Modeling
91:14 Language Model
97:32 Create a Language Model
99:50 Tokenization in Nlp
100:58 Example of Tokenization
102:4 Create the Model Data Object
102:25 Model Data Object
102:55 Minimum Frequency
107:11 Bag of Words
115:55 Create a Model
116:0 Embedding Matrix
118:19 Create an Embedding Matrix
124:24 Fit the Model
00:00:00.000 |
Okay, hi everybody welcome back good to see you all here 00:00:11.360 |
Lots of cool things going on and like last week 00:00:16.400 |
I wanted to highlight a few really interesting articles that some of some of you folks who have written 00:00:24.640 |
Fatali wrote one of the best articles I've seen for a while. I think actually talking about 00:00:32.880 |
differential learning rates and stochastic gradient descent with restarts 00:00:36.720 |
Be sure to check it out if you can because what he's done. I feel like he's done a great job of 00:00:43.040 |
Kind of positioning in a place that you can get a lot out of it 00:00:48.360 |
You know regardless of your background, but for those who want to go further 00:00:52.160 |
He's also got links to like the academic papers that came from and kind of graphs of showing examples of all of all the things 00:01:02.440 |
Nicely done article so a good kind of role model for technical communication 00:01:08.600 |
One of the things I've liked about you know seeing people post these 00:01:12.560 |
Post these articles during the week is the discussion on the forums have also been like really great. There's been a lot of a 00:01:22.320 |
Explaining things you know which you know maybe there's parts of the post bit where people have said actually that's not quite how it works 00:01:28.640 |
And people have learned new things that way people have come up with new ideas as a result as well 00:01:33.600 |
These discussions of stochastic gradient descent with restarts and cyclical learning rates has been a few of them actually 00:01:48.720 |
Similar topic and why it works so well and again lots of great pictures and references to 00:01:53.780 |
Papers and most importantly perhaps code are showing how it actually works 00:01:59.600 |
Mark Hoffman covered the same topic at kind of a nice introductory level. I think really really kind of clear intuition 00:02:10.180 |
Many can't talk specifically about differential learning rates 00:02:15.920 |
And why it's interesting and again providing some nice context people not familiar with transfer learning 00:02:22.120 |
You're not going back back to saying like well. What is transfer learning? 00:02:24.520 |
Why is that interesting and given that why could differential learning rates be helpful? 00:02:35.440 |
article was that he talked not just about the technology that we're looking at but also talked about some of the 00:02:42.840 |
implications particularly from a commercial point of view 00:02:45.280 |
So thinking about like based on some of the things we've learned about so far 00:02:49.800 |
What are some of the implications that that has you know in real life? 00:02:56.180 |
And then discussing some of the yeah some of the implications 00:03:00.400 |
So there's been lots of great stuff online and thanks to everybody for all the great work that you've been doing 00:03:08.640 |
As we talked about last week if you're kind of vaguely wondering about writing something 00:03:13.800 |
But you're feeling a bit intimidated about it because you've never really written a technical post before just jump in you know 00:03:21.980 |
Welcoming and encouraging group. I think to to work with 00:03:26.360 |
So we're going to have a kind of an interesting lesson today, which is we're going to cover a 00:03:37.120 |
Whole lot of different applications, so we've we've spent quite a lot of time on computer vision 00:03:42.600 |
And today we're going to try if we can to get through three totally different areas 00:03:48.120 |
Structured learning so looking at kind of how you look at 00:03:52.120 |
So we're going to start out looking at structured learning or structured data learning by which I mean 00:03:59.340 |
Building models on top of things look more like database tables 00:04:04.680 |
So kind of columns of different types of data. They might be financial or geographical or whatever 00:04:10.240 |
We're going to look at using deep learning for language natural language processing 00:04:16.600 |
And we're going to look at using deep learning for recommendation systems, and so we're going to cover these 00:04:22.440 |
at a very high level and the focus will be on 00:04:31.480 |
More than here is what's going on behind the scenes, and then the next three lessons 00:04:36.160 |
We'll be digging into the details of what's been going on behind the scenes and also coming back to 00:04:41.920 |
Looking at a lot of the details of computer vision that we kind of skipped over so far 00:04:47.740 |
So the focus today is really on like how do you actually do these applications? 00:04:53.880 |
And we'll kind of talk briefly about some of the concepts involved 00:04:59.720 |
Before we do I did want to talk about one key 00:05:06.480 |
Which is dropout and you might have seen dropout mentioned a bunch of times already and got the got the impression that this is 00:05:14.840 |
So to look at dropout. I'm going to look at the the dog breeds 00:05:18.740 |
Current cable competition that's going on and what I've done is I've gone ahead and I've created a 00:05:30.240 |
and I've passed in pre compute equals true and so that's going to 00:05:34.520 |
Pre-compute the activations that come out of the last convolutional layer. Remember an activation is just a number 00:05:46.480 |
Like here is one activation. It's a number and 00:05:49.600 |
Specifically the activations are calculated based on some 00:05:58.160 |
kernels or filters and they get applied to the previous layers activations 00:06:03.160 |
Which could well be the inputs or they could themselves be the results of other calculations 00:06:09.000 |
Okay, so when we say activation keep remembering we're talking about a number that's being calculated 00:06:17.580 |
And then what we do is we put on top of that a bunch of additional 00:06:24.880 |
Fully connected layers, so we're just going to do some matrix modifications on top of these just like in our Excel worksheet 00:06:34.000 |
We had this matrix that we just did a matrix multiplication 00:06:39.520 |
So what you can actually do is if you just type 00:06:45.360 |
The name of your learner object you can actually see 00:06:49.200 |
What's in it? You can see the layers in it. So when I was previously been skipping over a little bit about oh 00:06:54.800 |
We add a few layers to the end. These are actually the layers that we add 00:06:58.120 |
We're going to do batch norm in the last lesson. So don't worry about that for now a 00:07:03.240 |
Linear layer simply means a matrix multiply. Okay, so this is a matrix which has a 1024 rows and 00:07:10.360 |
512 columns and so in other words, it's going to take in 1,024 activations and spit out 00:07:20.400 |
Then we have a relu which remember is just replace the negatives with zero 00:07:27.200 |
We'll come back to drop out and then we have a second linear layer that takes those 00:07:30.800 |
512 activations from the previous linear layer and puts them through a new matrix multiply 00:07:35.880 |
512 by 120 spits out a new 120 activations and then finally put that through 00:07:43.440 |
Softmax and for those of you that don't remember softmax. We looked at that last year at last week 00:07:51.960 |
Take the previous the activation. Let's say for dog 00:07:55.680 |
Go either the power of that and then divide that into the sum of either the power of all the activations 00:08:02.960 |
So that was the thing that adds up to one all of them add up to one and each one individually is between zero and one 00:08:10.960 |
That's that's what we added on top and that's the thing when we have pre compute equals true 00:08:15.920 |
That's the thing we train so I wanted to talk about what this dropout is and what this P is because it's a really important 00:08:25.000 |
So a dropout layer with P equals zero point five 00:08:28.340 |
Literally does this we go over to our spreadsheet and let's pick any layer with some activations and let's say okay 00:08:34.800 |
I'm going to apply dropout with a P of zero point five two times two what that means is I go through and 00:08:44.720 |
Pick a cell right pick an activation. So I picked like half of them randomly and I delete them 00:08:54.880 |
That's that's what dropout is right? So it's so the P equals point five means what's the probability of 00:09:11.560 |
It doesn't actually change by very much at all just a little bit particularly because remember it's going through a max pooling layer 00:09:17.760 |
Right, so it's only going to change it at all if it was actually the maximum in that group of four 00:09:22.660 |
and furthermore, it's just one piece of you know, if it's going into a convolution rather than into a max pool 00:09:35.720 |
The idea of like randomly throwing away half of the activations in a layer 00:09:41.840 |
Has a really interesting result and one important thing to mention is each mini batch we throw away a different 00:09:50.960 |
random half of activations in that layer and so what it means is 00:09:56.160 |
It it forces it to not overfit right in other words if there's some particular activation 00:10:05.680 |
That exact dog or that exact cat right then when that gets dropped out 00:10:12.160 |
The whole thing now isn't going to work as well. It's not going to recognize that image, right? 00:10:17.160 |
so it has to in order for this to work it has to try and find a 00:10:24.280 |
That actually continues to work even as random half of the activations get thrown away every time 00:10:31.720 |
Right, so it's a it's it's I guess about four years old now. They're four years old and it's been 00:10:43.120 |
Making modern deep learning work and the reason why is it really just about solve? 00:10:49.120 |
The problem of generalization for us before dropout came along 00:10:53.440 |
if you try to train a model with lots of parameters and you were overfitting and 00:11:01.160 |
You already tried all the data augmentation you called and you already had as much data as you could you? 00:11:07.240 |
There were some other things you could try, but to a large degree you were kind of stuck 00:11:13.800 |
Jeffrey Hinton and his colleagues came up with this this dropout idea that was loosely inspired by the way the brain works 00:11:22.160 |
And also loosely inspired by Jeffrey Hinton's experience in bank tele cues apparently 00:11:30.160 |
yeah, somehow they came up with this amazing idea of like hey, let's let's try throwing things away at random and 00:11:36.080 |
So as you can imagine if your P was like 0.01 00:11:42.420 |
Then you're throwing away 1% of your activations for that layer at random. It's not going to randomly 00:11:56.040 |
Overfitting much at all on the other hand if your P was 0.99 00:12:00.040 |
then that would be like going through the whole thing and throwing away nearly everything right and 00:12:06.360 |
That would be very hard for it to overfit so that would be great for generalization, but it's also going to kill your 00:12:19.040 |
playoff between high P values generalize well 00:12:22.860 |
But will decrease your training accuracy and low P values will generalize less well, but will give you a less good training accuracy 00:12:30.760 |
So for those of you that have been wondering why is it that particularly early in training are my validation losses better? 00:12:39.040 |
Than my training losses right which seems otherwise really surprising. Hopefully some of you have been wondering why that is 00:12:46.400 |
because on a data set that it never gets to see you wouldn't expect the losses to ever be 00:12:51.880 |
That's better and the reason why is because when we look at the validation set we turn off dropout 00:12:58.200 |
Right so in other words when you're doing inference when you're trying to say is this a cat or is this a dog? 00:13:05.800 |
Random dropout there right we want to be using the best model we can 00:13:10.300 |
Okay, so that's why early in training in particular 00:13:24.000 |
You have to do anything to accommodate for the fact that you are throwing away some 00:13:34.280 |
We don't but pytorch does so pytorch behind the scenes does two things if you say P equals point five 00:13:48.120 |
Doubles all the activations that are already there so on average the kind of the average activation doesn't change 00:13:57.640 |
So yeah, you don't have to worry about it basically it's done for you 00:14:08.560 |
This is the this is the P value for all of the added layers to say 00:14:13.820 |
With fastai what dropout do you want on each of the layers in these these added layers? 00:14:19.360 |
It won't change the dropout in the pre-trained network like the hope is that that's already been 00:14:25.440 |
Pretty trained with some appropriate level of dropout 00:14:28.440 |
We don't change it but on these layers that we add you can say how much and so you can see here 00:14:33.080 |
I said peas equals point five so my first dropout has point five my second dropout has point five 00:14:39.460 |
All right, and remember coming to the input of this 00:14:42.240 |
Was the output of the last convolutional layer of pre-trained network? 00:14:47.460 |
And we go away and we actually throw away half of that before you can start go through our linear layer 00:14:56.120 |
Throw away half of the result of that go through another linear layer and then pass that to our softmax 00:15:04.080 |
Minor numerical precision region reasons it turns out to be better to take the log of the softmax then the softmax directly 00:15:12.080 |
And that's why you'll have noticed that when you actually get predictions out of our models you always have to go 00:15:20.040 |
Again, the details as to why aren't important. So if we want to 00:15:25.240 |
Try removing dropout. We could go peas equals zero 00:15:30.200 |
Right and you'll see where else before we started with the point seven six accuracy in the first epoch now 00:15:35.680 |
You've got a point eight accuracy in the first epoch 00:15:37.680 |
So by not doing dropout our first epoch worked better not surprisingly because we're not throwing anything away 00:15:44.240 |
but by the third epoch here, we had eighty four point eight and 00:15:48.160 |
Here we have eighty four point one. So it started out better and ended up worse 00:15:53.080 |
So even after three epochs, you can already see we're massively overfitting, right? 00:15:57.520 |
We've got point three loss on the train and point five loss on the validation 00:16:03.560 |
And so if you look now you can see in the resulting model there's no dropout at all 00:16:11.760 |
So if the P is zero, we don't even add it to the model 00:16:14.840 |
Another thing to mention is you might have noticed that what we've been doing is we've been adding two 00:16:26.000 |
Right in our additional layers. You don't have to do that. By the way, there's actually a parameter called extra fully connected 00:16:33.520 |
Layers that you can basically pass a list of how long do you want all how big do you want each of the additional fully connected? 00:16:45.840 |
Right because you need something that takes the output of the convolutional layer 00:16:50.120 |
which in this case is a size 1024 and turns it into the number of 00:16:54.800 |
Classes you have cats versus dogs would be two dog breeds would be 120 00:17:00.960 |
Planet satellite 17 whatever that's you always need one linear layer at least and you can't pick how big that is 00:17:10.640 |
But you can choose what the other size is or if it happens at all 00:17:15.600 |
So if we were to pass in an empty list, then now we're saying don't add any additional linear layers 00:17:23.080 |
Right. So here if we've got P's equals zero extra fully connected layers is empty. This is like the minimum 00:17:32.240 |
Kind of top model we can put on top and again like if we do that 00:17:37.800 |
You can see above we actually end up with in this case a 00:17:44.960 |
Reasonably good result because we're not training it for very long and this particular pre-trained network is very well suited 00:17:54.040 |
So Jeremy, what kind of P should we we using? 00:18:14.760 |
For most things right? So like it's it's you you don't necessarily need to change it at all 00:18:23.200 |
Just start bumping it up. So try first of all setting it to 0.5 00:18:28.240 |
That'll set them both to 0.5 if it's still overfitting a lot try 0.7 like you can you can narrow down 00:18:37.040 |
Numbers change right and if you're under fitting 00:18:44.160 |
It's unlikely you would need to make it much lower because like even in these dogs versus cats situations 00:18:51.600 |
You know, we don't seem to have to make it lower so it's more likely you'd be increasing it to like 0.6 or 0.7 00:18:58.800 |
But you can fiddle around I find these the ones that are there by default seem to work pretty well most of the time 00:19:10.440 |
Was in the dog breeds one. I did set it them both to point five 00:19:16.760 |
Bigger model so like resnet 34 has less parameters 00:19:21.120 |
So it doesn't overfit as much but then when I started bumping pumping it up to like a resnet 50 00:19:26.420 |
Which has a lot more parameters. I noticed it started overfitting. So then I also increased my dropout. So as you use like 00:19:32.920 |
Bigger models you'll often need to add more dropout. Can you pass that over there, please? You know 00:19:39.360 |
If you set B 2.5 roughly what percentage is it 50% 50%? Yeah 00:19:51.640 |
Thanks. Is there a particular way in which you can determine if the data is being old fitted? 00:20:01.280 |
You can see that the like here you can see that the training error is a 00:20:15.080 |
Zero overfitting is not generally optimal like the only way to find that out is 00:20:19.920 |
Remember the only thing you're trying to do is to get this number low right the validation loss number low 00:20:24.440 |
So in the end you kind of have to play around with a few different things and see which thing ends up getting the validation 00:20:31.080 |
Loss low, but you're kind of going to feel over time for your particular problem 00:20:36.720 |
What does overfitting? What does too much overfitting look like? 00:20:44.840 |
So that's dropout and we're going to be using that a lot and remember it's there by default service here another question 00:21:00.280 |
Is does it like you know a delete each cell with a probability of? 00:21:06.120 |
0.5 does it just pick 50% randomly? I mean, I know both effectively 00:21:11.280 |
It's the former yeah, okay, okay, second question is why why does the average activation matter? 00:21:17.920 |
well, it matters because the remember if you look at the 00:21:40.520 |
Right and add it up, so if we deleted half of these 00:21:44.000 |
Then that would also cause this number to half which would cause like everything else after that to change and so if you change 00:21:51.600 |
What it means you know like it then you're changing something that used to say like oh 00:21:57.080 |
Fluffy ears are fluffy if this is greater than point six now 00:22:00.720 |
It's only fluffy if it's greater than point three like we're changing the meaning of everything 00:22:04.000 |
So you the goal here is to delete things without changing 00:22:08.800 |
We're using a linear activation for one of the earlier activations 00:22:17.560 |
Why are we using linear? Yeah? Why that particular activation? 00:22:22.040 |
Because that's what this set of layers is so we've we've the the pre trained network is all is the convolutional network 00:22:28.960 |
And that's pretty computed, so we don't see it so what that spits out is a vector 00:22:35.320 |
So the only choice we have is to use linear layers at this point 00:22:41.760 |
Can we have different level of dropout by layer? Yes, absolutely how to do that great so 00:22:49.880 |
You can absolutely have different dropout by layer, and that's why this is actually called peas 00:22:54.720 |
So you can pass in an array here, so if I went zero 00:22:58.400 |
comma 0.2 for example and then extra fully connected. I might add 512 00:23:05.120 |
Right then that's going to be zero dropout before the first of them and point two dropout before the second of them 00:23:12.800 |
Yes requests, and I must admit. I don't have a great 00:23:17.760 |
Intuition even after doing this for a few years for like 00:23:20.640 |
When should earlier or later layers have different amounts of dropout? 00:23:26.640 |
It's still something I kind of play with and I can't quite find rules of thumb 00:23:32.000 |
So there's some of you come up with some good rules of thumb. I'd love to 00:23:37.840 |
You can use the same dropout in every fully connected layer 00:23:42.040 |
The other thing you can try is often people only put dropout on the very last 00:23:48.040 |
Linear layer, so that'd be the two things to try 00:23:50.440 |
So Jeremy, why do you monitor the log loss the loss instead of the accuracy going up? 00:24:00.800 |
Well because the loss is the only thing that we can see 00:24:05.080 |
For both the validation set and the training set so it's nice to be able to compare them 00:24:16.720 |
Later the loss is the thing that we're actually 00:24:22.240 |
So it's it's kind of a little more. It's a little easier to monitor that and understand what that means 00:24:32.120 |
So with dropout we are kind of adding some random noise every iteration right so 00:24:39.240 |
So that means that we don't do as much learning right or actually so that's right 00:24:45.800 |
So we have to play around with the learning rate and it doesn't seem to impact the learning rate 00:24:50.860 |
Enough that I've ever noticed it. I I would say you're probably right in theory it might but not enough that it's ever affected me 00:25:07.360 |
Structured data problem and so to remind you we were looking at Kaggles Rossman competition 00:25:17.160 |
Chain of supermarkets, I believe and you can find this in lesson 3 Rossman 00:25:26.280 |
The main data set is the one where we were looking to say at a particular store 00:25:36.040 |
Okay, and there's a few big key pieces of information one is what was the date another was were they open? 00:25:47.400 |
And was it a holiday as for school a state holiday there? 00:25:51.360 |
Or was it a school holiday there and then we had some more information about stores like what for this store? 00:25:57.200 |
What kind of stuff did they tend to sell what kind of store are they how far away the competition and so forth so? 00:26:03.240 |
With the data set like this there's really two main kinds of column. There's columns that we think of as 00:26:10.600 |
Categorical they have a number of levels so the assortment 00:26:13.760 |
Column is categorical, and it has levels such as a B and C 00:26:19.200 |
Where else something like competition distance we would call continuous 00:26:25.380 |
It has a number attached to it where differences or ratios even if that number have some kind of meaning 00:26:31.480 |
And so we need to deal with these two things quite differently, okay, so anybody who's done any 00:26:39.240 |
Machine learning of any kind will be familiar with using continuous columns if you've done any linear regression for example 00:26:45.400 |
You can just like multiply them by parameters for instance 00:26:48.680 |
Categorical columns we're going to have to think about a little bit more 00:26:52.440 |
We're not going to go through the data cleaning we're going to assume that that's in feature engineering we're going to assume all that's been done 00:27:04.280 |
at the end of that we have a list of columns and the in this case I 00:27:09.960 |
Didn't do any of the thinking around the feature engineering or data cleaning myself 00:27:16.920 |
This is all directly from the third place winners of this competition 00:27:20.680 |
And so they came up with all of these different 00:27:30.640 |
You'll notice the list here is a list of the things that we're going to treat as categorical variables 00:27:42.480 |
Although we could treat them as continuous like they the differences between 2000 and 2003 is meaningful 00:27:51.200 |
We don't have to right and you'll see shortly how 00:28:00.880 |
But basically if we decide to make something a categorical variable what we're telling our neural net down the track is 00:28:07.480 |
That for every different level of say year, you know, 2000 2001 2002 you can treat it totally differently 00:28:14.920 |
Where else if we say it's continuous, it's going to have to come up with some kind of like function some kind of smooth ish 00:28:22.120 |
function right and so often even for things like year that actually are continuous 00:28:29.280 |
But they don't actually have many distinct levels it often works better 00:28:36.200 |
So another good example day of week, right? So like day of week between naught and six 00:28:42.080 |
It's a number and it means something like the difference between three and five is two days and has meaning but if you think about 00:28:54.520 |
It could well be that like, you know, Saturdays and Sundays are over here and Fridays are over here and Wednesdays over here 00:29:03.000 |
Kind of qualitatively differently, right? So by saying this is the categorical variable as you'll see we're going to let the neural net 00:29:11.920 |
Do that right? So this thing where we get where we say 00:29:15.960 |
Which are continuous and which are categorical to some extent? This is a modeling decision you get to make 00:29:23.440 |
now if something is coded in your data is like a B and C or 00:29:29.560 |
You know Jeremy and your net or whatever you actually you're going to have to call that categorical, right? 00:29:36.400 |
There's no way to treat that directly as a continuous variable 00:29:40.000 |
On the other hand if it starts out as a continuous variable like age or day of week 00:29:48.280 |
Whether you want to treat it as continuous or categorical. Okay, so summarize if it's categorical in the data 00:29:54.060 |
It's going to have to be categorical in the model if it's continuous in the data 00:29:58.240 |
You get to pick whether to make it continuous or categorical in the model 00:30:02.680 |
So in this case again, I just did whatever the third place winners of this competition did 00:30:09.440 |
These are the ones that they decided to use as categorical. These were the ones they decided to use as continuous and you can see 00:30:18.000 |
The continuous ones are all of the ones which are actual 00:30:22.120 |
Floating point numbers like competition distance actually has a decimal place to it, right and temperature actually has a decimal place to it 00:30:32.080 |
categorical because they have many many levels right like if it's like five digits of floating point then potentially there will be as 00:30:43.160 |
As there are rows and by the way the word we use to say how many levels are in a category? 00:30:51.400 |
So if you hear me say cardinality for example the cardinality of the day of week 00:30:55.200 |
Variable is seven because there are seven different days of the week 00:30:58.480 |
Do you have a heuristic for one to have been continuous variables or do you ever been variables? I don't ever been continuous variables 00:31:11.800 |
So yeah, so one thing we could do with like max temperature is group it into 00:31:16.520 |
0 to 10 10 to 20 20 to 30 and then call that categorical 00:31:21.000 |
interestingly a paper just came out last week in which a group of researchers found that 00:31:30.440 |
But that literally came out in the last week and until that time I haven't seen anything in deep learning saying that so I haven't 00:31:36.440 |
I haven't looked at it myself until this week. I would have said it's a bad idea 00:31:41.360 |
Now I have to think differently. I guess maybe it is sometimes 00:31:51.480 |
Year as a category what happens when you run the model on a year? It's never seen so you trained it in 00:31:58.080 |
Well, we'll get there. Yeah, the short answer is it'll be treated as an unknown category 00:32:04.300 |
And so plan does which is the underlying data frame thing? 00:32:08.840 |
We're using with categories as a special category called unknown and if it sees a category it hasn't seen before it gets treated as unknown 00:32:16.800 |
So for our deep learning model unknown would just be another category 00:32:22.600 |
If our data set training the data set doesn't have a category and 00:32:32.080 |
Test has unknown. How will it be? It'll just be part of this unknown category. Well, it's still predict 00:32:39.480 |
It'll predict something right like it will just have the value 00:32:42.940 |
0 behind the scenes and if there's been any unknowns of any kind in the training set then it'll have learned a 00:32:49.680 |
Way to predict unknown if it hasn't it's going to have some random vector. And so that's a 00:32:56.480 |
Interesting detail around training that we probably won't talk about in this part of the course 00:33:03.720 |
Okay, so we've got our categorical and continuous variable lists to find in this case there was a 800,000 rows 00:33:28.880 |
Replace it in the data frame with a version where you say take it and change its type to category 00:33:34.800 |
Okay, and so that just that's just a pandas thing. So I'm not going to teach you pandas 00:33:41.120 |
There's plenty of books particularly Wes McKinney's books book on Python for data analysis is great 00:33:47.080 |
But hopefully it's intuitive as to what's going on even if you haven't seen the specific syntax before 00:33:52.320 |
So we're going to turn that column into a categorical column 00:33:56.840 |
And then for the continuous variables, we're going to make them all 00:33:59.920 |
32-bit floating-point and for the reason for that is that PyTorch 00:34:05.400 |
Expects everything to be 32-bit floating-point. Okay, so like some of these include like 00:34:16.720 |
Can't see them straight away. But anyway, some of them. Yeah, like was there a promo was was a holiday 00:34:23.760 |
And so that'll become the floating-point values one and zero instance. Okay, so 00:34:29.640 |
I try to do as much of my work as possible on 00:34:38.000 |
For when I'm working with images that generally means resizing the images to like 64 by 64 or 128 by 128 00:34:45.320 |
We can't do that with structured data. So instead I tend to take a sample. So I randomly pick a few rows 00:34:53.080 |
So I start running with a sample and I can use exactly the same thing that we've seen before 00:34:57.920 |
For getting a validation set we can use the same way to get some random 00:35:02.460 |
Random row numbers to use in a random sample. Okay, so this is just a bunch of random numbers 00:35:09.280 |
And then okay, so that's going to be a size 150,000 rather than 840,000 00:35:20.840 |
And so my data that before I go any further it basically looks like this. You can see I've got some booleans here 00:35:29.880 |
Integers here of various different scales. There's my year 2014 00:35:35.240 |
And I've got some letters here. So even though I said 00:35:42.880 |
Pandas still displays that in the notebook as strings, right? 00:35:51.440 |
so then the first AI library has a special little function called process data frame and 00:35:57.440 |
Process data frame takes a data frame and you tell it. What's my dependent variable? 00:36:05.720 |
The first thing is it pulls out that dependent variable and puts it into a separate variable 00:36:10.620 |
Okay, and deletes it from the original data frame 00:36:13.800 |
So DF now does not have the sales column in where else y just contains the sales column 00:36:19.880 |
Something else that it does is it does scaling? 00:36:27.040 |
Really like to have the input data to all be somewhere around zero with a standard deviation of somewhere around one 00:36:34.920 |
all right, so we can always take our data and 00:36:37.200 |
Subtract the mean and divide by the standard deviation to make that happen 00:36:43.080 |
So that's what do scale equals true does and it actually returns a special object 00:36:47.480 |
Which keeps track of what mean and standard deviation did it use for that normalizing? 00:36:52.560 |
So you can then do the same thing to the test set later 00:37:01.040 |
missing values and categorical variables just become the ID 0 and then all the other categories become 12345 for that 00:37:20.400 |
That's a Boolean and just says is this missing or not and I'm going to skip over this pretty quickly because we talk about this 00:37:26.120 |
In detail in the machine learning course, okay, so if you've got any questions about this part 00:37:30.800 |
That would be a good place to go. It's nothing deep learning specific there 00:37:39.800 |
For example has become year 2 ok because these categorical variables have all been replaced with 00:37:48.440 |
Right and the reason for that is later on we're going to be putting them into a matrix 00:37:53.680 |
Right and so we wouldn't want the matrix to be 2014 rows long when it could just be 2 rows long 00:37:59.440 |
so that's the basic idea there, and you'll see that the 00:38:05.120 |
AC for example has been replaced in the same way with 1 and 3 00:38:14.040 |
Which does not contain the dependent variable and where everything is a number okay? 00:38:18.880 |
And so that's the that's where we need to get to to do deep learning and all of the stage above that 00:38:24.160 |
As I said we talk about in detail in the machine learning course nothing deep learning specific about any of it 00:38:29.860 |
This is exactly what we throw into our random forests as well, so 00:38:36.280 |
Thing we talk about a lot in the machine learning course of course is validation sets 00:38:40.200 |
In this case we need to predict the next two weeks of sales 00:38:45.800 |
Right it's not like pick a random set of sales, but we have to pick the next two weeks of sales. That was what the Kaggle 00:38:55.960 |
And therefore I'm going to create a validation set which is the last two weeks of 00:39:02.320 |
my training set right to try and make it as similar to the test set as possible and 00:39:06.640 |
We just posted actually Rachel wrote this thing last week about 00:39:11.480 |
Creating validation sets so if you go to fast at AI you can check it out 00:39:18.840 |
But it's basically a summary of a recent machine learning lesson that we did 00:39:25.180 |
The videos are available for that as well, and this is kind of a written a written summary of it 00:39:37.160 |
So Rachel has been a lot of time thinking about kind of you know 00:39:39.760 |
How do you need to think about validation sets and training sets and test sets and so forth and that's all there? 00:39:45.480 |
But again, nothing deep learning specific, so let's get straight to the the deep learning action, okay? 00:39:51.400 |
so in this particular competition as always with any competition or any kind of 00:39:59.920 |
Machine learning project you really need to make sure you have a strong understanding of your metric 00:40:05.280 |
How are you going to be judged here and in this case? 00:40:08.400 |
You know Kaggle makes it easy they tell us how we're going to be judged and so we're going to be judged on the roots 00:40:15.920 |
Right so we're going to say like oh you predicted three 00:40:19.180 |
It was actually three point three so you were ten percent out 00:40:24.520 |
And then we're going to average all those percents right and remember. I warned you that 00:40:30.480 |
You are going to need to make sure you know logarithms really well right and so in this case from you know 00:40:37.920 |
We're basically being saying your prediction divided by the actual the mean of that 00:40:52.160 |
Metric in Pytorch called root mean squared percent error 00:40:54.880 |
We could actually easily create it by the way if you look at the source code 00:41:00.480 |
You'll see like it's you know a line of code, but easier still would be to realize that 00:41:09.240 |
That right then you could replace a with like 00:41:17.960 |
And then you can replace that whole thing with a subtraction 00:41:22.040 |
That's just the rule of logs right and so if you don't know that rule 00:41:28.520 |
Then you know make sure you go look it up because it's super helpful 00:41:31.200 |
But it means in this case all we need to do is to 00:41:41.200 |
Notebook and when you take the log of the data getting the root means great error 00:41:46.120 |
Will actually get you the root means great percent error for free, okay? 00:41:50.720 |
But then when we want to like print out our root means red percent error 00:41:55.440 |
We actually have to go either the power of it 00:41:58.640 |
Again, right and then we can actually return the percent difference, so that's all that's going on here 00:42:04.760 |
It's again. Not really deep learning specific at all 00:42:06.960 |
So here we finally get to the deep learning alright, so as per usual like you'll see everything 00:42:15.840 |
We look at today looks exactly the same as everything. We've looked at so far. Which is first we create a model data object 00:42:25.400 |
Training set an optional test set built into it from that we will get a learner we will then 00:42:32.000 |
Optionally called learner dot LR find will then called learner dot fit 00:42:37.600 |
It'll be all the same parameters and everything that you've seen many times before okay 00:42:42.320 |
So the difference though is obviously we're not going to go 00:42:45.560 |
Image classifier data dot from CSV or dot from paths we need to get some different kind of model data 00:42:58.960 |
Okay, but this will return an object with basically the same API that you're familiar with and rather than from paths 00:43:06.720 |
Or from CSV this is from data frame, okay, so this gets past a few things 00:43:12.520 |
The path here is just used for it to know where should it store? 00:43:17.600 |
Like model files or stuff like that right this is just basically saying where do you want to store anything that you save later? 00:43:24.160 |
This is the list of the indexes of the rows that we want to put in the validation set we created earlier 00:43:38.040 |
Let's have a look here's this is where we did the log right so I talked the 00:43:42.680 |
The Y that came out of property F our dependent variable. I logged it and I call that YL 00:43:50.680 |
When we create our model data we need to tell it that's our dependent variable 00:43:54.200 |
So so far we've got list of the stuff to go in the validation set which is what's our independent variables? 00:44:00.800 |
What's our dependent variables and then we have to tell it which things do we want treated as categorical right? 00:44:14.600 |
Right so it could do the whole thing as if it's continuous it would just be totally meaningless 00:44:20.260 |
Right so we need to tell it which things do we want to treat as categories and so here we just pass in 00:44:30.960 |
okay, and then a bunch of the parameters are the same as the ones you're used to for example you can set the batch size 00:44:48.400 |
Attribute there's a vowel DL attribute a train DS attribute of our DS attribute. It's got a length 00:45:03.840 |
Okay, so now we need to create the the model or create the learner and so to skip ahead a little bit 00:45:10.280 |
We're basically going to pass in something that looks pretty familiar 00:45:15.120 |
We're going to be passing saying from our model from our model data 00:45:21.560 |
And we'll basically be passing in a few other bits of information which will include 00:45:29.920 |
How many how many activations to have in each layer how much dropout to use at the at the later layers? 00:45:38.120 |
But then there's a couple of extra things that we need to learn about and specifically it's this thing called 00:45:47.120 |
So this is really the key new concept we have to learn about all right, so 00:45:55.960 |
All we're doing basically is we're going to take our 00:45:59.520 |
Let's forget about categorical variables for a moment and just think about the continuous variables 00:46:05.920 |
For our continuous variables all we're going to do 00:46:12.760 |
Okay, so for our continuous variables, we're basically going to say like okay, here's a 00:46:22.520 |
big list of all of our continuous variables like the minimum temperature and 00:46:26.600 |
maximum temperature and the distance to the nearest competitor and so forth right and so here's just a bunch of 00:46:33.480 |
floating point numbers and so basically what the neural nets going to do is it's going to take that that 1d array or 00:46:48.200 |
All means the same thing okay, so we're going to take our rank 1 tensor 00:46:51.680 |
And let's put it through a matrix multiplication, so let's say this has got like I don't know 20 00:46:57.880 |
continuous variables, and then we can put it through a matrix which 00:47:03.160 |
Must have 20 rows. That's how much this multiplication works, and then we can decide how many columns we want right 00:47:10.400 |
So maybe we decided 100 right and so that matrix multiplication is going to spit out a new 00:47:20.800 |
Okay, that's that's what that's what a linear. That's what a matrix product does and that's the definition of a linear layer 00:47:30.160 |
Okay, and so then the next thing we do is we can put that through a relu right which means we throw away the negatives 00:47:37.200 |
Okay, and now we can put that through another matrix product. Okay, so this is going to have to have a hundred rows by definition 00:47:45.100 |
And we can have as many columns as we like and so let's say maybe this was 00:47:50.840 |
The last layer so the next thing we're trying to do is to predict sales 00:47:57.720 |
value, we're trying to predict the sales so we could put it through a 00:48:00.400 |
Matrix product that just had one column and that's going to spit out a single number 00:48:11.440 |
Neural net if you like now in practice, you know we wouldn't make it one layer 00:48:20.440 |
You know, maybe we'd have 50 here and so then that gives us a 50 long vector and 00:48:38.760 |
And that spits out a single number and one reason I wanted to change that there was to point out, you know, relu 00:48:48.240 |
Like you'd never want to throw away the negatives because that the softmax 00:48:57.320 |
Needs negatives in it because it's the negatives that are the things that allow it to create low probabilities 00:49:02.820 |
That's minor detail, but it's useful to remember. Okay, so basically 00:49:18.240 |
Fully connected neural net is something that takes in as an input a rank one tensor 00:49:41.960 |
Okay, and so we could obviously decide to add more 00:49:46.920 |
Linear layers we could decide maybe to add dropout 00:49:51.000 |
Right. So these are some of the decisions that we we get to make right but we there's not that much we can do 00:49:58.800 |
Right. There's not much really crazy architecture stuff to do. So when we come back to 00:50:06.520 |
We're going to learn about all the weird things that go on and like res nets and inception networks and blah blah blah 00:50:12.100 |
But in these fully connected networks, they're really pretty simple. They're just interspersed 00:50:19.580 |
Activation functions like value and a softmax at the end 00:50:24.680 |
And if it's not classification which actually ours is not classification in this case. We're trying to predict sales 00:50:34.420 |
Right, we don't want it to be between 0 and 1 00:50:37.780 |
Okay, so we can just throw away the last activation all together 00:50:41.580 |
If we have time we can talk about a slight trick we can do there but for now we can think of it that way 00:50:48.940 |
So that was all assuming that everything was continuous, right? But what about categorical, right? So we've got like 00:51:04.500 |
We're going to treat it as categorical, right? So it's like Saturday Sunday Monday 00:51:16.760 |
okay, how do we feed that in because I want to find a way of getting that in so that we still end up with a 00:51:24.940 |
so the trick is this we create a new little matrix of 00:51:33.700 |
And as many columns as we choose right so let's pick four right so here's our 00:51:46.900 |
Right and basically what we do is let's add our categorical variables to the end. So let's say the first row was Sunday 00:51:55.380 |
Right then what we do is we do a lookup into this matrix and we say oh here's Sunday 00:52:04.340 |
This row and so this matrix we basically fill with floating point numbers. So we're going to end up grabbing a 00:52:14.020 |
Subset of four floating point numbers. It's Sunday's particular for floating point numbers 00:52:25.740 |
Into a rank one tensor of four floating point numbers and initially those four numbers are random 00:52:33.080 |
Right and in fact this whole thing we initially start out 00:52:40.020 |
But then we're going to put that through our neural net, right? 00:52:44.260 |
So we basically then take those four numbers and we remove Sunday instead we add 00:52:49.660 |
Our four numbers on here, right? So we've turned our categorical thing into a floating point vector 00:52:56.360 |
Right and so now we can just put that through our neural net 00:53:00.100 |
just like before and at the very end we find out the loss and 00:53:04.300 |
then we can figure out which direction is down and 00:53:08.180 |
Do gradient descent in that direction and eventually that will find its way back 00:53:12.940 |
To this little list of four numbers and it'll say okay those random numbers weren't very good 00:53:18.620 |
This one needs to go up a bit that one needs to go up a bit that one needs to go down a bit 00:53:22.660 |
That one needs to go up a bit and so we'll actually update 00:53:25.260 |
our original those four numbers in that matrix and 00:53:31.780 |
And so this this matrix will stop looking random and it will start looking more and more like like 00:53:37.660 |
The exact four numbers that happen to work best for Sunday the exact four numbers that happen to work best for Friday and so forth 00:53:45.700 |
And so in other words this matrix is just another bunch of weights 00:53:53.780 |
All right, and so matrices of this type are called 00:54:00.540 |
So an embedding matrix is something where we start out with an 00:54:10.100 |
integer between zero and the maximum number of levels of that category 00:54:15.420 |
We literally index into a matrix to find our particular row 00:54:20.460 |
So if it was the level was one we take the first row 00:54:27.340 |
we append it to all of our continuous variables and 00:54:35.020 |
Vector of continuous variables and when we can do the same thing for let's say zip code 00:54:39.540 |
Right, so we could like have an embedding matrix. Let's say there are 5,000 zip codes 00:54:45.260 |
It would be 5,000 rows long as wide as we decide maybe it's 50 wide and so we'd say okay. Here's 00:54:54.860 |
That zip code is index number four in our matrix 00:54:58.560 |
So go down and we find the fourth row regret those 50 numbers and append those 00:55:03.900 |
Onto our big vector and then everything after that is just the same. We just put it through a linear layer value linear layer, whatever 00:55:15.180 |
Represent that's a great question and we'll learn more about that when we look at collaborative filtering for now 00:55:21.860 |
They represent no more or no less than any other parameter in our neural net, you know, they're just 00:55:28.900 |
They're just parameters that we're learning that happen to end up giving us 00:55:35.780 |
We will discover later that these particular parameters often 00:55:39.260 |
However, our human interpretable and quite can be quite interesting, but that's a side effect of them. It's not 00:55:45.660 |
Fundamental they're just four random numbers for now that we're that we're learning or sets of four random numbers 00:55:52.940 |
To have a good heuristic for the dimensionality of the embedding matrix, so why four here 00:56:10.940 |
What I first of all did was I made a little list of every categorical variable and its cardinality 00:56:17.460 |
Okay, so there they allow so there's a hundred and that's a thousand plus different stores 00:56:28.740 |
That's because there are seven days of the week plus one left over four unknown 00:56:32.700 |
Even if there were no missing values in the original data 00:56:36.060 |
I always still set aside one just in case there's a missing or an unknown or something different in the test set 00:56:41.900 |
Again four years, but there's actually three plus room for an unknown and so forth. Alright, so what I do 00:57:04.700 |
These are my embedding matrices. So my store matrix. So the that has to have a 00:57:10.140 |
thousand one hundred and sixteen rows because I need to look up right to find his store number three and then it's going to return back a 00:57:21.940 |
Day of week it's going to look up into which one of the eight and return the thing of length four 00:57:28.400 |
So would you typically build an embedding matrix for each categorical feature? Yes. Yeah, so that's what I've done here 00:57:57.380 |
And then you may have noticed that that's actually the first thing that we pass to get learner 00:58:03.260 |
And so that tells it for every categorical variable. That's the embedding matrix to use for that variable 00:58:17.060 |
Random initialization are there other ways to actually initialize embedding? 00:58:21.000 |
Yes or no, there's two ways one is random the other is pre-trained and 00:58:28.460 |
We'll probably talk about pre-trained more later in the course 00:58:32.060 |
But the basic idea though is if somebody else at Rossman had already trained a neural net 00:58:36.280 |
just like you you would use a pre-trained net from image net to look at pictures of cats and dogs if 00:58:42.300 |
Somebody else has pre-trained a network to predict cheese sales in Rossman 00:58:47.200 |
You may as well start with their embedding matrix of stores to predict liquor sales in Rossman 00:58:55.280 |
At Pinterest and Instacart they both use this technique Instacart uses it for routing their shoppers 00:59:03.200 |
Pinterest uses it for deciding what to display on a web page when you go there and they have 00:59:12.000 |
In Instacart's case of stores that get shared in the organization so people don't have to train new ones 00:59:25.760 |
Why wouldn't you just use like the one hot scheme and just 00:59:34.280 |
As opposed to just doing a lot of questions. So so we could easily as you point out have 00:59:41.600 |
Instead of passing in these four numbers. We could instead have passed in seven numbers 00:59:47.640 |
all zeros, but one of them is a one and that also is a list of floats and 00:59:58.960 |
Generally speaking categorical variables have been used in statistics for many years. It's called dummy variable coding 01:00:12.520 |
Could only ever be associated with a single floating-point number 01:00:16.840 |
Right, and so it basically gets this kind of linear behavior. It says like Sunday is more or less of a single thing 01:00:25.960 |
Yeah, well, it's not just interactions. It's saying like now Sunday is a concept in four-dimensional space 01:00:32.200 |
Right. And so what we tend to find happen is that these 01:00:37.440 |
Embedding vectors tend to get these kind of rich semantic concepts. So for example 01:00:51.080 |
You'll tend to see that Saturday and Sunday will have like some particular number higher or more likely 01:00:57.320 |
it turns out that certain days of the week are associated with higher sales of 01:01:07.000 |
Certain kinds of goods that you kind of can't go without I don't know like gas or milk say 01:01:19.240 |
Like wine that tend to be associated with like the days before weekends or holidays, right? So there might be kind of a column 01:01:37.800 |
You know, so basically yeah by by having this higher dimensionality vector rather than just a single number 01:01:50.480 |
Representations and so this idea of an embedding is actually what's called a distributed representation 01:01:58.600 |
It's kind of the fun most fundamental concept of neural networks 01:02:02.440 |
It's this idea that a concept in a neural network has a kind of a high dimensional 01:02:08.960 |
Representation and often it can be hard to interpret because the idea is like each of these 01:02:14.720 |
Numbers in this vector doesn't even have to have just one meaning 01:02:19.200 |
It could mean one thing if this is low and that one's high and something else if that one's high and that one's low 01:02:23.640 |
Because it's going through this kind of rich nonlinear 01:02:30.920 |
It's this rich representation that allows it to learn such such such interesting 01:02:40.520 |
Kind of oh another question. Sure. I'll speak louder. So are there 01:02:46.200 |
Is an embedding so I get the the fundamental of the like the word vector word to Vic vector algebra 01:02:55.040 |
You can run on this but are they embedding suited suitable for certain types of variables? 01:03:03.800 |
Are there different categories that that the embeddings are suitable for an embedding is suitable for any categorical variable? 01:03:11.120 |
Okay, so so the only thing it it can't really work 01:03:16.120 |
Well at all for would be something that is too high cardinality 01:03:19.880 |
So like in other words, we had likes whatever it was 600,000 rows if you had a variable with 600,000 levels 01:03:30.360 |
categorical variable you could bucketize it I guess 01:03:33.660 |
But yeah in general like you can see here that the third place getters in this competition 01:03:39.560 |
Really decided that everything that was not too high cardinality 01:03:45.880 |
They put them all as categorical variables and I think that's a good rule of thumb 01:03:49.320 |
You know if you can make it a categorical variable you may as well because that way it can learn this rich distributed representation 01:03:57.080 |
Or else if you leave it as continuous, you know, the most it can do is to kind of try and find a 01:04:02.520 |
You know a single functional form that fits it well 01:04:09.080 |
You were saying that you are kind of increasing the dimension 01:04:12.960 |
But actually in in most cases we will use a one-holding calling which has even a bigger dimension 01:04:23.240 |
Reducing but in the most rich. I think that's that's that's fair. Yeah. Yeah it like 01:04:28.240 |
Yes, you know you can think of it as one hot encoding which actually is high dimensional, but it's not 01:04:34.800 |
Meaningfully high dimensional because everything except one is zero 01:04:38.200 |
I'm saying that also because even this will reduce the amount of memory and things like this that you have to write 01:04:43.680 |
This is better. You're absolutely right. Absolutely, right? 01:04:46.760 |
And and so we may as well go ahead and actually describe like what's going on with the matrix algebra behind the scenes 01:04:52.920 |
It this if this doesn't quite make sense you can kind of skip over it 01:04:56.600 |
But for some people I know this really helps if we started out with something saying this is Sunday 01:05:05.320 |
we could represent this as a one hot encoded vector right and so 01:05:09.640 |
Sunday, you know, maybe was positioned here. So that would be a one and then the rest of zeros 01:05:22.360 |
Embedding matrix right with eight rows and in this case four columns 01:05:28.540 |
One way to think of this actually is a matrix product 01:05:35.840 |
Right, so I said you could think of this as like looking up the number one, you know and finding like its index in the array 01:05:48.000 |
identical to doing a matrix product between a one hot encoded vector and 01:05:53.080 |
The embedding matrix like you're going to go zero times this row one times this row zero times this row 01:06:02.040 |
And so it's like a one hot embedding matrix product is identical 01:06:09.680 |
Some people in the bad old days actually implemented embedding 01:06:16.200 |
Matrices by doing a one hot encoding and then a matrix product and in fact a lot of like machine learning 01:06:24.560 |
But as your net was kind of alluding to it's that's terribly inefficient. So all of the modern 01:06:31.660 |
Libraries implement this as taken take an integer and do a lookup into an array 01:06:37.040 |
But the nice thing about realizing that it's actually a matrix product 01:06:43.320 |
How the gradients are going to flow so when we do stochastic gradient descent, it's we can think of it as just another 01:06:50.060 |
Linear layer. Okay, does it say that's like a somewhat minor detail, but hopefully for some of you it helps 01:06:56.680 |
Could you touch on using dates and times as categoricals how that affects seasonality? Yeah, absolutely. That's a great question 01:07:13.680 |
So I covered dates in a lot of detail in the machine learning course, but it's worth briefly mentioning here 01:07:19.120 |
There's a fast AI function called add date part 01:07:33.920 |
It removes unless you've got drop equals false 01:07:37.800 |
It optionally removes the column from the data frame and replaces it with lots of columns 01:07:43.560 |
representing all of the useful information about that date like 01:07:47.680 |
Day of week day of month month of year year is at the start of a quarter 01:07:52.600 |
Is at the end of a quarter basically everything that pandas? 01:08:00.200 |
When we look at our list of features where you can see them here, right? 01:08:05.840 |
Year month week day day of week, etc. So these all get created for us by add date path 01:08:20.720 |
Matrix, so I guess eight rows by four column embedding matrix for day of week and 01:08:26.800 |
Conceptually that allows us allows our model to create some pretty interesting time series models 01:08:34.920 |
Right like it can if there's something that has a 01:08:40.840 |
That kind of goes up on Mondays and down on Wednesdays, but only for dairy and only in Berlin 01:08:47.040 |
It can totally do that, but it has all the information it needs 01:08:53.320 |
So this turns out to be a really fantastic way to deal with time series 01:08:57.960 |
So I'm really glad you asked the question you just need to make sure that 01:09:02.560 |
That the the cycle indicator in your time series exists as a column 01:09:07.800 |
So if you didn't have a column there called day of week 01:09:11.280 |
it would be very very difficult for the neural network to somehow learn to do like a 01:09:17.060 |
Divide mod 7 and then somehow look that up in an embedding matrix 01:09:20.960 |
I get not impossible, but really hard would use lots of computation wouldn't do it very well 01:09:26.720 |
So an example of the kind of thing that you need to think about might be 01:09:32.360 |
Holidays for example, you know, or if you were doing something in in, you know of sales of 01:09:41.840 |
You probably want a list of like when when are the when is the ball game on at AT&T Park? 01:09:47.120 |
All right, because that's going to impact how many people that are drinking beer in soma 01:09:51.660 |
all right, so you need to make sure that the kind of the basic indicators or 01:09:57.120 |
Periodicities or whatever are there in your data and as long as they are the neural nets going to learn to use them 01:10:03.200 |
So I'm kind of trying to skip over some of the non-deep learning parts 01:10:13.320 |
The key thing here is that we've got our model data that came from the data frame 01:10:17.560 |
We tell it how big to make the embedding matrices 01:10:21.260 |
We also have to tell it of the columns in that data frame 01:10:29.360 |
Categorical variables or how many of them are continuous variables. So the actual parameter is number of continuous variables 01:10:36.560 |
So you can hear you can see we just pass in how many columns are there minus how many categorical variables are there? 01:10:45.120 |
The neural net knows how to create something that puts the continuous variables over here and the categorical variables over there 01:10:57.680 |
All right. So this is the dropout applied to the embedding matrix 01:11:01.280 |
This is the number of activations in the first linear layer the number of activations in the second linear layer 01:11:07.280 |
The dropout in the first linear layer the dropout for the second linear layer 01:11:11.840 |
This bit we won't worry about for now and then finally is how many outputs do we want to create? 01:11:16.880 |
Okay, so this is the output of the last linear layer and obviously it's one because we want to predict a single number 01:11:26.680 |
So after that we now have a learner where we can call our find and we get the standard looking shape and we can say 01:11:38.680 |
Start training using exactly the same API. We've seen before 01:11:46.920 |
You can pass in I'm not sure if you've seen this before 01:11:51.000 |
Custom metrics what this does is it just says please print out a number at the end of every epoch by calling 01:11:57.560 |
this function and this is a function we defined a little bit earlier, which was the 01:12:02.120 |
Root mean squared percentage error. First of all going either the power of our 01:12:07.320 |
Sales because our sales were originally logged. So this doesn't change the training at all 01:12:20.920 |
And you know, we've got some benefits that the original people that built this don't have specifically we've got things like 01:12:29.280 |
Cyclical not cyclic learning rate stochastic gradient descent with restarts. And so it's actually interesting to have a look and compare 01:12:37.400 |
Although our validation set isn't identical to the test set it's very similar 01:12:45.720 |
It's a two-week period that is at the end of the training data 01:12:49.880 |
so our numbers should be similar and if we look at what we get 0.097 and compare that to the 01:13:07.520 |
Let's have a look in the top actually that's interesting 01:13:13.960 |
There's a big difference between the public and private leaderboard it would have 01:13:19.960 |
Would have been right at the top of the private leaderboard 01:13:22.280 |
But only in the top 30 or 40 on the public leaderboard. So not quite sure but you can see like we're certainly in 01:13:33.200 |
actually tried running the third place getters code and 01:13:38.120 |
Their final result was over 0.1. So I actually think that where should be compared to the private leaderboard 01:13:48.840 |
So anyway, so you can see there basically there's a technique for dealing with time series and 01:13:55.600 |
Structured data and you know, interestingly the group that that used this technique. They actually wrote a paper about it. That's linked in this notebook 01:14:04.640 |
When you compare it to the folks that won this competition and came second 01:14:11.560 |
They did the other folks did way more feature engineering like the winners of this competition were actually 01:14:19.000 |
subject matter experts in logistics sales forecasting and so they had their own like code to create lots and lots of features and 01:14:27.400 |
Talking to the folks at Pinterest who built their very similar model for recommendations of Pinterest 01:14:33.400 |
They said the same thing which is that when they switched from gradient boosting machines to deep learning 01:14:41.760 |
Feature engineering it was a much much simpler model and requires much less maintenance 01:14:48.440 |
And so this is like one of the big benefits of using this approach to deep learning. We can get state-of-the-art results 01:15:00.080 |
Are we using any time series in any of these fits 01:15:10.280 |
Absolutely using what we just saw we have day of week month of year all that stuff columns 01:15:17.200 |
And most of them are being treated as categories. So we're building a distributed representation of January 01:15:23.040 |
We're building a distributed representation of Sunday. We're building a distributed representation of Christmas. So we're not using any 01:15:30.720 |
Classic time series techniques all we're doing is 01:15:42.960 |
Exactly. Exactly. Yes. So the embedding matrix is able to deal with this stuff like 01:15:48.400 |
Day of week periodicity and so forth in a way 01:15:55.800 |
Standard time series technique I've ever come across 01:16:01.280 |
The matrix in the earlier models when we did the CNN we did not pass it during the fit 01:16:15.120 |
Anything to fit just the learning rate and the number of cycles 01:16:19.000 |
In this case we're passing in metrics because we want to print out some extra stuff 01:16:22.800 |
There is a difference in that we're calling data dot get learner. So with 01:16:34.600 |
We just go learner dot trained and pass it the data 01:16:40.680 |
In for these kinds of models in fact for a lot of the models the model that we build 01:16:46.680 |
Depends on the data in this case. We actually need to know like 01:16:53.400 |
And stuff like that. So in this case, it's actually the data object that creates the learner 01:16:59.200 |
So yeah, it is it is a bit upside down to what we've seen before 01:17:09.920 |
So in this case what we are doing is that we have some kind of a structured data 01:17:18.400 |
We got some column in a database or some things in it a parent is data frame 01:17:25.320 |
Yeah data frame and then we are mapping it to deep learning by using this 01:17:33.060 |
Embedding matrix for the categorical variables. So the continuous we just put them straight in 01:17:38.580 |
So all I need to do is like if I have a if I have already have a feature engineering model 01:17:46.100 |
Yeah, then to map it to deep learning. I just have to figure out which one I can move in to categorical and then 01:17:52.560 |
Yeah, great question. So yes, exactly if you want to use this on your own data set 01:17:59.900 |
Step one is list the categorical variable names list the continuous variable names 01:18:12.580 |
Create a list of which row indexes do you want new validation set? 01:18:21.980 |
Is to call this line of code using this exact like these exact you can just copy and paste it 01:18:29.600 |
step four is to create your list of how big you want each embedding matrix to be and 01:18:39.560 |
You can use these exact parameters to start with 01:18:42.880 |
And if it over fits or under fits you can fiddle with them and then the final step is to call 01:18:49.320 |
Fit so yeah, almost all of this code will be nearly identical 01:19:02.540 |
How is data augmentation can be used in this case and the second one is? 01:19:09.340 |
Why what are dropouts doing in here? Okay, so data augmentation I have no idea. I mean, that's a really interesting question. I 01:19:21.300 |
Think it's got to be domain specific. I've never seen any paper or anybody in industry doing data augmentation with structured data and deep learning 01:19:28.220 |
So I don't I think it can be done. I just haven't seen it done 01:19:42.380 |
The output of each of these linear layers is just a 01:19:49.380 |
Rank one tensor and so dropout is going to go ahead and say let's throw away half of the activations 01:19:56.420 |
and the very first dropout embedding dropout literally goes through the embedding matrix and says 01:20:11.860 |
Okay, let's take a break and let's come back at a 5 past 8 01:20:32.660 |
Equally exciting actually before I do I just mention that I had a good question during the break which was 01:20:50.580 |
And and basically I think the answer is like as we discussed before 01:20:54.660 |
No one in academia almost is working on this because it's not something that people really publish on 01:21:00.260 |
And as a result there haven't been really great examples where people could look at and say oh, here's a technique that works 01:21:14.020 |
Until now with this fast AI library. There hasn't been any 01:21:18.500 |
Way to to do it conveniently if you wanted to implement one of these models 01:21:27.100 |
Yourself or else now as we discussed. It's you know six 01:21:31.900 |
It's basically a six step process, you know involving about you know, not much more than six lines of code 01:21:41.340 |
So the reason I mentioned this is to say like I think there are a lot of big 01:21:49.100 |
opportunities to use this to solve problems that previously haven't been solved very well before 01:21:55.860 |
So like I'll be really interested to hear if some of you 01:22:03.780 |
Old Kaggle competitions you might find like oh I would have won this if I'd use this technique 01:22:09.420 |
That would be interesting or if you've got some data set you work with at work 01:22:13.900 |
You know some kind of predictive model that you've been doing with a GBM or a random forest. Does this help? 01:22:18.700 |
You know the thing I I'm still somewhat new to this I've been doing this for 01:22:26.220 |
Basically since the start of the year was when I started working on these structured deep learning models 01:22:35.540 |
Where might it fail? It's worked for nearly everything. I've tried it with so far 01:22:39.480 |
But yeah, I think this class is the first time that 01:22:44.700 |
There's going to be like more than half a dozen people in the world who actually are working on this 01:22:50.220 |
So I think you know as a group we're going to hopefully learn a lot and build some interesting things 01:22:55.120 |
and this would be a great thing if you're thinking of writing a post about something or here's an area that 01:23:01.420 |
There's a couple of that. There's a post from Instacart about what they did 01:23:08.340 |
Riley AI video about what they did that's about it, and there's two academic papers 01:23:13.860 |
Both about Kaggle competition victories one from a Yoshio Yoshio Benjio and his group they won a taxi 01:23:23.300 |
Destination forecasting competition and then also the one linked 01:23:32.540 |
Yeah, there's some background on that all right 01:23:42.900 |
Is kind of like the most up-and-coming area of deep learning. It's kind of like two or three years behind 01:23:49.820 |
Computer vision in deep learning it was kind of like the the second area that deep learning started getting really popular in and 01:24:01.340 |
Got to the point where it was like the clear state-of-the-art 01:24:04.700 |
For most computer vision things maybe in like 2014, you know and in some things in like 2012 01:24:14.740 |
For a lot of things deep learning is now the state of the art, but not quite everything 01:24:23.820 |
The software and some of the concepts is much less mature than it is for computer vision 01:24:30.340 |
So in general none of the stuff we talk about after computer vision is going to be as like 01:24:36.620 |
Settled as the computer vision stuff was so NLP 01:24:40.980 |
One of the interesting things is in the last few months 01:24:43.980 |
Some of the good ideas from computer vision have started to spread into NLP for the first time and we've seen some really big 01:24:51.180 |
Advances so a lot of the stuff you'll see in NLP is is pretty new 01:24:58.900 |
Kind of NLP problem and one of the things you'll find in NLP 01:25:03.780 |
It's like there are particular problems you can solve and they have particular names 01:25:07.580 |
and so there's a particular kind of problem in NLP called language modeling and 01:25:12.020 |
Language modeling has a very specific definition. It means build a model where given a 01:25:18.740 |
Few words of a sentence. Can you predict what the next word is going to be? 01:25:23.140 |
So if you're using your mobile phone and you're typing away and you press space and then it says like this is what the next 01:25:30.700 |
Word might be like SwiftKey does this like really well and SwiftKey actually uses deep learning for this 01:25:36.620 |
That's that's a language model. Okay, so it has a very specific meaning when we say language modeling 01:25:42.980 |
We mean a model that can predict the next word of a sentence 01:25:53.820 |
Papers from archive. So for those of you that don't know it archive is 01:25:59.100 |
The most popular pre-print server in this community and various others 01:26:12.820 |
Abstracts and the topics for each and so here's an example. So the category of this particular paper was compute a 01:26:22.140 |
Then the summaries let the abstract of the paper 01:26:25.180 |
Let's say in the exploitation of mm-wave bands is one of the key enabler for 5g mobile blah blah blah. Okay, so here's like an 01:26:39.420 |
So I trained a language model on this archive data set that I downloaded and then I built a simple little test 01:26:52.140 |
So you'd say like oh imagine you started reading a document that said 01:26:55.460 |
Category is computer science networking and the summary is algorithms that and then I said, please write 01:27:03.100 |
An archive abstract so it said that if it's networking 01:27:10.220 |
Use the same network as a single node are not able to achieve the same performance as a traditional network based routing algorithms in this 01:27:16.860 |
Paper we propose a novel routing scheme, but okay 01:27:19.700 |
So it it's learned by reading archive papers that somebody who was saying algorithms that 01:27:26.500 |
Where the word cat CSNI came before it is going to talk like this and remember it started out not knowing English at all 01:27:35.740 |
Right, it actually started out with an embedding matrix for every word in English that was random 01:27:42.180 |
Okay, and by reading lots of archive papers, it weren't what kind of words followed others 01:27:47.700 |
So then I tried what if we said cat computer science computer vision? 01:27:55.820 |
Use the same data to perform image classification are increasingly being used to improve the performance of image classification 01:28:03.100 |
Algorithms and this paper we propose a novel method for image classification using a deep convolutional neural network parentheses CNN 01:28:10.020 |
So you can see like it's kind of like almost the same sentence as back here 01:28:15.940 |
But things have just changed into this world of computer vision rather than networking 01:28:21.060 |
So I tried something else which is like, okay 01:28:23.700 |
Category computer vision and I created the world's shortest ever abstract algorithms 01:28:29.260 |
And then I said title on and the title of this is going to be on the performance of deep learning for image classification 01:28:36.980 |
EOS is end of string. So that's like end of title 01:28:40.740 |
What if it was networking summary algorithms title on the performance of wireless networks as opposed to? 01:28:48.420 |
Towards computer vision towards a new approach to image classification 01:28:52.900 |
Networking towards a new approach to the analysis of wireless networks 01:28:58.340 |
So like I find this mind-blowing right? I started out with some random matrices 01:29:07.260 |
No, pre-trained anything. I fed it 18 months worth of archive articles and it learnt not only 01:29:17.420 |
but also after you say something's a convolutional neural network, you should then use parentheses to say what it's called and 01:29:24.900 |
furthermore that the kinds of things people talk and say create algorithms for in computer vision are 01:29:30.940 |
performing image classification and in networking are 01:29:34.220 |
Achieving the same performance as traditional network-based routing algorithms. So like a language model is 01:29:47.420 |
Right, and so we're going to try and build that 01:29:50.480 |
But actually not because we care about this at all 01:29:54.540 |
We're going to build it because we're going to try and create a pre-trained model 01:29:58.340 |
what we're actually going to try and do is take IMDB movie reviews and 01:30:02.960 |
Figure out whether they're positive or negative 01:30:06.060 |
So if you think about it, this is a lot like cats versus dogs. It's a classification algorithm, but rather than an image 01:30:15.620 |
So I'd really like to use a pre-trained network 01:30:19.860 |
like I would at least like a net to start with a network that knows how to read English, right and so 01:30:27.380 |
My view was like okay that to know how to read English means you should be able to like predict the next word of a sentence 01:30:38.700 |
Then use that pre-trained language model and then just like in computer vision 01:30:43.580 |
Stick some new layers on the end and ask it instead of to predicting the next word in the sentence 01:30:49.340 |
Instead predict whether something is positive or negative 01:30:52.520 |
So when I started working on this, this was actually a new idea 01:30:57.860 |
Unfortunately in the last couple of months I've been doing it 01:31:01.300 |
You know a few people have actually couple people have started publishing this and so this has moved from being a totally new idea to being 01:31:14.780 |
Creating a language model making that the pre-trained model for a classification model is what we're going to learn to do now 01:31:22.380 |
And so the idea is we're really kind of trying to leverage exactly what we learned in our computer vision work 01:31:28.420 |
Which is how do we do fine-tuning to create powerful classification models? Yes, you know 01:31:33.820 |
So why don't you think that doing just directly what you want to do? 01:31:43.660 |
Well a because it doesn't just turns out it doesn't empirically 01:31:48.300 |
And the reason it doesn't is a number of things 01:31:56.780 |
Fine-tuning a pre-trained network is really powerful 01:31:59.500 |
Right. So if we can get it to learn some related tasks first, then we can use all that information 01:32:19.300 |
They're pretty big and so after reading a thousand words knowing nothing about 01:32:24.220 |
How English is structured or even what the concept of the word is? 01:32:33.340 |
Integers, you know, they end up in inches. All you get is a one or a zero 01:32:38.340 |
Positive or negative and so trying to like learn the entire structure of English and then how it expresses positive and negative 01:32:44.540 |
Sentiments from a single number is just too much to expect 01:32:48.260 |
So by building a language model first we can try to build a neural network that kind of understands 01:32:54.900 |
The English of movie reviews and then we hope that some of the things it's learnt about 01:33:01.100 |
Are going to be useful in deciding whether something's a positive or a negative 01:33:08.020 |
Thanks. Is this similar to the CAR RNN by Carpathi? 01:33:15.780 |
Yeah, this is somewhat similar to CAR RNN by Carpathi. So the famous CAR as in C-H-A-R RNN 01:33:23.660 |
Try to predict the next letter given a number of previous letters 01:33:29.100 |
Language models generally work at a word level. They don't have to 01:33:33.460 |
and doing things at a word level turns out to be a 01:33:37.940 |
Can be quite a bit more powerful and we're going to focus on word level modeling in this course 01:33:47.380 |
Actual copies of what it's found in the in the training data set or are these completely 01:33:54.100 |
Random things that it actually learned and how do we know how to distinguish between those two? Yeah, I mean these are all good questions 01:34:02.340 |
The words are definitely words we've seen before the work because it's not at a character level 01:34:06.660 |
So it can only give us the word it's seen before the sentences 01:34:10.060 |
There's a number of kind of rigorous ways of doing it 01:34:14.380 |
But I think the easiest is to get a sense of like well here are two like different categories 01:34:19.780 |
Where it's kind of created very similar concepts, but mixing them up in just the right way like it would be very hard 01:34:27.660 |
To to do what we've seen here just by like spitting back things. It's seen before 01:34:34.220 |
But you could of course actually go back and check. You know have you seen that sentence before or like a string distance 01:34:44.820 |
And of course another way to do it is the length most importantly when we train the language model as we'll see 01:34:51.080 |
We'll have a validation set and so we're trying to predict the next word 01:34:54.540 |
Of something that's never seen before and so if it's good at doing that. It should be good at generating text in this case the purpose 01:35:05.380 |
That was just a fun example and so I'm not really going to study that too much 01:35:09.340 |
But you know you during the week totally can like you can totally build 01:35:14.620 |
The or you know greater American novel generator or whatever 01:35:21.740 |
To using language models to generate text that I'm not using here. They're pretty simple 01:35:27.940 |
We can talk about them on the forum if you like, but my focus is actually on classification 01:35:35.500 |
Incredibly powerful like text classification I 01:35:43.340 |
You want to like read every article as soon as it comes out through Reuters or Twitter or whatever and immediately 01:35:50.220 |
Identify things which in the past have caused you know massive market drops. That's a classification model or you want to 01:36:02.740 |
queries which tend to be associated with people who 01:36:06.940 |
Who leave your you know who cancel their contracts in the next month? 01:36:12.500 |
That's a classification problem, so like it's a really powerful kind of thing for 01:36:27.500 |
I'm trying to class documents into whether they're part of legal discovery or not part of legal discovery 01:36:38.260 |
In terms of stuff. We're importing we're importing a few new things here 01:36:41.820 |
one of the bunch of things we're importing is 01:36:52.420 |
Library and so fast AI is designed to work hand-in-hand with porch text as you'll see and then there's a few 01:36:59.180 |
Text specific sub bits of faster fast AI that we'll be using 01:37:04.200 |
So we're going to be working with the IMDB large movie review data set. It's very very well studied in academia 01:37:21.180 |
50,000 reviews highly polarized reviews either positive or negative each one has been 01:37:29.860 |
Okay, so we're going to try our first of all however to create a language model 01:37:33.540 |
So we're going to ignore the sentiment entirely right so just like the dogs and cats 01:37:37.580 |
Pre-train the model to do one thing and then fine-tune it to do something else 01:37:41.300 |
Because this kind of idea in NLP is is so so so new 01:37:47.980 |
There's basically no models you can download for this so we're going to have to create our own 01:37:55.620 |
Having downloaded the data you can use the link here. We do the usual stuff saying the path to it training and validation path 01:38:03.220 |
And as you can see it looks pretty pretty traditional compared to vision. There's a directory of training 01:38:10.120 |
There's a directory of test we don't actually have separate test and validation in this case 01:38:15.940 |
And just like in in vision the training directory has a bunch of files in it 01:38:22.440 |
In this case not representing images, but representing movie reviews 01:38:26.940 |
So we could cat one of those files and here we learn about the classic zombie Geddon movie 01:38:36.460 |
I have to say with a name like zombie Geddon and an atom bomb on the front cover 01:38:45.040 |
Rented if you want to get stoned on a Friday night and laugh with your buddies 01:38:51.780 |
Don't rent it if you're an uptight weenie or want a zombie movie with lots of fresh eating 01:38:55.560 |
I think I'm going to enjoy zombie Geddon so all right, so we've learned something today 01:39:00.360 |
All right, so we can just use standard unique stuff to see like how many words are in the data set so the training set we've got 01:39:13.400 |
Test set we've got five point six million words 01:39:20.260 |
These are this is IMDB so IMDB is yeah random people this is not a New York Times listed review as far as I know 01:39:35.580 |
Before we can do anything with text we have to turn it into a list of tokens 01:39:41.580 |
A token is basically like a word right so we're going to try and turn this eventually into a list of numbers 01:39:47.180 |
So the first step is to turn it into a list of words 01:39:49.580 |
That's called tokenization in NLP NLP has a huge lot of jargon that we'll we'll learn over time 01:39:56.180 |
One thing that's a bit tricky though when we're doing tokenization is here 01:40:02.740 |
I've tokenized that review and then joined it back up with spaces and you'll see here that wasn't 01:40:09.220 |
Has become two tokens which makes perfect sense right wasn't is two things, right? 01:40:20.500 |
Right, where else lots of exclamation marks has become lots of tokens. So like a good tokenizer 01:40:30.260 |
Pieces of an English sentence each separate piece of punctuation will be separated 01:40:36.740 |
And each part of a multi-part word will be separated as appropriate. So 01:40:42.500 |
Spacey is a I think it's an Australian developed piece of software actually that does lots of NLP stuff 01:40:52.220 |
Fast AI is designed to work. Well with the spacey tokenizer as its torch text. So here's an example of 01:40:59.100 |
Tokenization, right so what we do with torch text is we basically have to start out by creating 01:41:06.700 |
Something called a field and a field is a definition of how to pre-process some text 01:41:12.620 |
And so here's an example of the definition of a field. It says I want to lowercase 01:41:17.360 |
The text and I want to tokenize it with the function called spacey tokenize 01:41:23.160 |
Okay, so it hasn't done anything yet. We're just telling her when we do do something 01:41:28.100 |
This is what to do. And so that we're going to store that 01:41:35.580 |
And so this is this is none of this is but this is not fast AI specific at all 01:41:39.900 |
This is part of torch text. You can go to the torch text website read the docs. There's not lots of docs yet 01:41:48.300 |
Probably the best information you'll find about it is in this lesson, but there's some more information on this site 01:41:54.260 |
Alright, so what we can now do is go ahead and create the usual fast AI model data object 01:42:03.060 |
Okay, and so to create the model data object. We have to provide a few bits of information 01:42:10.260 |
So the path to the text files the validation set and the test set in this case just to keep things simple 01:42:17.660 |
I don't have a separate validation in test set so I'm going to pass in the validation set for both of those two things 01:42:23.620 |
Right. So now we can create our model data object as per usual. The first thing we give it is the path 01:42:31.060 |
The second thing we give it is the torch text field definition of how to pre-process that text 01:42:36.940 |
The third thing we give it is the dictionary or the list of all of the files we have train validation test 01:42:44.540 |
As per usual we can pass in a batch size and then we've got a special special couple of extra things here 01:42:51.900 |
One is a very commonly used in NLP minimum frequency. What this says is 01:43:00.020 |
In a moment, we're going to be replacing every one of these words with an integer 01:43:04.980 |
Which basically will be a unique index for every word and this basically says if there are any words that occur less than 10 times 01:43:16.220 |
Right don't think of it as a word, but we'll see that in more detail in a moment 01:43:20.740 |
And then we're going to see this in more detail as well BP TT stands for back prop through time 01:43:27.580 |
And this is where we define how long a sentence will we? 01:43:32.060 |
Stick on the GPU at once. So we're going to break them up in this case. We're going to break them up into sentences of 01:43:38.820 |
70 tokens or less on the whole so we're going to see all this in a moment 01:43:44.860 |
All right. So after building our model data object, right what it actually does is it's going to fill this text field 01:43:54.700 |
With an additional attribute called vocab and this is a really important NLP concept 01:44:01.020 |
I'm sorry. There's so many NLP concepts. We just have to throw at you kind of quickly, but we'll see them a few times 01:44:08.100 |
Vocab is the vocabulary and the vocabulary in NLP has a very specific meaning it is 01:44:14.100 |
What is the list of unique words that appeared in this text? 01:44:17.160 |
So every one of them is going to get a unique index. So let's take a look right here is text 01:44:24.540 |
Vocab dot I to s this stands for this is all torch text not fast AI 01:44:32.300 |
Maps the integer zero to unknown the integer one the padding into to the then comma dot and 01:44:41.500 |
Of two and so forth. All right, so this is the first 12 01:44:50.220 |
Of the vocab from the IMDB movie review and it's been sorted by frequency 01:44:55.820 |
Except for the first two special ones. So for example, we can then go backwards s to I string to int 01:45:02.900 |
Here is the it's in position 0 1 2 so stream to int the is 2 01:45:09.460 |
So the vocab lets us take a word and map it to an integer or take an integer and map it to a word 01:45:19.060 |
Right. And so that means that we can then take 01:45:22.060 |
the first 12 tokens for example of our text and turn them into 01:45:28.380 |
12 it's so for example here is of the you can see 7 2 and 01:45:38.500 |
Right. So we're going to be working in this form. Did you have a question? Yeah, could you pass that back there? 01:45:47.940 |
Is it a common to any stemming or limitizing? 01:45:53.900 |
Generally tokenization is is what we want like with a language model 01:45:57.800 |
We you know to keep it as general as possible we want to know what's coming next and so like whether it's 01:46:04.700 |
Future tense or past tense or plural or singular like we don't really know which things are going to be interesting in which aren't 01:46:15.420 |
It seems that it's generally best to kind of leave it alone as much as possible 01:46:23.340 |
You know having said that as I say, this is all pretty new 01:46:26.660 |
So if there are some particular areas that some researcher maybe has already discovered that some other kinds of pre-processing are helpful 01:46:33.620 |
You know, I wouldn't be surprised not to know about it 01:46:40.420 |
You know natural language is in context important context is very important. So if you're if you're using 01:46:51.780 |
This is this look this is I just don't get some of the big premises of this like they're in order 01:46:57.740 |
Yeah, so just because we replaced I with the number 12 01:47:07.380 |
There is a different way of dealing with natural language called a bag of words and bag of words 01:47:12.380 |
You do throw away the order in the context and in the machine learning course 01:47:16.220 |
We'll be learning about working with bag of words representations 01:47:21.740 |
No longer useful or in the verge of becoming no longer useful 01:47:26.540 |
We're starting to learn how to use dick learning to use context properly now 01:47:32.620 |
But it's kind of for the first time it's really like only in the last few months 01:47:37.140 |
All right, so I mentioned that we've got two numbers batch size and BPT T back prop through time 01:47:58.940 |
Okay, so we've got some big long piece of text, you know, here's our sentence. It's a bunch of words, right and 01:48:03.540 |
Actually what happens in a language model is even though we have lots of movie reviews 01:48:10.460 |
They actually all get concatenated together into one big block of text, right? So it's basically predict the next word 01:48:18.580 |
In this huge long thing, which is all of the IMDb movie reviews concatenate together. So this thing is, you know 01:48:26.340 |
What do we say? It was like tens of millions of words long and so what we do 01:48:36.020 |
First right so these like are our spits into batches, right? And so if we said 01:48:45.020 |
64 we actually break the whatever was 60 million words into the 64 01:48:53.700 |
right, and then we take each one of the 64 sections and 01:49:02.340 |
Like underneath the previous one I didn't do a great job of that 01:49:24.100 |
Actually, I think we've moved them across wise so it's actually I think just transpose it we end up with a matrix. It's like 64 01:49:39.340 |
Wide and the length let's say the original was 64 million right then the length is like 01:50:03.660 |
We then grab a little chunk of this at a time and those chunk lengths are approximately equal to 01:50:11.500 |
BP TT which I think we had equal to 70. So we basically grab a little 01:50:20.980 |
That's the first thing we chuck into our GPU. That's a batch, right? So a batch is always of length of width 01:50:28.020 |
64 or batch size and each bit is a sequence of length up to 70 01:50:37.260 |
Right. So here if I go take my train data loader 01:50:42.060 |
I don't know if you folks have tried playing with this yet 01:50:44.980 |
But you can take any data loader wrap it with it up to turn it into an iterator and then call next on it to grab 01:50:51.660 |
a batch of data just as if you were a neural net you get exactly what the neural net gets and you can see here we 01:51:03.060 |
Tensor right so it's 64 wide right and I said it's approximately 01:51:15.140 |
And that's actually kind of interesting a really neat trick that torch text does is they randomly change 01:51:22.060 |
The back prop through time number every time so each epoch it's getting slightly different 01:51:32.220 |
This is kind of like in computer vision. We randomly shuffle the images 01:51:37.080 |
We can't randomly shuffle the words right because we need to be in the right order 01:51:42.100 |
So instead we randomly move their break points a little bit. Okay, so this is the equivalent 01:51:50.340 |
This here is of length 75 right there's a there's an ellipsis in the middle 01:52:00.420 |
And that represents the first 75 words of the first review 01:52:09.780 |
Represents the first 75 words of this of the second of the 64 segments 01:52:15.060 |
That's it have to go in like 10 million words to find that one right and so here's the first 01:52:20.780 |
75 words of the last of those 64 segments okay, and so then what we have 01:52:38.540 |
6 1 5 there's 6 1 5 25 there's 25 right and in this case 01:52:47.820 |
It's also 75 by 64 but for minor technical reasons being flattened out 01:52:53.060 |
Into a single vector that basically it's exactly the same as this matrix, but it's just moved down 01:53:01.980 |
By one because we're trying to predict the next word 01:53:05.740 |
Right so that all happens for us right if we ask for and this is the fast AI now if you ask for a language model 01:53:15.420 |
object then it's going to create these batches of 01:53:23.980 |
Bits of our language corpus along with the same thing shuffled along by one word 01:53:32.220 |
Right and so we're always going to try and predict the next word 01:53:36.100 |
So why don't you instead of just arbitrarily choosing 64? 01:53:47.100 |
Why don't you choose like like 64 is a large number 01:53:52.900 |
Maybe like do it by sentences and make it a large number and then pat it was zero or something 01:54:02.340 |
You know so that you actually have a one full sentence per line 01:54:05.460 |
Basically wouldn't that make more sense not really because remember we're using columns right so each of our columns is of length about 10 million 01:54:13.140 |
Right so although it's true that those columns aren't always exactly finishing on a full stop. This so damn long. We don't care 01:54:25.340 |
Right and we're trying to also each line contains multiple sentences column contains more 01:54:32.120 |
Yeah, it's it's it's of length about 10 million 01:54:35.500 |
And it contains many many many many many sentences 01:54:38.880 |
Because remember the first thing we did was take the whole thing and split it into 64 groups 01:54:50.620 |
So um I found this you know pertaining to this question this thing about like 01:54:55.960 |
What's in this language model matrix a little mind-bending for quite a while? 01:55:01.780 |
So don't worry if it takes a while and you have to ask a thousand questions on the forum. That's fine, right? 01:55:09.540 |
Go back and listen to what I just said in this lecture again 01:55:12.420 |
go back to that bit where I showed you is putting it up to 64 and moving them around and try it with some sentences and 01:55:17.600 |
Excel or something and see if you can do a better job of explaining it than I did 01:55:26.260 |
And then what fast AI adds on is this idea of like kind of how to build a language model out of it 01:55:33.460 |
Although actually a lot of that stolen from torch text as well like there's sometimes where torch text starts and fast AI ends 01:55:41.700 |
Subtle they really work closely together, okay? 01:55:53.540 |
Batches we can go ahead and create a model right and so in this case 01:55:59.300 |
We're going to create an embedding matrix and our vocab 01:56:06.780 |
Let's have a look back here, so we can see here in the model data object there are 01:56:18.020 |
Kind of pieces that we're going to go through that's basically equal to the number of 01:56:22.540 |
The total length of everything divided by batch size times 01:56:31.020 |
I've got the definition up here number of unique tokens NT is the number of tokens 01:56:36.080 |
That's the size of our vocab so we've got three thirty four thousand nine hundred and forty five unique words 01:56:43.700 |
And notice the unique words it had to appear at least ten times 01:56:46.900 |
Okay, because otherwise they've been replaced with 01:56:50.300 |
The length of the data set is one because as far as a language model is concerned there's only one 01:57:00.860 |
Thing which is the whole corpus all right, and then that thing has 01:57:12.820 |
So those thirty four thousand hundred and forty five things are used to create an embedding matrix 01:57:29.340 |
Right and so the first one represents onk the second one represents pad 01:57:35.180 |
The third one was dot the fourth one was comma this one. I'm just guessing was there and so forth 01:57:47.660 |
So this is literally identical to what we did 01:57:50.500 |
Before the break right this is a categorical variable. It's just a very high cardinality categorical variable and furthermore 01:57:59.300 |
It's the only variable right. This is pretty standard in NLP. You have a variable which is a word 01:58:10.260 |
single column basically, and it's it's of thirty four thousand nine hundred and forty five 01:58:16.860 |
Cardinality categorical variable and so we're going to create an embedding matrix for it 01:58:21.900 |
So M size is the size of the embedding vector 200, okay? 01:58:28.020 |
So that's going to be length 200 a lot bigger than our previous embedding vectors not surprising because a word 01:58:34.740 |
Has a lot more nuance to it than the concept of Sunday 01:58:40.780 |
Or Rossman's Berlin store or whatever right so it's generally an embedding size for a word 01:58:47.640 |
Will be somewhere between about 50 and about 600? 01:58:50.400 |
Okay, so I've kind of gone some in the middle 01:58:52.980 |
We then have to say as per usual how many activations 01:58:58.100 |
Do you want in your layers so we're going to use 500 and then how many layers? 01:59:02.140 |
Do you want in your neural net we're going to use three okay? 01:59:08.140 |
This is a minor technical detail it turns out that 01:59:11.180 |
We're going to learn later about the atom optimizer 01:59:14.460 |
That basically the defaults for it don't work very well with these kinds of models 01:59:18.720 |
So we just have to change some of these you know basically any time you're doing NLP. You should probably 01:59:30.060 |
So having done that we can now again take our model data object and grab a model out of it 01:59:38.580 |
What optimization function do we want how big an embedding do we want how many hidden activate how many activations number of hidden? 01:59:52.500 |
So this language model. We're going to use is a very recent development called awd LSTM by Stephen Meridy 02:00:01.020 |
Who's a NLP researcher based in San Francisco and his main contribution really was to show like? 02:00:07.680 |
How to put dropout all over the place in in these NLP models? 02:00:15.740 |
We'll do this in the last lecture is worrying about like what all that like 02:00:18.780 |
What is the architecture and what are all these dropouts for now? 02:00:22.460 |
Just know it's the same as per usual if you try to build an NLP model and your under fitting 02:00:28.540 |
Then decrease all of these dropouts if you're over fitting then increase all of these dropouts in roughly this ratio 02:00:35.960 |
Okay, that's that's my rule of thumb and it again. This is such a recent paper 02:00:42.260 |
Nobody else is working on this model anyway, so there's not a lot of guidance, but I've found this these ratios work 02:00:49.260 |
Well, that's what Stephen's been using as well 02:00:51.500 |
There's another kind of way we can avoid overfitting that we'll talk about in the last class 02:00:58.540 |
Again for now this one actually works totally reliably so all of your NLP models probably want this particular line of code 02:01:05.600 |
And then this one we're going to talk about at the end last lecture as well you can always include this basically what it says is 02:01:19.220 |
When you look at your gradients, and you multiply them by the learning rate, and you decide how much to update your weights by 02:01:31.220 |
Like don't let them be more than zero point three 02:01:34.740 |
and this is quite a cool little trick right because like 02:01:39.540 |
If you're learning rates pretty high, and you kind of don't want to get in that situation 02:01:46.140 |
We talked about where you're kind of got this kind of thing where you go 02:01:54.100 |
You know rather than little step little step little step instead you go like oh too big oh too big right with gradient 02:02:01.340 |
Clipping it kind of goes this far, and it's like oh my goodness. I'm going too far. I'll stop 02:02:05.900 |
Right that's basically what gradient flipping does 02:02:11.980 |
Anyway, so these are a bunch of parameters the details don't matter too much right now. You can just steal these 02:02:36.420 |
What to make and glove so I have two questions about that one is 02:02:41.840 |
How are those different from these and the second question? Why don't you initialize them with one of those? Yeah, so 02:02:51.900 |
So basically that's a great question, so basically 02:02:57.820 |
These embedding matrices before to do various other tasks. They're not whole pre-trained models 02:03:03.860 |
They're just a pre-trained embedding matrix, and you can download them and as unit says they have names like word2vec and love 02:03:12.780 |
There's no reason we couldn't download them really it's just like 02:03:25.620 |
Building a whole pre-trained model in this way didn't seem to benefit much if at all from using pre-trained word vectors 02:03:32.700 |
We're also using a whole pre-trained language model 02:03:37.460 |
So like you remember what a big those of you who saw word2vec it made a big splash when it came out 02:03:42.820 |
I I'm finding this technique of pre-trained language models seems much more powerful 02:03:49.740 |
Basically, but I think we combine both to make them a little better still 02:03:53.620 |
What is what is the model that you have used like how can I know the architecture of the model? 02:04:00.020 |
So we'll be learning about the model architecture in the last lesson for now. It's a recurrent neural network 02:04:07.980 |
Using something called LSTM long short-term memory 02:04:17.740 |
So if they had lots of details that we're skipping over but you know you can do all this without any of those details 02:04:25.980 |
I found that this language model took quite a while to fit so I kind of like ran it for a while 02:04:31.260 |
Noticed it was still under fitting save where it was up to 02:04:34.860 |
Ran it a bit more with longer cycle length saved it again. It still 02:04:44.220 |
And kind of finally got to the point where it's like kind of honestly I kind of ran out of patience 02:04:53.700 |
I did the same kind of test that we looked at before so I was like oh it wasn't quite what I was expecting 02:04:58.620 |
But I really liked it anyway the best and then I was like okay 02:05:01.080 |
Let's see how that goes the best performance was one in the movie was a little bit. I say okay 02:05:05.020 |
It looks like the language models working pretty well 02:05:14.980 |
Fine-tune it to do classification send my classification now obviously if I'm going to use a pre-trained model 02:05:21.760 |
I need to use exactly the same vocab right the word there 02:05:25.860 |
Still needs to map for the number two so that I can look up the vector for that right so that's why I first of all 02:05:33.820 |
Load back up my my field object the thing with the vocab in right now in this case 02:05:41.060 |
If I run it straight afterwards, this is unnecessary 02:05:43.880 |
It's already in memory, but this means I can come back to this later right and a new session basically 02:05:50.860 |
I can then go ahead and say okay. I've never got one more field right in addition to my field 02:05:59.780 |
Which represents the reviews I've also got a field which represents the label 02:06:09.060 |
Now this time I need to not treat the whole thing as one big 02:06:14.180 |
Piece of text, but every review is separate because each one has a different sentiment attached to it 02:06:20.420 |
And it so happens that torch text already has a data set that does that for IMDB, so I just used IMDB 02:06:30.180 |
So basically once we've done all that we end up with something where we can like grab for a particular example 02:06:40.020 |
Here's some of the text. This is another great Tom Beringdon movie blah blah blah blah all right, so 02:06:45.220 |
This is all not nothing fast AI specific here 02:06:51.660 |
But torch text docs can help understand what's going on all you need to know is that 02:06:56.660 |
Once you've used this special talks torch text thing called splits to grab a splits object 02:07:02.860 |
You can pass it straight into fast AI text data from splits and that basically converts a torch text 02:07:10.140 |
Object into a fast AI object we can train on so as soon as you've done that you can just go ahead and say 02:07:20.700 |
And then we can load into it the pre-trained model the language model 02:07:26.860 |
right, and so we can now take that pre-trained language model and 02:07:31.900 |
Use the stuff that we're kind of familiar with right so we can 02:07:35.300 |
Make sure that you know all it's at the last layer is frozen train it a bit 02:07:40.140 |
Unfreeze it train it a bit and the nice thing is once you've got a pre-trained 02:07:45.300 |
Language model it actually trained super fast you can see here. It's like a couple of minutes 02:07:50.380 |
The epoch and it only took me to get my is my best one here 02:07:56.060 |
It already took me like 10 epochs, so it's like 20 minutes to train this bit. It's really fast 02:08:03.900 |
94.5% so how good is 94.5% well it so happens that 02:08:11.540 |
Actually one of Stephen Verity's colleagues James Bradbury recently created a paper 02:08:17.220 |
Looking at the state at like where they tried to create a new state-of-the-art for a bunch of NLP things and one of the things 02:08:27.940 |
IMDB and they actually have here a list of the current world's best for 02:08:35.780 |
Even with stuff that is highly specialized for sentiment analysis the best anybody had previously come up with was 94 94.1 02:08:51.100 |
Anybody has created in the world before as far as we know or as far as James Bradbury knows 02:08:58.820 |
so when I say like there are big opportunities to use this I mean like 02:09:03.180 |
This is a technique that nobody else currently has access to which you know you could like it, you know, whatever 02:09:10.300 |
IBM has in Watson or whatever any big company has you know that they're 02:09:16.180 |
Advertising unless they have some secret source that they're not publishing which they don't right because people get you know 02:09:25.380 |
Then you now have access to a better text classification method than as ever existed before 02:09:30.340 |
So I really hope that you know, you can try this out and see how you go 02:09:35.140 |
There may be some things that works really well on and others that it doesn't work as well on I don't know 02:09:41.860 |
I think this kind of sweet spot here that we had about 25,000 02:09:48.420 |
You know short to medium sized documents if you don't have at least that much text 02:09:54.060 |
It may be hard to train a different language model 02:09:56.540 |
But having said that there's a lot more we do here, right and we won't be able to do it in part one of this course 02:10:02.660 |
We're doing part two, but for example, we could start like training language models that look at like 02:10:08.860 |
You know lots and lots of medical journals and then we could like make a downloadable 02:10:13.620 |
medical language model that then anybody could use to like fine-tune on like a 02:10:20.300 |
Prostate cancer subset of medical literature for instance, like there's so much we could do 02:10:26.300 |
It's kind of exciting and then you know to your nets point we could also combine this with like pre-trained word vectors 02:10:34.260 |
Trying that hard like, you know, we even without news like 02:10:37.780 |
we could have pre-trained a Wikipedia say corpus language model and then fine-tuned it into a 02:10:45.820 |
IMDb language model and then fine-tuned that into an IBM IMDb sentiment analysis model and we would have got something better than this 02:10:53.100 |
So like this and I really think this is the tip of the iceberg 02:10:56.780 |
And I was talking there's a really fantastic researcher called Sebastian ruder who is 02:11:04.500 |
Basically the only NLP researcher. I know who's been really really writing a lot about 02:11:11.380 |
Training and fine-tuning and transfer learning and NLP and I was asking him like why isn't this happening more? 02:11:17.740 |
And his view was it's because there isn't the software to make it easy, you know 02:11:23.500 |
So I'm actually going to share this lecture with with him tomorrow 02:11:27.780 |
Because you know it feels like there's you know 02:11:32.540 |
Hopefully going to be a lot of stuff coming out now that we're making it really easy to do this 02:11:41.380 |
We're kind of out of time so what I'll do is I'll quickly look at 02:11:45.580 |
Collaborative filtering introduction and then we'll finish it next time the collaborative filtering. There's very very little new to learn 02:11:53.360 |
We basically learned everything we're going to need 02:11:56.300 |
So collaborative filtering will will cover this quite quickly next week 02:12:02.980 |
And then we're going to do a really deep dive into collaborative filtering next week 02:12:07.980 |
Where we're going to learn about like we're actually going to from scratch learn how to do stochastic gradient descent 02:12:13.820 |
How to create loss functions how they work exactly and then we'll go from there and we'll gradually build back up to really deeply understand 02:12:22.820 |
What's going on in the structured models and then what's going on in confidence and then finally what's going on in recurrent neural networks 02:12:30.500 |
And hopefully we'll be able to build them all 02:12:32.940 |
From scratch okay, so this is kind of going to be really important this movie lens data set because we're going to use it to 02:12:40.860 |
Really foundational theory and kind of math behind it so the movie lens data set 02:12:47.380 |
This is basically what it looks like it contains a bunch of ratings. It says user number one 02:12:54.140 |
Watched movie number 31 and they gave it a rating of two and a half 02:13:02.020 |
Then they watched movie one or two nine and they gave it a rating of three and they watched rating one ones movie one one seven 02:13:08.340 |
Two and they gave it a rating of four. Okay, and so forth 02:13:11.020 |
So this is the ratings table. This is really the only one that matters and our goal will be for some user 02:13:18.740 |
We haven't seen before sorry for some user movie combination. We haven't seen before we have to predict if they'll like it 02:13:25.580 |
Right and so this is how recommendation systems are built 02:13:29.220 |
This is how like Amazon besides what books to recommend how Netflix decides what movies to recommend and so forth 02:13:34.880 |
To make it more interesting we'll also actually download a list of movies so each movie 02:13:42.020 |
We're actually going to have the title and so for that question earlier about like what's actually going to be in these embedding matrices 02:13:47.420 |
How do we interpret them? We're actually going to be able to look and see 02:13:52.660 |
So basically this is kind of like what we're creating this is kind of crosstab of users 02:14:01.400 |
Alright, and so feel free to look ahead during the week. You'll see basically as per usual collab filter data set from CSP 02:14:10.800 |
Learn dot fit and we're done and you won't be surprised to hear when we then take that and we can cut the benchmarks 02:14:16.680 |
It seems to be better than the benchmarks where you looked at so that'll basically be it and then next week 02:14:22.040 |
We'll have a deep dive and we'll see how to actually build this from scratch. All right. See you next week