back to indexMachine Learning 1: Lesson 10
Chapters
0:0 Fast AI
1:22 Feature Engineering
4:25 Structured Data
8:34 Recap
11:53 AutoGrad
13:12 Variables
15:3 Iterators
24:8 Gradients
37:8 Data Loader
40:18 Parameters
47:30 Weight Decay
55:8 Discussion
00:00:00.000 |
Well, welcome back to machine learning one of the most exciting things this week 00:00:05.480 |
Almost certainly the most exciting thing this week is that fastai is now on pip so you can pip install fastai 00:00:14.760 |
And so thank you to Prince and for to karem for making that happen 00:00:21.360 |
To USF students who had never published a pip package before and this is one of the harder ones to publish because it's got a lot 00:00:30.160 |
So it's you know probably still easiest just to do the Conda end update thing 00:00:36.720 |
But a couple of places that it would be handy instead to pip install fastai would be well obviously if you're working 00:00:42.880 |
Outside of the the repo and the notebooks then this gives you access to fastai everywhere 00:00:49.680 |
Also, I believe they submitted a pull request to Kaggle to try and get it added to the Kaggle kernels 00:00:55.960 |
So hopefully you'll be able to use it on Kaggle kernels 00:01:00.080 |
Yeah, you can use it at your work or whatever else 00:01:04.160 |
So that's that's exciting. I mean I'm not going to say it's like officially released yet. You know it's still 00:01:13.880 |
You're helping add documentation and all that kind of stuff, but it's great that that's now there 00:01:21.560 |
a couple of cool kernels from USF students this week thought I'd highlight two that were both from the 00:01:37.840 |
Written out you know written a standard English text they also had one for Russian 00:01:45.920 |
And you're trying to kind of identify things that could be like a first second third and say like that's a cardinal number 00:01:52.760 |
Or if this is a phone number or whatever and I did a quick little bit of searching and I saw that 00:01:57.820 |
There had been some attempts in academia to use 00:02:01.840 |
deep learning for this, but they hadn't managed to make much progress and 00:02:09.200 |
Colonel here which gets point nine nine two on the leaderboard, which I think is like top 20 00:02:14.520 |
Is yeah, it's kind of entirely heuristic, and it's a great example of 00:02:18.640 |
Kind of feature engineering this in this case the whole thing is basically entirely feature engineering 00:02:23.780 |
So it's basically looking through and using lots of regular expressions to figure out for each token 00:02:29.600 |
What is it you know and I think she's done a great job here kind of laying it all out 00:02:34.240 |
clearly as to what all the different pieces are and how they all fit together and 00:02:38.560 |
She mentioned that she's maybe hoping to turn this into a library which I think would be great 00:02:45.480 |
Grab a piece of text and pull out. What are all the pieces in it? 00:02:53.400 |
The neural the natural language can like natural language processing community hopes to be able to do 00:02:58.640 |
Without like lots of handwritten code like this, but for now 00:03:03.260 |
This is I'll be interesting to see like what the winners turn out to have done, but I haven't seen 00:03:09.200 |
Machine learning being used really to do this particularly well 00:03:13.520 |
Perhaps the best approach is the ones which combine this kind of feature engineering along with some machine learning 00:03:19.600 |
But I think this is a great example of effective feature engineering, and this is a another USF student 00:03:27.460 |
Who has done much the same thing got a similar kind of score? 00:03:36.160 |
Again, this is gets you would get you a good leaderboard position with these as well 00:03:40.480 |
so I thought that was interesting to see examples of some of our students entering a 00:03:45.800 |
competition and getting kind of top 20 ish results by you know basically just handwritten heuristics, and this is where 00:03:59.640 |
Six years ago still basically all the best approaches was a whole lot of like carefully handwritten heuristics 00:04:06.640 |
often combined with some simple machine learning and 00:04:18.360 |
Automating much more of this and actually interestingly 00:04:25.600 |
Safe driver prediction competition was just finished 00:04:29.200 |
One of the Netflix prize winners won this competition and he 00:04:35.320 |
Invented a new algorithm for dealing with structured data which basically doesn't require any feature engineering at all 00:04:49.760 |
deep learning models and one gradient boosting machine 00:04:54.200 |
And his his basic approach was very similar to what we've been learning in this class so far 00:05:02.180 |
Which is using fully connected neural networks and we're and one hot encoding 00:05:07.720 |
And specifically embedding which we'll learn about but he had a very clever technique 00:05:13.280 |
Which was there was a lot of data in this competition which was unlabeled so in other words 00:05:24.200 |
Or whatever so unlabeled data so when you've got some labeled and some unlabeled data 00:05:29.080 |
We call that semi supervised learning and in real life 00:05:32.960 |
Most learning is semi supervised learning like in real life normally you have some things that are labeled and some things that are unlabeled 00:05:40.100 |
so this is kind of the most practically useful kind of learning and 00:05:44.160 |
Then structured data is it's the most common kind of data that companies deal with day to day 00:05:53.620 |
Structured data competition made it incredibly practically useful 00:05:57.460 |
And so what his technique for winning this was was to? 00:06:01.780 |
Do data augmentation which those of you doing the deep learning course have learned about which is basically the idea like if you had 00:06:09.760 |
Pictures you would like flip them horizontally or rotate them a bit data augmentation means creating new data examples 00:06:19.240 |
Different versions of ones you already have and the way he did it was for each row in the data. He would like 00:06:32.880 |
So each row now would represent like a mix of like 80 percent 85 percent of the original row 00:06:38.320 |
But 15 percent randomly selected from a different row 00:06:44.760 |
randomly changing the data a little bit and then he used something called an autoencoder which we will 00:06:51.080 |
Probably won't study until part two of the deep learning course 00:06:54.960 |
But the basic idea of an autoencoder is your dependent variable is the same as your independent variable 00:07:01.020 |
so in other words you try to predict your input, which obviously is 00:07:07.520 |
Trivial if you're allowed to like it like you know the identity transform for example trivially predicts the input 00:07:14.640 |
But the trick with an autoencoder is to have less activations in 00:07:18.640 |
At least one of your layers than your input right so if your input was like a hundred-dimensional vector, and you put it through a 00:07:27.560 |
100 pi 10 matrix to create 10 activations and then have to recreate the original hundred long vector from that 00:07:36.600 |
Then you've basically come you have to have compressed it effectively and so it turns out that 00:07:47.400 |
Correlations and features and interesting relationships in the data even when it's not labeled so he used that 00:07:55.280 |
Rather than doing any he didn't do any hand engineering. He just used an autoencoder 00:08:00.120 |
So you know these are some interesting kind of directions that if you keep going with your machine learning studies 00:08:08.920 |
Do part two with a deep learning course next year? 00:08:18.840 |
Feature engineering is going away, and this was just 00:08:21.440 |
Yeah, an hour ago, so this is very recent news indeed, but it's one of this is one of the most important 00:08:36.960 |
Simple logistic regression trained with SGD for MNIST 00:08:44.880 |
And here's the summary of where we got to we have nearly built a module 00:08:57.920 |
A model module and a training loop from scratch and we were going to kind of try and finish that and after we finish that 00:09:04.640 |
I'm then going to go through this entire notebook 00:09:06.640 |
Backwards right so having gone like top to bottom, but I'm going to go back through 00:09:17.600 |
Handwritten and end up module class we created 00:09:22.280 |
We defined our loss we defined our learning rate, and we defined our optimizer 00:09:26.840 |
And this is the thing that we're going to try and write by hand in a moment 00:09:32.440 |
That and that we're still in with from Pytorch, but that we've written ourselves and this we've written ourselves 00:09:38.760 |
So the basic idea was we're going to go through some number of epochs, so let's go through one epoch 00:09:43.460 |
Right and we're going to keep track of how much for each mini batch. What was the loss so that we can report it at the end 00:09:51.840 |
We're going to turn our training data loader into an iterator 00:09:55.140 |
So that we can loop through it loop through every mini batch, and so now we can go and go ahead and say for tensor in 00:10:01.680 |
The length of the data loader, and then we can call next to grab the next independent variables and the dependent variables 00:10:11.300 |
From our data loader from that iterator, okay? 00:10:15.960 |
So then remember we can then pass the X tensor into our model by calling the model as if it was a function 00:10:23.560 |
But first of all we have to turn it into a variable 00:10:28.880 |
Blah dot CUDA to turn it into a variable a shorthand for that is just the capital V now 00:10:34.360 |
It's a capital T for a tensor capital B for a V for a variable. That's just a shortcut in fast AI 00:10:43.240 |
And so the next thing we needed was to calculate our loss 00:10:45.560 |
Because we can't calculate the derivatives of the loss if you haven't calculated the loss 00:10:50.940 |
So the loss takes the predictions and the actuals 00:10:54.240 |
Okay, so the actuals again are the the Y tensor and again. We have to turn that into a variable 00:10:59.760 |
Now can anybody remind me what a variable is and why we would want to use a variable here? 00:11:11.320 |
I think once you turn into variable, then it tracks it so then you can do it backward on that so you can get it 00:11:18.400 |
It can track like it's process of like you know as you add the function as the function is targeting layers within each other 00:11:23.440 |
You can track it and then we do backward on it back propagates and does the yeah, right, so 00:11:40.720 |
So there's actually a fantastic tutorial on the Pytorch website 00:11:45.480 |
So on the Pytorch website there's a tutorial section 00:11:56.680 |
And there's a tutorial there about autograd autograd is the name of the automatic 00:12:01.680 |
differentiation package that comes with Pytorch and it's it's an implementation of automatic differentiation and so the variable plus 00:12:12.400 |
The key class here because that's the thing that makes turns a tensor into something where we can keep track of its gradients 00:12:19.240 |
So basically here they show how to create a variable do an operation to a variable 00:12:25.780 |
And then you can go back and actually look at the grad function 00:12:30.360 |
Which is the the function that it's keeping track of basically to calculate the gradient right so as we do 00:12:38.160 |
More and more operations to this very a variable and the variables calculated from that variable it keeps keeping track of it 00:12:45.320 |
So later on we can go dot backward and then print dot grad and find out the gradient 00:12:52.560 |
Right and so you notice we never defined the gradient. We just defined it as being x plus 2 00:12:58.520 |
Squared times 3 whatever and it can calculate the gradient 00:13:04.640 |
Okay, so that's why we need to turn that into a variable so L is now a 00:13:16.800 |
Variable containing the loss so it contains a single number for this mini batch 00:13:22.600 |
Which is the loss for this mini batch, but it's not just a number. It's a it's a number as a variable 00:13:29.160 |
So it's a number that knows how it was calculated all right 00:13:32.920 |
so we're going to append that loss to our array just so we can 00:13:39.920 |
And now we're going to calculate the gradient so L dot backward is the thing that says 00:13:46.840 |
Calculate the gradient so remember when we call the the network. It's actually calling our forward function 00:13:55.560 |
So that's like cap go through it forward and then backward is like using the chain rule to calculate the gradients 00:14:02.400 |
Backwards okay, and then this is the thing we're about to write which is update the weights based on the gradients and the learning rate 00:14:11.840 |
Zero grad will explain when we write this out by hand 00:14:16.640 |
and so then at the end we can turn our validation data loader into an iterator and 00:14:35.840 |
Which thing did you predict which thing was actual and so check whether they're equal right and then the 00:14:44.480 |
Main of that is going to be our accuracy, okay? 00:14:51.760 |
What's the advantage that you found converted into a iterator rather than like use normal? 00:15:04.120 |
So it's still and this is a normal Python loop so the question really is like 00:15:12.560 |
The alternative perhaps you're thinking it would be like we could choose like a something like a list with an indexer 00:15:19.040 |
Okay, so you know the problem there is that we want 00:15:23.560 |
Was a few things I mean one key one is we want each time we grab a new mini batch. We want to be random 00:15:29.600 |
We want a different different shuffled thing so this 00:15:35.960 |
Forever you know you can loop through it as many times as you like so 00:15:40.120 |
There's this kind of idea. It's called different things in different languages 00:15:44.160 |
But a lot of languages are called like stream processing 00:15:47.480 |
And it's this basic idea that rather than saying I want the third thing or the ninth thing 00:15:51.720 |
It's just like I want the next thing right it's great for like network programming. It's like grab the next thing from the network 00:16:00.320 |
UI programming it's like grab the next event where somebody clicked a button it also turns out to be great for 00:16:06.360 |
This kind of numeric programming. It's like I just want the next batch of data 00:16:10.340 |
It means that the data like can be kind of arbitrarily long as we're describing one piece at a time 00:16:18.540 |
Yeah, so you know I mean and also in I guess the short answer is because it's how pytorch works 00:16:27.440 |
Pytorch that's pytorch is data loaders are designed to be 00:16:30.460 |
Called in this way, and then so Python has this concept of a generator 00:16:38.480 |
Different type of generator. I wonder if this is gonna be a snake generator or a computer generator, okay? 00:16:44.480 |
A generator is a way that you can create a function that as it says behaves like an iterator 00:16:50.920 |
So like Python has recognized that this stream processing approach to programming is like super handy and helpful and 00:16:57.840 |
Supports it everywhere so basically anywhere that you use a for in loop anywhere you use a list comprehension 00:17:05.760 |
Those things can always be generators or iterators so by programming this way. We just get a lot of 00:17:11.880 |
Flexibility I guess is that sound about right Terrence you're the programming language expert. Did you? 00:17:21.680 |
So Terrence actually does programming languages for a living so we should ask him 00:17:26.440 |
Yeah, I mean the short answer is what you said 00:17:32.400 |
But in this case that all that data has to be in memory anyway because we've got no doesn't have to be in memory 00:17:39.000 |
So in fact most of the time we could pull a mini batch from something in fact most of the time with pytorch 00:17:44.160 |
The mini batch will be read from like separate images spread over your disk on demand 00:17:51.800 |
But in general you want to keep as little in memory as possible at a time 00:17:56.440 |
And so the idea of stream processing also is great because you can do compositions you can 00:18:00.640 |
Pipe the data to a different machine you can yeah 00:18:05.000 |
You can grab the next thing from here and then send it off to the next stream which can then grab it and do something 00:18:09.400 |
Else which you guys all recognize of course in the command-line pipes and redirection 00:18:17.200 |
The benefit of working with people that actually know what they're talking about 00:18:21.840 |
All right, so let's now take that and get rid of the optimizer 00:18:28.320 |
Okay, so the only thing that we're going to be left with is the negative log likelihood loss function 00:18:33.920 |
Which we could also replace actually we have a? 00:18:38.560 |
implementation of that from scratch that unit wrote in the 00:18:41.160 |
In the notebooks, so it's only one line of code as we learned earlier. You can do it with a single if statement, okay? 00:18:48.340 |
So I don't know why I was so lazy is to include this 00:18:51.840 |
So what we're going to do is we're going to again grab this module that we've written ourselves the logistic regression module 00:18:58.880 |
We're going to have one epoch again. We're going to loop through each thing in our iterator again 00:19:05.020 |
We're going to grab our independent independent variable for the mini batch again 00:19:11.760 |
Calculate the loss, so this is all the same as before 00:19:14.360 |
But now we're going to get rid of this optimizer dot step 00:19:23.480 |
As I mentioned we're not going to do the calculus by hand so we'll call L dot backward to calculate the gradients automatically 00:19:30.940 |
And that's going to fill in our weight matrix, so do you remember when we created our? 00:19:40.760 |
Here's that module we built so the weight matrix for the for the 00:19:48.420 |
Linear layer weights we called l1w and for the bias we called l1b right so they were the attributes we created 00:20:00.360 |
I've just put them into things called W and B just to save some typing basically so W is our weights 00:20:10.400 |
So the weights remember the weights are a variable and to get the tensor out of the variable 00:20:16.800 |
We have to use dot data right so we want to update the actual tensor that's in this variable, so we say weights dot data 00:20:22.920 |
Minus equals so we want to go in the opposite direction to the gradient the gradient tells us which way is up 00:20:36.400 |
times the learning rate so that is the formula for 00:20:43.200 |
All right, so as you can see it's it's like as as easier thing as you can possibly imagine 00:20:48.680 |
It's like literally update the weights to be equal to be equal to whatever they are now minus the gray the gradients 00:21:01.760 |
So anybody have any questions about that step in terms of like why we do it or how did you have a question? 00:21:07.960 |
So that step, but when we do the next of deal 00:21:14.720 |
So when it is the end of the loop. How do you grab the next element? 00:21:23.880 |
Each index in range of length, so this is going 0 1 2 3 at the end of this loop 00:21:29.960 |
It's going to print out the mean of the validation set go back to the start of the epoch at which point 00:21:39.680 |
Okay, so basically behind the scenes in python when you call it a 00:21:44.440 |
On this it basically tells it to like reset its state to create a new iterator 00:21:58.680 |
The code is all you know available for you to look at so we could look at like MD dot train 00:22:07.680 |
DL is a fast AI dot data set dot model data loader, so we could like take a look at the code of that 00:22:18.560 |
And see exactly how it's being built right and so you can see here that here's the next function 00:22:27.440 |
Keeping track of how many times it's been through in the self dot I 00:22:31.240 |
And here's the it a function which is the thing that gets quick called when you when you create a new iterator 00:22:37.200 |
And you can see it's basically passing it off to something else 00:22:39.760 |
Which is a type data loader and then you can check out data loader if you're interested to see how that's implemented 00:22:49.240 |
Basically uses multi-threading to allow it to have multiple of these going on at the same time 00:22:55.120 |
It's actually a great. It's really simple. It's like it's only about a screen full of code 00:23:00.340 |
So if you're interested in simple multi-threaded programming. It's a good thing to look at 00:23:10.160 |
Why have you wrapped this in a for epoch in range one since that'll only run once? 00:23:16.240 |
Because in real life we would normally be running multiple epochs 00:23:20.520 |
So like in this case because it's a linear model it actually basically trains to 00:23:26.880 |
As good as it's going to get in one epoch so if I type three here 00:23:34.200 |
It actually won't really improve after the first epoch much at all as you can see right 00:23:41.280 |
But when we go back up to the top we're going to look at some slightly deeper and more interesting 00:23:46.480 |
Versions which will take more epochs, so you know if I was turning this into a into a function 00:23:52.380 |
You know I'd be going like you know death train model 00:23:56.800 |
And one of the things you would pass in is like number of epochs 00:24:13.720 |
When you're you know creating these neural network layers 00:24:20.880 |
This is just as part watch is concerned. This is just it's an end up module 00:24:25.000 |
It could be a we could be using it as a layer it could be using the function 00:24:28.600 |
We could be using it as a neural net pie torch doesn't think of those as different things, right? 00:24:33.760 |
So this could be a layer inside some other network, right? 00:24:38.080 |
So how do gradients work so if you've got a layer which remember is just a bunch of we can think of it basically 00:24:44.160 |
as its activations right or some activations that get computed through some 00:24:48.880 |
other non-linear activation function or through some linear function and 00:24:57.840 |
We it's very likely that we're then like let's say putting it through a matrix product right 00:25:08.560 |
So each one of these so if we were to grab like 00:25:11.160 |
One of these activations right is actually going to be 00:25:27.000 |
The derivative you have to know how this weight matrix 00:25:34.880 |
Impacts that output and that output and that output and that output 00:25:38.960 |
Right and then you have to add all of those together to find like the total impact of this 00:25:51.800 |
You have to tell it when to set the gradients to zero 00:25:56.680 |
Right because the idea is that you know you could be like having lots of different loss functions or lots of different outputs in your next 00:26:02.560 |
Activation set of activations or whatever all adding up 00:26:06.600 |
Increasing or decreasing your gradients right so you basically have to say okay. This is a new 00:26:15.720 |
Reset okay, so here is where we do that right so before we do L dot backward we say 00:26:25.320 |
Let's take the gradients. Let's take the tensor that they point to and 00:26:31.000 |
Then zero underscore does anybody remember from last week what underscore does as a suffix in pi torch? 00:26:42.680 |
Forgot the language, but basically it changes it within the place right there the language is in place yeah 00:26:50.760 |
Exactly so it sounds like a minor technicality 00:26:55.000 |
But it's super useful to remember every function pretty much has an underscore version suffix 00:27:05.840 |
Tensor of zeros of a particular size so zero underscore means replace the contents of this with a bunch of zeros, okay? 00:27:18.340 |
That's it right, so that's like SGD from scratch 00:27:24.240 |
And if I get rid of my menu bar we can officially say it fits within a screen, okay? 00:27:31.360 |
Of course we haven't got our definition of logistic regression here. That's another half a screen, but basically there's there's not much to it 00:27:39.160 |
So later on if we have to do this more the gradient is it because you might find like a wrong 00:27:44.920 |
Minima local minimize that way so you have to kick it out 00:27:47.560 |
And that's what you have to do multiple times when the surface is get more. Why do you need multiple epochs? 00:27:51.680 |
Is that your question well? I mean a simple way to answer that would be let's say our learning rate was tiny 00:28:04.920 |
Right there's nothing that says going through one epoch is enough to get you all the way there 00:28:09.800 |
So then you'd be like okay. Well, let's increase our learning rate, and it's like yeah, sure 00:28:13.960 |
We'll increase our learning rate, but who's to say that the highest learning rate that learns stably is is enough to 00:28:21.120 |
Learn this as well as it can be learned and for most data sets for most architectures one epoch is 00:28:36.680 |
They're very nicely behaved. You know so you can often use higher learning rates and learn more quickly also they 00:28:43.520 |
They don't you can't like generally get as good at accuracy 00:28:48.380 |
So there's not as far to take them either so yeah doing one epoch is going to be the rarity all right 00:28:56.680 |
So going backwards. We're basically going to say all right. Let's not write 00:29:01.200 |
Those two lines again and again again. Let's not write those three lines again and again and again 00:29:09.960 |
So that's like that's the only difference between that version and this version is rather than saying dot zero ourselves 00:29:16.800 |
Rather than saying minus gradient times LIR ourselves 00:29:32.280 |
The the weights is actually pretty inefficient. It doesn't take advantage of 00:29:43.520 |
In the deal course we learn about how to do momentum from scratch as well, okay, so 00:30:01.680 |
Learns much slower so now that I've typed just plain old SGD here. This is now literally doing exactly the same thing 00:30:07.680 |
As our slow version so I have to increase the learning rate 00:30:11.880 |
Okay there we go so this this is now the same as the the one we wrote by hand 00:30:23.800 |
Let's do a little bit more stuff automatically 00:30:29.020 |
Let's not you know given that every time we train something we have to loop through epoch 00:30:37.640 |
Look through batch do forward get the loss zero the gradient do backward do a step of the optimizer 00:30:51.000 |
All right there it is okay, so let's take a look at fit 00:31:01.980 |
Fit go through each epoch go through each batch 00:31:14.440 |
Keep track of the loss and at the end calculate the validation all right and so then 00:31:23.200 |
So if you're interested in looking at this this stuff's all inside fastai.model 00:31:46.040 |
Zero the gradients calculate the loss remember PyTorch tends to call it criterion rather than loss 00:31:55.720 |
And then there's something else we haven't learned here, but we do learn the deep learning course 00:31:59.520 |
Which is gradient clicking so you can ignore that 00:32:01.720 |
All right, so you can see now like all the stuff that we've learnt when you look inside the actual frameworks 00:32:15.000 |
So then the next step would be like okay. Well this idea of like having some 00:32:19.640 |
Weights and a bias and doing a matrix product in addition 00:32:30.520 |
Let's put that in a function and then the very idea of like first doing this and then doing that 00:32:36.800 |
This idea of like chaining functions together. Let's put that into a function and 00:32:46.720 |
Okay, so sequential simply means do this function take the result send it to this function etc, right? 00:32:55.080 |
And linear means create the weight matrix create the biases 00:33:05.400 |
So we can then you know as we started to talk about like turn this into a deep neural network 00:33:13.800 |
by saying you know rather than sending this straight off into 00:33:19.100 |
10 activations, let's let's put it into say 100 activations. We could pick whatever one number we like 00:33:30.060 |
Put it through another linear layer another relu and then our final output with our final activation function right and so this is now 00:33:54.940 |
I'm actually going to run a few more epochs right and you can see the accuracy 00:34:00.740 |
Increasing right so if you try and increase the learning rate here, it's like zero point one 00:34:14.740 |
This is called learning rate annealing and the trick is this 00:34:20.860 |
Trying to fit to a function right you've been taking a few steps 00:34:25.740 |
Step step step as you get close to the middle like get close to the bottom 00:34:32.900 |
Your steps probably want to become smaller right otherwise what tends to happen is you start finding you're doing this 00:34:40.100 |
All right, and so you can actually see it here right they've got 93 94 and a bit 94 6 00:34:48.100 |
94 8 like it's kind of starting to flatten out 00:34:50.820 |
Right now that could be because it's kind of done as well as it can 00:34:55.420 |
Or it could be that it's going to going backwards and forwards 00:34:58.620 |
So what is a good idea is is later on in training is to decrease your learning rate and to take smaller steps 00:35:07.100 |
Okay, that's called a learning rate annealing. So there's a function in fast AI called set learning rates 00:35:12.780 |
you can pass in your optimizer and your new learning rate and 00:35:16.540 |
You know see if that helps right and very often it does 00:35:27.780 |
In the deep learning course we learn a much much better technique than this to do this all automatically and about a more granular 00:35:34.460 |
Level, but if you're doing it by hand, you know like an order of magnitude at a time is what? 00:35:42.060 |
So you'll see people in papers talk about learning rate schedules 00:35:46.780 |
This is like a learning rate schedule. So this schedule just a moment Erica 00:35:51.140 |
I just come to earnest first has got us to 97 right and I tried 00:35:55.720 |
Kind of going further and we don't seem to be able to get much better than that 00:35:59.860 |
So yeah, so here we've got something where we can get 97 percent 00:36:04.380 |
Accuracy. Yes, Erica. So it seems like you change the learning rate 00:36:11.820 |
Ten times smaller than we started with so we had point one now, it's point. Oh one. Yeah 00:36:15.780 |
But that makes the whole model train really slow 00:36:19.540 |
So I was wondering if you can make it so that it changes dynamically as it approaches 00:36:24.180 |
Closer to the minima. Yeah, pretty much. Yeah, so so that's some of the stuff we learn in the deep learning course 00:36:34.140 |
So how it is different from using Adam optimizer or something that that's the kind of stuff we can do 00:36:39.780 |
I mean you still need annealing as I say we do this kind of stuff in the deep learning course 00:36:43.780 |
So for now, we're just going to stick to standard SGD. I 00:36:46.540 |
Had a question about the data loading. Yeah, I know it's a fast AI function 00:36:53.580 |
But could you go into a little bit detail of how it's creating batches how it's learning data and how it's making those decisions 00:37:03.460 |
Would be good to ask that on Monday night so we can talk about in detail in the deep learning class 00:37:16.300 |
Where they basically say let's let's create a thing called a data set 00:37:21.140 |
Right and a data set is basically something that looks like a list. It has a length 00:37:28.740 |
right and so that's like how many images are in the data set and it has the ability to 00:37:35.780 |
Index into it like a list right so if you had like D equals data set 00:37:41.820 |
You can do length D, and you can do D of some index right that's basically all the data set 00:37:47.860 |
Is as far as pytorch is concerned and so you start with a data set, so it's like okay? 00:37:53.220 |
D 3 gives you the third image. You know or whatever 00:37:58.140 |
And so then the idea is that you can take a data set and you can pass that into a constructor for a data loader 00:38:12.020 |
That gives you something which is now iterable right so you can now say it a 00:38:17.220 |
deal and that's something that you can call next on and 00:38:23.660 |
What that now is going to do is if when you do this you can choose to have shuffle on or shuffle off shuffle on 00:38:31.060 |
Means give me random mini-batch shuffle off means go through it sequentially 00:38:38.980 |
What the data loader does now when you say next is it basically assuming you said shuffle equals true is it's going to grab? 00:38:45.220 |
You know if you've got a batch size of 64 64 random integers between 0 and length and call this 00:38:53.220 |
64 times to get 64 different items and jam them together 00:39:06.540 |
We just do some of the details differently so specifically particularly with computer vision 00:39:15.940 |
I'm so much pre-processing data augmentation like flipping changing the colors a little bit rotating those turn out to be really 00:39:23.340 |
Computationally expensive even just reading the JPEGs turns out to be computation expensive 00:39:27.820 |
So pie torch uses an approach where it fires off multiple processes to do that in parallel 00:39:34.020 |
Whereas the fast AI library instead does something called multi threading, which is a much can be a much faster way of doing it 00:39:46.140 |
So an epoch is it really pork in the sense that all of the elements so it's a shuffle at the beginning of the 00:39:53.900 |
Poke something like that. Yeah. Yeah, I mean not all libraries work the same way some do sampling with replacement 00:40:01.660 |
We actually the fast AI library hands off the shuffling off to the set to the actual pie torch version 00:40:09.260 |
And I believe the pie torch version. Yeah, actually shuffles and an epoch covers everything once I believe 00:40:15.220 |
Okay, now the thing is when you start to get these bigger networks 00:40:25.100 |
Potentially you're getting quite a few parameters 00:40:32.860 |
I want to ask you to calculate how many parameters there are but let's let's remember here. We've got 00:40:37.880 |
28 by 28 input into 100 output and then 100 into 100 and then 100 into 10 00:40:44.740 |
All right, and then for each of those who got weights and biases 00:40:56.180 |
returns a list where each element of the list is a matrix of actually a tensor of 00:41:02.100 |
The parameters for that not just for that layer 00:41:05.860 |
But if it's a layer with both weights and biases that would be two parameters, right? 00:41:09.820 |
So basically returns us a list of all of the tenses containing the the parameters 00:41:14.980 |
Num elements in pytorch tells you how how big that is right so if I run this 00:41:27.900 |
So I've got seven hundred and eighty four inputs and the first layer has a hundred outputs 00:41:32.900 |
So therefore the first weight matrix is of size seventy eight thousand four hundred 00:41:37.300 |
Okay, and the first bias vector is of size a hundred and then the next one is a hundred by a hundred 00:41:42.900 |
Okay, and there's a hundred and then the next one is a hundred by ten, and then there's my bias, okay? 00:41:48.820 |
So there's the number of elements in each layer, and if I add them all up. It's nearly a hundred thousand 00:41:54.420 |
Okay, and so I'm possibly at risk of overfitting. Yeah, all right, so 00:42:01.620 |
We might want to think about using regularization 00:42:05.020 |
So a really simple common approach to regularization in all of machine learning 00:42:19.980 |
It's super important super handy. You can use it with just about anything right and the basic idea 00:42:31.540 |
L2 regularization the basic idea is this normally we'd say our loss is 00:42:35.700 |
Equal to let's just do RMSE to keep things kind of simple 00:42:39.660 |
It's equal to our predictions minus our actuals 00:42:43.180 |
You know squared, and then we sum them up take the average 00:42:52.620 |
What if we then want to say you know what like if I've got lots and lots of parameters? 00:42:58.660 |
Don't use them unless they're really helping enough right like if you've got a million parameters, and you only really needed 10 00:43:05.940 |
Parameters to be useful just use 10 right so how could we like tell the loss function to do that? 00:43:12.820 |
And so basically what we want to say is hey if a parameter is zero 00:43:17.220 |
That's no problem. It's like it doesn't exist at all so let's penalize a parameter 00:43:26.740 |
Right so what would be a way we could measure that? 00:43:29.940 |
How can we like calculate how unzero our parameters are 00:43:42.940 |
You calculates the average of all the parameters that's my first can't quite be the average 00:43:53.780 |
Close yes, Taylor. Yeah. Yes, you figured it out. Okay? 00:43:59.900 |
Assuming all of our data has been normalized standardized however you want to call it 00:44:03.900 |
We want to check that they're like significantly different from zero right would that be not the data that the parameter 00:44:09.460 |
Is rather would be significantly and the parameters don't have to be normalized or anything that is calculated right? 00:44:14.780 |
Yeah, so significantly different from zero right as well 00:44:17.340 |
I just met assuming that the data has been normalized so that we can compare them. Oh, yeah, got it. Yeah, right 00:44:23.820 |
And then those that are not significantly different from zero we can probably just drop 00:44:28.460 |
And I think Chen she's going to tell us how to do that. You just figured it out, right? 00:44:31.380 |
The meaning of the absolute could do that that would be called l1. Which is great so l1 00:44:43.180 |
Value of the weights average l2 is actually the sum 00:44:51.060 |
Yeah, yeah exactly so we just take this we can just we don't even have to square root 00:44:55.340 |
So we just take the squares of the weights themselves, and then like we want to be able to say like okay 00:45:06.580 |
Not being zero right because if we actually don't have that many parameters 00:45:10.740 |
We don't want to regularize much at all if we've got heaps. We do want to regularize a lot right so then we put a 00:45:18.580 |
Parameter yeah, right except I have a rule in my classes. Which is never to use Greek letters, so normally people use alpha 00:45:27.540 |
So this is some number which you often see something around kind of 1e neg 6 to 1e neg 4 00:45:48.020 |
When you think about it, we don't actually care about the loss other than like maybe to print it out 00:45:51.800 |
All we actually care about is the gradient of the loss 00:46:07.220 |
Right so there are two ways to do this we can actually modify our loss function to add in this square 00:46:18.340 |
We could modify that thing where we said weights equals weights minus 00:46:23.760 |
Gradient times learning rate to subtract that 00:46:34.780 |
These are roughly these are kind of basically equivalent, but they have different names. This is called L2 regularization 00:46:49.700 |
Was the how it was first posed in the neural network literature whereas this other version is kind of 00:46:56.140 |
How it was posed in the statistics literature, and yeah, you know they're they're equivalent 00:47:03.060 |
As we talked about in the deep learning class it turns out 00:47:06.380 |
They're not exactly equivalent because when you have things like momentum and Adam it can behave differently and two weeks ago a researcher 00:47:16.820 |
Do proper weight decay in modern optimizers and one of our fast AI students just implemented that in the fast AI library 00:47:39.380 |
But actually it turns out based on this paper two weeks ago is actually L2 regularization 00:47:43.980 |
It's not quite correct, but it's close enough so here. We can say weight decay is 1e neg 3 00:47:48.780 |
So it's going to set our cons out our penalty multiplier a to 1e neg 3 and it's going to add that to the loss function 00:47:57.020 |
Okay, and so let's make a copy of these cells 00:48:01.020 |
Just so we can compare hope this actually works 00:48:06.180 |
Okay, and we'll set this running okay, so this is now optimizing 00:48:13.460 |
If you're actually so I've made a mistake here, which is I didn't rerun 00:48:17.820 |
This cell this is an important thing to kind of remember since I didn't run this rerun this cell 00:48:23.340 |
Here when it created the optimizer and said net dot parameters 00:48:27.700 |
It started with the parameters that I had already trained right so I actually hadn't recreated my network 00:48:33.820 |
Okay, so I actually need to go back and rerun this cell first to recreate the network 00:48:49.500 |
So you might notice some notice something kind of kind of counterintuitive here 00:49:02.580 |
That's our training error right now. You would expect our training error with regularization 00:49:10.860 |
That makes sense right because we're like we're penalizing 00:49:26.980 |
So the reason that can happen is that if you have a function 00:49:35.540 |
Right it takes potentially a really long time to train 00:49:38.900 |
or else if you have a function that kind of looks more like 00:49:45.740 |
And there are certain things that you can do which sometimes just like can take a function 00:49:50.820 |
That's kind of horrible and make it less horrible, and it's sometimes weight decay can actually 00:49:56.500 |
Make your functions a little more nicely behaved, and that's actually happened here 00:50:01.060 |
So like I just mentioned that to say like don't let that confuse you right like weight decay really does 00:50:07.060 |
Panelize the training set and look so strictly speaking 00:50:10.140 |
The final number we get to for the training set shouldn't end up be being better 00:50:26.260 |
Don't get it. Okay, why making it faster like the time matters like the training time 00:50:32.500 |
No, it's this is after one epoch. Yeah, right so after one epoch 00:50:38.020 |
Now congratulations for saying I don't get it. That's like the best thing anybody can say you know so helpful 00:50:58.420 |
Okay, and this here is our training with weight decay, okay, so this is not related to time 00:51:08.420 |
Right after one epoch my claim was that you would expect the training set all other things being equal 00:51:23.260 |
Because we're penalizing it you know this has no penalty this has a penalty so the thing with a penalty should be worse and 00:51:37.980 |
Because in a single epoch it matters a lot as to whether you're trying to optimize something 00:51:44.380 |
That's very bumpy or whether you're trying to optimize something. That's kind of nice and smooth 00:51:50.340 |
If you're trying to optimize something that's really bumpy like imagine in some high-dimensional space, right? 00:51:56.220 |
You end up kind of rolling around through all these different tubes and tunnels and stuff 00:52:01.940 |
You know or else if it's just smooth you just go boom 00:52:04.800 |
Adam it's like imagine a marble rolling down a hill where one of them you've got like 00:52:09.980 |
It's a called Lombard Street in San Francisco. It's like backwards forwards backwards forwards 00:52:15.260 |
It takes a long time to drive down the road right 00:52:17.980 |
Where else you know if you kind of took a motorbike and just went straight over the top. You're just going boom, right, so 00:52:23.500 |
So whether it's a kind of the shape of the loss function surface 00:52:28.580 |
you know impacts or kind of defines how easy it is to optimize and therefore how 00:52:34.500 |
Far can it get in a single epoch and based on these results? 00:52:39.100 |
It would appear that weight decay here has made it this function easier to optimize 00:52:48.180 |
The panelizing is making the optimizer more than likely to reach the global minimum 00:52:54.120 |
No, I wouldn't say that my claim actually is that at the end 00:52:58.180 |
It's probably going to be less good on the training set indeed. This doesn't look to be the case at the end 00:53:07.900 |
Training set is now worse with weight decay now. That's what I would expect right? 00:53:12.820 |
I would expect like if you actually find like I never use the term global optimum because 00:53:17.300 |
It's just not something we have any guarantees about we don't really care about we just care like where do we get to after? 00:53:25.620 |
We hope that we found somewhere. That's like a good solution 00:53:28.660 |
And so by the time we get to like a good solution the training set with weight decay the loss is worse 00:53:41.860 |
Right because we penalized the training set in order to kind of try and create something that generalizes better 00:53:49.900 |
You know that the parameters that are kind of pointless are now zero and it generalizes better 00:53:54.060 |
Right so so always saying is that it just got to a good point 00:54:09.700 |
But if you're bit by it you mean just wait decay you always make the function surface smoother 00:54:14.020 |
No, it's not always true, but it's like it's worth remembering that 00:54:21.620 |
if you're having trouble training a function adding a little bit of weight decay may 00:54:29.780 |
The word so by recognizing the parameters what it does is it smoothens out the loss 00:54:39.740 |
you know the reason why we do it is because we want to penalize things that aren't zero to say like 00:54:44.780 |
Don't make this parameter a high number unless it's really helping the loss a lot right set it to zero if you can 00:54:51.660 |
Because setting as many parameters to zero as possible means it's going to generalize better, right? 00:54:59.060 |
Network, right so that's that's we do that's why we do it 00:55:07.800 |
So let's okay. That's one moment. Okay, so I just wanted to check how we actually went here 00:55:13.180 |
So after the second epoch yeah, so you can see here. It's really has helped right after the second epoch 00:55:17.780 |
Before we got to 97% accuracy now. We're nearly up to about 98% accuracy 00:55:23.660 |
Right and you can see that the loss was 0.08 versus 0.13 right so adding regularization 00:55:38.340 |
Solution yes Erica, so there are two pieces to this right one is L2 regularization and the weight decay 00:55:47.500 |
No, there's so my claim was they're the same thing, right? 00:55:50.940 |
So weight decay is the version if you just take the derivative of L2 regularization you get weight decay 00:55:58.020 |
So you can implement it either by changing the loss function with an with a squared loss 00:56:06.540 |
The weights themselves as part of the gradient, okay? 00:56:11.820 |
Yeah, I was just going to finish the questions. Yes. Okay pass it to division 00:56:16.820 |
Can we use regularization convolution layer as well absolutely so convolution layer just is is weights so yep 00:56:28.140 |
And Jeremy can you explain why you thought you needed weight decay in this particular problem? 00:56:34.580 |
Not easily I mean other than to say it's something that I would always try you're all fitting founder well. Yeah, I mean okay, so 00:56:45.220 |
Even if I yeah, okay, that's a good point unit, so if if my training loss 00:56:56.660 |
Was higher than my validation loss than I'm under fitting 00:57:00.340 |
Right, so there's definitely no point regularizing right if like that would always be a bad thing 00:57:06.920 |
That would always mean you need like more parameters in your model 00:57:09.900 |
In this case. I'm I'm over fitting that doesn't necessarily mean regularization will help, but it's certainly worth trying 00:57:18.620 |
Thank you, and that's a great point. There's one more question. Yeah 00:57:24.620 |
So how do you choose the up to a number of epoch? 00:57:37.140 |
It's a it's that's a long story and lots of lots of 00:57:41.820 |
It's a bit of both we just don't as I say we don't have time to cover 00:57:50.900 |
Best practices in this class we're going to learn the kind of fundamentals. Yeah, okay, so let's take a 00:58:12.020 |
So something that we cover in great detail in the deep learning course 00:58:18.060 |
But it's like really important to mention here. Is that is that the secret in my opinion to kind of modern machine learning techniques is to 00:58:28.900 |
The solution to your problem right like as we've done here. You know we've got like a hundred thousand weights 00:58:34.740 |
When we only had a small number of 28 by 28 images 00:58:49.500 |
statistics and learning was done for decades before 00:58:55.580 |
Senior lecturers at most universities in most areas of have this background where they've learned the correct way to build a model is 00:59:05.460 |
Right and so hopefully we've learned two things so far. You know one is we can build 00:59:11.780 |
Very accurate models even when they have lots and lots of parameters 00:59:17.420 |
Like a random forest has a lot of parameters and you know this here deep network has a lot of parameters 00:59:25.860 |
And we can do that by either using bagging or by using 00:59:34.180 |
Okay, and regularization in neural nets means either weight decay 00:59:42.660 |
Drop out which we won't worry too much about here 00:59:55.260 |
Building useful models and like I just wanted to kind of warn you that once you leave this classroom 01:00:02.020 |
Like even possibly when you go to the next faculty members talk like there'll be people at USF as well who? 01:00:13.340 |
Models with small numbers of parameters you know your next boss is very likely to have been trained in the world of like models 01:00:23.780 |
More pure or easier or better or more interpretable or whatever I? 01:00:29.140 |
Am convinced that that is not true probably not ever true certainly very rarely true 01:00:43.260 |
Models with lots of parameters can be extremely interpretable as we learn from our whole lesson of random forest interpretation 01:00:50.980 |
You can use most of the same techniques with neural nets, but with neural nets are even easier right remember how we did feature importance 01:01:01.020 |
Randomizing a column to see how it changes in that column would impact the output 01:01:04.860 |
Well, that's just like a kind of dumb way of calculating its gradient 01:01:10.100 |
How much does burying this import change the output with a neural net we can actually calculate its gradient? 01:01:15.340 |
Right so with PI torch you could actually say what's the gradient of the output with respect to this column? 01:01:27.900 |
And you know I'll mention for those of you interested in making a real impact 01:01:35.020 |
Basically any of these things the neural nets all right so that that that whole area 01:01:40.180 |
Needs like libraries to be written blog posts to be written 01:01:46.620 |
But only in very narrow domains like computer vision as far as I know nobody's written the paper saying 01:01:54.420 |
Neural networks you know interpretation methods 01:02:03.620 |
So what we're going to do though is we're going to start with applying this 01:02:13.460 |
And this is mildly terrifying for me because we're going to do NLP and our NLP 01:02:17.980 |
Faculty expert is in the room so David just yell at me if I screw this up too badly 01:02:22.500 |
And so NLP refers to you know any any kind of modeling where we're working with with natural language text 01:02:40.860 |
Linear model is pretty close to the state-of-the-art for solving a particular problem. It's actually something where I 01:02:50.020 |
actually surpassed this bad at state-of-the-art in this using a 01:02:56.980 |
But this is actually going to show you pretty close to the state of art with with a linear model 01:03:03.200 |
We're going to be working with the IMDB IMDB data set so this is a data set of movie reviews 01:03:16.000 |
Once you download it you'll see that you've got a train and a test 01:03:26.000 |
In your train directory you'll see there's a negative and a positive directory and in your positive directory 01:03:39.920 |
So somehow we've managed to pick out a story of a man who has unnatural feelings for a pig as our first choice 01:03:48.680 |
So we're going to look at these movie reviews 01:03:56.280 |
And for each one, we're going to look to see whether they were positive or negative 01:04:00.000 |
So they've been put into one of these folders. They were downloaded from from IMDB them the movie database and review site 01:04:06.840 |
The ones that were strongly in positive went in positive strongly negative went negative and the rest they didn't label at all 01:04:14.600 |
So these are only highly polarized reviews so in this case. You know 01:04:18.080 |
We have an insane violent mob which unfortunately just too absurd 01:04:23.920 |
Too off-putting those in the area we turned off so the label for this was a zero which is 01:04:39.280 |
In the first AI library. There's lots of little 01:04:44.600 |
Most kinds of domains that you do machine learning on for NLP one of the simple things we have is text from folders 01:04:51.600 |
That's just going to go ahead and go through and find all of the folders in here 01:04:56.200 |
With these names and create a labeled data set and you know don't let these things 01:05:04.120 |
Ever stop you from understanding. What's going on behind the scenes? 01:05:07.680 |
Right we can grab its source code and as you can see it's time. You know it's like five lines 01:05:12.860 |
Okay, so I don't like to write these things out in full 01:05:16.460 |
You know but hide them behind at all functions so you can reuse them 01:05:19.880 |
But basically it's just going to go through each directory and then within that so it goes through 01:05:34.240 |
This array of texts and figure out what folder it's in and stick that into the array of labels, okay, so 01:05:41.280 |
That's how we basically end up with something where we have an array of 01:05:47.520 |
The reviews and an array of the labels, okay, so that's our data so our job will be to take 01:05:59.120 |
Okay, and the way we're going to do it is we're going to throw away 01:06:04.920 |
Like all of the interesting stuff about language 01:06:08.640 |
Which is the order in which the words are in right now? This is very often not a good idea 01:06:15.480 |
But in this particular case it's going to turn out to work like not too badly 01:06:19.040 |
So let me show what I mean by like throwing away the order of the words like normally the order of the words 01:06:26.280 |
Before something then that not refers to that thing right so but the thing is when in this case 01:06:32.840 |
We're trying to predict whether something's positive or negative if you see the word absurd appear a lot 01:06:38.040 |
Right then maybe that's a sign that this isn't very good 01:06:44.600 |
So you know cryptic maybe that's a sign that it's not very good. So the idea is that we're going to turn it into something called a 01:06:53.960 |
Where for each document I each review what is going to create a list of what words are in it? 01:06:58.880 |
Rather than what order they're in so let me give an example 01:07:12.640 |
This movie is good. The movie is good. They're both positive this movie is bad. The movie is bad 01:07:18.360 |
They're both negative right so I'm going to turn this into a term document matrix 01:07:23.400 |
So the first thing I need to do is create something called a vocabulary a vocabulary is a list of all the unique words 01:07:28.960 |
That appear okay, so here's my vocabulary this movie is good the bad. That's all the words 01:07:34.900 |
Okay, and so now I'm going to take each one of my movie reviews and turn it into a 01:07:41.280 |
Vector of which words appear and how often do they appear right and in this case none of my words appear twice 01:07:47.440 |
So this movie is good has those four words in it 01:08:03.440 |
Right and this representation we call a bag of words 01:08:08.680 |
Representation right so this here is a bag of words representation of the view of the review 01:08:13.860 |
It doesn't contain the order of the text anymore. It's just a bag of the words 01:08:21.820 |
Movie this okay, so that's the first thing we're going to do is we're going to turn it into a bag of words 01:08:27.800 |
Representation and the reason that this is convenient 01:08:36.280 |
rectangular matrix that we can like do math on 01:08:39.200 |
Okay, and specifically we can do a logistic regression, and that's what we're going to do is we're going to get to a point 01:08:46.880 |
Before we get there though. We're going to do something else which is called naive base, okay? 01:08:54.640 |
Has something which will create a term document matrix for us. It's called count vectorizer. Okay, so we'll just use it now 01:09:04.680 |
You have to turn your text into a list of words 01:09:13.880 |
Because like if this was actually this movie is good 01:09:27.440 |
Punctuation well perhaps more interestingly what if it was this movie isn't good 01:09:35.800 |
How you turn a piece of text into a list of tokens is called tokenization, right? 01:09:41.400 |
And so a good tokenizer would turn this movie isn't good 01:09:49.560 |
Quote movie space is space and good space right so you can see in this version here 01:09:57.320 |
If I now split this on spaces every token is either a single piece of punctuation or like this suffix and is 01:10:04.840 |
Considered like a word right that's kind of like how we would probably want to tokenize that piece of text because you wouldn't want 01:10:14.200 |
to be like an object right because that does there's no concept of good full stop right or 01:10:25.720 |
Tokenization is something we hand off to a tokenizer 01:10:27.720 |
Fast AI has a tokenizer in it that we can use 01:10:31.680 |
So this is how we create our term document matrix with a tokenizer 01:10:37.560 |
SK learn has a pretty standard API which is nice 01:10:45.840 |
I'm sure you've seen it a few times now before so once we've built some kind of model 01:10:55.800 |
This is just defining what it's going to do. We can call fit transform to 01:11:00.780 |
To do that right so in this case fit transform is going to create the vocabulary 01:11:11.920 |
Transform is a little bit different that says use the previously fitted model which in this case means use the previously created vocabulary 01:11:21.600 |
We wouldn't want the validation set in the training set to have 01:11:24.400 |
You know the words in different orders in the matrices right because then they'd like to have different meanings 01:11:29.480 |
So this is here saying use the same vocabulary 01:11:32.200 |
To create a bag of words for the validation set could you pass that back please? 01:11:38.280 |
What if the violation set has different set of words other than training? Yeah, that's a great question so generally most 01:11:47.960 |
Of these kind of vocab creating approaches will have a special token for unknown 01:11:52.940 |
Sometimes you can you'll also say like hey if a word appears less than three times call it unknown 01:12:00.640 |
But otherwise it's like if you see something you haven't seen before call it unknown 01:12:05.000 |
So that would just become a column in the bag of words is is unknown 01:12:09.080 |
Good question all right, so when we create this 01:12:16.160 |
Term document matrix of the training set we have 25,000 rows because there are 25,000 movie reviews 01:12:25.620 |
What does that represent? What does that mean there are seven hundred and thirty five thousand one hundred thirty two? 01:12:33.560 |
All vocabulary yeah, go on. What do you mean? 01:12:38.880 |
So like the the number of words union of a number of words that the number of unique words yeah exactly good 01:12:54.040 |
Words all right, so we don't want to actually store that as 01:12:59.040 |
A normal array in memory because it's going to be very wasteful 01:13:06.520 |
Matrix all right and what a sparse matrix does is it just stores it? 01:13:16.560 |
Whereabouts of the non zeros right so it says like okay term number so document number one 01:13:25.560 |
Appears and it has four of them. You know document one term number 01:13:35.120 |
Has that that appears and it's a one right and so forth. That's basically how it's stored 01:13:41.000 |
There's actually a number of different ways of storing 01:13:43.520 |
And if you do Rachel's computational linear algebra course you'll learn about the different types and why you choose them and how to convert 01:13:50.560 |
And so forth, but they're all kind of something like this right and you don't really on the whole have to worry about the details 01:13:57.640 |
The important thing to know is it's it's efficient. Okay, and so we could grab the first review 01:14:11.400 |
One long one row long matrix okay with 93 stored elements so in other words 01:14:16.980 |
93 of those words are actually used in the first document, okay? 01:14:22.820 |
We can have a look at the vocabulary by saying vectorizer dot get fetch feature names that gives us the vocab 01:14:29.440 |
And so here's an example of a few of the elements of get feature names 01:14:36.880 |
Didn't intentionally pick the one that had Aussie, but you know that's the important words obviously 01:14:44.280 |
Haven't used the tokenizer here. I'm just bidding on space so this isn't quite the same as what the 01:14:55.360 |
By making it a set we make them unique so this is 01:14:58.920 |
Roughly the list of words that would appear right and that length is 91 01:15:04.720 |
Which is pretty similar to 93 and just the difference will be that I didn't use a real tokenizer. Yeah 01:15:13.080 |
So that's basically all that's been done there. It's kind of created this unique list of words and map them 01:15:19.600 |
We could check by calling vectorizer dot vocabulary underscore to find the idea of a particular word 01:15:27.240 |
So this is like the reverse map of this one right this is like integer to word 01:15:31.760 |
Here is word to integer, and so we saw absurd appeared twice in the first document 01:15:38.040 |
So let's check train term doc 0 comma 1 2 9 7 there 01:15:42.120 |
It is is 2 right or else unfortunately Aussie didn't appear in the unnatural relationship with a pig movie 01:15:49.720 |
So 0 comma 5,000 is 0 okay, so that's that's our term document matrix 01:15:59.340 |
Yes, so does it care about the relative relationship between the words 01:16:08.480 |
As in the ordering of the words no, we've thrown away the orderings. That's why it's a bag of words 01:16:16.520 |
Necessarily a good idea what I will say is that like the vast majority of NLP work 01:16:23.880 |
That's been done over the last few decades generally uses this representation because we didn't really know much better 01:16:29.800 |
Nowadays increasingly we're using recurrent neural networks instead which we'll learn about in our 01:16:40.080 |
But sometimes this representation works pretty well, and it's actually going to work pretty well in this case 01:16:46.920 |
Okay, so in fact you know most like back when I was at fast mail my email company a 01:16:55.400 |
Lot of the spam filtering we did used this next technique naive Bayes 01:17:00.440 |
Which is as a bag of words approach just kind of like you know if you're getting a lot of? 01:17:05.960 |
Email containing the word Viagra, and it's always been a spam 01:17:09.760 |
And you never get email from your friends talking about Viagra 01:17:13.480 |
Then it's very likely something that says Viagra regardless of the detail of the language is probably from a spammer 01:17:19.880 |
Alright, so that's the basic theory about like classification using a term document matrix, okay, so let's talk about naive Bayes 01:17:28.280 |
And here's the basic idea. We're going to start with our term document matrix 01:17:41.680 |
These next two is our corpus of negative reviews, and so here's our whole corpus of all reviews 01:17:55.720 |
Got a call the as we tend to call these more generically features rather than words, right? 01:18:00.480 |
This is a feature movie is a feature is as a feature, right? 01:18:04.800 |
So it's kind of more now like machine learning language a column is a feature 01:18:08.660 |
We'll call those we often call those f in the phase so we can basically say the probability 01:18:17.880 |
Given that the class is one given that it's a positive review 01:18:23.620 |
It's just the average of how often do you see this in the positive reviews? 01:18:39.880 |
In a particular class right so if I've never received an email from a friend that said Viagra 01:18:47.040 |
All right, that doesn't actually mean the probability of us of a friend sending sending me an email about Viagra is zero 01:18:56.260 |
Hope I don't get an email. You know from Terrence tomorrow saying like 01:19:02.160 |
Jeremy you probably could use this you know advertisement for Viagra, but you know it could happen and you know 01:19:08.660 |
You know, I'm sure it'd be in my best interest 01:19:12.400 |
So so what we do is we say actually what we've seen so far is not the full sample of everything that could happen 01:19:20.120 |
It's like a sample of what's happened so far. So let's assume that the next email you get 01:19:26.480 |
Actually does mention Viagra and every other possible word right so basically we're going to add a row of 01:19:35.480 |
Okay, so that's like the email that contains every possible word so that way nothing's ever 01:19:40.160 |
infinitely unlikely okay, so I take the average of 01:19:48.360 |
Times that this appears in my positive corpus plus the ones 01:19:59.320 |
Feature equals this appears in a document given that class equals one 01:20:06.840 |
And so not surprisingly here's the same thing 01:20:10.320 |
For probability that this feature this appears given class equals zero right same calculation except for the zero 01:20:18.080 |
Rows and obviously these are the same because this appears 01:20:24.000 |
twice in the positives sorry once in the positives and once in the negatives, okay 01:21:04.880 |
Given that I've got this particular document so somebody sent me this particular email or I have this particular IMDB review 01:21:21.080 |
Don't know positive right so for this particular movie review. What's the probability that its class is? 01:21:26.240 |
Positive right and so we can say well that's equal to the probability 01:21:42.400 |
Multiplied by the probability that any movie reviews class is positive 01:21:46.960 |
Divided by the probability of getting this particular movie review 01:21:51.960 |
All right, that's just basis rule okay, and so we can calculate 01:22:00.280 |
But actually what we really want to know is is it more likely that this is class zero or class one? 01:22:12.000 |
Probability that's plus one and divided by a probability that's plus zero 01:22:16.120 |
What if we did that right and so then we could say like okay? 01:22:21.560 |
If this number is bigger than one then it's more likely to be class one if it's smaller than one 01:22:26.760 |
It's more likely to be class zero right so in that case we could just divide 01:22:35.280 |
Right by the same version for class zero right which is the same as multiplying it by the reciprocal 01:22:41.680 |
And so the nice thing is now that's going to put a probability D on top here, which we can get rid of 01:22:46.960 |
Right and a probability of getting the data given class zero down here and the probability of getting plus 01:22:55.560 |
Zero here right and so if we basically what that means is we want to calculate 01:23:02.560 |
The probability that we would get this particular document given that the class is one 01:23:08.760 |
Times the probability that the class is one divided by the probability of getting this particular document given the class is two zero 01:23:26.360 |
Right probability that the class is zero is just one minus that right so 01:23:36.640 |
So there are those two numbers right I've got an equal amount of both so it's both point five 01:23:40.960 |
What is the probability of getting this document given that the class is one can anybody tell me how I would calculate that 01:24:02.640 |
Look at all the documents which have class equal to one uh-huh and one divided by that will give you 01:24:08.500 |
So remember it's though. It's going to be for a particular document so for example. We'd be saying like what's the probability that? 01:24:14.960 |
This review is positive right so what so you're on the right track 01:24:20.200 |
But what we have to going to have to do is going to have to say let's just look at 01:24:31.320 |
For class equals one right so the probability that a class one review has this is 01:24:40.120 |
Two-thirds the probability it has movie is one is is one and good is one 01:24:47.960 |
So the probability it has all of them is all of those multiplied together 01:24:51.880 |
Kinda and the kinder Tyler why is it not really can you pass it to Tyler? 01:24:58.240 |
So glad you look horrified and skeptical word choice is not independent 01:25:16.160 |
Because this is what happens if you take Bayes's theorems in a naive way and Tyler is not naive anything better right so 01:25:23.880 |
Naive Bayes says let's assume that if you have this movie is bloody stupid 01:25:30.400 |
But the probability of hate is independent of the probability of bloody is an independent of the probability of stupid, right? 01:25:36.880 |
Which is definitely not true right and so naive Bayes ain't actually very good 01:25:41.920 |
But I'm kind of teaching it to you because it's going to turn out to be a convenient 01:25:46.560 |
Peace for something we're about to learn later 01:25:51.120 |
It's okay, right? I mean, it's it's it's I would never I would never choose it 01:25:55.760 |
Like I don't think it's better than any other technique. That's equally fast and equally easy 01:25:59.520 |
But you know, it's a thing you can do and it's certainly going to be a useful foundation 01:26:08.080 |
so here is our calculation right of the probability that this document is 01:26:15.360 |
That we get this particular document assuming. It's a positive review. Here's the probability given 01:26:21.160 |
It's a negative and here's the ratio and this ratio is above one 01:26:25.280 |
So we're going to say I think that this is probably a positive review. Okay, so that's the Excel version 01:26:34.080 |
So you can tell that I let your net touch this because it's got latex in it. We've got actual math. So 01:26:51.960 |
Written out as Python. Okay, so our independent variable is our term document matrix 01:26:58.520 |
Dependent variable is just the labels of the Y 01:27:09.680 |
Okay, and so then we can sum them over the rows to get the total word count 01:27:16.720 |
For that feature across all the documents, right? 01:27:21.040 |
Plus one right because that's the email Terrence is totally going to send me something about Biagra today 01:27:26.240 |
I can tell that's that's that yeah, okay, so I'll do the same thing for the negative reviews 01:27:31.700 |
Right and then of course it's nicer to take the log 01:27:36.560 |
Right because if we take the log then we can add things together rather than multiply them together 01:27:41.680 |
And once you like multiply enough of these things together 01:27:44.280 |
It's going to get kind of so close to zero that you'll probably run out of floating point, right? So we take the log 01:27:54.560 |
Then we can as I say we then multiply that or in log we subtract that from the so add that to the 01:28:11.800 |
Multiply the Bayes probabilities by the accounts we can just use matrix multiply 01:28:21.880 |
The log of the class ratios we can just use plus B and so we end up with something that looks a lot like our 01:28:31.520 |
Logistic regression right, but we're not learning anything right not in kind of a SGD point of view 01:28:38.280 |
We're just we're calculating it using this theoretical model 01:28:41.560 |
Okay, and so as I said we can then compare that as to whether it's bigger or smaller than zero 01:28:46.400 |
Not one anymore because we're now in log space 01:28:48.720 |
Right and then we can compare that to the mean and we say okay. That's 80% accurate 81% accurate 01:28:56.000 |
Right so naive Bayes, you know is not is not nothing. It gave us something. Okay? 01:29:05.320 |
This version where we're actually looking at how often a word appears 01:29:13.040 |
It turns out at least for this problem and quite often it doesn't matter whether absurd appeared twice or once all that matters 01:29:21.200 |
So what what people tend to try doing is to say take the turn of the term 01:29:31.060 |
Replaces anything positive with one and anything negative with negative one we don't have any negative counts obviously so this 01:29:38.200 |
Binerizes it so it says it's I don't care that you saw absurd twice 01:29:42.720 |
I just care that you saw it right so if we do exactly the same thing 01:29:55.060 |
Okay now this is the difference between theory and practice right in theory 01:30:05.680 |
Naive Bayes sounds okay, but it's it's naive unlike Tyler. It's naive right so what Tyler would probably do would instead say rather than assuming 01:30:17.120 |
That I should use these coefficients are why don't we learn them so it sound reasonable Tyler? 01:30:24.480 |
Yeah, okay, so let's learn them so we can you know we can totally learn them, so let's create a logistic regression 01:30:33.800 |
Some coefficients, and that's going to literally give us something with exactly the same functional form that we had before 01:30:43.160 |
R and a theoretical B. We're going to calculate the two things based on logistic regression, and that's better 01:30:57.640 |
Why do something based on some theoretical model because theoretical models are never 01:31:03.160 |
Going to be as accurate pretty much as a data-driven model right because theoretical models 01:31:11.080 |
Don't know like physics thing or something where you're like okay? 01:31:13.840 |
This is actually how the world works there really is no I don't know 01:31:17.360 |
We're working in a vacuum, and this is the exact gravity and blah blah blah right, but most of the real world 01:31:23.080 |
This is how things are like it's better to learn your coefficients and calculate them. Yes, you know 01:31:41.120 |
Term document matrix is much wider than it is tall 01:31:47.780 |
Mathematically basically almost a mathematically equivalent reformulation of logistic regression that happens to be a lot faster when it's wider than it is tall 01:31:55.680 |
So the short answer is if you don't put that here anytime 01:31:58.760 |
It's wider than it is tall put dual equals true and it will run this runs in like two seconds 01:32:03.200 |
If you don't have it here, it'll take a few minutes 01:32:06.880 |
So like in math there's this kind of concept of dual versions of problems which are kind of like 01:32:12.480 |
Equivalent versions that sometimes work better for certain situations 01:32:17.360 |
Okay, here is so here is the binarized version 01:32:26.600 |
and it's it's about the same right so you can see I've fitted it with the the sign of the dock of the dock term dock matrix and 01:32:37.840 |
Now the thing is that this is going to be a coefficient for every term 01:32:45.320 |
There was about 75,000 terms in our vocabulary 01:32:49.480 |
And that seems like a lot of coefficients given that we've only got 01:32:53.400 |
25,000 reviews, so maybe we should try regularizing this 01:33:00.400 |
Regularization built into SK learns logistic regression plus which is C is the parameter that they use a smaller 01:33:07.640 |
This is slightly weird a smaller parameter is more regularization, right? 01:33:12.120 |
So that's why I used one a to basically turn off regularization here. So if I turn on regularization set it to point one 01:33:18.440 |
Then now it's 88 percent. Okay, which makes sense. You know, you wouldn't you would think like 01:33:25.400 |
25,000 parameters for 25,000 documents, you know, it's likely to overfit indeed. It did overfit 01:33:30.880 |
So this is adding L2 regularization to avoid overfitting 01:33:37.880 |
Mentioned earlier that as well as L2, which is looking at the weight squared. There's also L1 01:33:44.440 |
Which is looking at just the absolute value of the weights, right? I 01:33:55.760 |
Kind of pretty sloppy in my wording before I said that L2 tries to make things zero 01:34:00.920 |
That's kind of true. But if you've got two things that are highly correlated 01:34:04.960 |
Then L2 regularization will like move them both down together 01:34:10.040 |
It won't make one of them zero and one of them non-zero, right? 01:34:13.440 |
So L1 regularization actually has the property that it'll try to make as many things zero as possible 01:34:20.680 |
Whereas L2 regularization has a property that it tends to try to make kind of everything smaller 01:34:25.560 |
we actually don't care about that difference in 01:34:28.640 |
Really any modern machine learning because we very rarely try to directly interpret the coefficients. We try to understand our models through 01:34:37.660 |
Interrogation using the kind of techniques that we've learned 01:34:40.800 |
The reason that we would care about L1 versus L2 is simply like which one ends up with a better error on the validation 01:34:52.320 |
L2 actually turns out to be a lot faster because you can't use dual equals true unless you have L2 01:34:58.680 |
So you know and L2 is the default so I didn't really worry too much about that difference here 01:35:21.260 |
Before we learned about elastic net right like combining L1 and L2. Yeah. Yeah. Yeah, you can do that is that but I mean 01:35:34.080 |
Yeah, I've never seen anybody find that useful 01:35:42.880 |
That you can when you do your count vectorizer 01:35:48.200 |
Wherever that was when you do your count vectorizer you can also ask for n grams right by default we get 01:35:54.960 |
unigrams that is single words, but if we if we say 01:35:59.560 |
n gram range equals 1 comma 3 that's also going to give us 01:36:04.920 |
Bigrams and trigrams by which I mean if I now say okay. Let's go ahead and 01:36:10.920 |
Do the count vectorizer get feature names now my vocabulary includes a bigram 01:36:18.280 |
Right by fast by vengeance and a trigram by vengeance full stop 01:36:23.480 |
Five era miles right so this is now doing the same thing but after tokenizing 01:36:28.960 |
It's not just grabbing each word and saying that's part of our vocabulary 01:36:32.600 |
But each two words next to each other and each three words next to each other and this ten this turns out to be like 01:36:38.560 |
Super helpful in like taking advantage of bag of word 01:36:44.280 |
Approaches because we now can see like the difference between like 01:36:54.840 |
Right or even like double quote good double quote, which is probably going to be sarcastic right so using trigram features 01:37:04.120 |
Actually is going to turn out to make both naive phase 01:37:08.840 |
And logistic regression quite a lot better. It really takes us quite a lot further and makes them quite useful 01:37:21.320 |
Tokenizers so you are saying some marks features, so how are these? 01:37:36.440 |
Didn't want to create too many features. I mean it actually worked fine even without max features. I think I had something like I 01:37:42.080 |
Can't remember 70 million coefficients. It still worked right, but just there's no need to have 70 million coefficients 01:37:51.760 |
The count vectorizer will sort the vocabulary by how often everything appears whether it be unigram by gram trigram 01:38:03.440 |
After the first 800,000 most common n grams n gram is just the generic word for 01:38:12.800 |
so that's why the the train term doc dot shape is now 25,000 by 01:38:18.320 |
800,000 and like if you're not sure what number this should be I 01:38:22.240 |
Just picked something that was really big and you know didn't didn't worry about it too much, and it seemed to be fine 01:38:32.840 |
All right, okay, well, that's we're out of time so what we're going to see 01:38:38.200 |
Next week and by the way you know we could have 01:38:41.680 |
Replaced this logistic regression with our pytorch version and next week 01:38:47.880 |
We'll actually see something in the fastai library that does exactly that 01:38:51.480 |
but also what we'll see next week so next week tomorrow is 01:38:57.480 |
How to combine logistic regression and naive Bayes together to get something that's better than either 01:39:03.080 |
and then we'll learn how to move from there to create a 01:39:06.800 |
Deeper neural network to get a pretty much state-of-the-art result for structured learning all right, so we'll see them