back to indexLesson 5: Deep Learning 2018
Chapters
0:0 Intro
6:15 Carroll Seedlings Competition
7:48 Learning Objectives
9:43 Movie Lens Data
11:22 XLS
12:24 Matrix Factorization
37:3 PiTorch
40:13 Forward
41:31 Embedding matrices
43:48 Creating a Python module
47:47 Basic initialization
50:13 Minibatching
52:13 Creating an optimizer
54:40 Writing a training loop
55:56 Fast AI
00:00:06.480 |
I was really thrilled to see actually one of our master's students here at USF actually 00:00:22.720 |
With structured deep learning and turned it into a blog post which as I suspected has been 00:00:29.200 |
Incredibly popular because it's just something 00:00:31.200 |
People didn't know about and so it actually ended up getting picked up by the towards data science publication 00:00:38.960 |
Which I quite like actually if you're interested in keeping up with what's going on in data science. It's quite good medium publication 00:00:49.440 |
structured deep learning and basically introduced 00:00:51.800 |
you know that the basic ideas that we learned about last week and 00:00:57.920 |
It got picked up quite quite widely one of the one of the things I was pleased to see actually 00:01:02.920 |
Sebastian Ruder who actually mentioned in last week's class as being one of my favorite researchers 00:01:07.240 |
Tweeted it and then somebody from Stitch Fix said oh, yeah, we've actually been doing that for ages, which is kind of cute 00:01:16.640 |
Kind of know that this is happening in industry a lot and I've been telling people this is happening in industry a lot 00:01:23.040 |
And now the Karen's kind of published a blog saying hey check out this cool thing and now Stitch Fix is like yeah 00:01:33.240 |
Great to see and I think there's still a lot more that can be dug into with this structured deep learning stuff 00:01:40.720 |
You know to build on top of Karen's post would be to maybe like experiment with some different data sets 00:01:46.320 |
Maybe find some old Kaggle competitions and see like there's some competitions that you could now win with this or some which doesn't work 00:01:58.400 |
Experimenting a bit with different amounts of dropout different layer sizes, you know 00:02:02.480 |
Because nobody much has written about this. I don't think there's been any blog posts about this before that. I've seen anywhere 00:02:11.160 |
There's a lot of unexplored territory. So I think there's a lot we could we could build on top of here 00:02:17.360 |
And there's definitely a lot of interest. I saw one person on Twitter saying this is what I've been looking for ages 00:02:32.280 |
Predictor as well as his currency predictor after lesson one 00:02:39.680 |
Download something a bit bigger which was to download a couple of hundred of images of actors and he manually 00:02:49.080 |
I think first of all he like used Google to try and find ones with glasses and ones without then he manually went through and 00:02:54.040 |
Check that they had been put in the right spot 00:02:58.240 |
Vanilla resnet didn't do so well with just the last layer 00:03:03.280 |
And so what Nicole did was he went through and tried unfreezing the layers and using differential learning rates and got up to 00:03:10.240 |
100% accuracy and the thing I like about these things that Nicole is doing is the way he's 00:03:15.840 |
He's not downloading a Kaggle data set. He's like deciding on a problem that he's going to try and solve 00:03:23.560 |
And he's actually got a link here even to a suggested way to help you download images from Google 00:03:29.100 |
So I think this is great and I actually gave a talk 00:03:33.000 |
just this afternoon at Singularity University to a 00:03:36.260 |
Executive team of one of the world's largest telecommunications companies and actually showed them this post 00:03:44.480 |
Folks there were telling me that that all the vendors that come to them and tell them they need like 00:03:48.860 |
Millions of images and huge data centers full of hardware, and you know they have to buy special 00:03:54.140 |
Software that only these vendors can provide and I said like actually this person's been doing a course for three weeks now 00:04:01.040 |
And look at what he's just done with a computer that cost him 60 cents an hour 00:04:04.980 |
And they were like they were so happy to hear that like okay. They're you know this actually isn't the reach of normal people 00:04:12.200 |
I'm assuming Nicole's a normal person. I haven't actually 00:04:15.680 |
If you're proudly abnormal Nicole I apologize 00:04:20.500 |
I actually went and actually had a look at his cricket 00:04:24.760 |
Classifier and I was really pleased to see that his code actually is the exact same code 00:04:30.280 |
That we used in lesson one. I was hoping that would be the case. You know the anything he changed was 00:04:37.120 |
So this idea that we can take those four lines of code and reuse it to do other things 00:04:41.400 |
It's definitely turned out to be true, and so these are good things to show like a your organization 00:04:47.880 |
If you're anything like the executives at this big company. I spoke to today. There'll be a certain amount of like 00:04:54.120 |
Not to surprise but almost like pushback of like if this was true somebody there's you know they basically said that this is true 00:05:00.920 |
Somebody would have told us so like why isn't everybody doing this already so like I think you might have to actually show them 00:05:07.080 |
You know maybe you can build your own with some internal data 00:05:10.240 |
You've got at work or something like here. It is you know didn't cost me anything. It's all finished 00:05:19.920 |
Viddly or vitally I don't know how to pronounce his name correctly has done another very nice post on 00:05:23.840 |
just an introductory post on how we train neural networks, and I wanted to point this one out as being like I think 00:05:31.120 |
This is one of the participants in this course 00:05:34.000 |
Who's just got a particular knack for technical communication, and I think we can all learn from you know from his posts about about good technical writing 00:05:45.800 |
He assumes almost nothing like he has a kind of a very chatty tone and describes everything 00:05:50.760 |
But he also assumes that the reader is intelligent 00:05:53.140 |
But you know so like he's not afraid to kind of say he's a paper or he's an equation or or whatever 00:05:58.320 |
But then he's going to go through and tell you exactly what that equation means 00:06:01.600 |
So it's kind of like this nice mix of like writing for 00:06:05.920 |
Respectfully for an intelligent audience, but also not assuming any particular background knowledge 00:06:15.640 |
Then I made the mistake earlier this wake of posting a picture of my first placing on the Kaggle seedlings competition 00:06:23.400 |
At which point five other fast AI students posted their pictures of them passing me over the next few days 00:06:29.820 |
So this is the current leaderboard for the Kaggle plant seedlings competition 00:06:34.500 |
I believe the put up top six are all fast AI students or in the worst of those teachers 00:06:58.920 |
And most of the images were only were less than a hundred pixels by hundred pixels 00:07:07.960 |
And yet week, you know, I bet my my approach was basically to say let's just run through the notebook 00:07:12.880 |
We have pretty much default took me. I don't know an hour 00:07:15.720 |
And I'm I think the other students doing a little bit more than that 00:07:21.600 |
But not a lot more and basically what this is saying is yeah, these these techniques 00:07:26.800 |
Work pretty reliably to a point where people that aren't using the fast AI libraries 00:07:37.560 |
I suspect all these are fast AI students. You might have to go down quite a way 00:07:41.720 |
So I thought that was very interesting and really really cool 00:07:49.960 |
Start what I would kind of call like the second half of this course 00:07:56.320 |
so the first half of this course has been like 00:08:01.800 |
Like these are the applications that we can use this for 00:08:06.200 |
The here's kind of the code you have to write 00:08:08.440 |
Here's a fairly high level ish description of what it's doing 00:08:15.320 |
We're kind of we're kind of done for that bit and what we're now going to do is go in reverse 00:08:20.440 |
We're going to go back over all of those exact same things again 00:08:23.900 |
But this time we're going to dig into the detail of everyone and we're going to look inside the source code of a fast AI 00:08:29.480 |
Library to see what it's doing and try to replicate 00:08:32.920 |
that so in a sense like there's not going to be a lot more 00:08:38.680 |
Best practices to show you like I've kind of shown you the best best practices 00:08:46.560 |
But I feel like for us to now build on top of those to debug those models to come back to part 2 00:08:52.520 |
Where we're going to kind of try out some new things, you know, it really helps to understand what's going on 00:08:58.240 |
Behind the scenes. Okay, so the goal here today is we're going to try and create a 00:09:04.680 |
pretty effective collaborative filtering model 00:09:07.920 |
Almost entirely from scratch so we'll use the kind of we'll use pytorch as a 00:09:15.120 |
Automatic differentiation tool and as a GPU programming tool and not very much else. We'll try not to use its neural net features 00:09:23.720 |
Fast AI library any more than necessary. So that's the goal 00:09:29.320 |
Let's go back and you know, we only very quickly look at collaborative filtering last time 00:09:33.400 |
So let's let's go back and have a look at collaborative filtering. And so we're going to look at this 00:09:48.560 |
it's got a bunch of different users that are represented by some ID and a bunch of movies that are represented by some ID and 00:09:55.840 |
Rating it also has a timestamp. I haven't actually ever tried to use this 00:10:00.520 |
I guess this is just like what what time did that person rate that movie? 00:10:04.640 |
So that's all we're going to use for modeling is 00:10:09.560 |
three columns user ID movie ID and rating and so thinking of that in kind of 00:10:17.000 |
Structured data terms user ID and movie ID would be categorical variables 00:10:27.040 |
We're not going to use this for modeling but we can use it for looking at stuff later 00:10:32.960 |
We can grab a list of the names of the movies as well 00:10:35.920 |
And you could use this genre information. I haven't tried to be interested if during the week anybody tries it and finds it helpful 00:10:43.200 |
I guess as you might not find it helpful. We'll see 00:10:49.040 |
In order to kind of look at this better. I just grabbed 00:10:56.600 |
Users that have watched the most movies and the movies that have been the most watched 00:11:00.320 |
And made a cross tab of it right so this is exactly the same data 00:11:05.560 |
But it's a subset and now rather than being user movie rating we've got user 00:11:14.480 |
And so some users haven't watched some of these movies. That's why some of these are not a number, okay? 00:11:24.400 |
And you'll see there's a thing called collab filter dot XLS if you don't see it there now 00:11:40.360 |
That table okay, so as I go through this like 00:11:44.280 |
Set up of the problem and kind of how it's described and stuff if you're ever feeling 00:11:52.240 |
Ask either directly or through the forum if you ask through the forum and somebody answers there 00:12:00.720 |
but if somebody else asks a question you would like answered of course just like it and 00:12:06.880 |
Your net will keep an eye out for that because kind of that's we're digging in 00:12:10.960 |
To the details of what's going on behind the scenes. It's kind of important that at each stage you feel like okay. I can see 00:12:19.040 |
Okay, so we're actually not going to build a neural net to start with 00:12:30.000 |
Instead we're going to do something called a matrix factorization 00:12:36.040 |
The reason we're not going to build a neural net to start with is that it so happens. There's a really really simple 00:12:41.120 |
Kind of way of solving these kinds of problems, which I'm going to show you and so if I scroll down 00:12:47.440 |
I've basically what I've got here is the same the same thing, but this time these are my predictions 00:12:54.640 |
Rather than my actuals, and I'm going to show you how I created these predictions, okay, so here are my actuals 00:13:13.200 |
Average square root okay, so this is RMSE down here. Okay, so on average we're 00:13:24.280 |
So let me show you what this model is and I'm going to show you by saying how do we guess? 00:13:36.520 |
The prediction here. This is just at this stage is still random is 00:13:43.760 |
So how are we calculating 0.91 and the answer is we're taking it as? 00:13:52.000 |
Dot product with this vector here so dot product means point seven one times point one nine 00:13:59.860 |
plus point eight one times point six three plus point seven four plus point three one and so forth and in 00:14:05.540 |
You know linear algebra speak because one of them is a column and one of them is a row 00:14:09.900 |
This is the same as a matrix product so you can see here. I've used the excel function matrix multiply 00:14:28.540 |
Then I'm just going to set this to zero right because like there's no error in predicting something that hasn't happened 00:14:34.620 |
Okay, so what I'm going to do is I'm basically going to say all right every one of my rate 00:14:39.460 |
Rate my predictions is not going to be a neural net. It's going to be a single 00:14:46.980 |
now the matrix multiplication that it's doing is basically in practice is between like this matrix and 00:14:59.460 |
Matrix right so each one of these is a single part of that 00:15:04.560 |
So I randomly initialize these these are just random numbers 00:15:17.900 |
Random matrices, and I've said let's assume for the time being that every rating can be represented as 00:15:29.380 |
So then in excel you can actually do gradient descent 00:15:34.180 |
You have to go to your options to the add-in section and check the box to say turn it on and once you do you'll 00:15:45.140 |
And if I go solver it says okay, what's your? 00:15:49.180 |
Objective function, and you just choose the cell so in this case. We chose the cell that contains our written in spread error and 00:16:03.140 |
We've selected this matrix and this matrix and so it's going to do a gradient descent 00:16:07.940 |
For us by changing these matrices to try and in this case minimize this is min minimize this excel 00:16:15.860 |
So right grg non-linear is a gradient descent method so say solve and you'll see it starts at 2.8 and 00:16:24.420 |
Then down here you'll see that number is going down. It's not actually showing us what it's doing 00:16:30.740 |
but we can see that the numbers going down, so 00:16:35.980 |
neural neti feel to it in that we're doing like a matrix product and we're doing a gradient descent, but we don't have a 00:16:47.300 |
Linear layer on top of that so we don't get to call this deep learning 00:16:51.220 |
so things where people do like deep learning each things where they have kind of 00:16:55.500 |
Matrix products and gradient descents, but it's not deep people tend to just call that shallow learning okay, so we're doing shallow learning here 00:17:03.460 |
Alright, so I'm just going to go ahead and press escape to stop it because I'm sick of waiting 00:17:12.060 |
We've now got down to the old point three nine all right, so for example 00:17:17.660 |
It guessed that movie 72 for sorry movie 27 for use of 72 would get 00:17:28.380 |
2772 and actually got a 4 rating so you can see like it's it's doing something quite useful 00:17:37.140 |
So why is it doing something quite useful? I mean something to note here is 00:17:42.660 |
The number of things we're trying to predict here is there's 225 of them 00:17:47.940 |
Right and the number of things we're using to predict is that times two so 150 of them 00:17:55.100 |
So it's not like we can just exactly fit we actually have to do some kind of 00:18:01.340 |
So basically what this is saying is that there does seem to be some 00:18:10.260 |
So for those of you that have done some linear algebra 00:18:13.060 |
And this is actually a matrix decomposition normally in linear algebra you would do this using a 00:18:18.980 |
Analytical technique or using some techniques that are specifically designed for this purpose, but the nice thing is that we can use 00:18:26.580 |
Gradient descent to solve pretty much everything including this 00:18:30.220 |
I don't like to so much think of it from a linear algebra point of view though 00:18:34.340 |
I like to think of it from an intuitive point of view which is this let's say movie. Sorry. Let's say movie ID 27 is 00:18:48.740 |
Move and so let's say we're trying to make that prediction for user 00:18:53.500 |
272 are they going to like Lord of the Rings part one and so conceptually 00:18:59.820 |
That particular movie. Maybe there's like there's four. Sorry. There's five 00:19:07.980 |
What if the first one was like how much is it sci-fi and fantasy and the second one is like? 00:19:13.860 |
How recent a movie and how much special effects is there you know and the one at the top might be like how dialogue driven? 00:19:20.980 |
Is it right like let's say those kind of five these five numbers represented particular things about the movie and so if that was the case 00:19:29.260 |
Then we could have the same five numbers for the user saying like okay 00:19:33.620 |
How much does the user like sci-fi and fantasy how much does the user like? 00:19:37.980 |
Modern this is a modern CGI driven movies. How much does this does this user like? 00:19:45.820 |
Dialogue driven movies and so if you then took that cross product 00:19:49.140 |
You would expect to have a good model right would expect to have a reasonable rating now the problem is 00:19:56.540 |
We don't have this information for each user. We don't have the information for each movie, so we're just going to like assume 00:20:06.060 |
Kind of way of thinking about this system, and let's and let's stochastic gradient descent try and find these numbers 00:20:11.500 |
Right so so in other words these these factors 00:20:15.860 |
We call these things factors these factors and we call them factors because you can multiply them together to create this 00:20:23.100 |
They're factors in a linear algebra sense these factors. We call them latent factors because they're not actually 00:20:32.340 |
Vector that we've like named and understood and like entered in manually we've kind of assumed 00:20:42.680 |
we've assumed that we can think of them as a dot product of 00:20:47.180 |
Some particular features about a movie and some particular features of what users like those kinds of movies, right? 00:20:56.220 |
To just say okay try and find some numbers that that work 00:20:59.980 |
So that's that's basically the technique right and it's kind of 00:21:05.660 |
The and the entirety is in this spreadsheet right so that is collaborative filtering using what we call probabilistic matrix factorization 00:21:15.020 |
And as you can see the whole thing is easy to do in an Excel spreadsheet and the entirety of it really is this single 00:21:21.500 |
Thing which is a single matrix multiplication 00:21:27.140 |
We like to know if it would be better to cap this to zero and five maybe yeah 00:21:36.300 |
Yeah, we're going to do that later right. There's a whole lot of stuff. We can do improvements. This is like our 00:21:43.580 |
Simple as possible starting point right so so what we're going to do now is we're going to try and implement this 00:21:52.340 |
And run it on the whole data set another question is how do you figure out how many? 00:21:57.700 |
You know how it's clear. How long are the metrics? Yeah, why is it five? Yeah, yeah 00:22:13.700 |
Think about this. This is actually an embedding matrix 00:22:20.540 |
So this length is actually the size of the embedding matrix. I'm not saying this is an analogy 00:22:27.180 |
I'm saying it literally this is literally an embedding matrix 00:22:34.900 |
Where a one is in the 72nd position and so we'd like to look it up, and it would return this list of five numbers 00:22:42.300 |
So the question is actually how do we decide on the dimensionality of our embedding vectors? 00:22:47.660 |
And the answer to that question is we have no idea 00:22:51.780 |
We have to try a few things and see what works 00:22:55.340 |
the underlying concept is you need to pick an embedding dimensionality, which is 00:23:03.740 |
Enough to reflect the kind of true complexity of this causal system 00:23:11.740 |
Have too many parameters that it could take forever to run or even with vectorization. It might overfit 00:23:17.820 |
So what does it mean when the factor is negative then 00:23:24.780 |
The factor being negative in the movie case would mean like this is not dialogue driven in fact 00:23:31.900 |
It's like the opposite dialogue here is terrible a negative for the user would be like I actually 00:23:38.060 |
dislike modern CGI movies, so it's not from zero to whatever it's the range of 00:23:44.620 |
Score it'd be negative is that range of score even like no no maximum. No. There's no constraints at all here 00:24:00.420 |
Questions the first question is why do what why can we trust this embeddings because like if you take a number six 00:24:07.700 |
It can be expressed as 1 into 6 or like 6 into 1 or 2 into 3 and 3 into 2 00:24:11.840 |
Also, are you saying like we could like reorder these five numbers in some other different order or like the value itself might be different 00:24:19.980 |
As long as the product is something well, but you see we're using gradient descent to find the best numbers 00:24:32.620 |
Yeah, there are other numbers, but they don't give you as good an objective value 00:24:36.920 |
And of course we should be checking that on a validation set really which we'll be doing in the Python version 00:24:43.460 |
Okay, and the second question is when we have a new movie or a new user do we have to retrain the model? 00:24:49.180 |
That is a really good question, and there isn't a straightforward answer to that 00:24:56.220 |
But basically you would need to have like a kind of a new user 00:25:00.100 |
Model or a new movie model that you would use initially 00:25:04.660 |
And then over time yes, you would then have to retrain the model 00:25:11.800 |
But Netflix used to have this thing that when you were first on board it onto Netflix 00:25:17.340 |
And you'd have to go through and like say a bunch of movies you like and it would then like train its model 00:25:28.140 |
Could you could you just find the nearest movie to the movie that you're trying to the new movie that you're trying to add? 00:25:33.420 |
Yeah, you could use nearest neighbors for sure 00:25:35.700 |
But the thing is initially at least in this case we have no 00:25:43.660 |
Columns to describe a movie so if you had something about like the movies 00:25:48.720 |
Genre release date who was in it or something you could have some kind of non collaborative filtering model 00:25:55.060 |
And that was kind of what I meant. I like a new movie model. You'd have to have some some kind of predictors 00:26:04.660 |
Lot of this is going to look familiar and and the way I'm going to do this is again 00:26:11.940 |
It's kind of this top-down approach. We're going to start using a 00:26:15.020 |
Few features of pytorch and fast AI and gradually we're going to redo it a few times in a few different ways 00:26:25.580 |
Regardless we do need a validation set so we can use our standard cross validation indexes approach to grab a random set of IDs 00:26:40.780 |
Which we'll talk about later in the course for those of you that have done some machine learning 00:26:48.060 |
And this is where we choose how big a embedding matrix do we want okay? 00:26:53.380 |
So again, you know here's where we get our model data object from CSB 00:27:05.380 |
Looks like that okay, so you'll see like stuff tends to look pretty familiar after a while 00:27:21.820 |
What are your rows effectively? What are your columns effectively, and what are your values effectively right so any any collaborative filtering? 00:27:29.580 |
Recommendation system approach. There's basically a concept of like 00:27:36.140 |
Now they might not be users and items like if you're doing the Ecuadorian groceries competition 00:27:42.680 |
There are stores and items and you're trying to predict. How many things are you going to sell at? 00:27:50.900 |
But generally speaking just this idea of like you've got a couple of kind of high cardinality 00:27:57.660 |
Categorical variables and something that you're measuring and you're kind of conceptualizing and saying okay, we could predict 00:28:04.540 |
The rating we can predict the value by doing this this dot product 00:28:09.140 |
Interestingly this is kind of relevant to that that last question or suggestion an 00:28:16.660 |
Identical way to think about this or to express this is to say 00:28:23.140 |
Whether user 72 will like movie 27 is basically saying 00:28:36.580 |
Which other movies were liked by people like? 00:28:43.140 |
User 72 it turns out that these are basically two ways of saying the exact same thing 00:28:50.160 |
So basically what collaborative filtering is doing? 00:28:52.300 |
You know kind of conceptually is to say okay this movie and this user 00:28:58.420 |
Which other movies are similar to it in terms of like? 00:29:02.160 |
Similar people enjoyed them and which people are similar to this person based on people that like the same kind of movies 00:29:12.900 |
Structure and anytime there's an underlying structure like this that kind of collaborative filtering approach is likely to be useful 00:29:21.860 |
So you yeah, so there's basically two parts the two bits of your thing that you're factoring and then the value the dependent variable 00:29:29.120 |
So as per usual we can take our model data and ask for a learner from it 00:29:35.420 |
And we need to tell it what size embedding matrix to use 00:29:38.940 |
How many sorry what validation set indexes to use what batch size to use and what optimizer? 00:29:45.740 |
To use and we're going to be talking more about optimizers 00:29:58.060 |
Right, and it all looks pretty similar interest. It's usual interestingly 00:30:04.020 |
I only had to do three epochs like this kind of models into train super quickly 00:30:09.620 |
You can use the learning rate finder as per usual all the stuff you're familiar with will work fine 00:30:14.780 |
And that was it so this took you know about two seconds the train. There's no pre trained anythings here 00:30:24.500 |
So this is our validation set and we can compare it we have this is a mean squared error 00:30:31.140 |
Not a root mean squared error, so we can take the square root 00:30:33.860 |
So that last time I ran it was 0.776 and that's 0.88 and there's some benchmarks available for this data set 00:30:43.980 |
And when I scrolled through and found the bench the best benchmark I could find here from this 00:30:48.560 |
Recommendation system specific library they had 0.91. So we've got a better loss in two seconds 00:31:01.420 |
So that's basically how you can do collaborative filtering 00:31:09.740 |
Thinking too much, but so now we're going to dig in and try and rebuild that we'll try and get to the point that we're getting 00:31:21.300 |
But if you want to do this yourself at home, you know without worrying about the detail 00:31:28.340 |
That's you know, those three lines of code is all you need 00:31:31.460 |
Okay, so we can get the predictions in the usual way and you know, we could for example plot 00:31:37.540 |
SNS is seaborne seaborne is a really great flooding library. It sits on top of matplotlib 00:31:45.940 |
So anything you learn about matplotlib will help you with seaborne. It's got a few like nice little plots like this joint plot 00:31:59.680 |
These are my predictions and you can kind of see the the shape here is that as we predict higher numbers 00:32:05.180 |
they actually are higher numbers and you can also see the histogram of the 00:32:09.360 |
Predictions and a histogram of the actions. So I'm just kind of plotting that just to show you another interesting visualization 00:32:20.780 |
Why it's set to 15? It's set to 50 because I tried a few things that's in the work 00:32:26.460 |
That's all. What does it mean? It's this it's the dimensionality of the embedding matrix 00:32:31.040 |
Or to think of it in another way. It's like how you know rather than being five. It's 00:32:37.140 |
Jeremy I have a question about suppose that your 00:32:46.300 |
Recommendation system is more implicit. So you have zeros or ones instead of just 00:32:55.140 |
Actual numbers, right? So basically we would then 00:32:57.700 |
Need to use a classifier instead of a regressor 00:33:01.420 |
Have to sample the negative for something like that 00:33:06.140 |
So if you don't have it, we just have once let's say like just kind of implicit feedback. Oh 00:33:11.380 |
I'm not sure we'll get to that one in this class 00:33:14.260 |
But what I will say is like in the case that you're just doing classification rather than regression 00:33:18.740 |
We haven't actually built that in the library yet 00:33:22.100 |
Maybe somebody this week wants to try adding it. It would only be a small number of lines of code. You basically have to change the 00:33:27.780 |
activation function to be a sigmoid and you would have to change the 00:33:32.380 |
Criterion or the loss function to be cross entropy 00:33:41.880 |
Classifier rather than a regressor. Those are the only things you'd have to change 00:33:46.180 |
So hopefully somebody this week will take up that challenge and by the time we come back next week. We will have that working 00:33:54.140 |
So I said that we're basically doing a dot product right or you know a dot product is kind of the vector version 00:34:05.620 |
So we're basically doing each of these things times each of these things and then add it together 00:34:11.700 |
That's a dot product. So let's just have a look at how we do that in PyTorch 00:34:17.300 |
So we can create a tensor in PyTorch just using this little capital T thing 00:34:22.380 |
You can just say that's the fast AI version the full version is torch dot from NumPy or something 00:34:28.780 |
But I've got it set up so you can pass it through pass in even a list of lists 00:34:33.260 |
So this is going to create a torch tensor with 1 2 3 4 and then here's a torch tensor with 2 2 10 10 00:34:44.260 |
Torch tensors, I didn't say dot CUDA. So they're not on the GPU. They're sitting on the CPU 00:34:54.860 |
Right and so anytime you have a mathematical operator between tensors in NumPy or PyTorch 00:35:05.100 |
Assuming that they're the same dimensionality, which they are they're both 2 by 2 00:35:14.140 |
3 by 10 is 30 and so forth. Okay, so there's our A times B 00:35:17.940 |
So if you think about basically what we want to do here is we want to take 00:35:37.100 |
2 plus 4 is 6 and so that is actually the dot product between 1 2 and 2 4 and 00:35:52.220 |
So in other words a times B dot sum along the first dimension 00:35:57.860 |
So that's summing up the columns. In other words across a row 00:36:01.260 |
Okay, this thing here is doing the dot product of 00:36:09.740 |
That makes sense and obviously we could do that with 00:36:13.540 |
You know some kind of matrix modification approach, but I'm trying to really do things with as little 00:36:22.820 |
Okay, so that's what we're going to use for our dot products from now on so basically all we need to do now is 00:36:29.660 |
Remember we have the data we have is not in that crosstab format 00:36:35.420 |
So in Excel we've got it in this crosstab format, but we've got it here in this 00:36:43.340 |
So conceptually we want to be like looking up this user 00:36:47.300 |
Into our embedding matrix to find their 50 factors looking up that movie to find their 50 factors and then take the dot product 00:37:04.220 |
To do it we're going to build a layer our own custom 00:37:11.860 |
So the the more generic vocabulary we call this is we're going to build a pytorch module 00:37:18.980 |
Okay, so a pytorch module is a very specific thing 00:37:23.500 |
It's something that you can use as a layer and a neural net once you've created your own pytorch module 00:37:31.700 |
And a module works by assuming we've already got one say called model 00:37:36.940 |
You can pass in some things in parentheses, and it will calculate it right so assuming that we already have a module called dot product 00:37:48.940 |
To create our dot product object, and we can basically now treat that like a function 00:37:56.100 |
All right, but the thing is it's not just a function because we'll be able to do things like take derivatives of it 00:38:05.100 |
Stack of neural network layers blah blah blah, right, so it's basically a function that we can kind of compose very conveniently 00:38:13.460 |
So here how do we define a module which as you can see here returns a dot product well 00:38:20.700 |
We have to create a Python class and so if you haven't done Python OO before 00:38:26.500 |
You're going to have to learn because all pytorch modules are written in Python OO 00:38:32.260 |
And it's one of the things I really like about pytorch is that it doesn't 00:38:36.020 |
Reinvent totally new ways of doing things like TensorFlow does all the time in pytorch that you know really tend to use 00:38:44.780 |
Pythonic ways to do things so in this case. How do you create you know some kind of new behavior you create a Python class? 00:38:52.220 |
So Jeremy suppose that you have a lot of data 00:38:58.600 |
Not just a little bit of data. You can have a memory. Will you be able to use fast AI to solve corollary filtering? 00:39:12.780 |
mini batch stochastic gradient descent which does it a batch at a time the 00:39:27.660 |
Pandas data frame and a pandas data frame has to live in memory 00:39:38.460 |
You know instances on Amazon so like if you had a CSV that was bigger than 512 gig 00:39:43.660 |
You know that would be impressive if that did happen 00:39:48.140 |
I guess you would have to instead save that as a B calls array and 00:39:51.780 |
Create a slightly different version that reads from a B calls array to streaming in or maybe from a desk 00:40:01.100 |
It would be easy to do I don't think I've seen 00:40:05.380 |
Real world situations where you have 512 gigabyte collaborative filtering matrices, but yeah, we can do it 00:40:16.540 |
This is PyTorch specific this next bit is that when you define like the actual work to be done which is here return 00:40:27.300 |
You have to put it in a special method called forward 00:40:31.860 |
Okay, and this is this idea that like it's very likely you're pretty neural net right and in a neural net the thing where you 00:40:39.980 |
Set of activations is called the forward pass and so that's doing a forward calculation 00:40:45.900 |
The gradients is called the backward calculation 00:40:49.620 |
We don't have to do that because PyTorch calculates that automatically so we just have to define 00:40:54.500 |
Forward so we create a new class we define forward and here we write in our definition of dot product 00:41:01.800 |
Okay, so that's it. So now that we've created this class definition. We can instantiate our 00:41:09.100 |
Model right and we can call our model and get back the numbers we expected. Okay, so that's it 00:41:18.120 |
PyTorch layer and if you compare that to like any other 00:41:23.080 |
Library around pretty much. This is way easier 00:41:31.720 |
So let's go ahead and now create a more complex 00:41:35.080 |
Module and we're going to basically do the same thing. We're going to have a forward again 00:41:41.720 |
We're going to have our users times movies dot sum 00:41:44.960 |
But we're going to do one more thing beforehand, which is we're going to create two 00:41:49.520 |
Embedding matrices and then we're going to look up our users and our movies in those embedding matrices 00:42:02.840 |
The users the user IDs and the movie IDs may not be contiguous 00:42:09.680 |
You know like they may be they start at a million and go to a million one thousand say, right? So if we just used 00:42:18.240 |
Those IDs directly to look up into an embedding matrix 00:42:23.080 |
We would have to create an embedding matrix of size one million one thousand right which we don't want to do 00:42:28.080 |
so the first thing I do is to get a list of the 00:42:34.520 |
then I create a mapping from every user ID to a 00:42:39.360 |
Contiguous integer this thing I've done here where I've created a 00:42:44.900 |
dictionary which maps from every unique thing to a unique index is 00:42:50.960 |
Well worth studying during the week because like it's super super handy 00:42:55.440 |
It's something you very very often have to do in all kinds of machine learning 00:43:01.680 |
It's easy enough to figure out if you can't figure it out just ask on the forum 00:43:04.920 |
Anyway, so once we've got the mapping from user to a contiguous index 00:43:11.480 |
We then can say let's now replace the user ID column 00:43:17.480 |
With that contiguous index right so pandas dot apply applies an arbitrary function 00:43:24.680 |
In Python Lambda is how you create an anonymous function on the fly and this anonymous function simply returns the index 00:43:32.560 |
through the same thing for movies and so after that we now have the same ratings table we had before but our 00:43:42.600 |
Integers and therefore there are things that we can look up into an embedding matrix 00:43:47.480 |
So let's get the count of our users in our movies 00:43:52.760 |
And let's now go ahead and try and create our 00:44:01.920 |
Earlier on when we created our simplest possible 00:44:15.040 |
Because we weren't like saying how many users are there or how many movies are there or how many factors? 00:44:20.320 |
Do we want or whatever right anytime we want to do something like? 00:44:24.440 |
This where we're passing in and saying we want to construct our 00:44:29.800 |
Module with this number of users and this number of movies then we need a constructor 00:44:37.000 |
for our class and you create a constructor in Python by defining a 00:44:42.600 |
Dunder in it underscore underscore in it underscore underscore 00:44:49.840 |
Constructor then if you haven't done over before 00:44:51.920 |
You wanted to do some study during the week, but it's a pretty simple idea 00:44:57.160 |
This is just the thing that when we create this object. This is what gets run, okay? 00:45:01.680 |
Again special Python thing when you create your own constructor 00:45:06.920 |
You have to call the parent class constructor 00:45:08.800 |
And if you want to have all of the cool behavior of a Pytorch module you get that by inheriting 00:45:15.600 |
From an end up module neural net module okay, so basically by inheriting here and calling the superclass constructor 00:45:23.560 |
We now have a fully functioning Pytorch layer, okay, so now we have to give it some behavior 00:45:29.760 |
And so we give it some behavior by storing some things in it all right, so here. We're going to create something called 00:45:37.400 |
self dot you users and that is going to be an 00:45:44.160 |
Number of rows is and users number of columns is n factors 00:45:48.440 |
So that is exactly this right the number of rows is n uses number of columns is n factors 00:45:56.920 |
And then we'll have to do the same thing for movies 00:46:00.120 |
All right, so that's going to go ahead and create these two 00:46:10.400 |
However when you randomly initialize or an array it's important to randomly initialize it to a 00:46:16.040 |
Reasonable set of numbers like a reasonable scale, right if we randomly initialize them from like naught to a million 00:46:23.040 |
Then we would start out and you know these things would start out being like 00:46:27.720 |
You know billions and billions of size rotating and that's going to be very hard to do gradient descent on 00:46:33.600 |
So I just kind of manually figured here like okay about what size 00:46:39.800 |
Numbers that are going to give me about the right ratings, and so we don't we know we did ratings between about naught and five 00:46:45.920 |
So if we start out with stuff between about naught and 0.05, then we're going to get ratings of about the right level 00:46:54.160 |
You can easily enough like that calculate that in neural nets. There are standard algorithms for 00:47:01.880 |
Basically doing doing that calculation and the basic the key algorithm is 00:47:09.360 |
Something called her initialization from timing her and the basic idea 00:47:20.540 |
Here you basically set the weights equal to a normal distribution 00:47:27.800 |
With a standard deviation, which is basically inversely proportional to the number of things 00:47:44.100 |
So in this case we basically a set if you basically take that 00:47:52.020 |
naught to 0.05 and multiply it by the fact that you've got 00:47:55.720 |
40 things I wasn't 40 or 50 things coming out of it 00:47:59.820 |
50 50 things coming out of it, then you're going to get something of about the right size 00:48:06.820 |
Pytorch has already has like her initialization 00:48:11.100 |
Class there like we don't in normally in real life have to think about this we can just call the existing initialization 00:48:17.780 |
Functions, but we're trying to do this all like from scratch here. Okay without any 00:48:26.380 |
So there's quite a bit of Pytorch notation here, so self.you we've already set to an instance of the embedding class 00:48:39.620 |
Attribute which contains the actual the actual embedding matrix 00:48:52.980 |
It's a variable a variable is exactly the same as a tensor in other words it supports the exact same 00:49:09.220 |
To pull the tensor out of a variable you get its data attribute 00:49:15.740 |
Okay, so this is so this is now the tensor of the weight matrix of the self.you embedding 00:49:22.980 |
And then something that's really handy to know is that all of the tensor functions in Pytorch 00:49:30.340 |
You can stick an underscore at the end, and that means do it in place 00:49:34.500 |
Right so this is say create a random uniform random number of an appropriate size 00:49:40.860 |
For this tensor and don't return it, but actually fill in that matrix 00:49:46.860 |
In place okay, so that's a super handy thing to know about I mean it wouldn't be rocket science otherwise. We would have to have gone 00:49:55.220 |
Okay, there's the non in place version. That's what saves us some typing saves us some screen noise. That's all 00:50:14.900 |
So now we've got our randomly initialized embedding weight matrices 00:50:22.980 |
I'm actually going to use the same columnar model data that we used for 00:50:29.540 |
And so it's actually going to be passed both categorical variables and continuous variables 00:50:36.620 |
continuous variables, so I'm just going to grab the 00:50:40.180 |
zeroth column out of the categorical variables and call it users and the first column and call it movies okay, so I'm just kind of 00:50:48.660 |
Too lazy to create my own. I'm not so much too lazy that we do have a special class for this 00:50:53.340 |
But I'm trying to avoid creating a special class, so I'm just going to leverage this columnar model data class 00:50:58.920 |
Okay, so we can basically grab our user and movies 00:51:03.020 |
Mini batches right and remember this is not a single user in a single movie. This is going to be a whole mini batch of them 00:51:11.340 |
We can now look up that mini batch of users in our embedding matrix U and the movies in 00:51:20.380 |
All right, so this is like exactly the same as just doing an array look up to grab the the user ID numbered 00:51:26.820 |
Value, but we're doing it a whole mini batch at a time 00:51:32.340 |
Can do a whole mini batch at a time with pretty much everything that we can get really easy speed up 00:51:37.580 |
We don't have to write any loops on the whole to do everything through our mini batch 00:51:42.460 |
And in fact if you do have a loop through your mini batch manually you don't get GPU acceleration 00:51:48.460 |
That's really important to know right so you never want to loop have a for loop going through your mini batch 00:51:53.860 |
You always want to do things in this kind of like whole mini batch at a time 00:51:58.120 |
But pretty much everything in pytorch does things a whole mini batch at a time, so you shouldn't have to worry about it 00:52:04.060 |
And then here's our product just like before all right so having defined 00:52:19.540 |
Everything except the rating and the timestamp 00:52:22.820 |
In my ratings table my Y is my rating and then I can just say okay. Let's 00:52:27.920 |
Grab a model data from a data frame using that X and that Y and here is our list of 00:52:40.180 |
And then so let's now instantiate that pytorch object 00:52:46.940 |
All right, so we've now created that from scratch 00:52:49.260 |
And then the next thing we need to do is to create an optimizer, so this is part of pytorch 00:52:56.420 |
The only fast AI thing here is this line right because it's like I don't think showing you 00:53:03.840 |
How to build data sets and data loaders is interesting enough really we might do that in part two of the course 00:53:10.060 |
And it's actually so straightforward like a lot of you are already doing it on the forums 00:53:15.740 |
So I'm not going to show you that in this part 00:53:17.860 |
But if you're interested feel free to to talk on the forums about it 00:53:21.940 |
But I'm just going to basically take the the thing that feeds this data as a given particularly because these things are so flexible 00:53:28.420 |
Right you you know if you've got stuff in a data frame. You can just use this you don't have to rewrite it 00:53:32.900 |
So that's the only fast AI thing we're using so this is a pytorch thing and so 00:53:39.060 |
Optim is the thing in pytorch that gives us an optimizer. We'll be learning about that 00:53:47.420 |
So it's actually the thing that's going to update our weights 00:53:53.740 |
Calls them the parameters of the model so earlier on we said model equals embedding dot blah blah blah 00:54:03.580 |
Derives from nn dot module we get all of the pytorch module behavior and one of the things we got for free 00:54:14.260 |
So that's pretty that's pretty handy right that's the thing that basically is going to automatically 00:54:20.860 |
Give us a list of all of the weights in our model that have to be updated and so that's what gets passed to the optimizer 00:54:28.900 |
We also passed the optimizer the learning rate 00:54:31.820 |
The weight decay which we'll talk about later and momentum that we'll talk about later 00:54:41.300 |
Okay, one other thing that I'm not going to do right now 00:54:44.060 |
But we will do later is to write a training loop so the training loop is a thing that loops through each mini batch 00:54:51.620 |
Updates the weight to subtract the gradient times the volume rate 00:54:56.300 |
There's a function in fast AI which is the training loop and it's 00:55:13.180 |
This is just the thing that shows a progress bar so ignore this for X comma Y in my training data loader 00:55:24.260 |
Print out the loss in our in a progress bar call any callbacks you have and at the end 00:55:35.860 |
Call the call the metrics on the validation right so this there's just for each epoch go through each mini batch 00:55:48.680 |
Basically going to take advantage of this optimizer, but we're rewriting that from scratch shortly 00:55:59.900 |
Okay, we're just using a pipe watch module so this this fit thing although. It's past it part of fast AI 00:56:06.420 |
It's like lower down the layers of abstraction now. This is the thing that takes a 00:56:11.560 |
regular pipe torch model, so if you ever want to like 00:56:19.220 |
Fast AI stuff as possible like you've got some pipe torch model. You've got some code on the internet 00:56:26.060 |
But you don't want to write your own training loop, then this is this is what you want to do 00:56:30.340 |
You want to call fast AI's fit function and so what you'll find is like 00:56:34.300 |
The library is designed so that you can kind of dig in at any layer of abstraction 00:56:39.140 |
You like right and so at this layer of abstraction. You're not going to get things like 00:56:45.340 |
Stochastic gradient descent with restarts you're not going to get like differential learning rates like all that stuff 00:56:52.180 |
That's in the learner like you could do it, but you'd have to write it all by by hand yourself 00:56:56.620 |
Right and that's the downside of kind of going down to this level of abstraction 00:57:00.900 |
The upside is that as you saw the code for this is very simple. It's just a simple training loop 00:57:09.060 |
So this is like this is a good thing for us to use here 00:57:12.460 |
We can we just call it and it looks exactly like what we're we're used to see right we get our 00:57:19.540 |
validation and training loss for the three plus 00:57:31.260 |
So we're not there so in other words the the default fast AI collaborative filtering algorithm is doing something 00:57:39.620 |
Smarter than this so we're going to try and do that 00:57:45.020 |
One thing that we can do since we're calling our you know this lower level fit function 00:57:49.780 |
There's no learning rate and kneeling we could do our own learning rate and kneeling so you can hear it 00:57:54.140 |
See here there's a fast AR function called set learning rates 00:57:57.220 |
you can pass in a standard pytorch optimizer and pass in your new learning rate and 00:58:02.540 |
Then call fit again. And so this is how we can let manually do a learning rate schedule 00:58:09.100 |
And so you can see we've got a little bit better 00:58:15.580 |
Okay, so I think what we might do is we might have a 00:58:20.740 |
Seven-minute break and then we're going to come back and try and improve this score of it 00:58:28.240 |
For those who are interested somebody was asking me at the break for a kind of a quick 00:58:41.740 |
Walk through so this is totally optional, but if you go into the fast AI library, there's a model.py file 00:58:53.740 |
That's where fit is which we're just looking at which goes through 00:58:57.420 |
Each epoch in epochs and then goes through each X and Y in the mini batch and then it calls this 00:59:12.000 |
Here and you can see the key thing is it calculates the output from the model the models for N right and so if you remember 00:59:24.320 |
We didn't actually call model dot forward we just called model parentheses and that's because the 00:59:35.520 |
You know when you call it as if it's a function it passes it along to forward 00:59:39.640 |
Okay, so that's that's what that's doing there right and then the rest of this will will learn about shortly. Just basically doing the 00:59:54.200 |
That's that's kind of gives you a bit of a sense of how the code is structured if you want to look at it 00:59:59.160 |
and as I say like the the fast AI code is designed to 01:00:08.680 |
Pretty easy to read so like feel free like take a look at it 01:00:13.480 |
And if you want to know what's going on just ask on the forums 01:00:16.400 |
And if you know if you think there's anything that could be 01:00:23.840 |
Because yeah, the code is definitely know we're going to be digging into the code more and more 01:00:28.880 |
Okay, so let's try and improve this a little bit and let's start off by improving it in Excel 01:00:38.040 |
So you might have noticed here that we've kind of got the idea that 01:00:44.880 |
You know like sci-fi modern movies with special effects, you know 01:00:49.760 |
Whatever and movie number 27 is sci-fi and has special effects and not much dialogue 01:00:55.520 |
but we're missing an important case, which is like 01:01:01.400 |
User 72 is pretty enthusiastic on the whole and on average rates things highly highly, you know and movie 01:01:20.040 |
constant for the user and a constant for the movie and 01:01:24.880 |
Remember in neural network terms we call that a bias 01:01:28.880 |
That's we want to add a bias so we could easily do that and if we go into the bias tab here 01:01:35.040 |
We've got the same data as before and we've got the same 01:01:38.560 |
Latent factors as before and I've just got one extra 01:01:44.000 |
Row here and one extra column here and you won't be surprised here that we now 01:01:50.800 |
Take the same matrix multiplication as before and we add in 01:02:02.640 |
So other than that we've got exactly the same loss function over here 01:02:07.720 |
And so just like before we can now go ahead and solve that and now our changing variables include the 01:02:16.600 |
bias and we can say solve and if we leave that for a little while it will come to a 01:02:26.600 |
Okay, so that's the first thing we're going to do to improve our model and there's really very little show 01:02:40.600 |
I have to find a function called get embedding which takes a number of inputs and a number of factors 01:02:47.720 |
so the number of rows and the embedding matrix and unposted with matrix creates the embedding and 01:02:55.040 |
Randomly initializes it. I don't know why I'm doing negative to positive here and it's zero last time 01:03:00.440 |
Honestly, it doesn't matter much as long as it's in the right ballpark 01:03:03.280 |
And then we return that initialized embedding 01:03:06.780 |
So now we need not just our users by factors, which I'll chuck into u, our movies by factors 01:03:14.680 |
Which I've chuck into m, but we also need users by 1 01:03:18.440 |
Which we'll put into ub, user bias, and movies by 1 which we'll put into movie bias 01:03:24.280 |
Okay, so this is just doing a list comprehension 01:03:27.360 |
Going through each of the tuples creating embedding for each of them and putting them into these things 01:03:32.800 |
Okay, so now our forward is exactly the same as before 01:03:38.200 |
U times m dot sum and this is actually a little confusing because we're doing it in two two steps 01:03:47.840 |
Maybe to make it a bit easier. Let's pull this out 01:03:56.960 |
Okay, so maybe that looks a little bit more familiar 01:04:00.880 |
All right, u times n dot sum that's the same dot product and then here we're just going to add in our user bias and 01:04:10.040 |
Dot squeeze is the PyTorch thing that adds an additional 01:04:19.480 |
That's not going to make any sense if you haven't done broadcasting before 01:04:23.040 |
I'm not going to do broadcasting in this course because we've already done it and we're doing it in the machine learning course 01:04:32.200 |
Broadcasting is what happens when you do something like this where um is a matrix 01:04:42.000 |
How do you add a vector to a matrix and basically what it does? 01:04:50.400 |
So that it makes it the same size as the matrix and the particular way whether it duplicates it across columns or down rows 01:04:57.240 |
Or how it does it is called broadcasting the broadcasting rules are the same as numpy 01:05:02.700 |
PyTorch didn't actually used to support broadcasting 01:05:06.100 |
So I was actually the guy who first added broadcasting to PyTorch using an ugly hack and then the PyTorch authors did an awesome job 01:05:12.880 |
Of supporting it actually inside the language 01:05:16.400 |
So now you can use the same broadcasting operations in PyTorch is numpy 01:05:21.000 |
If you haven't dealt with this before it's really important to learn it 01:05:26.540 |
Because like it's it's kind of the most important fundamental way to do computations quickly in numpy and PyTorch 01:05:34.760 |
It's the thing that lets you not have to do loops 01:05:37.100 |
Could you imagine here if I had to loop through every row of this matrix and add each you know? 01:05:43.120 |
This back to the every row it would be slow it would be you know a lot more code 01:05:47.640 |
And the idea of broadcasting it actually goes all the way back to 01:05:52.840 |
APL which is a language designed in the 50s by an extraordinary guy called Ken Iverson 01:05:58.120 |
APL was originally a designed or written out as a new type of mathematical notation 01:06:07.480 |
Notation as a tool for thought and the idea was that like really good notation could actually make you think of better things 01:06:14.320 |
And part of that notation is this idea of broadcasting. I'm incredibly enthusiastic about it, and we're going to use it plenty 01:06:35.560 |
Anyway, so basically it works reasonably intuitively we can add on we can add the vectors to the matrix 01:06:47.400 |
Having done that we're now going to do one more trick. Which is I think it was your net asked earlier about could we 01:07:03.360 |
We could right and specifically what we could do is 01:07:12.600 |
All right, so to remind you the sigmoid function 01:07:24.200 |
All right, we could put it through a sigmoid function 01:07:28.040 |
So we could take like 4.96 and put it through a sigmoid function and like that. You know that's kind of high 01:07:34.360 |
So it kind of be over here somewhere right and then we could multiply that 01:07:42.600 |
For example right and in this case we want it to be between 1 and 5 right so maybe we might multiply it by 4 and 01:07:58.240 |
The result so the result is basically the thing that comes straight out of the dot product plus the addition of the biases 01:08:05.720 |
And put it through a sigmoid function now in pytorch 01:08:10.440 |
Basically all of the functions you can do the tensors are available 01:08:16.520 |
Inside this thing called capital F, and this is like totally standard in pytorch 01:08:23.040 |
It's actually called torch dot nn dot functional 01:08:26.000 |
But everybody including all of the pytorch docs import torch dot nn dot functional as capital F 01:08:32.220 |
Right so capital F dot sigmoid means a function called sigmoid that is coming from 01:08:40.560 |
Functional module right and so that's going to apply a sigmoid function to the result 01:08:45.520 |
So I've squished them all between 0 and 1 using that nice little shape, and then I can multiply that by 01:08:54.040 |
Right and then add on 1 and that's going to give me something between 1 and 5 okay, so 01:08:59.560 |
Like there's no need to do this. I could comment it out, and it will still work right 01:09:06.100 |
But now it has to come up with a set of calculations that are always between 01:09:10.560 |
1 and 5 right where else if I leave this in then it's like makes it really easy 01:09:15.960 |
It's basically like oh if you think this is a really good movie just calculate a really high number 01:09:20.520 |
It's a really crappy movie cap a really low number, and I'll make sure it's in the right region 01:09:27.160 |
It's still a good example of this kind of like if you're doing any kind of parameter fitting 01:09:32.100 |
Try and make it so that the thing that you want your function to return 01:09:36.240 |
It's like it's easy for it to return that okay, so that's why we do that that function squishing 01:09:47.280 |
So we can create that in the same way as before you'll see here 01:09:50.960 |
I'm calling dot CUDA to put it on the GPU because we're not using any learner stuff normally that'll happen for you 01:09:57.400 |
But we have to manually say put it on the GPU 01:09:59.780 |
This is the same as before create our optimizer 01:10:02.600 |
Fit exactly the same as before and these numbers are looking good and again. We'll do a little 01:10:10.760 |
Change to our learning rate and learning rate schedule, and we're down to 0.8. So we're actually pretty close 01:10:35.520 |
And you're not reminded me of an important point which is that this is not 01:10:40.840 |
strictly speaking a matrix factor ization because strictly speaking a matrix factor ization would take that matrix by 01:11:10.080 |
Right we're saying if the original was empty put in a zero 01:11:16.880 |
You can't do that with normal matrix factor ization normal matrix factor ization that creates the whole matrix 01:11:25.600 |
When people used to try and use traditional linear algebra for this because when you have these sparse matrices like in practice 01:11:33.720 |
This matrix is not doesn't have many gaps because we picked the users that watch the most movies and the movies that are the most 01:11:40.980 |
Watched but if you look at the whole matrix, it's it's mainly empty and so traditional 01:11:46.200 |
Techniques treated empty is zero and so like you basically have to predict a zero 01:11:52.440 |
As if the fact that I haven't watched a movie means I don't like the movie that gives terrible answers 01:11:57.740 |
So this probabilistic matrix factor ization approach 01:12:02.880 |
takes advantage of the fact that our data structure 01:12:09.720 |
Rather than that cross tab right and so it's only calculating the loss for the user ID movie ID 01:12:16.000 |
Combinations that actually appear that's exactly like user ID one movie ID one or two nine should be three 01:12:21.880 |
It's actually three and a half so our loss is point five like there's nothing here. That's ever going to calculate a 01:12:29.280 |
Prediction or a loss for a user movie combination that doesn't appear in this table 01:12:33.680 |
By definition the only stuff that we can appear in a mini batch is what's in this table? 01:12:39.720 |
And like a lot of this happened interestingly enough actually in the Netflix prize 01:12:52.320 |
This probabilistic matrix factor ization it had actually already been invented, but nobody noticed 01:12:59.440 |
Alright, and then in the first year of the Netflix prize 01:13:01.960 |
Someone wrote this like really really famous blog post where they basically said like hey check this out 01:13:08.040 |
Incredibly simple technique works incredibly well and suddenly all the Netflix leaderboard entries work much much better 01:13:15.560 |
And so you know that's quite a few years ago now, and this is like now 01:13:19.960 |
Every collaborative filtering approach does this not every collaborative filtering approach adds this sigmoid thing by the way. It's not like 01:13:29.000 |
Rocket science this is this is not like the NLP thing we saw last week 01:13:32.760 |
Which is like hey, this is a new state-of-the-art like this is you know not particularly uncommon 01:13:37.100 |
But there are still people that don't do this and it definitely helps a lot right to have this and so 01:13:42.600 |
Actually you know what we could do is maybe now's a good time to have a look at the definition of this right so 01:13:58.040 |
and we can now compare this to the thing we originally used which was 01:14:04.560 |
Whatever came out of collab filter data set all right, so let's go to 01:14:17.640 |
Get learner right so we can go down to get learner and that created a collab filter learner 01:14:25.800 |
passing in the model from get model is get model, so it created an embedding bias and 01:14:37.080 |
You can see here here. It is like. It's the same thing. There's the embedding for each of the things 01:14:43.040 |
Here's our forward that does the u times I dot sum 01:14:56.840 |
It's a little shorter and easier because we're taking advantage of the fact that there's a special 01:15:08.360 |
So we can actually we're getting passed in the users and the items and we don't have to pull them out of cats and cunts 01:15:17.440 |
So hopefully you can see like the fast AI library is not some inscrutable code containing concepts 01:15:23.120 |
You can never understand. We've actually just built up this entire thing from scratch ourselves 01:15:35.720 |
You know I think it's simply because we used stochastic gradient descent with restarts and a cycle multiplier and an atom optimizer 01:15:50.440 |
So I'm looking at this and thinking that is we could totally improve this model, but maybe 01:15:57.800 |
Looking at the date and doing some tricks with the date. Yes, this is kind of a just a regular 01:16:04.400 |
Kind of model in a way. Yeah, you can add more features. Yeah, it's actually exactly so like now that you've seen this 01:16:11.760 |
You could now you know even if you didn't have 01:16:16.200 |
embedding bias in a notebook that you've written yourself some other model that's in fast AI you could look at it in fast AI and 01:16:22.440 |
Be like oh that does most of the things that I'd want to do, but it doesn't deal with time and so you could just go 01:16:28.680 |
Oh, okay. Let's grab it. Copy it. You know pop it into my notebook and 01:16:36.920 |
Right, and then you can start playing that and you can now create your own 01:16:48.080 |
Yeah, you're that's mentioning a couple things we could do we could try incorporating time stamps, so we could assume that maybe 01:16:57.000 |
For a particular user over time users tend to get more or less positive about movies 01:17:02.640 |
Also remember there was the list of genres for each movie. Maybe we could incorporate that 01:17:09.600 |
So one problem is it's a little bit difficult to incorporate that stuff 01:17:15.600 |
Into this embedding dot bias model because it's kind of it's pretty custom right so what we're going to do next is 01:17:35.440 |
Take exactly the same thing as we had before here's our list of users 01:17:49.200 |
Embeddings right and so as you can see I've just kind of transposed 01:17:52.600 |
The movie ones so that so that they're all in the same orientation 01:18:00.680 |
But D cross tab okay, so in the original format so each row is a user movie rating 01:18:18.040 |
Contiguous index right and so I can do that in Excel using this match that basically says 01:18:25.400 |
What you know how far down this list you have to go and it said 01:18:32.880 |
Okay user 29 was the second thing that list so forth okay? 01:18:37.920 |
So this is the same as that thing that we did 01:18:42.040 |
In our Python code where we basically created a dictionary to map this stuff 01:18:46.640 |
So now we can for this particular user movie rating 01:18:56.960 |
Right and so you can see here what it's doing is it saying all right. Let's basically offset 01:19:07.720 |
And the number of rows we're going to go down is equal to the user index and the number of columns 01:19:13.960 |
One two three four five okay, and so you can see what it does is it creates point one nine point six three point three one 01:19:19.960 |
Here it is point one nine, okay, so so this is literally 01:19:32.200 |
one hot encoding right because if instead this was a 01:19:37.280 |
Vector containing one zero zero zero zero zero right, and we multiplied that by this matrix 01:19:44.440 |
Then the only row it's going to return would be the first one okay, so 01:19:49.960 |
So it's really useful to remember that embedding 01:19:56.680 |
The only reason it exists the only reason it exists is because this is an optimization 01:20:03.200 |
You know this lets PyTorch know like okay. This is just a matrix multiply 01:20:08.480 |
But I guarantee you that you know this thing is one hot encoded 01:20:13.360 |
Therefore you don't have to actually do the matrix multiply you can just do a direct look up 01:20:17.540 |
Okay, so that's literally all an embedding is is it is a computational? 01:20:22.960 |
Performance thing for a particular kind of matrix multiply all right so that looks up that uses user 01:20:31.320 |
And then we can look up that uses movie all right, so here is movie ID 01:20:39.360 |
417 which apparently is index number 14 here. It is here, so it should have been point seven five point four seven 01:20:46.200 |
Yes, it is point seven five point four seven, okay, so we've now got the user embedding and the movie embedding 01:21:03.920 |
Instead what if we can catenate the two together into a single vector of length? 01:21:23.280 |
tensor of input activations or in this case a tensor of 01:21:29.320 |
Actually, this is a tensor of output activations. This is coming out of an embedding layer 01:21:32.840 |
We can chuck it in a neural net because neural nets we now know can calculate 01:21:39.040 |
Anything okay including hopefully collaborative filtering, so let's try that 01:21:51.880 |
This time I have not bothered to create a separate 01:22:01.760 |
Linear layer in PyTorch already has a bias in it right so when we go 01:22:20.520 |
U matrix right and this is the number of users and this is the 01:22:32.480 |
All right, so here's our number of movies and here's our again number of factors all right, and so remember we look up a 01:22:48.440 |
We look up a single movie and let's grab them and concatenate them together 01:22:55.720 |
Right so here's like the user part. Here's the movie part and then let's put that 01:23:04.600 |
Right so that number of rows here is going to have to be the number of users plus the number of movies 01:23:21.600 |
Because we're going to take that so in this case. We're going to pick 10 apparently so it's picked 10 and then we're going to 01:23:35.000 |
Matrix, which obviously needs to be of size 10 here 01:23:38.440 |
And then the number of columns is a size 1 because we want to predict a single rating 01:23:49.760 |
Okay, and so that's our kind of flow chart of what's going on right it is a standard 01:23:56.920 |
I'm called a one hidden layer neural net it depends how you think of it like there's kind of an embedding layer 01:24:03.820 |
But because this is linear and this is linear the two together is really one linear layer, right? It's just a computational convenience 01:24:11.600 |
So it's really got one hidden layer because it's just got one layer before this nonlinear activation 01:24:22.460 |
Linear layer with some number of rows and some number of columns you just go and end up in here 01:24:33.560 |
We learned how to create a linear layer from scratch by creating our own weight matrix in our own biases 01:24:40.680 |
So if you want to check that out you can do so there right, but it's the same basic technique. We've already seen 01:24:49.240 |
We create our embeddings we create our two linear layers 01:24:53.240 |
That's all the stuff that we need to start with you know really if I wanted to make this more general 01:24:59.120 |
I would have had another parameter here called like 01:25:07.640 |
equals 10 and then this would be a parameter and 01:25:13.080 |
Then you could like more easily play around with different numbers of activations 01:25:17.400 |
So when we say like okay in this layer. I'm going to create a layer with this many activations all I mean 01:25:26.360 |
My linear layer has how many columns in its weight matrix. That's how many activations it creates 01:25:33.040 |
All right, so we grab our users and movies we put them through our embedding matrix, and then we concatenate them together 01:25:43.560 |
Concatenates them together on the first dimension so in other words we can catenate the columns together to create longer rows 01:25:50.840 |
Okay, so that's concatenating on dimension one 01:25:53.920 |
Dropout will come back to her in a moment. We've looked at that briefly 01:25:59.880 |
So then having done that we'll put it through that linear layer we had 01:26:07.440 |
We'll do our relu and you'll notice that relu is again inside our capital F and end up functional 01:26:15.120 |
It's just a function so remember activation functions are basically things that take one activation in and spit one activation out 01:26:23.320 |
in this case take in something that can have negatives or positives and 01:26:27.440 |
Truncate the negatives to zero. That's all relu does 01:26:39.980 |
Neural network. I don't know if we get to call it deep. It's only got one hidden layer 01:26:44.880 |
But it's definitely a neural network right and so we can now construct it we can put it on the GPU 01:26:50.540 |
We can create an optimizer for it, and we can fit it 01:26:54.360 |
Now you'll notice there's one other thing. I've been passing to fit which is 01:26:59.440 |
What loss function are we trying to minimize? 01:27:02.480 |
Okay, and this is the mean squared error loss and again. It's inside F 01:27:06.260 |
Okay, pretty much all the functions are inside it, okay? 01:27:11.720 |
One of the things that you have to pass fit is something saying like how do you score is what counts as good or bad? 01:27:22.360 |
Real neural net do we have to use the same number of embeddings for users and that's a great question you don't know 01:27:30.440 |
It's absolutely right you don't and so like we've got a lot of benefits here right because if we 01:27:39.560 |
You know we're grabbing a user embedding we're concatenating it with a movie embedding which maybe is like I don't know some different size 01:27:51.960 |
but then also perhaps we looked up the genre of the movie and like you know there's actually a 01:28:05.480 |
Three or something and so like we could then concatenate like a genre embedding and then maybe the time stamp is in here as a continuous 01:28:12.780 |
Number right and so then that whole thing we can then feed into 01:28:27.120 |
Remember our final non-linearity was a sigmoid right so we can now 01:28:31.080 |
Recognize that this thing we did where we did sigmoid times max rating but minus min rating plus blah blah blah 01:28:39.360 |
Nonlinear activation function right and remember in our last layer 01:28:44.520 |
We use generally different kinds of activation functions 01:28:48.200 |
So as we said we don't need any activation function at all right we could just do 01:28:57.040 |
But by not having any nonlinear activation function, we're just making it harder, so that's why we put the sigmoid in there as well, okay 01:29:10.940 |
There we go you know interestingly we actually got a better score than we did with our 01:29:22.000 |
So it'll be interesting to try training this with stochastic gradient descent with restarts and see if it's actually better 01:29:27.480 |
You know maybe you can play around with the number of hidden layers and the dropout and whatever else and see if you can 01:29:34.200 |
Come up with you know get a better answer than 01:29:47.840 |
Okay, so so general so this is like if you were going deep into collaborative filtering at your workplace 01:29:55.560 |
Or whatever this wouldn't be a bad way to go like it's like I'd start out with like oh, okay 01:29:59.840 |
Here's like a collaborative data set 30 and fast AI 01:30:02.800 |
Get learner there's you know not much I can send it basically number of factors is about the only thing that I pass in 01:30:09.320 |
I can learn for a while maybe try a few different approaches, and then you're like okay. There's like 01:30:19.200 |
Okay, how do I make it better, and then I'd be like digging into the code and saying like okay? 01:30:24.600 |
What would Jeremy actually do here? This is actually what I want you know and and and people around it 01:30:29.880 |
So one of the nice things about the neural net approach 01:30:41.640 |
We can choose how many hidden and we can also choose 01:30:48.400 |
So what we're actually doing is we haven't just got really you that we're also going like okay. Let's 01:31:03.820 |
All right, that's dropout. That's when this case. We were deleting 01:31:14.240 |
75% of them all right and then after the second linear 75% of them so we can add a whole lot of regularization 01:31:19.960 |
Yeah, so you know this it kind of feels like the this this embedding net 01:31:25.080 |
You know you could you could change this again. We could like have it so that we can pass into the constructor 01:31:32.360 |
Well if we wanted to make it look as much as possible like what we had before we could surpass in peace 01:31:47.160 |
I'm not sure this is the best API, but it's not terrible 01:31:50.700 |
Probably what since we've only got exactly two layers. We could say P1 equals 0.75 01:32:14.440 |
P2 you know where we go and like if you wanted to go further 01:32:25.240 |
Structured data learner you could actually have a thing this number of hidden 01:32:31.800 |
You know maybe you could make a list and so then rather than creating exactly one 01:32:38.800 |
Hidden layer and one output layer. This could be a little loop that creates and 01:32:43.440 |
Hidden layers each one of the size you want so like this is all stuff you can play with during the during the week 01:32:49.120 |
If you want to and I feel like if you've got like a much smaller collaborative filtering data set 01:32:55.620 |
You know maybe you'd need like more regularization or whatever. It's a much bigger one 01:33:00.560 |
Maybe more layers would help. I don't know you know I haven't seen 01:33:07.200 |
Much discussion of this kind of neural network approach to collaborative filtering 01:33:10.880 |
But I'm not a collaborative filtering expert, so maybe it's maybe it's around, but that'd be interesting thing to try 01:33:16.640 |
So the next thing I wanted to do was to talk about 01:33:26.120 |
The training loop, so what's actually happening inside the training loop? 01:33:37.600 |
The actual updating of the weights to pie torches optimizer 01:33:47.640 |
What that optimizer is is actually doing and we're also I also want to understand what this momentum term is doing 01:34:01.400 |
spreadsheet called grad desk gradient descent 01:34:04.560 |
And it's kind of designed to be read left to write sorry right to left worksheet wise 01:34:12.560 |
Is some data right and we're going to implement gradient descent in Excel because obviously everybody wants to do deep learning in Excel and we've done 01:34:24.080 |
Convolutions in Excel so now we need SGD in Excel so we can replace Python once and for all okay, so 01:34:31.360 |
Let's start by creating some data right and so here's 01:34:38.560 |
Independent you know I've got one column of X's you know and one column 01:34:44.720 |
Of wise and these are actually directly linearly related, so this is this is random 01:35:17.440 |
So let's start with the most basic version of SGD 01:35:21.800 |
And so the first thing I'm going to do is I'm going to run a macro so you can see what this looks like 01:35:26.520 |
So I hit run, and it does five epochs. I do another five epochs to another five epochs 01:35:36.280 |
The first one was pretty terrible. It's hard to see so I just delete that first one get better scaling 01:35:44.360 |
All right, so you can see actually it's pretty constantly improving the loss right. This is the loss per epoch 01:35:52.360 |
All right, so how do we do that? So let's reset it 01:36:05.000 |
Some intercept and some slope right so this is my randomly initialized weights 01:36:13.320 |
So I have randomly initialized them both to one 01:36:16.120 |
You could pick a different random number if you like, but I promise that I randomly picked the number one 01:36:30.560 |
So here is my intercept and slope. I'm just going to copy them over here right so you can literally see this is just equal 01:36:39.800 |
Here is equals C2. Okay, so I'm going to start with my very first row of data x equals 40 and y equals 58 01:36:50.440 |
After I look at this piece of data. I want to come up with a slightly better intercept and a slightly better slope 01:36:58.840 |
So to do that I need to first of all basically figure out 01:37:04.920 |
Which direction is is down in other words if I make my intercept a little bit higher 01:37:11.360 |
Or a little bit lower would it make my error a little bit better or a little bit worse? 01:37:16.200 |
So let's start out by calculating the error so to calculate the error the first thing we need is a prediction 01:37:35.560 |
And so here is our error. It's equal to our prediction minus our actual square 01:37:41.840 |
So we could like play around with this. I don't want my error to be 1849. I'd like it to be lower 01:37:53.040 |
1849 goes to 1840 okay, so a higher intercept would be better 01:37:57.360 |
Okay, what about the slope if I increase that? 01:38:03.720 |
To 1730 okay a higher slope would be better as well 01:38:18.480 |
You know encode in the spreadsheet is to do literally what I just did 01:38:22.200 |
It's to add a little bit to the intercept and the slope and see what happens 01:38:25.520 |
And that's called finding the derivative through finite differencing right and so let's go ahead and do that 01:38:31.920 |
So here is the value of my error if I add 0.01 01:38:40.480 |
My intercept right so it's c4 plus open o1 and then I just put that into my linear function 01:38:46.680 |
And then I subtract my actual all squared right and so that causes my error to go down a bit. That's are increasing 01:38:58.000 |
Which one is that increasing c4 increasing the intercept a little bit has caused my error to go down 01:39:03.900 |
So what's the derivative well the derivative is equal to how much the dependent variable changed by? 01:39:10.100 |
Divided by how much the independent variable changed by right and so there it is right our 01:39:16.040 |
Dependent variable changed by that minus that 01:39:18.600 |
Right and our independent variable we changed by 0.01 01:39:28.080 |
All right, so remember when people talking about derivatives 01:39:31.200 |
This is this is all they're doing is they're saying what's this value? 01:39:35.120 |
But as we make this number smaller and smaller and smaller and smaller as it as it limits to zero 01:39:42.720 |
I'm not smart enough to think in terms of like derivatives and integrals and stuff like that so whatever I think about this 01:39:49.360 |
I always think about you know an actual like plus point oh one divided by point oh one because like I just find that 01:39:55.960 |
Easier just like I never think about probability density functions. I always think about actual probabilities about toss a coin 01:40:05.480 |
So I always think like remember. It's it's totally fair to do this because a computer is 01:40:11.240 |
Discrete it's not continuous like a computer can't do anything infinitely small anyway, right? 01:40:17.880 |
So it's actually got to be calculating things at some level of precision right and our brains kind of need that as well 01:40:25.920 |
So this is like my version of Jeffrey Hinton's like to visualize things in more than two dimensions 01:40:32.000 |
You just like say twelve dimensions really quickly while visualizing it in two dimensions 01:40:35.860 |
This is my equivalent you know to think about derivatives. Just think about division 01:40:41.920 |
And like although all the mathematicians say no you can't do that 01:40:46.120 |
You actually can like if you think of dx dy is being literally you know changing x over changing y like 01:40:54.200 |
The division actually like the calculations still work like all the time, so 01:40:59.080 |
Okay, so let's do the same thing now with changing 01:41:06.160 |
And so here's the same thing right and so you can see both of these are negative 01:41:10.560 |
Okay, so that's saying if I increase my intercept my loss goes down if I increase my slope my loss goes down 01:41:24.180 |
with respect to my slope is is actually pretty high and that's not surprising because 01:41:35.400 |
You know the constant term is just being added whereas the slope is being multiplied by 40 01:41:46.800 |
Finite differencing is all very well and good, but it's a big problem with finite differencing in 01:41:51.920 |
High dimensional spaces and the problem is this right and this is like 01:42:00.760 |
How to calculate derivatives or integrals, but you need to learn how to think about them spatially right and so remember 01:42:10.160 |
Vector very high dimensional vector. It's got like a million items in it right 01:42:18.560 |
Some weight matrix right of size like 1 million by size a hundred thousand or whatever and it's spitting out something of size 100,000 01:42:30.560 |
So you need to realize like there isn't like a gradient here, but it's like for every one of these things in this vector 01:42:49.040 |
Not a single gradient number not even a gradient 01:43:03.880 |
I would literally have to like add a little bit to this and see what happens to all of these 01:43:08.920 |
Add a little bit to this see what happens to all of these right to fill in 01:43:13.800 |
one column of this at a time, so that's going to be 01:43:17.660 |
Horrendously slow like that that so that's why like if you're ever thinking like oh we can just do this with finite differencing 01:43:24.720 |
Just remember like okay. We're dealing in the with these very high dimensional vectors where 01:43:33.880 |
Matrix calculus like all the concepts are identical 01:43:39.760 |
But when you actually draw it out like this you suddenly realize like okay for each number I could change 01:43:45.760 |
There's a whole bunch of numbers that impacts and I have this whole matrix of things to compute right and so 01:43:52.040 |
Your gradient calculations can take up a lot of memory, and they can take up a lot of time 01:44:03.640 |
And it's definitely well worth like spending time 01:44:11.360 |
you know the idea of like the gradients like look up things like Jacobian and 01:44:22.800 |
They're the things that you want to search for to start 01:44:26.480 |
unfortunately people normally write about them with you know lots of Greek letters and 01:44:34.000 |
Blah blah blahs right, but there are some there are some nice 01:44:38.200 |
You know intuitive explanations out there, and hopefully you can share them on the forum if you find them because this is stuff 01:44:52.400 |
You're trying to train something and it's not working properly and like later on we'll learn how to like look inside 01:44:58.160 |
Pytorch to like actually get the values of the gradients, and you need to know like okay 01:45:02.560 |
Well, how would I like plot the gradients you know? 01:45:05.640 |
What would I consider unusual like you know these are the things that turn you into a really awesome? 01:45:11.040 |
deep learning practitioner is when you can like debug your problems by like 01:45:15.760 |
Grabbing the gradients and doing histograms of them and like knowing you know that you could like plot that all each layer my 01:45:22.340 |
Average gradients getting worse or you know bigger or you know whatever 01:45:26.160 |
Okay, so the trick to doing this more quickly is to do it 01:45:33.200 |
Rather than through finite differencing and so analytically is basically there is a list you probably all learned it at high school 01:45:41.440 |
There is a literally a list of rules that for every 01:45:44.360 |
Mathematical function there's a like this is the derivative of that function right so 01:45:59.480 |
To X right and so we actually have here an X squared 01:46:11.920 |
Not any of the individual rules, but I want you to know the chain rule 01:46:19.520 |
You've got some function of some function of something 01:46:26.920 |
Don't know that's a linear layer. That's a rally you right and 01:46:31.320 |
Then we can kind of keep going backwards, right? 01:46:40.080 |
Just a function of a function of a function of a function where the innermost is you know it's basically linear 01:46:58.360 |
All right, and so it's a function of a function of a function and so therefore to calculate the derivative of 01:47:08.520 |
The loss of your model with respect to the weights of your model 01:47:12.440 |
You're going to need to use the chain rule and 01:47:14.440 |
Specifically whatever layer it is that you're up to like I want to calculate the derivative here 01:47:21.760 |
All of these ones because that's all that's that's the function that's being applied 01:47:25.680 |
right and that's why they call this back propagation because the value of the derivative of 01:47:37.800 |
Now basically you can do it like this you can say let's call you 01:47:41.520 |
Is this right let's call that you right then it's simply equal to 01:47:54.320 |
Derivative of that right you just multiply them together and 01:47:58.040 |
So that's what back propagation is like it's not that back propagation is a new thing for you to learn 01:48:09.240 |
Take the derivative of every one of your layers and 01:48:13.480 |
multiply them all together so like it doesn't deserve a new name right apply the chain rule to my layers 01:48:21.920 |
Does not deserve a new lane, but it gets one because us neural networks folk really need to seem as clever as possible 01:48:29.520 |
It's really important that everybody else thinks that we are way outside of their capabilities 01:48:34.920 |
Right so the fact that you're here means that we failed because you guys somehow think that you're capable 01:48:40.920 |
Right so remember. It's really important when you talk to other people that you say back propagation and 01:48:46.640 |
Rectified linear unit rather than like multiply the layers 01:48:51.360 |
Gradients or replace negatives with zeros, okay, so so here we go so here is so I've just gone ahead and 01:48:59.480 |
Grabbed the derivative unfortunately there is no automatic differentiation in Excel yet 01:49:05.920 |
So I did the alternative which is to paste the formula into Wolfram Alpha and got back the derivatives 01:49:12.120 |
So there's the first derivative, and there's the second derivative 01:49:18.240 |
Infinite tiny small neural network, so we don't have to worry about the chain rule 01:49:22.880 |
and we should see that this analytical derivative is pretty close to our estimated derivative from the finite differencing and 01:49:29.920 |
Indeed it is right and we should see that these ones are pretty similar as well, and indeed they are right 01:49:38.680 |
implemented my own neural nets 20 years ago I 01:49:42.560 |
You know had to actually calculate the derivatives 01:49:45.600 |
And so I always would write like had something that would check the derivatives using finite differencing 01:49:50.800 |
And so for those poor people that do have to write these things by hand 01:49:54.080 |
You'll still see that they have like a finite differencing checker 01:49:58.280 |
So if you ever do have to implement a derivative by hand, please make sure that you 01:50:03.800 |
Have a finite differencing checker so that you can test it 01:50:14.080 |
B, then we're going to get a slightly better loss, so let's increase B by a bit 01:50:22.760 |
Well we'll increase it by some multiple of this and the multiple 01:50:25.880 |
We're going to choose is called a learning rate, and so here's our learning rate. So here's 1e neck 4 01:50:42.040 |
Derivative times our learning rate, okay, so we've gone from 1 to 1.01 and 01:50:51.280 |
We've done the same thing so it's gone from 1 to 01:50:57.640 |
So this is a special kind of mini batch. It's a mini batch of size 1. Okay, so we call this online gradient descent 01:51:05.080 |
Just means mini batch of size 1 so then we can go on to the next one x is 86 01:51:11.440 |
Y is 202 right. This is my intercept and slope copied across from the last row 01:51:19.880 |
Okay, so here's my new y prediction. Here's my new error 01:51:29.480 |
All right, so we keep doing that for every mini batch of one 01:51:39.280 |
Okay, and so then at the end of an epoch we would grab 01:51:48.320 |
Paste them back over here. That's our new values 01:51:52.220 |
There we are and we can now continue again, right so we're now starting with 01:51:59.400 |
Pops today see that in the wrong spot. It should be 01:52:09.680 |
Okay, so there's our new intercept. There's only slow possibly I've got those the wrong way around 01:52:13.920 |
But anyway you get the idea and then we continue okay, so I recorded the world's tiniest macro 01:52:23.720 |
Copies the final slope and puts it into the new slope copies the final intercept puts it into the new intercept 01:52:37.720 |
Five times and after each time it grabs the root means greater error and pastes it into the next 01:52:43.840 |
Spare area and that is attached to this run button and so that's going to go ahead and do that five times 01:52:50.280 |
Okay, so that's stochastic gradient descent in Excel 01:52:55.240 |
So it to turn this into a CNN right you would just replace 01:53:02.040 |
This error function right and therefore this prediction with the output of that 01:53:11.000 |
Okay, and that then would be in CNN being trained with with SGD, okay 01:53:18.320 |
Now the problem is that you'll see when I run this 01:53:29.320 |
It's kind of going very slowly right we know that we need to get to a slope of 2 and an intercept of 30 01:53:43.680 |
It's like it keeps going the same direction, so it's like come on take a hint. That's a good direction 01:53:55.080 |
So the come on take a hint. That's a good direction. Please keep doing that but more is called momentum 01:54:12.800 |
The same thing and what to simplify this spreadsheet. I've removed the finite difference in columns, okay, other than that 01:54:20.520 |
This is just the same right so he's still got our X's our Y's 01:54:40.880 |
Our new calculation here for our new a term just like before is it's equal to whatever a was before 01:54:54.880 |
Okay, now this time. I'm not taking the derivative, but I'm financing some other number times the learning rate, so what's this other number? 01:55:04.240 |
Okay, so this other number is equal to the derivative 01:55:29.480 |
between this rows derivative or this mini batches derivative and 01:55:36.640 |
Right so in other words keep going the same direction as you were before 01:55:42.120 |
right then update it a little bit right and so in our 01:55:48.000 |
Stretch in our Python just before we had a momentum of 0.9 01:55:52.120 |
Okay, so you can see what tends to happen is that our? 01:55:58.160 |
negative kind of gets more and more negative right all the way up to like 2,000 01:56:11.320 |
Add riveters are kind of all over the place, right? Sometimes there's 700 some negative seven positive hundred 01:56:17.640 |
You know so this is basically saying like yeah, if you've been going 01:56:21.360 |
Down for quite a while keep doing that until finally here. It's like okay. That's that seems to be far enough 01:56:28.360 |
So that's getting less and less and less negative 01:56:30.360 |
Right and still we start going positive again 01:56:32.760 |
So you can kind of see why it's called momentum 01:56:35.120 |
It's like once you start traveling in a particular direction for a particular weight 01:56:39.680 |
You kind of the wheel start spinning and then once the gradient turns around the other way 01:56:45.040 |
It's like oh slow down. We've got this kind of momentum, and then finally turn back around 01:56:52.520 |
All right, we can do exactly the same thing right and after five iterations for 89 01:57:03.640 |
Where else before after five iterations? We're at 104 right and after a few more. Let's do maybe 15 01:57:20.560 |
It's going right so it's it's it's a bit better. It's not heaps better. You can still see like 01:57:35.000 |
Zipping along right, but it's definitely an improvement and it also gives us something else to tune 01:57:40.920 |
Which is nice like so if this is kind of a well-behaved error surface right in other words like 01:57:46.480 |
Although it might be bumpy along the way. There's kind of some overall 01:57:51.200 |
Direction like imagine you're going down a hill right and there's like bumps on it right so the more momentum 01:57:57.980 |
You get up. We're going to skipping over the tops right so we could say like okay 01:58:01.220 |
Let's increase our beta up to point nine eight 01:58:03.220 |
Right and see if that like allows us to train a little faster and whoa look at that suddenly with 22 01:58:09.640 |
All right so one nice thing about things like momentum is it's like another parameter that you can tune to try and make your model 01:58:18.720 |
Basically everybody does this every like you look at any like image net winner or whatever they all use momentum 01:58:36.200 |
Back over here when we said use SGD that basically means use the the basic tab of our Excel spreadsheet 01:59:15.640 |
Actually was not right earlier on in this course. I said we've been using Adam by default 01:59:22.500 |
We actually haven't we've actually been I've noticed that we've actually been using SGD with momentum by default and the reason is 01:59:33.420 |
Has had it's much faster as you'll see it's much much faster to learn with but there's been some problems 01:59:39.100 |
Which is people haven't been getting quite as good like final answers with Adam as they have with SGD with momentum 01:59:45.660 |
And that's why you'll see like all the you know image net winning 01:59:48.660 |
Solutions and so forth and all the academic papers always use SGD with momentum and our Adam 01:59:56.220 |
Seems to be a particular problem in NLP people really haven't got Adam working at all. Well 02:00:00.500 |
The good news is this was I built it looks like this was solved two weeks ago 02:00:08.760 |
It basically it turned out that the way people were dealing with a combination of weight decay in Adam 02:00:19.540 |
and that's that's kind of carried through to every single library and 02:00:24.060 |
one of our students and answer ha has actually just 02:00:27.580 |
Completed a prototype of adding is this new version of Adam is called Adam W into fast AI 02:00:34.420 |
And he's confirmed that he's getting the much faster both the faster 02:00:39.780 |
Performance and also the the better accuracy. So hopefully we'll have this Adam W in fast AI 02:00:47.780 |
Ideally before next week. We'll see how we go very very soon 02:00:51.020 |
So so it is worth telling you about about Adam 02:00:54.980 |
So let's talk about it, it's actually incredibly simple 02:01:00.760 |
But again, you know make sure you make it sound really complicated when you tell people so that you can look clever 02:01:07.180 |
So here's the same spreadsheet again, right and here's our 02:01:12.620 |
Randomly selected A and B again somehow it's still one. Here's our prediction. Here's our derivatives 02:01:20.020 |
Okay, so now how we calculating our new A and our new B 02:01:23.800 |
You can immediately see it's looking pretty hopeful because even by like row 10 02:01:30.460 |
We're like we're seeing the numbers move a lot more. Alright, so this is looking pretty encouraging 02:01:45.900 |
Minus J8. Okay, so we're gonna have to find out what that is 02:01:55.580 |
Divided by the square root of L8. Okay, so we're gonna have to dig it and see what's going on 02:02:00.340 |
One thing to notice here is that my learning rate is way higher than it used to be 02:02:08.540 |
Big number. Okay, so let's start out by looking and seeing what this J8 thing is 02:02:16.940 |
J8 is identical to what we had before J8 is equal to the linear interpolation of the derivative and 02:02:32.300 |
So one part of atom is to use momentum in the way we just defined 02:02:37.300 |
Okay, the second piece was to divide by square root L8, what is that? 02:02:43.620 |
square root L8, okay is another linear interpolation of something and 02:02:50.260 |
Something else and specifically it's a linear interpolation of 02:02:55.340 |
F8 squared, okay. It's a linear interpolation of the derivative squared 02:03:02.660 |
Along with the derivative squared last time. Okay, so in other words, we've got two pieces of 02:03:18.740 |
version of the gradient the other is calculating the momentum version of the gradient squared and 02:03:29.420 |
Exponentially weighted moving average in other words 02:03:33.540 |
It's basically equal to the average of this one and the last one and the last one and the last one, but we're like multiplicatively 02:03:40.380 |
decreasing the previous ones because we're multiplying it by 02:03:43.220 |
0.9 times 0.9 times 0.9 times 0.9. And so you actually see that for instance in the fast AI code 02:04:02.740 |
We don't just calculate the average loss, right? 02:04:10.860 |
What I actually want we certainly don't just report the loss for every mini batch because that just bounces around so much 02:04:16.700 |
So instead I say average loss is equal to whatever the average loss was last time 02:04:29.900 |
Right. So in other words the faster AI library 02:04:33.700 |
The thing that it's actually when you do like the loading rate finder or plot loss 02:04:38.340 |
It's actually showing you the exponentially weighted moving average of the loss 02:04:43.260 |
Okay, so it's like a really handy concept. It appears quite a lot 02:04:48.180 |
Right the other handy concept to know about it's this idea of like you've got two numbers 02:04:54.500 |
One of them is multiplied by some value. The other is multiplied by 1 minus that value 02:04:59.980 |
So this is a linear interpolation with two values. You'll see it all the time and 02:05:08.100 |
Deep learning people nearly always use the value alpha when they do this 02:05:12.380 |
So like keep an eye out if you're reading a paper or something and you see like alpha times blah blah blah blah plus 02:05:21.380 |
Times some other blah blah blah blah right immediately like when people read papers 02:05:27.100 |
None of us like read everything in the equation. We look at it. We go Oh linear interpolation 02:05:33.900 |
Right and I say something I was just talking to Rachel about yesterday is like 02:05:37.940 |
Whether we could start trying to find like a new way of writing papers where we literally refactor them 02:05:43.540 |
Right like it'd be so much better to have written like 02:05:49.060 |
Blah blah blah comma blah blah blah right because then you don't have to have that pattern recognition right but until we 02:05:55.940 |
Convinced the world to change how they write papers 02:05:58.540 |
This is what you have to do is you have to look you know 02:06:01.340 |
Know what to look for right and once you do suddenly the huge page with formulas 02:06:07.420 |
Aren't aren't bad at all like you often notice like for example the two things in here like they might be totally identical 02:06:14.900 |
But this might be at time t and this might be at like time t minus 1 or something 02:06:19.060 |
Right like it's very often these big ugly formulas turn out to be 02:06:23.100 |
Really really simple if only they had repacked them 02:06:28.580 |
So what are we doing with this gradient squared? 02:06:31.100 |
So what we were doing with the gradient squared is 02:06:35.860 |
We were taking the square root, and then we were adjusting the learning rate by dividing the learning rate by that 02:06:51.740 |
We're taking the exponentially waiting move moving average of a bunch of things that are always positive 02:06:57.160 |
And then we're taking the square root of that 02:06:59.160 |
All right, so when is this number going to be high? 02:07:01.780 |
It's going to be particularly high if there's like one big 02:07:05.840 |
You know if the gradients got a lot of variation right so there's a high variance of gradient 02:07:11.760 |
Then this G squared thing is going to be a really high number whereas if it's like a constant 02:07:18.060 |
Amount right it's going to be smaller that because when you add things that are squared the square 02:07:25.020 |
It's like jump out much bigger whereas if there wasn't if there wasn't much change that it's not going to be as big so basically 02:07:37.700 |
If our gradient is changing a lot now, what do you want to do if? 02:07:42.060 |
You've got something which is like first negative and then positive and then small and then high 02:07:49.660 |
Well you probably want to be more careful right you probably don't want to take a big step 02:07:55.100 |
Because you can't really trust it right so when the when the variance of the gradient is high 02:08:00.340 |
We're going to divide our learning rate by a big number 02:08:06.060 |
Very similar kind of size all the time then we probably feel pretty good about this step 02:08:13.460 |
And so this is called an adaptive learning rate and like a lot of people will have this confusion about Adam 02:08:20.300 |
I've seen it on the forum actually where people are like, isn't there some kind of adaptive learning rate where somehow you're like setting different 02:08:27.220 |
Learning rates for different layers or something. It's like no not really 02:08:32.500 |
Right all we're doing is we're just saying like just keep track of the average of the squares of the gradients and use that 02:08:40.620 |
To adjust the learning rate, so there's still one learning rate 02:08:47.300 |
right, but effectively every parameter at every epoch is 02:08:52.340 |
Being kind of like getting a bigger jump if the learning rate if the gradients been pretty constant for that weight and a smaller jump 02:09:00.760 |
Otherwise okay, and that's Adam. That's the entirety of Adam in 02:09:05.260 |
Excel right so there's now no reason at all why you can't 02:09:09.100 |
Train image net in Excel because we've got you've got access to all of the pieces you need 02:09:16.660 |
Okay, that's not bad right five and we're straight up to 29 and 2 right so the difference between like 02:09:26.780 |
You know standard SGD in this is huge and basically that you know the key difference was that it figured out that we need to be 02:09:46.020 |
two different parameters one is kind of the momentum for the gradient piece the other is the momentum for the gradient squared piece and 02:09:53.540 |
They I think they're called like I think there's just a couple of the beta 02:09:59.300 |
I think when you when you want to change it in PyTorch is a thing called beta 02:10:02.840 |
Which is just a couple of two numbers you can change 02:10:13.700 |
yeah, I think I understand this concept of you know when they 02:10:17.700 |
When a gradient is it goes up and down then you're not really sure 02:10:23.020 |
Which direction should should go so you should kind of slow things down therefore you subtract that gradient from the learning rate 02:10:30.300 |
So but how do you implement that how far do you go? 02:10:34.620 |
I guess maybe I missed something early on you do you set a number somewhere we divide 02:10:43.460 |
Divided by the square root of the moving average gradient squared, so that's where we use it. Oh 02:10:50.940 |
I'm sorry. Can you be a little more sure so d2 is the learning rate? Which is 1? Yeah m27 is 02:11:17.340 |
The new method that you just mentioned which is in the process of getting implemented in yes, I don't w yeah, I don't w 02:11:25.060 |
How different is it from here? Okay? I can let's do that so 02:11:31.100 |
To understand Adam W. We have to understand weight decay 02:11:35.260 |
And maybe we'll learn more about that later. Let's see how we go now with weight decay 02:11:45.060 |
Lots and lots of parameters like we do with you know most of the neural nets we train 02:11:50.540 |
You very often have like more parameters and data points or you know like regularization becomes important 02:11:57.940 |
And we've learned how to avoid overfitting by using dropout right which randomly deletes some activations 02:12:06.460 |
In the hope that it's going to learn some kind of more resilient set of weights 02:12:13.740 |
We can use called weight decay or L2 regularization 02:12:17.740 |
And it's actually comes kind of it's a kind of classic statistical technique and the idea is that we take our loss function 02:12:26.660 |
Error squared loss function and we add an additional piece to it 02:12:36.740 |
To basically add the square of the weights, so we'd say plus 02:12:52.680 |
Weight decay or L2 regularization and so the idea is that now 02:13:02.100 |
The the loss function wants to keep the weights small right because increasing the weights makes the loss worse and 02:13:09.380 |
So it's only going to increase the weights if the loss improves by more 02:13:15.180 |
Than the amount of that penalty and in fact to make this weight decay to proper weight decay we then need some 02:13:21.140 |
Multiplier here right so if you remember back in our here, we said weight decay equals Wd 5e neg 4 02:13:31.780 |
Okay, so to actually use the same weight decay. I would have to multiply by 02:13:37.380 |
So that's actually now the same weight decay, so 02:13:46.500 |
You have a really high weight decay that it's going to set all the parameters to zero 02:13:50.500 |
So it'll never over fit right because it can't set any parameter to anything 02:13:55.700 |
And so as you gradually decrease the weight decay a few more weights 02:14:01.500 |
Can actually be used right, but the ones that don't help much. It's still going to leave at zero or close to zero, right? 02:14:09.340 |
So that's what that's what weight decay is is literally to change the loss function to add in this 02:14:23.420 |
some parameter some some hyper parameter should see 02:14:30.060 |
If you put that into the loss function as I have here 02:14:33.700 |
Then it ends up in the moving average of gradients and the moving average of squares of gradients 02:14:50.780 |
Decreasing the amount of weight decay, and if there's very little variation we end up increasing the amount of weight decay, so we end up 02:15:00.140 |
penalize parameters, you know weights that are really high 02:15:03.200 |
Unless their gradient varies a lot, which is never what we intended right? That's just not not the plan at all 02:15:12.060 |
So the trick with Adam W is we basically remove 02:15:19.020 |
So it's not in the loss function. It's not in the G not in the G squared 02:15:23.100 |
And we move it so that instead it's it's it's added directly to the 02:15:30.260 |
When we update with the learning rate, it's added there instead so in other words it would be 02:15:34.800 |
We would put the weight decay or actually the gradient with the weight decay in here when we calculate the new a and ub 02:15:48.620 |
Description which will probably only make sense if you listen to a three or four times on the video and then talk about it 02:15:56.820 |
Yeah, but if you're interested let me know and we can also look at a nonce code. That's implemented this 02:16:03.620 |
And you know the the idea of using weight decay is it's a really helpful 02:16:15.020 |
Because it's basically this way that we can kind of say like 02:16:17.940 |
You know, please don't increase any of the weight values unless the 02:16:31.540 |
And so generally speaking pretty much all state-of-the-art models have both dropout and weight decay 02:16:38.220 |
And I don't claim to know like how to set h1 and how much of each to use 02:16:44.320 |
Other than to say like you it's worth trying both 02:16:52.220 |
Is there any way to interpret the final sort of use it embeddings? Like absolutely. We're gonna look at that next week. It's super fun 02:16:59.340 |
It turns out that you know, we'll learn what some of the worst movies of all time 02:17:03.540 |
It's like um, it's a John Travolta Scientology ones like battleship earth or something 02:17:11.820 |
I think that was like the worst movie of all time according to our settings 02:17:19.420 |
Do you have any recommendations for scaling the L2 penalty or is that kind of based on how how wide the nodes or how many 02:17:29.100 |
suggestion at all like I I kind of look for like papers or Kaggle competitions or whatever similar and try to set up frankly 02:17:37.780 |
The same it seems like in a particular area like computer vision object recognition 02:17:44.620 |
It's like somewhere between 1e neg 4 or 1e neg 5 seems to work, you know 02:17:54.260 |
The authors point out that with this new approach it actually becomes like it seems to be much more stable 02:17:59.220 |
As to what the right weight decay amounts are so hopefully now when we start playing with it, we'll be able to have some 02:18:05.040 |
definitive recommendations by the time we get to part 2 02:18:14.420 |
You know practice the thing that you're least familiar with so if it's like Jacobians and Hessians read about those if it's broadcasting 02:18:21.300 |
Read about those if it's understanding Python. Oh read about that, you know, try and implement your own custom layers 02:18:27.560 |
Read the fast AI layers, you know and and talk on the forum about anything that you find 02:18:33.860 |
Weird or confusing? All right. See you next week