Lesson 5: Deep Learning 2018

00:00:02.000 | So we had a busy lesson last week and

00:00:06.480 | I was really thrilled to see actually one of our master's students here at USF actually

00:00:15.400 | Actually took what we learned

00:00:19.280 | Took what we learned

00:00:22.720 | With structured deep learning and turned it into a blog post which as I suspected has been

00:00:29.200 | Incredibly popular because it's just something

00:00:31.200 | People didn't know about and so it actually ended up getting picked up by the towards data science publication

00:00:38.960 | Which I quite like actually if you're interested in keeping up with what's going on in data science. It's quite good medium publication

00:00:45.400 | and so Karen

00:00:47.680 | talked about

00:00:49.440 | structured deep learning and basically introduced

00:00:51.800 | you know that the basic ideas that we learned about last week and

00:00:57.920 | It got picked up quite quite widely one of the one of the things I was pleased to see actually

00:01:02.920 | Sebastian Ruder who actually mentioned in last week's class as being one of my favorite researchers

00:01:07.240 | Tweeted it and then somebody from Stitch Fix said oh, yeah, we've actually been doing that for ages, which is kind of cute

00:01:14.880 | I

00:01:16.640 | Kind of know that this is happening in industry a lot and I've been telling people this is happening in industry a lot

00:01:21.600 | But nobody's been talking about it

00:01:23.040 | And now the Karen's kind of published a blog saying hey check out this cool thing and now Stitch Fix is like yeah

00:01:28.640 | We're doing that already so

00:01:30.640 | So that's been great

00:01:33.240 | Great to see and I think there's still a lot more that can be dug into with this structured deep learning stuff

00:01:40.720 | You know to build on top of Karen's post would be to maybe like experiment with some different data sets

00:01:46.320 | Maybe find some old Kaggle competitions and see like there's some competitions that you could now win with this or some which doesn't work

00:01:54.240 | For would be equally interesting

00:01:56.240 | and also like just

00:01:58.400 | Experimenting a bit with different amounts of dropout different layer sizes, you know

00:02:02.480 | Because nobody much has written about this. I don't think there's been any blog posts about this before that. I've seen anywhere

00:02:11.160 | There's a lot of unexplored territory. So I think there's a lot we could we could build on top of here

00:02:17.360 | And there's definitely a lot of interest. I saw one person on Twitter saying this is what I've been looking for ages

00:02:23.380 | another thing which I was pleased to see is

00:02:27.200 | Nicky all who we saw his

00:02:30.280 | cricket versus baseball

00:02:32.280 | Predictor as well as his currency predictor after lesson one

00:02:35.880 | Went on to

00:02:39.680 | Download something a bit bigger which was to download a couple of hundred of images of actors and he manually

00:02:46.840 | Went through and checked which well

00:02:49.080 | I think first of all he like used Google to try and find ones with glasses and ones without then he manually went through and

00:02:54.040 | Check that they had been put in the right spot

00:02:56.040 | and this is a good example of one where

00:02:58.240 | Vanilla resnet didn't do so well with just the last layer

00:03:03.280 | And so what Nicole did was he went through and tried unfreezing the layers and using differential learning rates and got up to

00:03:10.240 | 100% accuracy and the thing I like about these things that Nicole is doing is the way he's

00:03:15.840 | He's not downloading a Kaggle data set. He's like deciding on a problem that he's going to try and solve

00:03:21.560 | He's going from scratch from Google

00:03:23.560 | And he's actually got a link here even to a suggested way to help you download images from Google

00:03:29.100 | So I think this is great and I actually gave a talk

00:03:33.000 | just this afternoon at Singularity University to a

00:03:36.260 | Executive team of one of the world's largest telecommunications companies and actually showed them this post

00:03:42.040 | because the

00:03:44.480 | Folks there were telling me that that all the vendors that come to them and tell them they need like

00:03:48.860 | Millions of images and huge data centers full of hardware, and you know they have to buy special

00:03:54.140 | Software that only these vendors can provide and I said like actually this person's been doing a course for three weeks now

00:04:01.040 | And look at what he's just done with a computer that cost him 60 cents an hour

00:04:04.980 | And they were like they were so happy to hear that like okay. They're you know this actually isn't the reach of normal people

00:04:12.200 | I'm assuming Nicole's a normal person. I haven't actually

00:04:15.680 | If you're proudly abnormal Nicole I apologize

00:04:20.500 | I actually went and actually had a look at his cricket

00:04:24.760 | Classifier and I was really pleased to see that his code actually is the exact same code

00:04:30.280 | That we used in lesson one. I was hoping that would be the case. You know the anything he changed was

00:04:34.420 | the number of epochs I guess

00:04:37.120 | So this idea that we can take those four lines of code and reuse it to do other things

00:04:41.400 | It's definitely turned out to be true, and so these are good things to show like a your organization

00:04:47.880 | If you're anything like the executives at this big company. I spoke to today. There'll be a certain amount of like

00:04:54.120 | Not to surprise but almost like pushback of like if this was true somebody there's you know they basically said that this is true

00:05:00.920 | Somebody would have told us so like why isn't everybody doing this already so like I think you might have to actually show them

00:05:07.080 | You know maybe you can build your own with some internal data

00:05:10.240 | You've got at work or something like here. It is you know didn't cost me anything. It's all finished

00:05:19.920 | Viddly or vitally I don't know how to pronounce his name correctly has done another very nice post on

00:05:23.840 | just an introductory post on how we train neural networks, and I wanted to point this one out as being like I think

00:05:31.120 | This is one of the participants in this course

00:05:34.000 | Who's just got a particular knack for technical communication, and I think we can all learn from you know from his posts about about good technical writing

00:05:41.640 | What I really like particularly is that he?

00:05:45.800 | He assumes almost nothing like he has a kind of a very chatty tone and describes everything

00:05:50.760 | But he also assumes that the reader is intelligent

00:05:53.140 | But you know so like he's not afraid to kind of say he's a paper or he's an equation or or whatever

00:05:58.320 | But then he's going to go through and tell you exactly what that equation means

00:06:01.600 | So it's kind of like this nice mix of like writing for

00:06:05.920 | Respectfully for an intelligent audience, but also not assuming any particular background knowledge

00:06:13.640 | so

00:06:15.640 | Then I made the mistake earlier this wake of posting a picture of my first placing on the Kaggle seedlings competition

00:06:23.400 | At which point five other fast AI students posted their pictures of them passing me over the next few days

00:06:29.820 | So this is the current leaderboard for the Kaggle plant seedlings competition

00:06:34.500 | I believe the put up top six are all fast AI students or in the worst of those teachers

00:06:40.840 | And so I think this is like a really

00:06:43.480 | Oh

00:06:45.000 | Look James has just passed he was first

00:06:47.080 | This is a really good example of like

00:06:50.040 | what you can do this is

00:06:53.480 | I'm trying to think it was like a

00:06:56.160 | small number of thousands of images

00:06:58.920 | And most of the images were only were less than a hundred pixels by hundred pixels

00:07:07.960 | And yet week, you know, I bet my my approach was basically to say let's just run through the notebook

00:07:12.880 | We have pretty much default took me. I don't know an hour

00:07:15.720 | And I'm I think the other students doing a little bit more than that

00:07:21.600 | But not a lot more and basically what this is saying is yeah, these these techniques

00:07:26.800 | Work pretty reliably to a point where people that aren't using the fast AI libraries

00:07:32.400 | You know literally really struggling

00:07:37.560 | I suspect all these are fast AI students. You might have to go down quite a way

00:07:41.720 | So I thought that was very interesting and really really cool

00:07:45.720 | So today we're going to

00:07:49.960 | Start what I would kind of call like the second half of this course

00:07:56.320 | so the first half of this course has been like

00:07:59.480 | getting through

00:08:01.800 | Like these are the applications that we can use this for

00:08:06.200 | The here's kind of the code you have to write

00:08:08.440 | Here's a fairly high level ish description of what it's doing

00:08:13.000 | and

00:08:15.320 | We're kind of we're kind of done for that bit and what we're now going to do is go in reverse

00:08:20.440 | We're going to go back over all of those exact same things again

00:08:23.900 | But this time we're going to dig into the detail of everyone and we're going to look inside the source code of a fast AI

00:08:29.480 | Library to see what it's doing and try to replicate

00:08:32.920 | that so in a sense like there's not going to be a lot more

00:08:38.680 | Best practices to show you like I've kind of shown you the best best practices

00:08:45.880 | I know

00:08:46.560 | But I feel like for us to now build on top of those to debug those models to come back to part 2

00:08:52.520 | Where we're going to kind of try out some new things, you know, it really helps to understand what's going on

00:08:58.240 | Behind the scenes. Okay, so the goal here today is we're going to try and create a

00:09:04.680 | pretty effective collaborative filtering model

00:09:07.920 | Almost entirely from scratch so we'll use the kind of we'll use pytorch as a

00:09:15.120 | Automatic differentiation tool and as a GPU programming tool and not very much else. We'll try not to use its neural net features

00:09:22.200 | We'll try not to use

00:09:23.720 | Fast AI library any more than necessary. So that's the goal

00:09:27.280 | so

00:09:29.320 | Let's go back and you know, we only very quickly look at collaborative filtering last time

00:09:33.400 | So let's let's go back and have a look at collaborative filtering. And so we're going to look at this

00:09:38.100 | movie lens data set

00:09:41.120 | so

00:09:43.520 | the movie lens data set

00:09:45.520 | Basically is a list of ratings

00:09:48.560 | it's got a bunch of different users that are represented by some ID and a bunch of movies that are represented by some ID and

00:09:55.840 | Rating it also has a timestamp. I haven't actually ever tried to use this

00:10:00.520 | I guess this is just like what what time did that person rate that movie?

00:10:04.640 | So that's all we're going to use for modeling is

00:10:09.560 | three columns user ID movie ID and rating and so thinking of that in kind of

00:10:17.000 | Structured data terms user ID and movie ID would be categorical variables

00:10:20.760 | we have two of them and rating would be a

00:10:24.160 | Would be a dependent variable

00:10:27.040 | We're not going to use this for modeling but we can use it for looking at stuff later

00:10:32.960 | We can grab a list of the names of the movies as well

00:10:35.920 | And you could use this genre information. I haven't tried to be interested if during the week anybody tries it and finds it helpful

00:10:43.200 | I guess as you might not find it helpful. We'll see

00:10:47.040 | so

00:10:49.040 | In order to kind of look at this better. I just grabbed

00:10:53.520 | the

00:10:56.600 | Users that have watched the most movies and the movies that have been the most watched

00:11:00.320 | And made a cross tab of it right so this is exactly the same data

00:11:05.560 | But it's a subset and now rather than being user movie rating we've got user

00:11:10.880 | movie rating

00:11:14.480 | And so some users haven't watched some of these movies. That's why some of these are not a number, okay?

00:11:21.380 | Then I copied that into Excel

00:11:24.400 | And you'll see there's a thing called collab filter dot XLS if you don't see it there now

00:11:32.200 | I'll make sure I've got it there by tomorrow

00:11:34.200 | And

00:11:37.160 | Here is where I've copied

00:11:40.360 | That table okay, so as I go through this like

00:11:44.280 | Set up of the problem and kind of how it's described and stuff if you're ever feeling

00:11:49.600 | lost feel free to

00:11:52.240 | Ask either directly or through the forum if you ask through the forum and somebody answers there

00:11:58.720 | I want you to answer it here

00:12:00.720 | but if somebody else asks a question you would like answered of course just like it and

00:12:06.880 | Your net will keep an eye out for that because kind of that's we're digging in

00:12:10.960 | To the details of what's going on behind the scenes. It's kind of important that at each stage you feel like okay. I can see

00:12:16.560 | What's going on?

00:12:19.040 | Okay, so we're actually not going to build a neural net to start with

00:12:30.000 | Instead we're going to do something called a matrix factorization

00:12:36.040 | The reason we're not going to build a neural net to start with is that it so happens. There's a really really simple

00:12:41.120 | Kind of way of solving these kinds of problems, which I'm going to show you and so if I scroll down

00:12:47.440 | I've basically what I've got here is the same the same thing, but this time these are my predictions

00:12:54.640 | Rather than my actuals, and I'm going to show you how I created these predictions, okay, so here are my actuals

00:13:00.240 | Right here are my predictions and

00:13:03.920 | then down here

00:13:06.000 | we have our

00:13:08.000 | score which is the

00:13:10.960 | sum of the different squared

00:13:13.200 | Average square root okay, so this is RMSE down here. Okay, so on average we're

00:13:19.880 | Randomly initialized model is out by 2.8

00:13:24.280 | So let me show you what this model is and I'm going to show you by saying how do we guess?

00:13:29.960 | How much user ID number 14?

00:13:33.320 | likes movie ID number 27 and

00:13:36.520 | The prediction here. This is just at this stage is still random is

00:13:40.680 | 0.91

00:13:43.760 | So how are we calculating 0.91 and the answer is we're taking it as?

00:13:48.680 | this vector here

00:13:52.000 | Dot product with this vector here so dot product means point seven one times point one nine

00:13:59.860 | plus point eight one times point six three plus point seven four plus point three one and so forth and in

00:14:05.540 | You know linear algebra speak because one of them is a column and one of them is a row

00:14:09.900 | This is the same as a matrix product so you can see here. I've used the excel function matrix multiply

00:14:15.260 | And that's my prediction

00:14:18.500 | Having said that if the original

00:14:23.340 | Rating doesn't exist at all

00:14:28.540 | Then I'm just going to set this to zero right because like there's no error in predicting something that hasn't happened

00:14:34.620 | Okay, so what I'm going to do is I'm basically going to say all right every one of my rate

00:14:39.460 | Rate my predictions is not going to be a neural net. It's going to be a single

00:14:44.340 | matrix multiplication

00:14:46.980 | now the matrix multiplication that it's doing is basically in practice is between like this matrix and

00:14:57.460 | this

00:14:59.460 | Matrix right so each one of these is a single part of that

00:15:04.560 | So I randomly initialize these these are just random numbers

00:15:12.340 | That I've just pasted in here

00:15:15.380 | So I've basically started off with two

00:15:17.900 | Random matrices, and I've said let's assume for the time being that every rating can be represented as

00:15:26.780 | The the matrix product of those two

00:15:29.380 | So then in excel you can actually do gradient descent

00:15:34.180 | You have to go to your options to the add-in section and check the box to say turn it on and once you do you'll

00:15:43.140 | See there's something there called solver

00:15:45.140 | And if I go solver it says okay, what's your?

00:15:49.180 | Objective function, and you just choose the cell so in this case. We chose the cell that contains our written in spread error and

00:15:56.900 | Then it says okay. What do you want to?

00:16:00.580 | Change and you can see here

00:16:03.140 | We've selected this matrix and this matrix and so it's going to do a gradient descent

00:16:07.940 | For us by changing these matrices to try and in this case minimize this is min minimize this excel

00:16:15.860 | So right grg non-linear is a gradient descent method so say solve and you'll see it starts at 2.8 and

00:16:24.420 | Then down here you'll see that number is going down. It's not actually showing us what it's doing

00:16:30.740 | but we can see that the numbers going down, so

00:16:32.980 | this is kind of got a

00:16:35.980 | neural neti feel to it in that we're doing like a matrix product and we're doing a gradient descent, but we don't have a

00:16:43.820 | Non-linear layer, and we don't have a second

00:16:47.300 | Linear layer on top of that so we don't get to call this deep learning

00:16:51.220 | so things where people do like deep learning each things where they have kind of

00:16:55.500 | Matrix products and gradient descents, but it's not deep people tend to just call that shallow learning okay, so we're doing shallow learning here

00:17:03.460 | Alright, so I'm just going to go ahead and press escape to stop it because I'm sick of waiting

00:17:08.260 | and so you can see

00:17:12.060 | We've now got down to the old point three nine all right, so for example

00:17:17.660 | It guessed that movie 72 for sorry movie 27 for use of 72 would get

00:17:25.180 | 4.44 rating

00:17:28.380 | 2772 and actually got a 4 rating so you can see like it's it's doing something quite useful

00:17:37.140 | So why is it doing something quite useful? I mean something to note here is

00:17:42.660 | The number of things we're trying to predict here is there's 225 of them

00:17:47.940 | Right and the number of things we're using to predict is that times two so 150 of them

00:17:55.100 | So it's not like we can just exactly fit we actually have to do some kind of

00:17:58.780 | machine learning here

00:18:01.340 | So basically what this is saying is that there does seem to be some

00:18:05.660 | way of

00:18:07.580 | making predictions in this way and

00:18:10.260 | So for those of you that have done some linear algebra

00:18:13.060 | And this is actually a matrix decomposition normally in linear algebra you would do this using a

00:18:18.980 | Analytical technique or using some techniques that are specifically designed for this purpose, but the nice thing is that we can use

00:18:26.580 | Gradient descent to solve pretty much everything including this

00:18:30.220 | I don't like to so much think of it from a linear algebra point of view though

00:18:34.340 | I like to think of it from an intuitive point of view which is this let's say movie. Sorry. Let's say movie ID 27 is

00:18:41.300 | Lord of the Rings

00:18:44.100 | part one and

00:18:46.100 | let's say

00:18:48.740 | Move and so let's say we're trying to make that prediction for user

00:18:53.500 | 272 are they going to like Lord of the Rings part one and so conceptually

00:18:59.820 | That particular movie. Maybe there's like there's four. Sorry. There's five

00:19:04.780 | Numbers here and we could say like well

00:19:07.980 | What if the first one was like how much is it sci-fi and fantasy and the second one is like?

00:19:13.860 | How recent a movie and how much special effects is there you know and the one at the top might be like how dialogue driven?

00:19:20.980 | Is it right like let's say those kind of five these five numbers represented particular things about the movie and so if that was the case

00:19:29.260 | Then we could have the same five numbers for the user saying like okay

00:19:33.620 | How much does the user like sci-fi and fantasy how much does the user like?

00:19:37.980 | Modern this is a modern CGI driven movies. How much does this does this user like?

00:19:45.820 | Dialogue driven movies and so if you then took that cross product

00:19:49.140 | You would expect to have a good model right would expect to have a reasonable rating now the problem is

00:19:56.540 | We don't have this information for each user. We don't have the information for each movie, so we're just going to like assume

00:20:03.460 | That this is a reasonable

00:20:06.060 | Kind of way of thinking about this system, and let's and let's stochastic gradient descent try and find these numbers

00:20:11.500 | Right so so in other words these these factors

00:20:15.860 | We call these things factors these factors and we call them factors because you can multiply them together to create this

00:20:23.100 | They're factors in a linear algebra sense these factors. We call them latent factors because they're not actually

00:20:29.900 | This is not actually a

00:20:32.340 | Vector that we've like named and understood and like entered in manually we've kind of assumed

00:20:39.180 | That we can think of movie ratings this way

00:20:42.680 | we've assumed that we can think of them as a dot product of

00:20:47.180 | Some particular features about a movie and some particular features of what users like those kinds of movies, right?

00:20:54.220 | And then we've used gradient descent

00:20:56.220 | To just say okay try and find some numbers that that work

00:20:59.980 | So that's that's basically the technique right and it's kind of

00:21:05.660 | The and the entirety is in this spreadsheet right so that is collaborative filtering using what we call probabilistic matrix factorization

00:21:15.020 | And as you can see the whole thing is easy to do in an Excel spreadsheet and the entirety of it really is this single

00:21:21.500 | Thing which is a single matrix multiplication

00:21:24.240 | plus randomly initializing

00:21:27.140 | We like to know if it would be better to cap this to zero and five maybe yeah

00:21:36.300 | Yeah, we're going to do that later right. There's a whole lot of stuff. We can do improvements. This is like our

00:21:43.580 | Simple as possible starting point right so so what we're going to do now is we're going to try and implement this

00:21:49.180 | in Python

00:21:52.340 | And run it on the whole data set another question is how do you figure out how many?

00:21:57.700 | You know how it's clear. How long are the metrics? Yeah, why is it five? Yeah, yeah

00:22:03.860 | So something to think about

00:22:06.860 | Given that this is like movie 49, right?

00:22:10.820 | And we're looking at a rating for movie 49

00:22:13.700 | Think about this. This is actually an embedding matrix

00:22:18.660 | and

00:22:20.540 | So this length is actually the size of the embedding matrix. I'm not saying this is an analogy

00:22:27.180 | I'm saying it literally this is literally an embedding matrix

00:22:30.600 | We could have a one hot encoding where 72

00:22:34.900 | Where a one is in the 72nd position and so we'd like to look it up, and it would return this list of five numbers

00:22:42.300 | So the question is actually how do we decide on the dimensionality of our embedding vectors?

00:22:47.660 | And the answer to that question is we have no idea

00:22:51.780 | We have to try a few things and see what works

00:22:55.340 | the underlying concept is you need to pick an embedding dimensionality, which is

00:23:03.740 | Enough to reflect the kind of true complexity of this causal system

00:23:08.700 | but not so big that you

00:23:11.740 | Have too many parameters that it could take forever to run or even with vectorization. It might overfit

00:23:17.820 | So what does it mean when the factor is negative then

00:23:24.780 | The factor being negative in the movie case would mean like this is not dialogue driven in fact

00:23:31.900 | It's like the opposite dialogue here is terrible a negative for the user would be like I actually

00:23:38.060 | dislike modern CGI movies, so it's not from zero to whatever it's the range of

00:23:44.620 | Score it'd be negative is that range of score even like no no maximum. No. There's no constraints at all here

00:23:51.740 | These are just standard embedding matrices

00:23:54.660 | Thanks

00:24:00.420 | Questions the first question is why do what why can we trust this embeddings because like if you take a number six

00:24:07.700 | It can be expressed as 1 into 6 or like 6 into 1 or 2 into 3 and 3 into 2

00:24:11.840 | Also, are you saying like we could like reorder these five numbers in some other different order or like the value itself might be different

00:24:19.980 | As long as the product is something well, but you see we're using gradient descent to find the best numbers

00:24:26.660 | So like once we found a good minimum

00:24:30.620 | the idea is like

00:24:32.620 | Yeah, there are other numbers, but they don't give you as good an objective value

00:24:36.920 | And of course we should be checking that on a validation set really which we'll be doing in the Python version

00:24:43.460 | Okay, and the second question is when we have a new movie or a new user do we have to retrain the model?

00:24:49.180 | That is a really good question, and there isn't a straightforward answer to that

00:24:53.540 | Time permitting will come back to it

00:24:56.220 | But basically you would need to have like a kind of a new user

00:25:00.100 | Model or a new movie model that you would use initially

00:25:04.660 | And then over time yes, you would then have to retrain the model

00:25:09.620 | So like I don't know if they still do it

00:25:11.800 | But Netflix used to have this thing that when you were first on board it onto Netflix

00:25:15.340 | They would say like what movies do you like?

00:25:17.340 | And you'd have to go through and like say a bunch of movies you like and it would then like train its model

00:25:28.140 | Could you could you just find the nearest movie to the movie that you're trying to the new movie that you're trying to add?

00:25:33.420 | Yeah, you could use nearest neighbors for sure

00:25:35.700 | But the thing is initially at least in this case we have no

00:25:43.660 | Columns to describe a movie so if you had something about like the movies

00:25:48.720 | Genre release date who was in it or something you could have some kind of non collaborative filtering model

00:25:55.060 | And that was kind of what I meant. I like a new movie model. You'd have to have some some kind of predictors

00:26:00.540 | Okay, so a

00:26:04.660 | Lot of this is going to look familiar and and the way I'm going to do this is again

00:26:11.940 | It's kind of this top-down approach. We're going to start using a

00:26:15.020 | Few features of pytorch and fast AI and gradually we're going to redo it a few times in a few different ways

00:26:23.580 | Kind of doing a little bit deeper each time

00:26:25.580 | Regardless we do need a validation set so we can use our standard cross validation indexes approach to grab a random set of IDs

00:26:36.520 | This is something called weight decay

00:26:40.780 | Which we'll talk about later in the course for those of you that have done some machine learning

00:26:45.260 | It's L2 regularization basically

00:26:48.060 | And this is where we choose how big a embedding matrix do we want okay?

00:26:53.380 | So again, you know here's where we get our model data object from CSB

00:27:00.580 | Passing in that ratings file which remember

00:27:05.380 | Looks like that okay, so you'll see like stuff tends to look pretty familiar after a while

00:27:13.620 | And then you just have to pass in

00:27:19.860 | the

00:27:21.820 | What are your rows effectively? What are your columns effectively, and what are your values effectively right so any any collaborative filtering?

00:27:29.580 | Recommendation system approach. There's basically a concept of like

00:27:33.440 | You know a user and an item

00:27:36.140 | Now they might not be users and items like if you're doing the Ecuadorian groceries competition

00:27:42.680 | There are stores and items and you're trying to predict. How many things are you going to sell at?

00:27:48.540 | This store of this type

00:27:50.900 | But generally speaking just this idea of like you've got a couple of kind of high cardinality

00:27:57.660 | Categorical variables and something that you're measuring and you're kind of conceptualizing and saying okay, we could predict

00:28:04.540 | The rating we can predict the value by doing this this dot product

00:28:09.140 | Interestingly this is kind of relevant to that that last question or suggestion an

00:28:16.660 | Identical way to think about this or to express this is to say

00:28:20.460 | when we're deciding

00:28:23.140 | Whether user 72 will like movie 27 is basically saying

00:28:29.500 | which other

00:28:31.940 | users liked movies that 72 liked and

00:28:36.580 | Which other movies were liked by people like?

00:28:43.140 | User 72 it turns out that these are basically two ways of saying the exact same thing

00:28:50.160 | So basically what collaborative filtering is doing?

00:28:52.300 | You know kind of conceptually is to say okay this movie and this user

00:28:58.420 | Which other movies are similar to it in terms of like?

00:29:02.160 | Similar people enjoyed them and which people are similar to this person based on people that like the same kind of movies

00:29:09.340 | so that's kind of the

00:29:11.580 | underlying

00:29:12.900 | Structure and anytime there's an underlying structure like this that kind of collaborative filtering approach is likely to be useful

00:29:18.840 | okay, so

00:29:21.860 | So you yeah, so there's basically two parts the two bits of your thing that you're factoring and then the value the dependent variable

00:29:29.120 | So as per usual we can take our model data and ask for a learner from it

00:29:35.420 | And we need to tell it what size embedding matrix to use

00:29:38.940 | How many sorry what validation set indexes to use what batch size to use and what optimizer?

00:29:45.740 | To use and we're going to be talking more about optimizers

00:29:49.100 | shortly

00:29:50.820 | We won't do Adam today, but we'll do Adam

00:29:53.120 | next week or the week after

00:29:55.780 | And then we can go ahead and say fit

00:29:58.060 | Right, and it all looks pretty similar interest. It's usual interestingly

00:30:04.020 | I only had to do three epochs like this kind of models into train super quickly

00:30:09.620 | You can use the learning rate finder as per usual all the stuff you're familiar with will work fine

00:30:14.780 | And that was it so this took you know about two seconds the train. There's no pre trained anythings here

00:30:22.340 | This is from random scratch, right?

00:30:24.500 | So this is our validation set and we can compare it we have this is a mean squared error

00:30:31.140 | Not a root mean squared error, so we can take the square root

00:30:33.860 | So that last time I ran it was 0.776 and that's 0.88 and there's some benchmarks available for this data set

00:30:43.980 | And when I scrolled through and found the bench the best benchmark I could find here from this

00:30:48.560 | Recommendation system specific library they had 0.91. So we've got a better loss in two seconds

00:30:59.420 | Already, so that's good

00:31:01.420 | So that's basically how you can do collaborative filtering

00:31:06.060 | with the fast AI library without

00:31:09.740 | Thinking too much, but so now we're going to dig in and try and rebuild that we'll try and get to the point that we're getting

00:31:16.980 | something around

00:31:19.300 | 0.77 0.78 from scratch

00:31:21.300 | But if you want to do this yourself at home, you know without worrying about the detail

00:31:28.340 | That's you know, those three lines of code is all you need

00:31:31.460 | Okay, so we can get the predictions in the usual way and you know, we could for example plot

00:31:37.540 | SNS is seaborne seaborne is a really great flooding library. It sits on top of matplotlib

00:31:43.220 | It actually leverages matplotlib

00:31:45.940 | So anything you learn about matplotlib will help you with seaborne. It's got a few like nice little plots like this joint plot

00:31:51.340 | Here is I'm doing

00:31:54.140 | predictions

00:31:56.020 | against

00:31:57.340 | Against actuals. So these are my actuals

00:31:59.680 | These are my predictions and you can kind of see the the shape here is that as we predict higher numbers

00:32:05.180 | they actually are higher numbers and you can also see the histogram of the

00:32:09.360 | Predictions and a histogram of the actions. So I'm just kind of plotting that just to show you another interesting visualization

00:32:16.020 | Could you please explain the n-factors

00:32:20.780 | Why it's set to 15? It's set to 50 because I tried a few things that's in the work

00:32:26.460 | That's all. What does it mean? It's this it's the dimensionality of the embedding matrix

00:32:31.040 | Or to think of it in another way. It's like how you know rather than being five. It's

00:32:37.140 | Jeremy I have a question about suppose that your

00:32:46.300 | Recommendation system is more implicit. So you have zeros or ones instead of just

00:32:55.140 | Actual numbers, right? So basically we would then

00:32:57.700 | Need to use a classifier instead of a regressor

00:33:01.420 | Have to sample the negative for something like that

00:33:06.140 | So if you don't have it, we just have once let's say like just kind of implicit feedback. Oh

00:33:11.380 | I'm not sure we'll get to that one in this class

00:33:14.260 | But what I will say is like in the case that you're just doing classification rather than regression

00:33:18.740 | We haven't actually built that in the library yet

00:33:22.100 | Maybe somebody this week wants to try adding it. It would only be a small number of lines of code. You basically have to change the

00:33:27.780 | activation function to be a sigmoid and you would have to change the

00:33:32.380 | Criterion or the loss function to be cross entropy

00:33:36.700 | rather than

00:33:39.540 | RMSE and that will give you a

00:33:41.880 | Classifier rather than a regressor. Those are the only things you'd have to change

00:33:46.180 | So hopefully somebody this week will take up that challenge and by the time we come back next week. We will have that working

00:33:52.140 | Okay

00:33:54.140 | So I said that we're basically doing a dot product right or you know a dot product is kind of the vector version

00:34:03.460 | I guess of this matrix product

00:34:05.620 | So we're basically doing each of these things times each of these things and then add it together

00:34:11.700 | That's a dot product. So let's just have a look at how we do that in PyTorch

00:34:17.300 | So we can create a tensor in PyTorch just using this little capital T thing

00:34:22.380 | You can just say that's the fast AI version the full version is torch dot from NumPy or something

00:34:28.780 | But I've got it set up so you can pass it through pass in even a list of lists

00:34:33.260 | So this is going to create a torch tensor with 1 2 3 4 and then here's a torch tensor with 2 2 10 10

00:34:41.440 | Okay, so here are two

00:34:44.260 | Torch tensors, I didn't say dot CUDA. So they're not on the GPU. They're sitting on the CPU

00:34:50.340 | just FYI

00:34:52.820 | We can multiply them together

00:34:54.860 | Right and so anytime you have a mathematical operator between tensors in NumPy or PyTorch

00:35:02.420 | It will do element wise

00:35:05.100 | Assuming that they're the same dimensionality, which they are they're both 2 by 2

00:35:08.720 | Okay, and so here we've got

00:35:11.460 | 2 by 2 is 4

00:35:14.140 | 3 by 10 is 30 and so forth. Okay, so there's our A times B

00:35:17.940 | So if you think about basically what we want to do here is we want to take

00:35:23.540 | Okay, so I've got 1

00:35:31.580 | Times 2 is 2 2 times 2 is 4

00:35:37.100 | 2 plus 4 is 6 and so that is actually the dot product between 1 2 and 2 4 and

00:35:44.380 | Then here we've got 3 by 10 is 30 4 by 40

00:35:48.900 | Sorry 4 by 10 is 40 30 and 40 is 70

00:35:52.220 | So in other words a times B dot sum along the first dimension

00:35:57.860 | So that's summing up the columns. In other words across a row

00:36:01.260 | Okay, this thing here is doing the dot product of

00:36:06.980 | Each of these rows with each of these rows

00:36:09.740 | That makes sense and obviously we could do that with

00:36:13.540 | You know some kind of matrix modification approach, but I'm trying to really do things with as little

00:36:19.860 | special case stuff as possible

00:36:22.820 | Okay, so that's what we're going to use for our dot products from now on so basically all we need to do now is

00:36:29.660 | Remember we have the data we have is not in that crosstab format

00:36:35.420 | So in Excel we've got it in this crosstab format, but we've got it here in this

00:36:39.940 | Listed format use a movie rating use a movie

00:36:43.340 | So conceptually we want to be like looking up this user

00:36:47.300 | Into our embedding matrix to find their 50 factors looking up that movie to find their 50 factors and then take the dot product

00:36:54.780 | of those two 50 long vectors

00:36:57.500 | So let's do that

00:37:04.220 | To do it we're going to build a layer our own custom

00:37:09.100 | neural net layer

00:37:11.860 | So the the more generic vocabulary we call this is we're going to build a pytorch module

00:37:18.980 | Okay, so a pytorch module is a very specific thing

00:37:23.500 | It's something that you can use as a layer and a neural net once you've created your own pytorch module

00:37:29.580 | You can throw it into a neural net

00:37:31.700 | And a module works by assuming we've already got one say called model

00:37:36.940 | You can pass in some things in parentheses, and it will calculate it right so assuming that we already have a module called dot product

00:37:45.460 | We can instantiate it like so

00:37:48.940 | To create our dot product object, and we can basically now treat that like a function

00:37:56.100 | All right, but the thing is it's not just a function because we'll be able to do things like take derivatives of it

00:38:02.620 | Stack them up together into a big

00:38:05.100 | Stack of neural network layers blah blah blah, right, so it's basically a function that we can kind of compose very conveniently

00:38:13.460 | So here how do we define a module which as you can see here returns a dot product well

00:38:20.700 | We have to create a Python class and so if you haven't done Python OO before

00:38:26.500 | You're going to have to learn because all pytorch modules are written in Python OO

00:38:32.260 | And it's one of the things I really like about pytorch is that it doesn't

00:38:36.020 | Reinvent totally new ways of doing things like TensorFlow does all the time in pytorch that you know really tend to use

00:38:44.780 | Pythonic ways to do things so in this case. How do you create you know some kind of new behavior you create a Python class?

00:38:52.220 | So Jeremy suppose that you have a lot of data

00:38:58.600 | Not just a little bit of data. You can have a memory. Will you be able to use fast AI to solve corollary filtering?

00:39:05.740 | Yes, absolutely

00:39:08.460 | It's it uses

00:39:12.780 | mini batch stochastic gradient descent which does it a batch at a time the

00:39:19.180 | This particular version is going to create a

00:39:27.660 | Pandas data frame and a pandas data frame has to live in memory

00:39:32.740 | Having said that you can get easily 512 gig

00:39:38.460 | You know instances on Amazon so like if you had a CSV that was bigger than 512 gig

00:39:43.660 | You know that would be impressive if that did happen

00:39:48.140 | I guess you would have to instead save that as a B calls array and

00:39:51.780 | Create a slightly different version that reads from a B calls array to streaming in or maybe from a desk

00:39:58.380 | data frame which also so

00:40:01.100 | It would be easy to do I don't think I've seen

00:40:05.380 | Real world situations where you have 512 gigabyte collaborative filtering matrices, but yeah, we can do it

00:40:12.940 | Okay now

00:40:16.540 | This is PyTorch specific this next bit is that when you define like the actual work to be done which is here return

00:40:24.780 | user times movie dot sum

00:40:27.300 | You have to put it in a special method called forward

00:40:31.860 | Okay, and this is this idea that like it's very likely you're pretty neural net right and in a neural net the thing where you

00:40:37.820 | calculate the next

00:40:39.980 | Set of activations is called the forward pass and so that's doing a forward calculation

00:40:45.900 | The gradients is called the backward calculation

00:40:49.620 | We don't have to do that because PyTorch calculates that automatically so we just have to define

00:40:54.500 | Forward so we create a new class we define forward and here we write in our definition of dot product

00:41:01.800 | Okay, so that's it. So now that we've created this class definition. We can instantiate our

00:41:09.100 | Model right and we can call our model and get back the numbers we expected. Okay, so that's it

00:41:16.120 | That's how we create a custom

00:41:18.120 | PyTorch layer and if you compare that to like any other

00:41:23.080 | Library around pretty much. This is way easier

00:41:26.080 | Basically, I guess because we're leveraging

00:41:28.960 | What's already in Python?

00:41:31.720 | So let's go ahead and now create a more complex

00:41:35.080 | Module and we're going to basically do the same thing. We're going to have a forward again

00:41:41.720 | We're going to have our users times movies dot sum

00:41:44.960 | But we're going to do one more thing beforehand, which is we're going to create two

00:41:49.520 | Embedding matrices and then we're going to look up our users and our movies in those embedding matrices

00:41:56.360 | So let's go through and and do that

00:42:00.640 | so the first thing to realize is that

00:42:02.840 | The users the user IDs and the movie IDs may not be contiguous

00:42:09.680 | You know like they may be they start at a million and go to a million one thousand say, right? So if we just used

00:42:18.240 | Those IDs directly to look up into an embedding matrix

00:42:23.080 | We would have to create an embedding matrix of size one million one thousand right which we don't want to do

00:42:28.080 | so the first thing I do is to get a list of the

00:42:31.800 | unique user IDs and

00:42:34.520 | then I create a mapping from every user ID to a

00:42:39.360 | Contiguous integer this thing I've done here where I've created a

00:42:44.900 | dictionary which maps from every unique thing to a unique index is

00:42:50.960 | Well worth studying during the week because like it's super super handy

00:42:55.440 | It's something you very very often have to do in all kinds of machine learning

00:42:59.400 | All right, and so I won't go through it here

00:43:01.680 | It's easy enough to figure out if you can't figure it out just ask on the forum

00:43:04.920 | Anyway, so once we've got the mapping from user to a contiguous index

00:43:11.480 | We then can say let's now replace the user ID column

00:43:17.480 | With that contiguous index right so pandas dot apply applies an arbitrary function

00:43:24.680 | In Python Lambda is how you create an anonymous function on the fly and this anonymous function simply returns the index

00:43:32.560 | through the same thing for movies and so after that we now have the same ratings table we had before but our

00:43:39.720 | IDs have been matched to contiguous

00:43:42.600 | Integers and therefore there are things that we can look up into an embedding matrix

00:43:47.480 | So let's get the count of our users in our movies

00:43:52.760 | And let's now go ahead and try and create our

00:43:55.520 | Python version of this

00:43:58.800 | Okay, so

00:44:01.920 | Earlier on when we created our simplest possible

00:44:06.840 | Pytorch module there was no like

00:44:12.280 | State we didn't need a constructor

00:44:15.040 | Because we weren't like saying how many users are there or how many movies are there or how many factors?

00:44:20.320 | Do we want or whatever right anytime we want to do something like?

00:44:24.440 | This where we're passing in and saying we want to construct our

00:44:29.800 | Module with this number of users and this number of movies then we need a constructor

00:44:37.000 | for our class and you create a constructor in Python by defining a

00:44:42.600 | Dunder in it underscore underscore in it underscore underscore

00:44:45.720 | special name so this just creates a

00:44:49.840 | Constructor then if you haven't done over before

00:44:51.920 | You wanted to do some study during the week, but it's a pretty simple idea

00:44:57.160 | This is just the thing that when we create this object. This is what gets run, okay?

00:45:01.680 | Again special Python thing when you create your own constructor

00:45:06.920 | You have to call the parent class constructor

00:45:08.800 | And if you want to have all of the cool behavior of a Pytorch module you get that by inheriting

00:45:15.600 | From an end up module neural net module okay, so basically by inheriting here and calling the superclass constructor

00:45:23.560 | We now have a fully functioning Pytorch layer, okay, so now we have to give it some behavior

00:45:29.760 | And so we give it some behavior by storing some things in it all right, so here. We're going to create something called

00:45:37.400 | self dot you users and that is going to be an

00:45:42.200 | embedding layer

00:45:44.160 | Number of rows is and users number of columns is n factors

00:45:48.440 | So that is exactly this right the number of rows is n uses number of columns is n factors

00:45:56.920 | And then we'll have to do the same thing for movies

00:46:00.120 | All right, so that's going to go ahead and create these two

00:46:04.800 | randomly initialized arrays

00:46:10.400 | However when you randomly initialize or an array it's important to randomly initialize it to a

00:46:16.040 | Reasonable set of numbers like a reasonable scale, right if we randomly initialize them from like naught to a million

00:46:23.040 | Then we would start out and you know these things would start out being like

00:46:27.720 | You know billions and billions of size rotating and that's going to be very hard to do gradient descent on

00:46:33.600 | So I just kind of manually figured here like okay about what size

00:46:39.800 | Numbers that are going to give me about the right ratings, and so we don't we know we did ratings between about naught and five

00:46:45.920 | So if we start out with stuff between about naught and 0.05, then we're going to get ratings of about the right level

00:46:54.160 | You can easily enough like that calculate that in neural nets. There are standard algorithms for

00:47:01.880 | Basically doing doing that calculation and the basic the key algorithm is

00:47:09.360 | Something called her initialization from timing her and the basic idea

00:47:15.400 | Is that you take the

00:47:20.540 | Here you basically set the weights equal to a normal distribution

00:47:27.800 | With a standard deviation, which is basically inversely proportional to the number of things

00:47:39.380 | in the previous layer

00:47:41.380 | And so in our previous layer

00:47:44.100 | So in this case we basically a set if you basically take that

00:47:52.020 | naught to 0.05 and multiply it by the fact that you've got

00:47:55.720 | 40 things I wasn't 40 or 50 things coming out of it

00:47:59.820 | 50 50 things coming out of it, then you're going to get something of about the right size

00:48:06.820 | Pytorch has already has like her initialization

00:48:11.100 | Class there like we don't in normally in real life have to think about this we can just call the existing initialization

00:48:17.780 | Functions, but we're trying to do this all like from scratch here. Okay without any

00:48:23.700 | special stuff going on

00:48:26.380 | So there's quite a bit of Pytorch notation here, so self.you we've already set to an instance of the embedding class

00:48:35.620 | it

00:48:37.620 | Has a dot weight

00:48:39.620 | Attribute which contains the actual the actual embedding matrix

00:48:43.900 | so that contains

00:48:46.700 | this

00:48:48.700 | The actual embedding matrix is not a tensor

00:48:52.980 | It's a variable a variable is exactly the same as a tensor in other words it supports the exact same

00:49:00.220 | operations as a tensor, but it also

00:49:04.500 | Does automatic differentiation?

00:49:06.500 | That's all a variable is basically

00:49:09.220 | To pull the tensor out of a variable you get its data attribute

00:49:15.740 | Okay, so this is so this is now the tensor of the weight matrix of the self.you embedding

00:49:22.980 | And then something that's really handy to know is that all of the tensor functions in Pytorch

00:49:30.340 | You can stick an underscore at the end, and that means do it in place

00:49:34.500 | Right so this is say create a random uniform random number of an appropriate size

00:49:40.860 | For this tensor and don't return it, but actually fill in that matrix

00:49:46.860 | In place okay, so that's a super handy thing to know about I mean it wouldn't be rocket science otherwise. We would have to have gone

00:49:55.220 | Okay, there's the non in place version. That's what saves us some typing saves us some screen noise. That's all

00:50:09.860 | Okay

00:50:14.900 | So now we've got our randomly initialized embedding weight matrices

00:50:19.860 | And so now the forward

00:50:22.980 | I'm actually going to use the same columnar model data that we used for

00:50:27.700 | Rossman

00:50:29.540 | And so it's actually going to be passed both categorical variables and continuous variables

00:50:34.080 | and in this case there are no

00:50:36.620 | continuous variables, so I'm just going to grab the

00:50:40.180 | zeroth column out of the categorical variables and call it users and the first column and call it movies okay, so I'm just kind of

00:50:48.660 | Too lazy to create my own. I'm not so much too lazy that we do have a special class for this

00:50:53.340 | But I'm trying to avoid creating a special class, so I'm just going to leverage this columnar model data class

00:50:58.920 | Okay, so we can basically grab our user and movies

00:51:03.020 | Mini batches right and remember this is not a single user in a single movie. This is going to be a whole mini batch of them

00:51:11.340 | We can now look up that mini batch of users in our embedding matrix U and the movies in

00:51:18.700 | our embedding matrix M

00:51:20.380 | All right, so this is like exactly the same as just doing an array look up to grab the the user ID numbered

00:51:26.820 | Value, but we're doing it a whole mini batch at a time

00:51:29.980 | Right and so it's because pytorch

00:51:32.340 | Can do a whole mini batch at a time with pretty much everything that we can get really easy speed up

00:51:37.580 | We don't have to write any loops on the whole to do everything through our mini batch

00:51:42.460 | And in fact if you do have a loop through your mini batch manually you don't get GPU acceleration

00:51:48.460 | That's really important to know right so you never want to loop have a for loop going through your mini batch

00:51:53.860 | You always want to do things in this kind of like whole mini batch at a time

00:51:58.120 | But pretty much everything in pytorch does things a whole mini batch at a time, so you shouldn't have to worry about it

00:52:04.060 | And then here's our product just like before all right so having defined

00:52:10.620 | That I'm now going to

00:52:16.980 | Go ahead and say you're at my X values is

00:52:19.540 | Everything except the rating and the timestamp

00:52:22.820 | In my ratings table my Y is my rating and then I can just say okay. Let's

00:52:27.920 | Grab a model data from a data frame using that X and that Y and here is our list of

00:52:35.380 | categorical variables

00:52:38.180 | Okay

00:52:40.180 | And then so let's now instantiate that pytorch object

00:52:46.940 | All right, so we've now created that from scratch

00:52:49.260 | And then the next thing we need to do is to create an optimizer, so this is part of pytorch

00:52:56.420 | The only fast AI thing here is this line right because it's like I don't think showing you

00:53:03.840 | How to build data sets and data loaders is interesting enough really we might do that in part two of the course

00:53:10.060 | And it's actually so straightforward like a lot of you are already doing it on the forums

00:53:15.740 | So I'm not going to show you that in this part

00:53:17.860 | But if you're interested feel free to to talk on the forums about it

00:53:21.940 | But I'm just going to basically take the the thing that feeds this data as a given particularly because these things are so flexible

00:53:28.420 | Right you you know if you've got stuff in a data frame. You can just use this you don't have to rewrite it

00:53:32.900 | So that's the only fast AI thing we're using so this is a pytorch thing and so

00:53:39.060 | Optim is the thing in pytorch that gives us an optimizer. We'll be learning about that

00:53:45.620 | very shortly

00:53:47.420 | So it's actually the thing that's going to update our weights

00:53:50.200 | Pytorch

00:53:53.740 | Calls them the parameters of the model so earlier on we said model equals embedding dot blah blah blah

00:54:01.020 | All right, and because embedding dot

00:54:03.580 | Derives from nn dot module we get all of the pytorch module behavior and one of the things we got for free

00:54:11.140 | Is the ability to say dot parameters?

00:54:14.260 | So that's pretty that's pretty handy right that's the thing that basically is going to automatically

00:54:20.860 | Give us a list of all of the weights in our model that have to be updated and so that's what gets passed to the optimizer

00:54:28.900 | We also passed the optimizer the learning rate

00:54:31.820 | The weight decay which we'll talk about later and momentum that we'll talk about later

00:54:41.300 | Okay, one other thing that I'm not going to do right now

00:54:44.060 | But we will do later is to write a training loop so the training loop is a thing that loops through each mini batch

00:54:49.660 | and

00:54:51.620 | Updates the weight to subtract the gradient times the volume rate

00:54:56.300 | There's a function in fast AI which is the training loop and it's

00:55:03.020 | It's pretty simple

00:55:10.140 | Here it is right for epoch in epochs

00:55:13.180 | This is just the thing that shows a progress bar so ignore this for X comma Y in my training data loader

00:55:20.900 | calculate the loss

00:55:24.260 | Print out the loss in our in a progress bar call any callbacks you have and at the end

00:55:35.860 | Call the call the metrics on the validation right so this there's just for each epoch go through each mini batch

00:55:43.900 | and do one step of our optimizer step is

00:55:48.680 | Basically going to take advantage of this optimizer, but we're rewriting that from scratch shortly

00:55:54.580 | So this is notice we're not using a learner

00:55:59.900 | Okay, we're just using a pipe watch module so this this fit thing although. It's past it part of fast AI

00:56:06.420 | It's like lower down the layers of abstraction now. This is the thing that takes a

00:56:11.560 | regular pipe torch model, so if you ever want to like

00:56:15.940 | skip as much

00:56:19.220 | Fast AI stuff as possible like you've got some pipe torch model. You've got some code on the internet

00:56:24.580 | You basically want to run it

00:56:26.060 | But you don't want to write your own training loop, then this is this is what you want to do

00:56:30.340 | You want to call fast AI's fit function and so what you'll find is like

00:56:34.300 | The library is designed so that you can kind of dig in at any layer of abstraction

00:56:39.140 | You like right and so at this layer of abstraction. You're not going to get things like

00:56:45.340 | Stochastic gradient descent with restarts you're not going to get like differential learning rates like all that stuff

00:56:52.180 | That's in the learner like you could do it, but you'd have to write it all by by hand yourself

00:56:56.620 | Right and that's the downside of kind of going down to this level of abstraction

00:57:00.900 | The upside is that as you saw the code for this is very simple. It's just a simple training loop

00:57:07.060 | It takes a standard pytorch model

00:57:09.060 | So this is like this is a good thing for us to use here

00:57:12.460 | We can we just call it and it looks exactly like what we're we're used to see right we get our

00:57:19.540 | validation and training loss for the three plus

00:57:23.220 | now you'll notice that

00:57:26.220 | We wanted something around point seven six

00:57:31.260 | So we're not there so in other words the the default fast AI collaborative filtering algorithm is doing something

00:57:39.620 | Smarter than this so we're going to try and do that

00:57:45.020 | One thing that we can do since we're calling our you know this lower level fit function

00:57:49.780 | There's no learning rate and kneeling we could do our own learning rate and kneeling so you can hear it

00:57:54.140 | See here there's a fast AR function called set learning rates

00:57:57.220 | you can pass in a standard pytorch optimizer and pass in your new learning rate and

00:58:02.540 | Then call fit again. And so this is how we can let manually do a learning rate schedule

00:58:09.100 | And so you can see we've got a little bit better

00:58:11.100 | 1.13

00:58:13.140 | We still got a long way to go

00:58:15.580 | Okay, so I think what we might do is we might have a

00:58:20.740 | Seven-minute break and then we're going to come back and try and improve this score of it

00:58:28.240 | For those who are interested somebody was asking me at the break for a kind of a quick

00:58:41.740 | Walk through so this is totally optional, but if you go into the fast AI library, there's a model.py file

00:58:49.060 | And

00:58:53.740 | That's where fit is which we're just looking at which goes through

00:58:57.420 | Each epoch in epochs and then goes through each X and Y in the mini batch and then it calls this

00:59:05.500 | Step function so the step function

00:59:10.000 | is

00:59:12.000 | Here and you can see the key thing is it calculates the output from the model the models for N right and so if you remember

00:59:21.400 | our dot product

00:59:24.320 | We didn't actually call model dot forward we just called model parentheses and that's because the

00:59:31.280 | nn dot module

00:59:34.080 | automatically

00:59:35.520 | You know when you call it as if it's a function it passes it along to forward

00:59:39.640 | Okay, so that's that's what that's doing there right and then the rest of this will will learn about shortly. Just basically doing the

00:59:46.860 | The loss function and then the backward pass

00:59:51.920 | Okay, so for those who are interested

00:59:54.200 | That's that's kind of gives you a bit of a sense of how the code is structured if you want to look at it

00:59:59.160 | and as I say like the the fast AI code is designed to

01:00:04.440 | both be world class performance, but also

01:00:08.680 | Pretty easy to read so like feel free like take a look at it

01:00:13.480 | And if you want to know what's going on just ask on the forums

01:00:16.400 | And if you know if you think there's anything that could be

01:00:19.400 | clearer

01:00:21.840 | Let us know

01:00:23.840 | Because yeah, the code is definitely know we're going to be digging into the code more and more

01:00:28.880 | Okay, so let's try and improve this a little bit and let's start off by improving it in Excel

01:00:38.040 | So you might have noticed here that we've kind of got the idea that

01:00:42.320 | User 72

01:00:44.880 | You know like sci-fi modern movies with special effects, you know

01:00:49.760 | Whatever and movie number 27 is sci-fi and has special effects and not much dialogue

01:00:55.520 | but we're missing an important case, which is like

01:01:01.400 | User 72 is pretty enthusiastic on the whole and on average rates things highly highly, you know and movie

01:01:10.680 | 27

01:01:12.080 | You know, it's just a popular movie

01:01:14.320 | You know which just on average it's higher

01:01:17.000 | so what we'd really like is to add a

01:01:20.040 | constant for the user and a constant for the movie and

01:01:24.880 | Remember in neural network terms we call that a bias

01:01:28.880 | That's we want to add a bias so we could easily do that and if we go into the bias tab here

01:01:35.040 | We've got the same data as before and we've got the same

01:01:38.560 | Latent factors as before and I've just got one extra

01:01:44.000 | Row here and one extra column here and you won't be surprised here that we now

01:01:50.800 | Take the same matrix multiplication as before and we add in

01:01:56.160 | that and we add in that

01:01:59.360 | Okay, so that's our bias

01:02:02.640 | So other than that we've got exactly the same loss function over here

01:02:07.720 | And so just like before we can now go ahead and solve that and now our changing variables include the

01:02:16.600 | bias and we can say solve and if we leave that for a little while it will come to a

01:02:24.320 | better result than we had before

01:02:26.600 | Okay, so that's the first thing we're going to do to improve our model and there's really very little show

01:02:35.000 | Just to

01:02:38.360 | Make the code a bit shorter

01:02:40.600 | I have to find a function called get embedding which takes a number of inputs and a number of factors

01:02:47.720 | so the number of rows and the embedding matrix and unposted with matrix creates the embedding and

01:02:53.560 | then

01:02:55.040 | Randomly initializes it. I don't know why I'm doing negative to positive here and it's zero last time

01:03:00.440 | Honestly, it doesn't matter much as long as it's in the right ballpark

01:03:03.280 | And then we return that initialized embedding

01:03:06.780 | So now we need not just our users by factors, which I'll chuck into u, our movies by factors

01:03:14.680 | Which I've chuck into m, but we also need users by 1

01:03:18.440 | Which we'll put into ub, user bias, and movies by 1 which we'll put into movie bias

01:03:24.280 | Okay, so this is just doing a list comprehension

01:03:27.360 | Going through each of the tuples creating embedding for each of them and putting them into these things

01:03:32.800 | Okay, so now our forward is exactly the same as before

01:03:38.200 | U times m dot sum and this is actually a little confusing because we're doing it in two two steps

01:03:47.840 | Maybe to make it a bit easier. Let's pull this out

01:03:51.300 | Put it up here

01:03:54.240 | Put this in parentheses

01:03:56.960 | Okay, so maybe that looks a little bit more familiar

01:04:00.880 | All right, u times n dot sum that's the same dot product and then here we're just going to add in our user bias and

01:04:07.840 | our movie bias

01:04:10.040 | Dot squeeze is the PyTorch thing that adds an additional

01:04:17.480 | unit axis on

01:04:19.480 | That's not going to make any sense if you haven't done broadcasting before

01:04:23.040 | I'm not going to do broadcasting in this course because we've already done it and we're doing it in the machine learning course

01:04:29.980 | But basically in short

01:04:32.200 | Broadcasting is what happens when you do something like this where um is a matrix

01:04:37.260 | ub

01:04:39.440 | Self dot ub uses is a is a vector

01:04:42.000 | How do you add a vector to a matrix and basically what it does?

01:04:46.880 | Is it duplicates?

01:04:48.880 | the vector

01:04:50.400 | So that it makes it the same size as the matrix and the particular way whether it duplicates it across columns or down rows

01:04:57.240 | Or how it does it is called broadcasting the broadcasting rules are the same as numpy

01:05:02.700 | PyTorch didn't actually used to support broadcasting

01:05:06.100 | So I was actually the guy who first added broadcasting to PyTorch using an ugly hack and then the PyTorch authors did an awesome job

01:05:12.880 | Of supporting it actually inside the language

01:05:16.400 | So now you can use the same broadcasting operations in PyTorch is numpy

01:05:21.000 | If you haven't dealt with this before it's really important to learn it

01:05:26.540 | Because like it's it's kind of the most important fundamental way to do computations quickly in numpy and PyTorch

01:05:34.760 | It's the thing that lets you not have to do loops

01:05:37.100 | Could you imagine here if I had to loop through every row of this matrix and add each you know?

01:05:43.120 | This back to the every row it would be slow it would be you know a lot more code

01:05:47.640 | And the idea of broadcasting it actually goes all the way back to

01:05:52.840 | APL which is a language designed in the 50s by an extraordinary guy called Ken Iverson

01:05:58.120 | APL was originally a designed or written out as a new type of mathematical notation

01:06:04.200 | He has this great essay called

01:06:07.480 | Notation as a tool for thought and the idea was that like really good notation could actually make you think of better things

01:06:14.320 | And part of that notation is this idea of broadcasting. I'm incredibly enthusiastic about it, and we're going to use it plenty

01:06:22.960 | so

01:06:25.080 | either watch the machine learning lesson or

01:06:29.360 | You know Google numpy broadcasting

01:06:32.760 | for information

01:06:35.560 | Anyway, so basically it works reasonably intuitively we can add on we can add the vectors to the matrix

01:06:43.160 | All right

01:06:47.400 | Having done that we're now going to do one more trick. Which is I think it was your net asked earlier about could we

01:06:55.840 | Squish the ratings to be between 1 and 5

01:07:00.200 | and the answer is

01:07:03.360 | We could right and specifically what we could do is

01:07:07.680 | we could

01:07:10.560 | Put it through a sigmoid function

01:07:12.600 | All right, so to remind you the sigmoid function

01:07:17.000 | Looks like that right and this is that's one

01:07:24.200 | All right, we could put it through a sigmoid function

01:07:28.040 | So we could take like 4.96 and put it through a sigmoid function and like that. You know that's kind of high

01:07:34.360 | So it kind of be over here somewhere right and then we could multiply that

01:07:39.400 | sigmoid like the result of that by 5

01:07:42.600 | For example right and in this case we want it to be between 1 and 5 right so maybe we might multiply it by 4 and

01:07:51.080 | Add 1 instance that's a basic idea

01:07:54.680 | And so here is that trick we take

01:07:58.240 | The result so the result is basically the thing that comes straight out of the dot product plus the addition of the biases

01:08:05.720 | And put it through a sigmoid function now in pytorch

01:08:10.440 | Basically all of the functions you can do the tensors are available

01:08:16.520 | Inside this thing called capital F, and this is like totally standard in pytorch

01:08:23.040 | It's actually called torch dot nn dot functional

01:08:26.000 | But everybody including all of the pytorch docs import torch dot nn dot functional as capital F

01:08:32.220 | Right so capital F dot sigmoid means a function called sigmoid that is coming from

01:08:38.360 | torches

01:08:40.560 | Functional module right and so that's going to apply a sigmoid function to the result

01:08:45.520 | So I've squished them all between 0 and 1 using that nice little shape, and then I can multiply that by

01:08:52.040 | 5 minus 1 plus 4

01:08:54.040 | Right and then add on 1 and that's going to give me something between 1 and 5 okay, so

01:08:59.560 | Like there's no need to do this. I could comment it out, and it will still work right

01:09:06.100 | But now it has to come up with a set of calculations that are always between

01:09:10.560 | 1 and 5 right where else if I leave this in then it's like makes it really easy

01:09:15.960 | It's basically like oh if you think this is a really good movie just calculate a really high number

01:09:20.520 | It's a really crappy movie cap a really low number, and I'll make sure it's in the right region

01:09:25.140 | So even though this isn't a neural network

01:09:27.160 | It's still a good example of this kind of like if you're doing any kind of parameter fitting

01:09:32.100 | Try and make it so that the thing that you want your function to return

01:09:36.240 | It's like it's easy for it to return that okay, so that's why we do that that function squishing

01:09:42.320 | So we call this embedding dot bias

01:09:47.280 | So we can create that in the same way as before you'll see here

01:09:50.960 | I'm calling dot CUDA to put it on the GPU because we're not using any learner stuff normally that'll happen for you

01:09:57.400 | But we have to manually say put it on the GPU

01:09:59.780 | This is the same as before create our optimizer

01:10:02.600 | Fit exactly the same as before and these numbers are looking good and again. We'll do a little

01:10:10.760 | Change to our learning rate and learning rate schedule, and we're down to 0.8. So we're actually pretty close

01:10:17.240 | pretty close

01:10:20.480 | So that's the key steps

01:10:24.900 | and

01:10:28.080 | This is how

01:10:30.080 | This is how most

01:10:32.720 | Collaborative filtering is done

01:10:35.520 | And you're not reminded me of an important point which is that this is not

01:10:40.840 | strictly speaking a matrix factor ization because strictly speaking a matrix factor ization would take that matrix by

01:10:50.200 | that matrix to create

01:10:52.760 | this matrix and

01:10:55.560 | remembering

01:10:58.560 | Anywhere that this is empty

01:11:05.040 | Like here or here

01:11:07.040 | We're putting in a zero

01:11:10.080 | Right we're saying if the original was empty put in a zero

01:11:14.400 | right now normally

01:11:16.880 | You can't do that with normal matrix factor ization normal matrix factor ization that creates the whole matrix

01:11:23.440 | And so it was a real problem actually

01:11:25.600 | When people used to try and use traditional linear algebra for this because when you have these sparse matrices like in practice

01:11:33.720 | This matrix is not doesn't have many gaps because we picked the users that watch the most movies and the movies that are the most

01:11:40.980 | Watched but if you look at the whole matrix, it's it's mainly empty and so traditional

01:11:46.200 | Techniques treated empty is zero and so like you basically have to predict a zero

01:11:52.440 | As if the fact that I haven't watched a movie means I don't like the movie that gives terrible answers

01:11:57.740 | So this probabilistic matrix factor ization approach

01:12:02.880 | takes advantage of the fact that our data structure

01:12:06.960 | Actually looks like this

01:12:09.720 | Rather than that cross tab right and so it's only calculating the loss for the user ID movie ID

01:12:16.000 | Combinations that actually appear that's exactly like user ID one movie ID one or two nine should be three

01:12:21.880 | It's actually three and a half so our loss is point five like there's nothing here. That's ever going to calculate a

01:12:29.280 | Prediction or a loss for a user movie combination that doesn't appear in this table

01:12:33.680 | By definition the only stuff that we can appear in a mini batch is what's in this table?

01:12:39.720 | And like a lot of this happened interestingly enough actually in the Netflix prize

01:12:48.660 | So before the Netflix prize came along

01:12:52.320 | This probabilistic matrix factor ization it had actually already been invented, but nobody noticed

01:12:59.440 | Alright, and then in the first year of the Netflix prize

01:13:01.960 | Someone wrote this like really really famous blog post where they basically said like hey check this out

01:13:08.040 | Incredibly simple technique works incredibly well and suddenly all the Netflix leaderboard entries work much much better

01:13:15.560 | And so you know that's quite a few years ago now, and this is like now

01:13:19.960 | Every collaborative filtering approach does this not every collaborative filtering approach adds this sigmoid thing by the way. It's not like

01:13:29.000 | Rocket science this is this is not like the NLP thing we saw last week

01:13:32.760 | Which is like hey, this is a new state-of-the-art like this is you know not particularly uncommon

01:13:37.100 | But there are still people that don't do this and it definitely helps a lot right to have this and so

01:13:42.600 | Actually you know what we could do is maybe now's a good time to have a look at the definition of this right so

01:13:51.400 | the column data

01:13:54.360 | module

01:13:56.040 | Contains all these definitions

01:13:58.040 | and we can now compare this to the thing we originally used which was

01:14:04.560 | Whatever came out of collab filter data set all right, so let's go to

01:14:10.040 | collab

01:14:13.560 | Filter data set here. It is and we called

01:14:17.640 | Get learner right so we can go down to get learner and that created a collab filter learner

01:14:25.800 | passing in the model from get model is get model, so it created an embedding bias and

01:14:33.160 | So here is embedding bias and

01:14:37.080 | You can see here here. It is like. It's the same thing. There's the embedding for each of the things

01:14:43.040 | Here's our forward that does the u times I dot sum

01:14:47.880 | plus plus

01:14:49.960 | sigmoid so in fact

01:14:51.960 | We have just actually rebuilt

01:14:54.680 | What's in the fast AI library literally?

01:14:56.840 | It's a little shorter and easier because we're taking advantage of the fact that there's a special

01:15:05.240 | collaborative filtering data set

01:15:08.360 | So we can actually we're getting passed in the users and the items and we don't have to pull them out of cats and cunts

01:15:14.360 | But other than that this is exactly the same

01:15:17.440 | So hopefully you can see like the fast AI library is not some inscrutable code containing concepts

01:15:23.120 | You can never understand. We've actually just built up this entire thing from scratch ourselves

01:15:28.480 | And so why did we get?

01:15:31.640 | 0.76 rather than 0.8

01:15:35.720 | You know I think it's simply because we used stochastic gradient descent with restarts and a cycle multiplier and an atom optimizer

01:15:45.160 | You know like a few little

01:15:47.160 | training tricks

01:15:50.440 | So I'm looking at this and thinking that is we could totally improve this model, but maybe

01:15:57.800 | Looking at the date and doing some tricks with the date. Yes, this is kind of a just a regular

01:16:04.400 | Kind of model in a way. Yeah, you can add more features. Yeah, it's actually exactly so like now that you've seen this

01:16:11.760 | You could now you know even if you didn't have

01:16:16.200 | embedding bias in a notebook that you've written yourself some other model that's in fast AI you could look at it in fast AI and

01:16:22.440 | Be like oh that does most of the things that I'd want to do, but it doesn't deal with time and so you could just go

01:16:28.680 | Oh, okay. Let's grab it. Copy it. You know pop it into my notebook and

01:16:33.560 | Let's create you know the better version

01:16:36.920 | Right, and then you can start playing that and you can now create your own

01:16:41.920 | model plus

01:16:45.040 | from the open source code here, and so

01:16:48.080 | Yeah, you're that's mentioning a couple things we could do we could try incorporating time stamps, so we could assume that maybe

01:16:53.980 | Well, maybe there's just like some

01:16:57.000 | For a particular user over time users tend to get more or less positive about movies

01:17:02.640 | Also remember there was the list of genres for each movie. Maybe we could incorporate that

01:17:09.600 | So one problem is it's a little bit difficult to incorporate that stuff

01:17:15.600 | Into this embedding dot bias model because it's kind of it's pretty custom right so what we're going to do next is

01:17:22.320 | we're going to try to create a

01:17:24.320 | neural net version of this

01:17:27.840 | So the basic idea here is

01:17:32.000 | We're going to

01:17:35.440 | Take exactly the same thing as we had before here's our list of users

01:17:38.720 | right and here is

01:17:41.520 | our embeddings

01:17:43.840 | All right, and here's our list of movies and

01:17:46.440 | here is our

01:17:49.200 | Embeddings right and so as you can see I've just kind of transposed

01:17:52.600 | The movie ones so that so that they're all in the same orientation

01:17:57.120 | And here is our user movie rating

01:18:00.680 | But D cross tab okay, so in the original format so each row is a user movie rating

01:18:07.360 | Okay

01:18:09.680 | So the first thing I do is I need to replace

01:18:14.000 | user 14

01:18:16.000 | with that users

01:18:18.040 | Contiguous index right and so I can do that in Excel using this match that basically says

01:18:25.400 | What you know how far down this list you have to go and it said

01:18:29.320 | User 14 was the first thing in that list

01:18:32.880 | Okay user 29 was the second thing that list so forth okay?

01:18:37.920 | So this is the same as that thing that we did

01:18:42.040 | In our Python code where we basically created a dictionary to map this stuff

01:18:46.640 | So now we can for this particular user movie rating

01:18:51.540 | Combination we can look up

01:18:54.640 | the appropriate embedding

01:18:56.960 | Right and so you can see here what it's doing is it saying all right. Let's basically offset

01:19:04.000 | from the start of this list

01:19:07.720 | And the number of rows we're going to go down is equal to the user index and the number of columns

01:19:12.200 | We're going to go across is

01:19:13.960 | One two three four five okay, and so you can see what it does is it creates point one nine point six three point three one

01:19:19.960 | Here it is point one nine, okay, so so this is literally

01:19:24.080 | modern embedding does but remember

01:19:27.280 | This is exactly the same as

01:19:30.120 | doing a

01:19:32.200 | one hot encoding right because if instead this was a

01:19:37.280 | Vector containing one zero zero zero zero zero right, and we multiplied that by this matrix

01:19:44.440 | Then the only row it's going to return would be the first one okay, so

01:19:49.960 | So it's really useful to remember that embedding

01:19:54.000 | Actually just is a matrix product

01:19:56.680 | The only reason it exists the only reason it exists is because this is an optimization

01:20:03.200 | You know this lets PyTorch know like okay. This is just a matrix multiply

01:20:08.480 | But I guarantee you that you know this thing is one hot encoded

01:20:13.360 | Therefore you don't have to actually do the matrix multiply you can just do a direct look up

01:20:17.540 | Okay, so that's literally all an embedding is is it is a computational?

01:20:22.960 | Performance thing for a particular kind of matrix multiply all right so that looks up that uses user

01:20:31.320 | And then we can look up that uses movie all right, so here is movie ID

01:20:36.880 | movie ID

01:20:39.360 | 417 which apparently is index number 14 here. It is here, so it should have been point seven five point four seven

01:20:46.200 | Yes, it is point seven five point four seven, okay, so we've now got the user embedding and the movie embedding

01:20:53.120 | and rather than doing a

01:20:55.800 | dot product of

01:20:59.440 | those two

01:21:01.440 | Right which is what we do normally?

01:21:03.920 | Instead what if we can catenate the two together into a single vector of length?

01:21:12.660 | Ten and then feed that into a neural net

01:21:17.400 | Right and so anytime we've got you know a

01:21:23.280 | tensor of input activations or in this case a tensor of

01:21:29.320 | Actually, this is a tensor of output activations. This is coming out of an embedding layer

01:21:32.840 | We can chuck it in a neural net because neural nets we now know can calculate

01:21:39.040 | Anything okay including hopefully collaborative filtering, so let's try that

01:21:44.960 | So here is our embedding net

01:21:48.600 | so

01:21:51.880 | This time I have not bothered to create a separate

01:21:57.840 | bias

01:21:59.520 | because instead the

01:22:01.760 | Linear layer in PyTorch already has a bias in it right so when we go

01:22:08.960 | nn.linear

01:22:11.280 | right

01:22:13.280 | Let's kind of draw this out

01:22:16.480 | So we've got our

01:22:20.520 | U matrix right and this is the number of users and this is the

01:22:28.040 | number of factors

01:22:29.760 | Right and we've got our M matrix

01:22:32.480 | All right, so here's our number of movies and here's our again number of factors all right, and so remember we look up a

01:22:43.240 | Single user

01:22:48.440 | We look up a single movie and let's grab them and concatenate them together

01:22:55.720 | Right so here's like the user part. Here's the movie part and then let's put that

01:23:01.320 | through a matrix product

01:23:04.600 | Right so that number of rows here is going to have to be the number of users plus the number of movies

01:23:10.920 | because that's how long that is and

01:23:13.600 | then the number of columns

01:23:16.240 | Can be anything we want?

01:23:21.600 | Because we're going to take that so in this case. We're going to pick 10 apparently so it's picked 10 and then we're going to

01:23:27.920 | stick that through a

01:23:29.920 | value and

01:23:31.920 | Then stick that through another

01:23:35.000 | Matrix, which obviously needs to be of size 10 here

01:23:38.440 | And then the number of columns is a size 1 because we want to predict a single rating

01:23:49.760 | Okay, and so that's our kind of flow chart of what's going on right it is a standard

01:23:56.920 | I'm called a one hidden layer neural net it depends how you think of it like there's kind of an embedding layer

01:24:03.820 | But because this is linear and this is linear the two together is really one linear layer, right? It's just a computational convenience

01:24:11.600 | So it's really got one hidden layer because it's just got one layer before this nonlinear activation

01:24:18.600 | Okay

01:24:20.460 | so in order to create a

01:24:22.460 | Linear layer with some number of rows and some number of columns you just go and end up in here

01:24:29.460 | In the machine learning class this week

01:24:33.560 | We learned how to create a linear layer from scratch by creating our own weight matrix in our own biases

01:24:40.680 | So if you want to check that out you can do so there right, but it's the same basic technique. We've already seen

01:24:47.240 | so

01:24:49.240 | We create our embeddings we create our two linear layers

01:24:53.240 | That's all the stuff that we need to start with you know really if I wanted to make this more general

01:24:59.120 | I would have had another parameter here called like

01:25:02.400 | num hidden

01:25:05.640 | you know equals

01:25:07.640 | equals 10 and then this would be a parameter and

01:25:13.080 | Then you could like more easily play around with different numbers of activations

01:25:17.400 | So when we say like okay in this layer. I'm going to create a layer with this many activations all I mean

01:25:23.700 | assuming it's a fully connected layer is

01:25:26.360 | My linear layer has how many columns in its weight matrix. That's how many activations it creates

01:25:33.040 | All right, so we grab our users and movies we put them through our embedding matrix, and then we concatenate them together

01:25:41.560 | Okay, so torch dot cat

01:25:43.560 | Concatenates them together on the first dimension so in other words we can catenate the columns together to create longer rows

01:25:50.840 | Okay, so that's concatenating on dimension one

01:25:53.920 | Dropout will come back to her in a moment. We've looked at that briefly

01:25:59.880 | So then having done that we'll put it through that linear layer we had

01:26:07.440 | We'll do our relu and you'll notice that relu is again inside our capital F and end up functional

01:26:15.120 | It's just a function so remember activation functions are basically things that take one activation in and spit one activation out

01:26:23.320 | in this case take in something that can have negatives or positives and

01:26:27.440 | Truncate the negatives to zero. That's all relu does

01:26:31.380 | And then here's our sigma

01:26:36.600 | So that's that that is now a genuine

01:26:39.980 | Neural network. I don't know if we get to call it deep. It's only got one hidden layer

01:26:44.880 | But it's definitely a neural network right and so we can now construct it we can put it on the GPU

01:26:50.540 | We can create an optimizer for it, and we can fit it

01:26:54.360 | Now you'll notice there's one other thing. I've been passing to fit which is

01:26:59.440 | What loss function are we trying to minimize?

01:27:02.480 | Okay, and this is the mean squared error loss and again. It's inside F

01:27:06.260 | Okay, pretty much all the functions are inside it, okay?

01:27:09.640 | so

01:27:11.720 | One of the things that you have to pass fit is something saying like how do you score is what counts as good or bad?

01:27:18.680 | so Jeremy now that we have a

01:27:22.360 | Real neural net do we have to use the same number of embeddings for users and that's a great question you don't know

01:27:30.440 | It's absolutely right you don't and so like we've got a lot of benefits here right because if we

01:27:36.040 | You know think about

01:27:39.560 | You know we're grabbing a user embedding we're concatenating it with a movie embedding which maybe is like I don't know some different size

01:27:51.960 | but then also perhaps we looked up the genre of the movie and like you know there's actually a

01:28:00.520 | Embedding matrix of like number of genres

01:28:02.960 | By I don't know

01:28:05.480 | Three or something and so like we could then concatenate like a genre embedding and then maybe the time stamp is in here as a continuous

01:28:12.780 | Number right and so then that whole thing we can then feed into

01:28:17.080 | you know

01:28:20.040 | Our neural net right and then at the end

01:28:27.120 | Remember our final non-linearity was a sigmoid right so we can now

01:28:31.080 | Recognize that this thing we did where we did sigmoid times max rating but minus min rating plus blah blah blah

01:28:36.560 | Is actually just another?

01:28:39.360 | Nonlinear activation function right and remember in our last layer

01:28:44.520 | We use generally different kinds of activation functions

01:28:48.200 | So as we said we don't need any activation function at all right we could just do

01:28:55.040 | that

01:28:57.040 | But by not having any nonlinear activation function, we're just making it harder, so that's why we put the sigmoid in there as well, okay

01:29:07.160 | so we can then fit it in the usual way and

01:29:10.940 | There we go you know interestingly we actually got a better score than we did with our

01:29:17.800 | This model

01:29:22.000 | So it'll be interesting to try training this with stochastic gradient descent with restarts and see if it's actually better

01:29:27.480 | You know maybe you can play around with the number of hidden layers and the dropout and whatever else and see if you can

01:29:34.200 | Come up with you know get a better answer than

01:29:37.960 | Point

01:29:44.880 | Seven six ish

01:29:47.840 | Okay, so so general so this is like if you were going deep into collaborative filtering at your workplace

01:29:55.560 | Or whatever this wouldn't be a bad way to go like it's like I'd start out with like oh, okay

01:29:59.840 | Here's like a collaborative data set 30 and fast AI

01:30:02.800 | Get learner there's you know not much I can send it basically number of factors is about the only thing that I pass in

01:30:09.320 | I can learn for a while maybe try a few different approaches, and then you're like okay. There's like

01:30:17.040 | That's how I go if I use the defaults

01:30:19.200 | Okay, how do I make it better, and then I'd be like digging into the code and saying like okay?

01:30:24.600 | What would Jeremy actually do here? This is actually what I want you know and and and people around it

01:30:29.880 | So one of the nice things about the neural net approach

01:30:33.100 | Is that you know as you net mentioned?

01:30:36.720 | We can have different numbers of

01:30:39.380 | embeddings

01:30:41.640 | We can choose how many hidden and we can also choose

01:30:46.280 | dropout right so

01:30:48.400 | So what we're actually doing is we haven't just got really you that we're also going like okay. Let's

01:30:55.560 | Let's delete a few things at random

01:31:03.820 | All right, that's dropout. That's when this case. We were deleting

01:31:09.280 | after the first linear layer

01:31:14.240 | 75% of them all right and then after the second linear 75% of them so we can add a whole lot of regularization

01:31:19.960 | Yeah, so you know this it kind of feels like the this this embedding net

01:31:25.080 | You know you could you could change this again. We could like have it so that we can pass into the constructor

01:31:32.360 | Well if we wanted to make it look as much as possible like what we had before we could surpass in peace

01:31:41.640 | P's equals 0.75

01:31:44.680 | comma 0.75

01:31:47.160 | I'm not sure this is the best API, but it's not terrible

01:31:50.700 | Probably what since we've only got exactly two layers. We could say P1 equals 0.75

01:31:57.280 | P2 equals 0.75

01:32:04.840 | and

01:32:07.600 | So then this will be

01:32:09.840 | P1 this will be

01:32:14.440 | P2 you know where we go and like if you wanted to go further

01:32:21.200 | You could make it look more like our

01:32:25.240 | Structured data learner you could actually have a thing this number of hidden

01:32:31.800 | You know maybe you could make a list and so then rather than creating exactly one

01:32:38.800 | Hidden layer and one output layer. This could be a little loop that creates and

01:32:43.440 | Hidden layers each one of the size you want so like this is all stuff you can play with during the during the week

01:32:49.120 | If you want to and I feel like if you've got like a much smaller collaborative filtering data set

01:32:55.620 | You know maybe you'd need like more regularization or whatever. It's a much bigger one

01:33:00.560 | Maybe more layers would help. I don't know you know I haven't seen

01:33:07.200 | Much discussion of this kind of neural network approach to collaborative filtering

01:33:10.880 | But I'm not a collaborative filtering expert, so maybe it's maybe it's around, but that'd be interesting thing to try

01:33:16.640 | So the next thing I wanted to do was to talk about

01:33:26.120 | The training loop, so what's actually happening inside the training loop?

01:33:34.120 | So at the moment we're basically passing off

01:33:37.600 | The actual updating of the weights to pie torches optimizer

01:33:43.840 | But what I want to do is like understand

01:33:47.640 | What that optimizer is is actually doing and we're also I also want to understand what this momentum term is doing

01:33:55.920 | so you'll find we have a

01:34:01.400 | spreadsheet called grad desk gradient descent

01:34:04.560 | And it's kind of designed to be read left to write sorry right to left worksheet wise

01:34:09.620 | so the rightmost worksheet

01:34:12.560 | Is some data right and we're going to implement gradient descent in Excel because obviously everybody wants to do deep learning in Excel and we've done

01:34:21.440 | Collaborative filtering in Excel we've done

01:34:24.080 | Convolutions in Excel so now we need SGD in Excel so we can replace Python once and for all okay, so

01:34:31.360 | Let's start by creating some data right and so here's

01:34:35.440 | you know here's some

01:34:38.560 | Independent you know I've got one column of X's you know and one column

01:34:44.720 | Of wise and these are actually directly linearly related, so this is this is random

01:34:51.640 | All right, and this one here is equal to X

01:34:54.860 | times 2

01:34:57.720 | plus

01:34:59.320 | 30 okay, so

01:35:01.320 | Let's try and use Excel to take

01:35:05.640 | That data and try and learn

01:35:09.520 | those parameters

01:35:12.600 | Okay, that's going to be our goal

01:35:17.440 | So let's start with the most basic version of SGD

01:35:21.800 | And so the first thing I'm going to do is I'm going to run a macro so you can see what this looks like

01:35:26.520 | So I hit run, and it does five epochs. I do another five epochs to another five epochs

01:35:33.680 | Okay, so

01:35:36.280 | The first one was pretty terrible. It's hard to see so I just delete that first one get better scaling

01:35:44.360 | All right, so you can see actually it's pretty constantly improving the loss right. This is the loss per epoch

01:35:52.360 | All right, so how do we do that? So let's reset it

01:35:55.040 | So here is my

01:36:00.120 | X's and my Y's and

01:36:02.200 | What I do is I start out by assuming

01:36:05.000 | Some intercept and some slope right so this is my randomly initialized weights

01:36:13.320 | So I have randomly initialized them both to one

01:36:16.120 | You could pick a different random number if you like, but I promise that I randomly picked the number one

01:36:23.400 | Twice there you go

01:36:26.080 | It was a random number between one and one

01:36:30.560 | So here is my intercept and slope. I'm just going to copy them over here right so you can literally see this is just equal

01:36:38.600 | C1

01:36:39.800 | Here is equals C2. Okay, so I'm going to start with my very first row of data x equals 40 and y equals 58

01:36:47.640 | And my goal is to come up

01:36:50.440 | After I look at this piece of data. I want to come up with a slightly better intercept and a slightly better slope

01:36:56.640 | Okay

01:36:58.840 | So to do that I need to first of all basically figure out

01:37:04.920 | Which direction is is down in other words if I make my intercept a little bit higher

01:37:11.360 | Or a little bit lower would it make my error a little bit better or a little bit worse?

01:37:16.200 | So let's start out by calculating the error so to calculate the error the first thing we need is a prediction

01:37:22.400 | So the prediction is equal to the intercept

01:37:26.660 | Plus x times slope right so that is our

01:37:32.520 | Zero hidden layer neural network, okay?

01:37:35.560 | And so here is our error. It's equal to our prediction minus our actual square

01:37:41.840 | So we could like play around with this. I don't want my error to be 1849. I'd like it to be lower

01:37:48.240 | So what if we set the intercept to?

01:37:50.360 | 1.1

01:37:53.040 | 1849 goes to 1840 okay, so a higher intercept would be better

01:37:57.360 | Okay, what about the slope if I increase that?

01:38:01.720 | It goes from 1849

01:38:03.720 | To 1730 okay a higher slope would be better as well

01:38:07.520 | Not surprising because we know

01:38:10.120 | Actually that there should be 30 and 2

01:38:12.460 | So one way to

01:38:16.120 | Figure that out

01:38:18.480 | You know encode in the spreadsheet is to do literally what I just did

01:38:22.200 | It's to add a little bit to the intercept and the slope and see what happens

01:38:25.520 | And that's called finding the derivative through finite differencing right and so let's go ahead and do that

01:38:31.920 | So here is the value of my error if I add 0.01

01:38:38.400 | to

01:38:40.480 | My intercept right so it's c4 plus open o1 and then I just put that into my linear function

01:38:46.680 | And then I subtract my actual all squared right and so that causes my error to go down a bit. That's are increasing

01:38:54.620 | my

01:38:58.000 | Which one is that increasing c4 increasing the intercept a little bit has caused my error to go down

01:39:03.900 | So what's the derivative well the derivative is equal to how much the dependent variable changed by?

01:39:10.100 | Divided by how much the independent variable changed by right and so there it is right our

01:39:16.040 | Dependent variable changed by that minus that

01:39:18.600 | Right and our independent variable we changed by 0.01

01:39:22.080 | So there is the estimated value of

01:39:26.080 | the error DB

01:39:28.080 | All right, so remember when people talking about derivatives

01:39:31.200 | This is this is all they're doing is they're saying what's this value?

01:39:35.120 | But as we make this number smaller and smaller and smaller and smaller as it as it limits to zero

01:39:42.720 | I'm not smart enough to think in terms of like derivatives and integrals and stuff like that so whatever I think about this

01:39:49.360 | I always think about you know an actual like plus point oh one divided by point oh one because like I just find that

01:39:55.960 | Easier just like I never think about probability density functions. I always think about actual probabilities about toss a coin

01:40:02.880 | Something happens three times

01:40:05.480 | So I always think like remember. It's it's totally fair to do this because a computer is

01:40:11.240 | Discrete it's not continuous like a computer can't do anything infinitely small anyway, right?

01:40:17.880 | So it's actually got to be calculating things at some level of precision right and our brains kind of need that as well

01:40:25.920 | So this is like my version of Jeffrey Hinton's like to visualize things in more than two dimensions

01:40:32.000 | You just like say twelve dimensions really quickly while visualizing it in two dimensions

01:40:35.860 | This is my equivalent you know to think about derivatives. Just think about division

01:40:41.920 | And like although all the mathematicians say no you can't do that

01:40:46.120 | You actually can like if you think of dx dy is being literally you know changing x over changing y like

01:40:54.200 | The division actually like the calculations still work like all the time, so

01:40:59.080 | Okay, so let's do the same thing now with changing

01:41:03.480 | my slope by a little bit

01:41:06.160 | And so here's the same thing right and so you can see both of these are negative

01:41:10.560 | Okay, so that's saying if I increase my intercept my loss goes down if I increase my slope my loss goes down

01:41:20.200 | Right and so my derivative of my error

01:41:24.180 | with respect to my slope is is actually pretty high and that's not surprising because

01:41:32.280 | It's actually

01:41:35.400 | You know the constant term is just being added whereas the slope is being multiplied by 40

01:41:40.000 | Okay now

01:41:46.800 | Finite differencing is all very well and good, but it's a big problem with finite differencing in

01:41:51.920 | High dimensional spaces and the problem is this right and this is like

01:41:56.700 | You don't need to learn

01:42:00.760 | How to calculate derivatives or integrals, but you need to learn how to think about them spatially right and so remember

01:42:07.580 | We have some

01:42:10.160 | Vector very high dimensional vector. It's got like a million items in it right

01:42:16.560 | And it's going through

01:42:18.560 | Some weight matrix right of size like 1 million by size a hundred thousand or whatever and it's spitting out something of size 100,000

01:42:28.700 | and

01:42:30.560 | So you need to realize like there isn't like a gradient here, but it's like for every one of these things in this vector

01:42:38.820 | Right, there's a gradient in every direction

01:42:43.760 | You know in every part of the output

01:42:46.360 | All right, so it actually has

01:42:49.040 | Not a single gradient number not even a gradient

01:42:53.100 | Vector but a gradient matrix

01:42:56.860 | right and so this

01:43:00.220 | This is a lot to calculate right

01:43:03.880 | I would literally have to like add a little bit to this and see what happens to all of these

01:43:08.920 | Add a little bit to this see what happens to all of these right to fill in

01:43:13.800 | one column of this at a time, so that's going to be

01:43:17.660 | Horrendously slow like that that so that's why like if you're ever thinking like oh we can just do this with finite differencing

01:43:24.720 | Just remember like okay. We're dealing in the with these very high dimensional vectors where

01:43:30.560 | You know this this kind of

01:43:33.880 | Matrix calculus like all the concepts are identical

01:43:39.760 | But when you actually draw it out like this you suddenly realize like okay for each number I could change

01:43:45.760 | There's a whole bunch of numbers that impacts and I have this whole matrix of things to compute right and so

01:43:52.040 | Your gradient calculations can take up a lot of memory, and they can take up a lot of time

01:43:58.080 | So we want to find some way to do this

01:44:01.080 | more quickly

01:44:03.640 | And it's definitely well worth like spending time

01:44:08.480 | kind of studying these ideas of like

01:44:11.360 | you know the idea of like the gradients like look up things like Jacobian and

01:44:17.860 | Hessian

01:44:22.800 | They're the things that you want to search for to start

01:44:26.480 | unfortunately people normally write about them with you know lots of Greek letters and

01:44:34.000 | Blah blah blahs right, but there are some there are some nice

01:44:38.200 | You know intuitive explanations out there, and hopefully you can share them on the forum if you find them because this is stuff

01:44:44.760 | You really need to

01:44:46.760 | Really need to understand in here

01:44:49.120 | You know because

01:44:52.400 | You're trying to train something and it's not working properly and like later on we'll learn how to like look inside

01:44:58.160 | Pytorch to like actually get the values of the gradients, and you need to know like okay

01:45:02.560 | Well, how would I like plot the gradients you know?

01:45:05.640 | What would I consider unusual like you know these are the things that turn you into a really awesome?

01:45:11.040 | deep learning practitioner is when you can like debug your problems by like

01:45:15.760 | Grabbing the gradients and doing histograms of them and like knowing you know that you could like plot that all each layer my

01:45:22.340 | Average gradients getting worse or you know bigger or you know whatever

01:45:26.160 | Okay, so the trick to doing this more quickly is to do it

01:45:31.960 | analytically

01:45:33.200 | Rather than through finite differencing and so analytically is basically there is a list you probably all learned it at high school

01:45:41.440 | There is a literally a list of rules that for every

01:45:44.360 | Mathematical function there's a like this is the derivative of that function right so

01:45:50.400 | You probably remember a few of them

01:45:53.640 | for example

01:45:56.680 | X squared

01:45:59.480 | To X right and so we actually have here an X squared

01:46:03.840 | So here is our two times

01:46:06.400 | now the one that I actually want you

01:46:08.760 | to know is

01:46:11.920 | Not any of the individual rules, but I want you to know the chain rule

01:46:17.040 | right, which is

01:46:19.520 | You've got some function of some function of something

01:46:24.200 | Why is this important I?

01:46:26.920 | Don't know that's a linear layer. That's a rally you right and

01:46:31.320 | Then we can kind of keep going backwards, right?

01:46:35.520 | Etc right a neural net is

01:46:40.080 | Just a function of a function of a function of a function where the innermost is you know it's basically linear

01:46:45.720 | rally you

01:46:47.840 | linear

01:46:49.080 | rally you

01:46:52.080 | dot dot dot dot linear

01:46:54.540 | sigmoid or soft mass

01:46:58.360 | All right, and so it's a function of a function of a function and so therefore to calculate the derivative of

01:47:05.680 | the weights in your model

01:47:08.520 | The loss of your model with respect to the weights of your model

01:47:12.440 | You're going to need to use the chain rule and

01:47:14.440 | Specifically whatever layer it is that you're up to like I want to calculate the derivative here

01:47:19.360 | I'm going to need to use all of these

01:47:21.760 | All of these ones because that's all that's that's the function that's being applied

01:47:25.680 | right and that's why they call this back propagation because the value of the derivative of

01:47:30.760 | that is

01:47:33.920 | equal to

01:47:35.800 | that derivative

01:47:37.800 | Now basically you can do it like this you can say let's call you

01:47:41.520 | Is this right let's call that you right then it's simply equal to

01:47:48.520 | the

01:47:50.520 | derivative of that

01:47:52.320 | times

01:47:54.320 | Derivative of that right you just multiply them together and

01:47:58.040 | So that's what back propagation is like it's not that back propagation is a new thing for you to learn

01:48:04.840 | It's not a new

01:48:07.120 | algorithm it is literally

01:48:09.240 | Take the derivative of every one of your layers and

01:48:13.480 | multiply them all together so like it doesn't deserve a new name right apply the chain rule to my layers

01:48:21.920 | Does not deserve a new lane, but it gets one because us neural networks folk really need to seem as clever as possible

01:48:29.520 | It's really important that everybody else thinks that we are way outside of their capabilities

01:48:34.920 | Right so the fact that you're here means that we failed because you guys somehow think that you're capable

01:48:40.920 | Right so remember. It's really important when you talk to other people that you say back propagation and

01:48:46.640 | Rectified linear unit rather than like multiply the layers

01:48:51.360 | Gradients or replace negatives with zeros, okay, so so here we go so here is so I've just gone ahead and

01:48:59.480 | Grabbed the derivative unfortunately there is no automatic differentiation in Excel yet

01:49:05.920 | So I did the alternative which is to paste the formula into Wolfram Alpha and got back the derivatives

01:49:12.120 | So there's the first derivative, and there's the second derivative

01:49:14.640 | Analytically we only have one layer in this

01:49:18.240 | Infinite tiny small neural network, so we don't have to worry about the chain rule

01:49:22.880 | and we should see that this analytical derivative is pretty close to our estimated derivative from the finite differencing and

01:49:29.920 | Indeed it is right and we should see that these ones are pretty similar as well, and indeed they are right

01:49:36.440 | and if you're you know back when I

01:49:38.680 | implemented my own neural nets 20 years ago I

01:49:42.560 | You know had to actually calculate the derivatives

01:49:45.600 | And so I always would write like had something that would check the derivatives using finite differencing

01:49:50.800 | And so for those poor people that do have to write these things by hand

01:49:54.080 | You'll still see that they have like a finite differencing checker

01:49:58.280 | So if you ever do have to implement a derivative by hand, please make sure that you

01:50:03.800 | Have a finite differencing checker so that you can test it

01:50:07.480 | All right

01:50:09.480 | So there's our derivatives

01:50:11.480 | So we know that if we increase

01:50:14.080 | B, then we're going to get a slightly better loss, so let's increase B by a bit

01:50:20.340 | How much should we increase it by?

01:50:22.760 | Well we'll increase it by some multiple of this and the multiple

01:50:25.880 | We're going to choose is called a learning rate, and so here's our learning rate. So here's 1e neck 4

01:50:30.360 | Okay, so our new value

01:50:33.520 | Is equal to whatever it was before

01:50:37.960 | Minus our

01:50:42.040 | Derivative times our learning rate, okay, so we've gone from 1 to 1.01 and

01:50:48.400 | then a

01:50:51.280 | We've done the same thing so it's gone from 1 to

01:50:55.640 | 1.12

01:50:57.640 | So this is a special kind of mini batch. It's a mini batch of size 1. Okay, so we call this online gradient descent

01:51:05.080 | Just means mini batch of size 1 so then we can go on to the next one x is 86

01:51:11.440 | Y is 202 right. This is my intercept and slope copied across from the last row

01:51:19.880 | Okay, so here's my new y prediction. Here's my new error

01:51:24.480 | Here are my derivatives

01:51:27.400 | Here are my new a and B

01:51:29.480 | All right, so we keep doing that for every mini batch of one

01:51:33.120 | and two eventually

01:51:36.600 | We react run out the end of an epoch

01:51:39.280 | Okay, and so then at the end of an epoch we would grab

01:51:43.560 | our intercept and slope and

01:51:48.320 | Paste them back over here. That's our new values

01:51:52.220 | There we are and we can now continue again, right so we're now starting with

01:51:59.400 | Pops today see that in the wrong spot. It should be

01:52:03.640 | paste special transpose values

01:52:07.400 | All right

01:52:09.680 | Okay, so there's our new intercept. There's only slow possibly I've got those the wrong way around

01:52:13.920 | But anyway you get the idea and then we continue okay, so I recorded the world's tiniest macro

01:52:20.720 | which literally just

01:52:23.720 | Copies the final slope and puts it into the new slope copies the final intercept puts it into the new intercept

01:52:34.320 | and

01:52:36.360 | does that

01:52:37.720 | Five times and after each time it grabs the root means greater error and pastes it into the next

01:52:43.840 | Spare area and that is attached to this run button and so that's going to go ahead and do that five times

01:52:50.280 | Okay, so that's stochastic gradient descent in Excel

01:52:55.240 | So it to turn this into a CNN right you would just replace

01:53:02.040 | This error function right and therefore this prediction with the output of that

01:53:08.120 | convolutional example spreadsheet

01:53:11.000 | Okay, and that then would be in CNN being trained with with SGD, okay

01:53:18.320 | Now the problem is that you'll see when I run this

01:53:29.320 | It's kind of going very slowly right we know that we need to get to a slope of 2 and an intercept of 30

01:53:35.640 | And you can kind of see at this rate

01:53:37.800 | It's going to take a very long time

01:53:40.560 | Right and specifically

01:53:43.680 | It's like it keeps going the same direction, so it's like come on take a hint. That's a good direction

01:53:55.080 | So the come on take a hint. That's a good direction. Please keep doing that but more is called momentum

01:54:00.920 | Right so on our next spreadsheet

01:54:03.720 | We're going to implement momentum

01:54:06.960 | Okay, so

01:54:10.320 | What momentum does is?

01:54:12.800 | The same thing and what to simplify this spreadsheet. I've removed the finite difference in columns, okay, other than that

01:54:20.520 | This is just the same right so he's still got our X's our Y's

01:54:24.720 | A's and B's our predictions

01:54:27.320 | Our error is now over here, okay?

01:54:30.600 | And here's our derivatives, okay?

01:54:34.440 | Our new calculation for this particular row

01:54:40.880 | Our new calculation here for our new a term just like before is it's equal to whatever a was before

01:54:52.880 | minus

01:54:54.880 | Okay, now this time. I'm not taking the derivative, but I'm financing some other number times the learning rate, so what's this other number?

01:55:04.240 | Okay, so this other number is equal to the derivative

01:55:11.520 | Times

01:55:15.160 | What's this K 1?

01:55:17.160 | 0.02

01:55:19.680 | plus

01:55:21.280 | 0.98 times

01:55:23.400 | the thing just above it

01:55:25.400 | Okay, so this is a linear

01:55:27.680 | interpolation

01:55:29.480 | between this rows derivative or this mini batches derivative and

01:55:33.560 | Whatever direction we went last time

01:55:36.640 | Right so in other words keep going the same direction as you were before

01:55:42.120 | right then update it a little bit right and so in our

01:55:48.000 | Stretch in our Python just before we had a momentum of 0.9

01:55:52.120 | Okay, so you can see what tends to happen is that our?

01:55:58.160 | negative kind of gets more and more negative right all the way up to like 2,000

01:56:04.480 | Where else with our standard SGD approach

01:56:11.320 | Add riveters are kind of all over the place, right? Sometimes there's 700 some negative seven positive hundred

01:56:17.640 | You know so this is basically saying like yeah, if you've been going

01:56:21.360 | Down for quite a while keep doing that until finally here. It's like okay. That's that seems to be far enough

01:56:28.360 | So that's getting less and less and less negative

01:56:30.360 | Right and still we start going positive again

01:56:32.760 | So you can kind of see why it's called momentum

01:56:35.120 | It's like once you start traveling in a particular direction for a particular weight

01:56:39.680 | You kind of the wheel start spinning and then once the gradient turns around the other way

01:56:45.040 | It's like oh slow down. We've got this kind of momentum, and then finally turn back around

01:56:49.680 | All right, so when we do it this way

01:56:52.520 | All right, we can do exactly the same thing right and after five iterations for 89

01:57:03.640 | Where else before after five iterations? We're at 104 right and after a few more. Let's do maybe 15

01:57:12.560 | Okay, so it's 102 for us here

01:57:20.560 | It's going right so it's it's it's a bit better. It's not heaps better. You can still see like

01:57:33.000 | These numbers they're not

01:57:35.000 | Zipping along right, but it's definitely an improvement and it also gives us something else to tune

01:57:40.920 | Which is nice like so if this is kind of a well-behaved error surface right in other words like

01:57:46.480 | Although it might be bumpy along the way. There's kind of some overall

01:57:51.200 | Direction like imagine you're going down a hill right and there's like bumps on it right so the more momentum

01:57:57.980 | You get up. We're going to skipping over the tops right so we could say like okay

01:58:01.220 | Let's increase our beta up to point nine eight

01:58:03.220 | Right and see if that like allows us to train a little faster and whoa look at that suddenly with 22

01:58:09.640 | All right so one nice thing about things like momentum is it's like another parameter that you can tune to try and make your model

01:58:16.720 | train better in practice

01:58:18.720 | Basically everybody does this every like you look at any like image net winner or whatever they all use momentum

01:58:30.000 | okay, and

01:58:32.000 | So

01:58:36.200 | Back over here when we said use SGD that basically means use the the basic tab of our Excel spreadsheet

01:58:45.520 | But then momentum equals point nine

01:58:48.080 | means add in

01:58:50.800 | Put a point nine over here

01:58:53.720 | okay, and

01:58:55.760 | so that that's kind of your like

01:58:59.640 | default starting point

01:59:01.640 | So let's keep going and talk about

01:59:06.600 | Adam

01:59:10.440 | So Adam is something which I

01:59:15.640 | Actually was not right earlier on in this course. I said we've been using Adam by default

01:59:22.500 | We actually haven't we've actually been I've noticed that we've actually been using SGD with momentum by default and the reason is

01:59:29.700 | that

01:59:31.420 | Adam

01:59:33.420 | Has had it's much faster as you'll see it's much much faster to learn with but there's been some problems

01:59:39.100 | Which is people haven't been getting quite as good like final answers with Adam as they have with SGD with momentum

01:59:45.660 | And that's why you'll see like all the you know image net winning

01:59:48.660 | Solutions and so forth and all the academic papers always use SGD with momentum and our Adam

01:59:56.220 | Seems to be a particular problem in NLP people really haven't got Adam working at all. Well

02:00:00.500 | The good news is this was I built it looks like this was solved two weeks ago

02:00:08.760 | It basically it turned out that the way people were dealing with a combination of weight decay in Adam

02:00:16.580 | Had a nasty kind of bargain it basically

02:00:19.540 | and that's that's kind of carried through to every single library and

02:00:24.060 | one of our students and answer ha has actually just

02:00:27.580 | Completed a prototype of adding is this new version of Adam is called Adam W into fast AI

02:00:34.420 | And he's confirmed that he's getting the much faster both the faster

02:00:39.780 | Performance and also the the better accuracy. So hopefully we'll have this Adam W in fast AI

02:00:47.780 | Ideally before next week. We'll see how we go very very soon

02:00:51.020 | So so it is worth telling you about about Adam

02:00:54.980 | So let's talk about it, it's actually incredibly simple

02:01:00.760 | But again, you know make sure you make it sound really complicated when you tell people so that you can look clever

02:01:07.180 | So here's the same spreadsheet again, right and here's our

02:01:12.620 | Randomly selected A and B again somehow it's still one. Here's our prediction. Here's our derivatives

02:01:20.020 | Okay, so now how we calculating our new A and our new B

02:01:23.800 | You can immediately see it's looking pretty hopeful because even by like row 10

02:01:30.460 | We're like we're seeing the numbers move a lot more. Alright, so this is looking pretty encouraging

02:01:36.300 | So how are we calculating this

02:01:40.300 | It's equal to our previous value of B

02:01:45.900 | Minus J8. Okay, so we're gonna have to find out what that is

02:01:50.140 | times

02:01:53.540 | our learning rate

02:01:55.580 | Divided by the square root of L8. Okay, so we're gonna have to dig it and see what's going on

02:02:00.340 | One thing to notice here is that my learning rate is way higher than it used to be

02:02:05.860 | But then we're dividing it by this

02:02:08.540 | Big number. Okay, so let's start out by looking and seeing what this J8 thing is

02:02:14.940 | Okay

02:02:16.940 | J8 is identical to what we had before J8 is equal to the linear interpolation of the derivative and

02:02:26.540 | the previous direction

02:02:29.740 | Okay, so that was easy

02:02:32.300 | So one part of atom is to use momentum in the way we just defined

02:02:37.300 | Okay, the second piece was to divide by square root L8, what is that?

02:02:43.620 | square root L8, okay is another linear interpolation of something and

02:02:50.260 | Something else and specifically it's a linear interpolation of

02:02:55.340 | F8 squared, okay. It's a linear interpolation of the derivative squared

02:03:02.660 | Along with the derivative squared last time. Okay, so in other words, we've got two pieces of

02:03:11.860 | momentum going on here. One is

02:03:14.180 | calculating the

02:03:17.140 | momentum

02:03:18.740 | version of the gradient the other is calculating the momentum version of the gradient squared and

02:03:25.140 | We often refer to this idea as a

02:03:29.420 | Exponentially weighted moving average in other words

02:03:33.540 | It's basically equal to the average of this one and the last one and the last one and the last one, but we're like multiplicatively

02:03:40.380 | decreasing the previous ones because we're multiplying it by

02:03:43.220 | 0.9 times 0.9 times 0.9 times 0.9. And so you actually see that for instance in the fast AI code

02:03:51.340 | If you look at fit

02:04:02.740 | We don't just calculate the average loss, right?

02:04:09.660 | because

02:04:10.860 | What I actually want we certainly don't just report the loss for every mini batch because that just bounces around so much

02:04:16.700 | So instead I say average loss is equal to whatever the average loss was last time

02:04:23.060 | times

02:04:25.140 | 0.98

02:04:26.500 | Plus the loss this time times 0.02

02:04:29.900 | Right. So in other words the faster AI library

02:04:33.700 | The thing that it's actually when you do like the loading rate finder or plot loss

02:04:38.340 | It's actually showing you the exponentially weighted moving average of the loss

02:04:43.260 | Okay, so it's like a really handy concept. It appears quite a lot

02:04:48.180 | Right the other handy concept to know about it's this idea of like you've got two numbers

02:04:54.500 | One of them is multiplied by some value. The other is multiplied by 1 minus that value

02:04:59.980 | So this is a linear interpolation with two values. You'll see it all the time and

02:05:05.300 | for some reason

02:05:08.100 | Deep learning people nearly always use the value alpha when they do this

02:05:12.380 | So like keep an eye out if you're reading a paper or something and you see like alpha times blah blah blah blah plus

02:05:18.860 | 1 minus alpha

02:05:21.380 | Times some other blah blah blah blah right immediately like when people read papers

02:05:27.100 | None of us like read everything in the equation. We look at it. We go Oh linear interpolation

02:05:33.900 | Right and I say something I was just talking to Rachel about yesterday is like

02:05:37.940 | Whether we could start trying to find like a new way of writing papers where we literally refactor them

02:05:43.540 | Right like it'd be so much better to have written like

02:05:46.340 | linear interpolate

02:05:49.060 | Blah blah blah comma blah blah blah right because then you don't have to have that pattern recognition right but until we

02:05:55.940 | Convinced the world to change how they write papers

02:05:58.540 | This is what you have to do is you have to look you know

02:06:01.340 | Know what to look for right and once you do suddenly the huge page with formulas

02:06:07.420 | Aren't aren't bad at all like you often notice like for example the two things in here like they might be totally identical

02:06:14.900 | But this might be at time t and this might be at like time t minus 1 or something

02:06:19.060 | Right like it's very often these big ugly formulas turn out to be

02:06:23.100 | Really really simple if only they had repacked them

02:06:26.680 | Okay

02:06:28.580 | So what are we doing with this gradient squared?

02:06:31.100 | So what we were doing with the gradient squared is

02:06:35.860 | We were taking the square root, and then we were adjusting the learning rate by dividing the learning rate by that

02:06:43.740 | Okay, so gradient squared is always positive

02:06:49.020 | okay, and

02:06:51.740 | We're taking the exponentially waiting move moving average of a bunch of things that are always positive

02:06:57.160 | And then we're taking the square root of that

02:06:59.160 | All right, so when is this number going to be high?

02:07:01.780 | It's going to be particularly high if there's like one big

02:07:05.840 | You know if the gradients got a lot of variation right so there's a high variance of gradient

02:07:11.760 | Then this G squared thing is going to be a really high number whereas if it's like a constant

02:07:18.060 | Amount right it's going to be smaller that because when you add things that are squared the square

02:07:25.020 | It's like jump out much bigger whereas if there wasn't if there wasn't much change that it's not going to be as big so basically

02:07:32.040 | This number at the bottom here

02:07:34.860 | It's going to be high

02:07:37.700 | If our gradient is changing a lot now, what do you want to do if?

02:07:42.060 | You've got something which is like first negative and then positive and then small and then high

02:07:47.300 | right

02:07:49.660 | Well you probably want to be more careful right you probably don't want to take a big step

02:07:55.100 | Because you can't really trust it right so when the when the variance of the gradient is high

02:08:00.340 | We're going to divide our learning rate by a big number

02:08:03.220 | Where else if our learning rate is?

02:08:06.060 | Very similar kind of size all the time then we probably feel pretty good about this step

02:08:11.340 | So we're dividing it by a small amount

02:08:13.460 | And so this is called an adaptive learning rate and like a lot of people will have this confusion about Adam

02:08:20.300 | I've seen it on the forum actually where people are like, isn't there some kind of adaptive learning rate where somehow you're like setting different

02:08:27.220 | Learning rates for different layers or something. It's like no not really

02:08:32.500 | Right all we're doing is we're just saying like just keep track of the average of the squares of the gradients and use that

02:08:40.620 | To adjust the learning rate, so there's still one learning rate

02:08:44.140 | Okay, in this case. It's one

02:08:47.300 | right, but effectively every parameter at every epoch is

02:08:52.340 | Being kind of like getting a bigger jump if the learning rate if the gradients been pretty constant for that weight and a smaller jump

02:09:00.760 | Otherwise okay, and that's Adam. That's the entirety of Adam in

02:09:05.260 | Excel right so there's now no reason at all why you can't

02:09:09.100 | Train image net in Excel because we've got you've got access to all of the pieces you need

02:09:13.860 | And so let's try this out run

02:09:16.660 | Okay, that's not bad right five and we're straight up to 29 and 2 right so the difference between like

02:09:26.780 | You know standard SGD in this is huge and basically that you know the key difference was that it figured out that we need to be

02:09:35.820 | you know

02:09:37.060 | moving this number

02:09:38.860 | much faster, okay, and so and so it did and

02:09:42.620 | So you can see we've now got like

02:09:46.020 | two different parameters one is kind of the momentum for the gradient piece the other is the momentum for the gradient squared piece and

02:09:53.540 | They I think they're called like I think there's just a couple of the beta

02:09:59.300 | I think when you when you want to change it in PyTorch is a thing called beta

02:10:02.840 | Which is just a couple of two numbers you can change

02:10:05.420 | Jeremy so

02:10:10.780 | So you said the

02:10:13.700 | yeah, I think I understand this concept of you know when they

02:10:17.700 | When a gradient is it goes up and down then you're not really sure

02:10:23.020 | Which direction should should go so you should kind of slow things down therefore you subtract that gradient from the learning rate

02:10:30.300 | So but how do you implement that how far do you go?

02:10:34.620 | I guess maybe I missed something early on you do you set a number somewhere we divide

02:10:41.460 | Yeah, we divide the learning rate

02:10:43.460 | Divided by the square root of the moving average gradient squared, so that's where we use it. Oh

02:10:50.940 | I'm sorry. Can you be a little more sure so d2 is the learning rate? Which is 1? Yeah m27 is

02:11:00.620 | our moving average of the squared gradients

02:11:03.260 | So we just go d2 divided by square root m27

02:11:08.060 | That's it

02:11:12.220 | Okay, thanks I

02:11:14.220 | Have one question yeah, so

02:11:17.340 | The new method that you just mentioned which is in the process of getting implemented in yes, I don't w yeah, I don't w

02:11:25.060 | How different is it from here? Okay? I can let's do that so

02:11:31.100 | To understand Adam W. We have to understand weight decay

02:11:35.260 | And maybe we'll learn more about that later. Let's see how we go now with weight decay

02:11:40.340 | so the idea is that

02:11:42.340 | when you have

02:11:45.060 | Lots and lots of parameters like we do with you know most of the neural nets we train

02:11:50.540 | You very often have like more parameters and data points or you know like regularization becomes important

02:11:57.940 | And we've learned how to avoid overfitting by using dropout right which randomly deletes some activations

02:12:06.460 | In the hope that it's going to learn some kind of more resilient set of weights

02:12:11.020 | There's another kind of regularization

02:12:13.740 | We can use called weight decay or L2 regularization

02:12:17.740 | And it's actually comes kind of it's a kind of classic statistical technique and the idea is that we take our loss function

02:12:24.300 | Right so we take our like

02:12:26.660 | Error squared loss function and we add an additional piece to it

02:12:31.200 | Let's add weight decay right now

02:12:34.740 | The additional piece we add is

02:12:36.740 | To basically add the square of the weights, so we'd say plus

02:12:42.620 | B squared

02:12:45.540 | Plus a squared

02:12:48.540 | Okay, that is now

02:12:52.680 | Weight decay or L2 regularization and so the idea is that now

02:13:02.100 | The the loss function wants to keep the weights small right because increasing the weights makes the loss worse and

02:13:09.380 | So it's only going to increase the weights if the loss improves by more

02:13:15.180 | Than the amount of that penalty and in fact to make this weight decay to proper weight decay we then need some

02:13:21.140 | Multiplier here right so if you remember back in our here, we said weight decay equals Wd 5e neg 4

02:13:31.780 | Okay, so to actually use the same weight decay. I would have to multiply by

02:13:34.780 | 0.0005

02:13:37.380 | So that's actually now the same weight decay, so

02:13:42.880 | If

02:13:46.500 | You have a really high weight decay that it's going to set all the parameters to zero

02:13:50.500 | So it'll never over fit right because it can't set any parameter to anything

02:13:55.700 | And so as you gradually decrease the weight decay a few more weights

02:14:01.500 | Can actually be used right, but the ones that don't help much. It's still going to leave at zero or close to zero, right?

02:14:09.340 | So that's what that's what weight decay is is literally to change the loss function to add in this

02:14:17.420 | Sum of squares of weights

02:14:21.460 | times

02:14:23.420 | some parameter some some hyper parameter should see

02:14:26.360 | the problem is that

02:14:30.060 | If you put that into the loss function as I have here

02:14:33.700 | Then it ends up in the moving average of gradients and the moving average of squares of gradients

02:14:40.100 | For Adam right and so basically we end up

02:14:44.780 | When there's a lot of variation

02:14:48.020 | we end up

02:14:50.780 | Decreasing the amount of weight decay, and if there's very little variation we end up increasing the amount of weight decay, so we end up

02:14:57.460 | basically saying

02:15:00.140 | penalize parameters, you know weights that are really high

02:15:03.200 | Unless their gradient varies a lot, which is never what we intended right? That's just not not the plan at all

02:15:12.060 | So the trick with Adam W is we basically remove

02:15:16.220 | Weight decay from here

02:15:19.020 | So it's not in the loss function. It's not in the G not in the G squared

02:15:23.100 | And we move it so that instead it's it's it's added directly to the

02:15:30.260 | When we update with the learning rate, it's added there instead so in other words it would be

02:15:34.800 | We would put the weight decay or actually the gradient with the weight decay in here when we calculate the new a and ub

02:15:41.740 | So it never ends up in our G and G squared

02:15:45.940 | So that was like a super fast

02:15:48.620 | Description which will probably only make sense if you listen to a three or four times on the video and then talk about it

02:15:54.820 | on the forum

02:15:56.820 | Yeah, but if you're interested let me know and we can also look at a nonce code. That's implemented this

02:16:03.620 | And you know the the idea of using weight decay is it's a really helpful

02:16:11.980 | regularizer

02:16:15.020 | Because it's basically this way that we can kind of say like

02:16:17.940 | You know, please don't increase any of the weight values unless the

02:16:26.700 | you know improvement in the loss

02:16:28.700 | Is worth it?

02:16:31.540 | And so generally speaking pretty much all state-of-the-art models have both dropout and weight decay

02:16:38.220 | And I don't claim to know like how to set h1 and how much of each to use

02:16:44.320 | Other than to say like you it's worth trying both

02:16:47.920 | To go back to the idea of embeddings

02:16:52.220 | Is there any way to interpret the final sort of use it embeddings? Like absolutely. We're gonna look at that next week. It's super fun

02:16:59.340 | It turns out that you know, we'll learn what some of the worst movies of all time

02:17:03.540 | It's like um, it's a John Travolta Scientology ones like battleship earth or something

02:17:11.820 | I think that was like the worst movie of all time according to our settings

02:17:19.420 | Do you have any recommendations for scaling the L2 penalty or is that kind of based on how how wide the nodes or how many

02:17:27.100 | No, I have no

02:17:29.100 | suggestion at all like I I kind of look for like papers or Kaggle competitions or whatever similar and try to set up frankly

02:17:37.780 | The same it seems like in a particular area like computer vision object recognition

02:17:44.620 | It's like somewhere between 1e neg 4 or 1e neg 5 seems to work, you know

02:17:49.600 | actually in the Adam W paper

02:17:52.620 | the

02:17:54.260 | The authors point out that with this new approach it actually becomes like it seems to be much more stable

02:17:59.220 | As to what the right weight decay amounts are so hopefully now when we start playing with it, we'll be able to have some

02:18:05.040 | definitive recommendations by the time we get to part 2

02:18:08.120 | All right. Well, that's 9 o'clock. So

02:18:11.340 | this week

02:18:14.420 | You know practice the thing that you're least familiar with so if it's like Jacobians and Hessians read about those if it's broadcasting

02:18:21.300 | Read about those if it's understanding Python. Oh read about that, you know, try and implement your own custom layers

02:18:27.560 | Read the fast AI layers, you know and and talk on the forum about anything that you find

02:18:33.860 | Weird or confusing? All right. See you next week

02:18:37.420 | [BLANK_AUDIO]

Lesson 5: Deep Learning 2018

Chapters