back to index

Lesson 5: Deep Learning 2018


Chapters

0:0 Intro
6:15 Carroll Seedlings Competition
7:48 Learning Objectives
9:43 Movie Lens Data
11:22 XLS
12:24 Matrix Factorization
37:3 PiTorch
40:13 Forward
41:31 Embedding matrices
43:48 Creating a Python module
47:47 Basic initialization
50:13 Minibatching
52:13 Creating an optimizer
54:40 Writing a training loop
55:56 Fast AI

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back
00:00:02.000 | So we had a busy lesson last week and
00:00:06.480 | I was really thrilled to see actually one of our master's students here at USF actually
00:00:15.400 | Actually took what we learned
00:00:19.280 | Took what we learned
00:00:22.720 | With structured deep learning and turned it into a blog post which as I suspected has been
00:00:29.200 | Incredibly popular because it's just something
00:00:31.200 | People didn't know about and so it actually ended up getting picked up by the towards data science publication
00:00:38.960 | Which I quite like actually if you're interested in keeping up with what's going on in data science. It's quite good medium publication
00:00:45.400 | and so Karen
00:00:47.680 | talked about
00:00:49.440 | structured deep learning and basically introduced
00:00:51.800 | you know that the basic ideas that we learned about last week and
00:00:57.920 | It got picked up quite quite widely one of the one of the things I was pleased to see actually
00:01:02.920 | Sebastian Ruder who actually mentioned in last week's class as being one of my favorite researchers
00:01:07.240 | Tweeted it and then somebody from Stitch Fix said oh, yeah, we've actually been doing that for ages, which is kind of cute
00:01:16.640 | Kind of know that this is happening in industry a lot and I've been telling people this is happening in industry a lot
00:01:21.600 | But nobody's been talking about it
00:01:23.040 | And now the Karen's kind of published a blog saying hey check out this cool thing and now Stitch Fix is like yeah
00:01:28.640 | We're doing that already so
00:01:30.640 | So that's been great
00:01:33.240 | Great to see and I think there's still a lot more that can be dug into with this structured deep learning stuff
00:01:40.720 | You know to build on top of Karen's post would be to maybe like experiment with some different data sets
00:01:46.320 | Maybe find some old Kaggle competitions and see like there's some competitions that you could now win with this or some which doesn't work
00:01:54.240 | For would be equally interesting
00:01:56.240 | and also like just
00:01:58.400 | Experimenting a bit with different amounts of dropout different layer sizes, you know
00:02:02.480 | Because nobody much has written about this. I don't think there's been any blog posts about this before that. I've seen anywhere
00:02:11.160 | There's a lot of unexplored territory. So I think there's a lot we could we could build on top of here
00:02:17.360 | And there's definitely a lot of interest. I saw one person on Twitter saying this is what I've been looking for ages
00:02:23.380 | another thing which I was pleased to see is
00:02:27.200 | Nicky all who we saw his
00:02:30.280 | cricket versus baseball
00:02:32.280 | Predictor as well as his currency predictor after lesson one
00:02:35.880 | Went on to
00:02:39.680 | Download something a bit bigger which was to download a couple of hundred of images of actors and he manually
00:02:46.840 | Went through and checked which well
00:02:49.080 | I think first of all he like used Google to try and find ones with glasses and ones without then he manually went through and
00:02:54.040 | Check that they had been put in the right spot
00:02:56.040 | and this is a good example of one where
00:02:58.240 | Vanilla resnet didn't do so well with just the last layer
00:03:03.280 | And so what Nicole did was he went through and tried unfreezing the layers and using differential learning rates and got up to
00:03:10.240 | 100% accuracy and the thing I like about these things that Nicole is doing is the way he's
00:03:15.840 | He's not downloading a Kaggle data set. He's like deciding on a problem that he's going to try and solve
00:03:21.560 | He's going from scratch from Google
00:03:23.560 | And he's actually got a link here even to a suggested way to help you download images from Google
00:03:29.100 | So I think this is great and I actually gave a talk
00:03:33.000 | just this afternoon at Singularity University to a
00:03:36.260 | Executive team of one of the world's largest telecommunications companies and actually showed them this post
00:03:42.040 | because the
00:03:44.480 | Folks there were telling me that that all the vendors that come to them and tell them they need like
00:03:48.860 | Millions of images and huge data centers full of hardware, and you know they have to buy special
00:03:54.140 | Software that only these vendors can provide and I said like actually this person's been doing a course for three weeks now
00:04:01.040 | And look at what he's just done with a computer that cost him 60 cents an hour
00:04:04.980 | And they were like they were so happy to hear that like okay. They're you know this actually isn't the reach of normal people
00:04:12.200 | I'm assuming Nicole's a normal person. I haven't actually
00:04:15.680 | If you're proudly abnormal Nicole I apologize
00:04:20.500 | I actually went and actually had a look at his cricket
00:04:24.760 | Classifier and I was really pleased to see that his code actually is the exact same code
00:04:30.280 | That we used in lesson one. I was hoping that would be the case. You know the anything he changed was
00:04:34.420 | the number of epochs I guess
00:04:37.120 | So this idea that we can take those four lines of code and reuse it to do other things
00:04:41.400 | It's definitely turned out to be true, and so these are good things to show like a your organization
00:04:47.880 | If you're anything like the executives at this big company. I spoke to today. There'll be a certain amount of like
00:04:54.120 | Not to surprise but almost like pushback of like if this was true somebody there's you know they basically said that this is true
00:05:00.920 | Somebody would have told us so like why isn't everybody doing this already so like I think you might have to actually show them
00:05:07.080 | You know maybe you can build your own with some internal data
00:05:10.240 | You've got at work or something like here. It is you know didn't cost me anything. It's all finished
00:05:19.920 | Viddly or vitally I don't know how to pronounce his name correctly has done another very nice post on
00:05:23.840 | just an introductory post on how we train neural networks, and I wanted to point this one out as being like I think
00:05:31.120 | This is one of the participants in this course
00:05:34.000 | Who's just got a particular knack for technical communication, and I think we can all learn from you know from his posts about about good technical writing
00:05:41.640 | What I really like particularly is that he?
00:05:45.800 | He assumes almost nothing like he has a kind of a very chatty tone and describes everything
00:05:50.760 | But he also assumes that the reader is intelligent
00:05:53.140 | But you know so like he's not afraid to kind of say he's a paper or he's an equation or or whatever
00:05:58.320 | But then he's going to go through and tell you exactly what that equation means
00:06:01.600 | So it's kind of like this nice mix of like writing for
00:06:05.920 | Respectfully for an intelligent audience, but also not assuming any particular background knowledge
00:06:15.640 | Then I made the mistake earlier this wake of posting a picture of my first placing on the Kaggle seedlings competition
00:06:23.400 | At which point five other fast AI students posted their pictures of them passing me over the next few days
00:06:29.820 | So this is the current leaderboard for the Kaggle plant seedlings competition
00:06:34.500 | I believe the put up top six are all fast AI students or in the worst of those teachers
00:06:40.840 | And so I think this is like a really
00:06:45.000 | Look James has just passed he was first
00:06:47.080 | This is a really good example of like
00:06:50.040 | what you can do this is
00:06:53.480 | I'm trying to think it was like a
00:06:56.160 | small number of thousands of images
00:06:58.920 | And most of the images were only were less than a hundred pixels by hundred pixels
00:07:07.960 | And yet week, you know, I bet my my approach was basically to say let's just run through the notebook
00:07:12.880 | We have pretty much default took me. I don't know an hour
00:07:15.720 | And I'm I think the other students doing a little bit more than that
00:07:21.600 | But not a lot more and basically what this is saying is yeah, these these techniques
00:07:26.800 | Work pretty reliably to a point where people that aren't using the fast AI libraries
00:07:32.400 | You know literally really struggling
00:07:37.560 | I suspect all these are fast AI students. You might have to go down quite a way
00:07:41.720 | So I thought that was very interesting and really really cool
00:07:45.720 | So today we're going to
00:07:49.960 | Start what I would kind of call like the second half of this course
00:07:56.320 | so the first half of this course has been like
00:07:59.480 | getting through
00:08:01.800 | Like these are the applications that we can use this for
00:08:06.200 | The here's kind of the code you have to write
00:08:08.440 | Here's a fairly high level ish description of what it's doing
00:08:15.320 | We're kind of we're kind of done for that bit and what we're now going to do is go in reverse
00:08:20.440 | We're going to go back over all of those exact same things again
00:08:23.900 | But this time we're going to dig into the detail of everyone and we're going to look inside the source code of a fast AI
00:08:29.480 | Library to see what it's doing and try to replicate
00:08:32.920 | that so in a sense like there's not going to be a lot more
00:08:38.680 | Best practices to show you like I've kind of shown you the best best practices
00:08:45.880 | I know
00:08:46.560 | But I feel like for us to now build on top of those to debug those models to come back to part 2
00:08:52.520 | Where we're going to kind of try out some new things, you know, it really helps to understand what's going on
00:08:58.240 | Behind the scenes. Okay, so the goal here today is we're going to try and create a
00:09:04.680 | pretty effective collaborative filtering model
00:09:07.920 | Almost entirely from scratch so we'll use the kind of we'll use pytorch as a
00:09:15.120 | Automatic differentiation tool and as a GPU programming tool and not very much else. We'll try not to use its neural net features
00:09:22.200 | We'll try not to use
00:09:23.720 | Fast AI library any more than necessary. So that's the goal
00:09:29.320 | Let's go back and you know, we only very quickly look at collaborative filtering last time
00:09:33.400 | So let's let's go back and have a look at collaborative filtering. And so we're going to look at this
00:09:38.100 | movie lens data set
00:09:43.520 | the movie lens data set
00:09:45.520 | Basically is a list of ratings
00:09:48.560 | it's got a bunch of different users that are represented by some ID and a bunch of movies that are represented by some ID and
00:09:55.840 | Rating it also has a timestamp. I haven't actually ever tried to use this
00:10:00.520 | I guess this is just like what what time did that person rate that movie?
00:10:04.640 | So that's all we're going to use for modeling is
00:10:09.560 | three columns user ID movie ID and rating and so thinking of that in kind of
00:10:17.000 | Structured data terms user ID and movie ID would be categorical variables
00:10:20.760 | we have two of them and rating would be a
00:10:24.160 | Would be a dependent variable
00:10:27.040 | We're not going to use this for modeling but we can use it for looking at stuff later
00:10:32.960 | We can grab a list of the names of the movies as well
00:10:35.920 | And you could use this genre information. I haven't tried to be interested if during the week anybody tries it and finds it helpful
00:10:43.200 | I guess as you might not find it helpful. We'll see
00:10:49.040 | In order to kind of look at this better. I just grabbed
00:10:56.600 | Users that have watched the most movies and the movies that have been the most watched
00:11:00.320 | And made a cross tab of it right so this is exactly the same data
00:11:05.560 | But it's a subset and now rather than being user movie rating we've got user
00:11:10.880 | movie rating
00:11:14.480 | And so some users haven't watched some of these movies. That's why some of these are not a number, okay?
00:11:21.380 | Then I copied that into Excel
00:11:24.400 | And you'll see there's a thing called collab filter dot XLS if you don't see it there now
00:11:32.200 | I'll make sure I've got it there by tomorrow
00:11:37.160 | Here is where I've copied
00:11:40.360 | That table okay, so as I go through this like
00:11:44.280 | Set up of the problem and kind of how it's described and stuff if you're ever feeling
00:11:49.600 | lost feel free to
00:11:52.240 | Ask either directly or through the forum if you ask through the forum and somebody answers there
00:11:58.720 | I want you to answer it here
00:12:00.720 | but if somebody else asks a question you would like answered of course just like it and
00:12:06.880 | Your net will keep an eye out for that because kind of that's we're digging in
00:12:10.960 | To the details of what's going on behind the scenes. It's kind of important that at each stage you feel like okay. I can see
00:12:16.560 | What's going on?
00:12:19.040 | Okay, so we're actually not going to build a neural net to start with
00:12:30.000 | Instead we're going to do something called a matrix factorization
00:12:36.040 | The reason we're not going to build a neural net to start with is that it so happens. There's a really really simple
00:12:41.120 | Kind of way of solving these kinds of problems, which I'm going to show you and so if I scroll down
00:12:47.440 | I've basically what I've got here is the same the same thing, but this time these are my predictions
00:12:54.640 | Rather than my actuals, and I'm going to show you how I created these predictions, okay, so here are my actuals
00:13:00.240 | Right here are my predictions and
00:13:03.920 | then down here
00:13:06.000 | we have our
00:13:08.000 | score which is the
00:13:10.960 | sum of the different squared
00:13:13.200 | Average square root okay, so this is RMSE down here. Okay, so on average we're
00:13:19.880 | Randomly initialized model is out by 2.8
00:13:24.280 | So let me show you what this model is and I'm going to show you by saying how do we guess?
00:13:29.960 | How much user ID number 14?
00:13:33.320 | likes movie ID number 27 and
00:13:36.520 | The prediction here. This is just at this stage is still random is
00:13:43.760 | So how are we calculating 0.91 and the answer is we're taking it as?
00:13:48.680 | this vector here
00:13:52.000 | Dot product with this vector here so dot product means point seven one times point one nine
00:13:59.860 | plus point eight one times point six three plus point seven four plus point three one and so forth and in
00:14:05.540 | You know linear algebra speak because one of them is a column and one of them is a row
00:14:09.900 | This is the same as a matrix product so you can see here. I've used the excel function matrix multiply
00:14:15.260 | And that's my prediction
00:14:18.500 | Having said that if the original
00:14:23.340 | Rating doesn't exist at all
00:14:28.540 | Then I'm just going to set this to zero right because like there's no error in predicting something that hasn't happened
00:14:34.620 | Okay, so what I'm going to do is I'm basically going to say all right every one of my rate
00:14:39.460 | Rate my predictions is not going to be a neural net. It's going to be a single
00:14:44.340 | matrix multiplication
00:14:46.980 | now the matrix multiplication that it's doing is basically in practice is between like this matrix and
00:14:59.460 | Matrix right so each one of these is a single part of that
00:15:04.560 | So I randomly initialize these these are just random numbers
00:15:12.340 | That I've just pasted in here
00:15:15.380 | So I've basically started off with two
00:15:17.900 | Random matrices, and I've said let's assume for the time being that every rating can be represented as
00:15:26.780 | The the matrix product of those two
00:15:29.380 | So then in excel you can actually do gradient descent
00:15:34.180 | You have to go to your options to the add-in section and check the box to say turn it on and once you do you'll
00:15:43.140 | See there's something there called solver
00:15:45.140 | And if I go solver it says okay, what's your?
00:15:49.180 | Objective function, and you just choose the cell so in this case. We chose the cell that contains our written in spread error and
00:15:56.900 | Then it says okay. What do you want to?
00:16:00.580 | Change and you can see here
00:16:03.140 | We've selected this matrix and this matrix and so it's going to do a gradient descent
00:16:07.940 | For us by changing these matrices to try and in this case minimize this is min minimize this excel
00:16:15.860 | So right grg non-linear is a gradient descent method so say solve and you'll see it starts at 2.8 and
00:16:24.420 | Then down here you'll see that number is going down. It's not actually showing us what it's doing
00:16:30.740 | but we can see that the numbers going down, so
00:16:32.980 | this is kind of got a
00:16:35.980 | neural neti feel to it in that we're doing like a matrix product and we're doing a gradient descent, but we don't have a
00:16:43.820 | Non-linear layer, and we don't have a second
00:16:47.300 | Linear layer on top of that so we don't get to call this deep learning
00:16:51.220 | so things where people do like deep learning each things where they have kind of
00:16:55.500 | Matrix products and gradient descents, but it's not deep people tend to just call that shallow learning okay, so we're doing shallow learning here
00:17:03.460 | Alright, so I'm just going to go ahead and press escape to stop it because I'm sick of waiting
00:17:08.260 | and so you can see
00:17:12.060 | We've now got down to the old point three nine all right, so for example
00:17:17.660 | It guessed that movie 72 for sorry movie 27 for use of 72 would get
00:17:25.180 | 4.44 rating
00:17:28.380 | 2772 and actually got a 4 rating so you can see like it's it's doing something quite useful
00:17:37.140 | So why is it doing something quite useful? I mean something to note here is
00:17:42.660 | The number of things we're trying to predict here is there's 225 of them
00:17:47.940 | Right and the number of things we're using to predict is that times two so 150 of them
00:17:55.100 | So it's not like we can just exactly fit we actually have to do some kind of
00:17:58.780 | machine learning here
00:18:01.340 | So basically what this is saying is that there does seem to be some
00:18:05.660 | way of
00:18:07.580 | making predictions in this way and
00:18:10.260 | So for those of you that have done some linear algebra
00:18:13.060 | And this is actually a matrix decomposition normally in linear algebra you would do this using a
00:18:18.980 | Analytical technique or using some techniques that are specifically designed for this purpose, but the nice thing is that we can use
00:18:26.580 | Gradient descent to solve pretty much everything including this
00:18:30.220 | I don't like to so much think of it from a linear algebra point of view though
00:18:34.340 | I like to think of it from an intuitive point of view which is this let's say movie. Sorry. Let's say movie ID 27 is
00:18:41.300 | Lord of the Rings
00:18:44.100 | part one and
00:18:46.100 | let's say
00:18:48.740 | Move and so let's say we're trying to make that prediction for user
00:18:53.500 | 272 are they going to like Lord of the Rings part one and so conceptually
00:18:59.820 | That particular movie. Maybe there's like there's four. Sorry. There's five
00:19:04.780 | Numbers here and we could say like well
00:19:07.980 | What if the first one was like how much is it sci-fi and fantasy and the second one is like?
00:19:13.860 | How recent a movie and how much special effects is there you know and the one at the top might be like how dialogue driven?
00:19:20.980 | Is it right like let's say those kind of five these five numbers represented particular things about the movie and so if that was the case
00:19:29.260 | Then we could have the same five numbers for the user saying like okay
00:19:33.620 | How much does the user like sci-fi and fantasy how much does the user like?
00:19:37.980 | Modern this is a modern CGI driven movies. How much does this does this user like?
00:19:45.820 | Dialogue driven movies and so if you then took that cross product
00:19:49.140 | You would expect to have a good model right would expect to have a reasonable rating now the problem is
00:19:56.540 | We don't have this information for each user. We don't have the information for each movie, so we're just going to like assume
00:20:03.460 | That this is a reasonable
00:20:06.060 | Kind of way of thinking about this system, and let's and let's stochastic gradient descent try and find these numbers
00:20:11.500 | Right so so in other words these these factors
00:20:15.860 | We call these things factors these factors and we call them factors because you can multiply them together to create this
00:20:23.100 | They're factors in a linear algebra sense these factors. We call them latent factors because they're not actually
00:20:29.900 | This is not actually a
00:20:32.340 | Vector that we've like named and understood and like entered in manually we've kind of assumed
00:20:39.180 | That we can think of movie ratings this way
00:20:42.680 | we've assumed that we can think of them as a dot product of
00:20:47.180 | Some particular features about a movie and some particular features of what users like those kinds of movies, right?
00:20:54.220 | And then we've used gradient descent
00:20:56.220 | To just say okay try and find some numbers that that work
00:20:59.980 | So that's that's basically the technique right and it's kind of
00:21:05.660 | The and the entirety is in this spreadsheet right so that is collaborative filtering using what we call probabilistic matrix factorization
00:21:15.020 | And as you can see the whole thing is easy to do in an Excel spreadsheet and the entirety of it really is this single
00:21:21.500 | Thing which is a single matrix multiplication
00:21:24.240 | plus randomly initializing
00:21:27.140 | We like to know if it would be better to cap this to zero and five maybe yeah
00:21:36.300 | Yeah, we're going to do that later right. There's a whole lot of stuff. We can do improvements. This is like our
00:21:43.580 | Simple as possible starting point right so so what we're going to do now is we're going to try and implement this
00:21:49.180 | in Python
00:21:52.340 | And run it on the whole data set another question is how do you figure out how many?
00:21:57.700 | You know how it's clear. How long are the metrics? Yeah, why is it five? Yeah, yeah
00:22:03.860 | So something to think about
00:22:06.860 | Given that this is like movie 49, right?
00:22:10.820 | And we're looking at a rating for movie 49
00:22:13.700 | Think about this. This is actually an embedding matrix
00:22:20.540 | So this length is actually the size of the embedding matrix. I'm not saying this is an analogy
00:22:27.180 | I'm saying it literally this is literally an embedding matrix
00:22:30.600 | We could have a one hot encoding where 72
00:22:34.900 | Where a one is in the 72nd position and so we'd like to look it up, and it would return this list of five numbers
00:22:42.300 | So the question is actually how do we decide on the dimensionality of our embedding vectors?
00:22:47.660 | And the answer to that question is we have no idea
00:22:51.780 | We have to try a few things and see what works
00:22:55.340 | the underlying concept is you need to pick an embedding dimensionality, which is
00:23:03.740 | Enough to reflect the kind of true complexity of this causal system
00:23:08.700 | but not so big that you
00:23:11.740 | Have too many parameters that it could take forever to run or even with vectorization. It might overfit
00:23:17.820 | So what does it mean when the factor is negative then
00:23:24.780 | The factor being negative in the movie case would mean like this is not dialogue driven in fact
00:23:31.900 | It's like the opposite dialogue here is terrible a negative for the user would be like I actually
00:23:38.060 | dislike modern CGI movies, so it's not from zero to whatever it's the range of
00:23:44.620 | Score it'd be negative is that range of score even like no no maximum. No. There's no constraints at all here
00:23:51.740 | These are just standard embedding matrices
00:23:54.660 | Thanks
00:24:00.420 | Questions the first question is why do what why can we trust this embeddings because like if you take a number six
00:24:07.700 | It can be expressed as 1 into 6 or like 6 into 1 or 2 into 3 and 3 into 2
00:24:11.840 | Also, are you saying like we could like reorder these five numbers in some other different order or like the value itself might be different
00:24:19.980 | As long as the product is something well, but you see we're using gradient descent to find the best numbers
00:24:26.660 | So like once we found a good minimum
00:24:30.620 | the idea is like
00:24:32.620 | Yeah, there are other numbers, but they don't give you as good an objective value
00:24:36.920 | And of course we should be checking that on a validation set really which we'll be doing in the Python version
00:24:43.460 | Okay, and the second question is when we have a new movie or a new user do we have to retrain the model?
00:24:49.180 | That is a really good question, and there isn't a straightforward answer to that
00:24:53.540 | Time permitting will come back to it
00:24:56.220 | But basically you would need to have like a kind of a new user
00:25:00.100 | Model or a new movie model that you would use initially
00:25:04.660 | And then over time yes, you would then have to retrain the model
00:25:09.620 | So like I don't know if they still do it
00:25:11.800 | But Netflix used to have this thing that when you were first on board it onto Netflix
00:25:15.340 | They would say like what movies do you like?
00:25:17.340 | And you'd have to go through and like say a bunch of movies you like and it would then like train its model
00:25:28.140 | Could you could you just find the nearest movie to the movie that you're trying to the new movie that you're trying to add?
00:25:33.420 | Yeah, you could use nearest neighbors for sure
00:25:35.700 | But the thing is initially at least in this case we have no
00:25:43.660 | Columns to describe a movie so if you had something about like the movies
00:25:48.720 | Genre release date who was in it or something you could have some kind of non collaborative filtering model
00:25:55.060 | And that was kind of what I meant. I like a new movie model. You'd have to have some some kind of predictors
00:26:00.540 | Okay, so a
00:26:04.660 | Lot of this is going to look familiar and and the way I'm going to do this is again
00:26:11.940 | It's kind of this top-down approach. We're going to start using a
00:26:15.020 | Few features of pytorch and fast AI and gradually we're going to redo it a few times in a few different ways
00:26:23.580 | Kind of doing a little bit deeper each time
00:26:25.580 | Regardless we do need a validation set so we can use our standard cross validation indexes approach to grab a random set of IDs
00:26:36.520 | This is something called weight decay
00:26:40.780 | Which we'll talk about later in the course for those of you that have done some machine learning
00:26:45.260 | It's L2 regularization basically
00:26:48.060 | And this is where we choose how big a embedding matrix do we want okay?
00:26:53.380 | So again, you know here's where we get our model data object from CSB
00:27:00.580 | Passing in that ratings file which remember
00:27:05.380 | Looks like that okay, so you'll see like stuff tends to look pretty familiar after a while
00:27:13.620 | And then you just have to pass in
00:27:21.820 | What are your rows effectively? What are your columns effectively, and what are your values effectively right so any any collaborative filtering?
00:27:29.580 | Recommendation system approach. There's basically a concept of like
00:27:33.440 | You know a user and an item
00:27:36.140 | Now they might not be users and items like if you're doing the Ecuadorian groceries competition
00:27:42.680 | There are stores and items and you're trying to predict. How many things are you going to sell at?
00:27:48.540 | This store of this type
00:27:50.900 | But generally speaking just this idea of like you've got a couple of kind of high cardinality
00:27:57.660 | Categorical variables and something that you're measuring and you're kind of conceptualizing and saying okay, we could predict
00:28:04.540 | The rating we can predict the value by doing this this dot product
00:28:09.140 | Interestingly this is kind of relevant to that that last question or suggestion an
00:28:16.660 | Identical way to think about this or to express this is to say
00:28:20.460 | when we're deciding
00:28:23.140 | Whether user 72 will like movie 27 is basically saying
00:28:29.500 | which other
00:28:31.940 | users liked movies that 72 liked and
00:28:36.580 | Which other movies were liked by people like?
00:28:43.140 | User 72 it turns out that these are basically two ways of saying the exact same thing
00:28:50.160 | So basically what collaborative filtering is doing?
00:28:52.300 | You know kind of conceptually is to say okay this movie and this user
00:28:58.420 | Which other movies are similar to it in terms of like?
00:29:02.160 | Similar people enjoyed them and which people are similar to this person based on people that like the same kind of movies
00:29:09.340 | so that's kind of the
00:29:11.580 | underlying
00:29:12.900 | Structure and anytime there's an underlying structure like this that kind of collaborative filtering approach is likely to be useful
00:29:18.840 | okay, so
00:29:21.860 | So you yeah, so there's basically two parts the two bits of your thing that you're factoring and then the value the dependent variable
00:29:29.120 | So as per usual we can take our model data and ask for a learner from it
00:29:35.420 | And we need to tell it what size embedding matrix to use
00:29:38.940 | How many sorry what validation set indexes to use what batch size to use and what optimizer?
00:29:45.740 | To use and we're going to be talking more about optimizers
00:29:49.100 | shortly
00:29:50.820 | We won't do Adam today, but we'll do Adam
00:29:53.120 | next week or the week after
00:29:55.780 | And then we can go ahead and say fit
00:29:58.060 | Right, and it all looks pretty similar interest. It's usual interestingly
00:30:04.020 | I only had to do three epochs like this kind of models into train super quickly
00:30:09.620 | You can use the learning rate finder as per usual all the stuff you're familiar with will work fine
00:30:14.780 | And that was it so this took you know about two seconds the train. There's no pre trained anythings here
00:30:22.340 | This is from random scratch, right?
00:30:24.500 | So this is our validation set and we can compare it we have this is a mean squared error
00:30:31.140 | Not a root mean squared error, so we can take the square root
00:30:33.860 | So that last time I ran it was 0.776 and that's 0.88 and there's some benchmarks available for this data set
00:30:43.980 | And when I scrolled through and found the bench the best benchmark I could find here from this
00:30:48.560 | Recommendation system specific library they had 0.91. So we've got a better loss in two seconds
00:30:59.420 | Already, so that's good
00:31:01.420 | So that's basically how you can do collaborative filtering
00:31:06.060 | with the fast AI library without
00:31:09.740 | Thinking too much, but so now we're going to dig in and try and rebuild that we'll try and get to the point that we're getting
00:31:16.980 | something around
00:31:19.300 | 0.77 0.78 from scratch
00:31:21.300 | But if you want to do this yourself at home, you know without worrying about the detail
00:31:28.340 | That's you know, those three lines of code is all you need
00:31:31.460 | Okay, so we can get the predictions in the usual way and you know, we could for example plot
00:31:37.540 | SNS is seaborne seaborne is a really great flooding library. It sits on top of matplotlib
00:31:43.220 | It actually leverages matplotlib
00:31:45.940 | So anything you learn about matplotlib will help you with seaborne. It's got a few like nice little plots like this joint plot
00:31:51.340 | Here is I'm doing
00:31:54.140 | predictions
00:31:56.020 | against
00:31:57.340 | Against actuals. So these are my actuals
00:31:59.680 | These are my predictions and you can kind of see the the shape here is that as we predict higher numbers
00:32:05.180 | they actually are higher numbers and you can also see the histogram of the
00:32:09.360 | Predictions and a histogram of the actions. So I'm just kind of plotting that just to show you another interesting visualization
00:32:16.020 | Could you please explain the n-factors
00:32:20.780 | Why it's set to 15? It's set to 50 because I tried a few things that's in the work
00:32:26.460 | That's all. What does it mean? It's this it's the dimensionality of the embedding matrix
00:32:31.040 | Or to think of it in another way. It's like how you know rather than being five. It's
00:32:37.140 | Jeremy I have a question about suppose that your
00:32:46.300 | Recommendation system is more implicit. So you have zeros or ones instead of just
00:32:55.140 | Actual numbers, right? So basically we would then
00:32:57.700 | Need to use a classifier instead of a regressor
00:33:01.420 | Have to sample the negative for something like that
00:33:06.140 | So if you don't have it, we just have once let's say like just kind of implicit feedback. Oh
00:33:11.380 | I'm not sure we'll get to that one in this class
00:33:14.260 | But what I will say is like in the case that you're just doing classification rather than regression
00:33:18.740 | We haven't actually built that in the library yet
00:33:22.100 | Maybe somebody this week wants to try adding it. It would only be a small number of lines of code. You basically have to change the
00:33:27.780 | activation function to be a sigmoid and you would have to change the
00:33:32.380 | Criterion or the loss function to be cross entropy
00:33:36.700 | rather than
00:33:39.540 | RMSE and that will give you a
00:33:41.880 | Classifier rather than a regressor. Those are the only things you'd have to change
00:33:46.180 | So hopefully somebody this week will take up that challenge and by the time we come back next week. We will have that working
00:33:54.140 | So I said that we're basically doing a dot product right or you know a dot product is kind of the vector version
00:34:03.460 | I guess of this matrix product
00:34:05.620 | So we're basically doing each of these things times each of these things and then add it together
00:34:11.700 | That's a dot product. So let's just have a look at how we do that in PyTorch
00:34:17.300 | So we can create a tensor in PyTorch just using this little capital T thing
00:34:22.380 | You can just say that's the fast AI version the full version is torch dot from NumPy or something
00:34:28.780 | But I've got it set up so you can pass it through pass in even a list of lists
00:34:33.260 | So this is going to create a torch tensor with 1 2 3 4 and then here's a torch tensor with 2 2 10 10
00:34:41.440 | Okay, so here are two
00:34:44.260 | Torch tensors, I didn't say dot CUDA. So they're not on the GPU. They're sitting on the CPU
00:34:50.340 | just FYI
00:34:52.820 | We can multiply them together
00:34:54.860 | Right and so anytime you have a mathematical operator between tensors in NumPy or PyTorch
00:35:02.420 | It will do element wise
00:35:05.100 | Assuming that they're the same dimensionality, which they are they're both 2 by 2
00:35:08.720 | Okay, and so here we've got
00:35:11.460 | 2 by 2 is 4
00:35:14.140 | 3 by 10 is 30 and so forth. Okay, so there's our A times B
00:35:17.940 | So if you think about basically what we want to do here is we want to take
00:35:23.540 | Okay, so I've got 1
00:35:31.580 | Times 2 is 2 2 times 2 is 4
00:35:37.100 | 2 plus 4 is 6 and so that is actually the dot product between 1 2 and 2 4 and
00:35:44.380 | Then here we've got 3 by 10 is 30 4 by 40
00:35:48.900 | Sorry 4 by 10 is 40 30 and 40 is 70
00:35:52.220 | So in other words a times B dot sum along the first dimension
00:35:57.860 | So that's summing up the columns. In other words across a row
00:36:01.260 | Okay, this thing here is doing the dot product of
00:36:06.980 | Each of these rows with each of these rows
00:36:09.740 | That makes sense and obviously we could do that with
00:36:13.540 | You know some kind of matrix modification approach, but I'm trying to really do things with as little
00:36:19.860 | special case stuff as possible
00:36:22.820 | Okay, so that's what we're going to use for our dot products from now on so basically all we need to do now is
00:36:29.660 | Remember we have the data we have is not in that crosstab format
00:36:35.420 | So in Excel we've got it in this crosstab format, but we've got it here in this
00:36:39.940 | Listed format use a movie rating use a movie
00:36:43.340 | So conceptually we want to be like looking up this user
00:36:47.300 | Into our embedding matrix to find their 50 factors looking up that movie to find their 50 factors and then take the dot product
00:36:54.780 | of those two 50 long vectors
00:36:57.500 | So let's do that
00:37:04.220 | To do it we're going to build a layer our own custom
00:37:09.100 | neural net layer
00:37:11.860 | So the the more generic vocabulary we call this is we're going to build a pytorch module
00:37:18.980 | Okay, so a pytorch module is a very specific thing
00:37:23.500 | It's something that you can use as a layer and a neural net once you've created your own pytorch module
00:37:29.580 | You can throw it into a neural net
00:37:31.700 | And a module works by assuming we've already got one say called model
00:37:36.940 | You can pass in some things in parentheses, and it will calculate it right so assuming that we already have a module called dot product
00:37:45.460 | We can instantiate it like so
00:37:48.940 | To create our dot product object, and we can basically now treat that like a function
00:37:56.100 | All right, but the thing is it's not just a function because we'll be able to do things like take derivatives of it
00:38:02.620 | Stack them up together into a big
00:38:05.100 | Stack of neural network layers blah blah blah, right, so it's basically a function that we can kind of compose very conveniently
00:38:13.460 | So here how do we define a module which as you can see here returns a dot product well
00:38:20.700 | We have to create a Python class and so if you haven't done Python OO before
00:38:26.500 | You're going to have to learn because all pytorch modules are written in Python OO
00:38:32.260 | And it's one of the things I really like about pytorch is that it doesn't
00:38:36.020 | Reinvent totally new ways of doing things like TensorFlow does all the time in pytorch that you know really tend to use
00:38:44.780 | Pythonic ways to do things so in this case. How do you create you know some kind of new behavior you create a Python class?
00:38:52.220 | So Jeremy suppose that you have a lot of data
00:38:58.600 | Not just a little bit of data. You can have a memory. Will you be able to use fast AI to solve corollary filtering?
00:39:05.740 | Yes, absolutely
00:39:08.460 | It's it uses
00:39:12.780 | mini batch stochastic gradient descent which does it a batch at a time the
00:39:19.180 | This particular version is going to create a
00:39:27.660 | Pandas data frame and a pandas data frame has to live in memory
00:39:32.740 | Having said that you can get easily 512 gig
00:39:38.460 | You know instances on Amazon so like if you had a CSV that was bigger than 512 gig
00:39:43.660 | You know that would be impressive if that did happen
00:39:48.140 | I guess you would have to instead save that as a B calls array and
00:39:51.780 | Create a slightly different version that reads from a B calls array to streaming in or maybe from a desk
00:39:58.380 | data frame which also so
00:40:01.100 | It would be easy to do I don't think I've seen
00:40:05.380 | Real world situations where you have 512 gigabyte collaborative filtering matrices, but yeah, we can do it
00:40:12.940 | Okay now
00:40:16.540 | This is PyTorch specific this next bit is that when you define like the actual work to be done which is here return
00:40:24.780 | user times movie dot sum
00:40:27.300 | You have to put it in a special method called forward
00:40:31.860 | Okay, and this is this idea that like it's very likely you're pretty neural net right and in a neural net the thing where you
00:40:37.820 | calculate the next
00:40:39.980 | Set of activations is called the forward pass and so that's doing a forward calculation
00:40:45.900 | The gradients is called the backward calculation
00:40:49.620 | We don't have to do that because PyTorch calculates that automatically so we just have to define
00:40:54.500 | Forward so we create a new class we define forward and here we write in our definition of dot product
00:41:01.800 | Okay, so that's it. So now that we've created this class definition. We can instantiate our
00:41:09.100 | Model right and we can call our model and get back the numbers we expected. Okay, so that's it
00:41:16.120 | That's how we create a custom
00:41:18.120 | PyTorch layer and if you compare that to like any other
00:41:23.080 | Library around pretty much. This is way easier
00:41:26.080 | Basically, I guess because we're leveraging
00:41:28.960 | What's already in Python?
00:41:31.720 | So let's go ahead and now create a more complex
00:41:35.080 | Module and we're going to basically do the same thing. We're going to have a forward again
00:41:41.720 | We're going to have our users times movies dot sum
00:41:44.960 | But we're going to do one more thing beforehand, which is we're going to create two
00:41:49.520 | Embedding matrices and then we're going to look up our users and our movies in those embedding matrices
00:41:56.360 | So let's go through and and do that
00:42:00.640 | so the first thing to realize is that
00:42:02.840 | The users the user IDs and the movie IDs may not be contiguous
00:42:09.680 | You know like they may be they start at a million and go to a million one thousand say, right? So if we just used
00:42:18.240 | Those IDs directly to look up into an embedding matrix
00:42:23.080 | We would have to create an embedding matrix of size one million one thousand right which we don't want to do
00:42:28.080 | so the first thing I do is to get a list of the
00:42:31.800 | unique user IDs and
00:42:34.520 | then I create a mapping from every user ID to a
00:42:39.360 | Contiguous integer this thing I've done here where I've created a
00:42:44.900 | dictionary which maps from every unique thing to a unique index is
00:42:50.960 | Well worth studying during the week because like it's super super handy
00:42:55.440 | It's something you very very often have to do in all kinds of machine learning
00:42:59.400 | All right, and so I won't go through it here
00:43:01.680 | It's easy enough to figure out if you can't figure it out just ask on the forum
00:43:04.920 | Anyway, so once we've got the mapping from user to a contiguous index
00:43:11.480 | We then can say let's now replace the user ID column
00:43:17.480 | With that contiguous index right so pandas dot apply applies an arbitrary function
00:43:24.680 | In Python Lambda is how you create an anonymous function on the fly and this anonymous function simply returns the index
00:43:32.560 | through the same thing for movies and so after that we now have the same ratings table we had before but our
00:43:39.720 | IDs have been matched to contiguous
00:43:42.600 | Integers and therefore there are things that we can look up into an embedding matrix
00:43:47.480 | So let's get the count of our users in our movies
00:43:52.760 | And let's now go ahead and try and create our
00:43:55.520 | Python version of this
00:43:58.800 | Okay, so
00:44:01.920 | Earlier on when we created our simplest possible
00:44:06.840 | Pytorch module there was no like
00:44:12.280 | State we didn't need a constructor
00:44:15.040 | Because we weren't like saying how many users are there or how many movies are there or how many factors?
00:44:20.320 | Do we want or whatever right anytime we want to do something like?
00:44:24.440 | This where we're passing in and saying we want to construct our
00:44:29.800 | Module with this number of users and this number of movies then we need a constructor
00:44:37.000 | for our class and you create a constructor in Python by defining a
00:44:42.600 | Dunder in it underscore underscore in it underscore underscore
00:44:45.720 | special name so this just creates a
00:44:49.840 | Constructor then if you haven't done over before
00:44:51.920 | You wanted to do some study during the week, but it's a pretty simple idea
00:44:57.160 | This is just the thing that when we create this object. This is what gets run, okay?
00:45:01.680 | Again special Python thing when you create your own constructor
00:45:06.920 | You have to call the parent class constructor
00:45:08.800 | And if you want to have all of the cool behavior of a Pytorch module you get that by inheriting
00:45:15.600 | From an end up module neural net module okay, so basically by inheriting here and calling the superclass constructor
00:45:23.560 | We now have a fully functioning Pytorch layer, okay, so now we have to give it some behavior
00:45:29.760 | And so we give it some behavior by storing some things in it all right, so here. We're going to create something called
00:45:37.400 | self dot you users and that is going to be an
00:45:42.200 | embedding layer
00:45:44.160 | Number of rows is and users number of columns is n factors
00:45:48.440 | So that is exactly this right the number of rows is n uses number of columns is n factors
00:45:56.920 | And then we'll have to do the same thing for movies
00:46:00.120 | All right, so that's going to go ahead and create these two
00:46:04.800 | randomly initialized arrays
00:46:10.400 | However when you randomly initialize or an array it's important to randomly initialize it to a
00:46:16.040 | Reasonable set of numbers like a reasonable scale, right if we randomly initialize them from like naught to a million
00:46:23.040 | Then we would start out and you know these things would start out being like
00:46:27.720 | You know billions and billions of size rotating and that's going to be very hard to do gradient descent on
00:46:33.600 | So I just kind of manually figured here like okay about what size
00:46:39.800 | Numbers that are going to give me about the right ratings, and so we don't we know we did ratings between about naught and five
00:46:45.920 | So if we start out with stuff between about naught and 0.05, then we're going to get ratings of about the right level
00:46:54.160 | You can easily enough like that calculate that in neural nets. There are standard algorithms for
00:47:01.880 | Basically doing doing that calculation and the basic the key algorithm is
00:47:09.360 | Something called her initialization from timing her and the basic idea
00:47:15.400 | Is that you take the
00:47:20.540 | Here you basically set the weights equal to a normal distribution
00:47:27.800 | With a standard deviation, which is basically inversely proportional to the number of things
00:47:39.380 | in the previous layer
00:47:41.380 | And so in our previous layer
00:47:44.100 | So in this case we basically a set if you basically take that
00:47:52.020 | naught to 0.05 and multiply it by the fact that you've got
00:47:55.720 | 40 things I wasn't 40 or 50 things coming out of it
00:47:59.820 | 50 50 things coming out of it, then you're going to get something of about the right size
00:48:06.820 | Pytorch has already has like her initialization
00:48:11.100 | Class there like we don't in normally in real life have to think about this we can just call the existing initialization
00:48:17.780 | Functions, but we're trying to do this all like from scratch here. Okay without any
00:48:23.700 | special stuff going on
00:48:26.380 | So there's quite a bit of Pytorch notation here, so self.you we've already set to an instance of the embedding class
00:48:37.620 | Has a dot weight
00:48:39.620 | Attribute which contains the actual the actual embedding matrix
00:48:43.900 | so that contains
00:48:48.700 | The actual embedding matrix is not a tensor
00:48:52.980 | It's a variable a variable is exactly the same as a tensor in other words it supports the exact same
00:49:00.220 | operations as a tensor, but it also
00:49:04.500 | Does automatic differentiation?
00:49:06.500 | That's all a variable is basically
00:49:09.220 | To pull the tensor out of a variable you get its data attribute
00:49:15.740 | Okay, so this is so this is now the tensor of the weight matrix of the self.you embedding
00:49:22.980 | And then something that's really handy to know is that all of the tensor functions in Pytorch
00:49:30.340 | You can stick an underscore at the end, and that means do it in place
00:49:34.500 | Right so this is say create a random uniform random number of an appropriate size
00:49:40.860 | For this tensor and don't return it, but actually fill in that matrix
00:49:46.860 | In place okay, so that's a super handy thing to know about I mean it wouldn't be rocket science otherwise. We would have to have gone
00:49:55.220 | Okay, there's the non in place version. That's what saves us some typing saves us some screen noise. That's all
00:50:14.900 | So now we've got our randomly initialized embedding weight matrices
00:50:19.860 | And so now the forward
00:50:22.980 | I'm actually going to use the same columnar model data that we used for
00:50:27.700 | Rossman
00:50:29.540 | And so it's actually going to be passed both categorical variables and continuous variables
00:50:34.080 | and in this case there are no
00:50:36.620 | continuous variables, so I'm just going to grab the
00:50:40.180 | zeroth column out of the categorical variables and call it users and the first column and call it movies okay, so I'm just kind of
00:50:48.660 | Too lazy to create my own. I'm not so much too lazy that we do have a special class for this
00:50:53.340 | But I'm trying to avoid creating a special class, so I'm just going to leverage this columnar model data class
00:50:58.920 | Okay, so we can basically grab our user and movies
00:51:03.020 | Mini batches right and remember this is not a single user in a single movie. This is going to be a whole mini batch of them
00:51:11.340 | We can now look up that mini batch of users in our embedding matrix U and the movies in
00:51:18.700 | our embedding matrix M
00:51:20.380 | All right, so this is like exactly the same as just doing an array look up to grab the the user ID numbered
00:51:26.820 | Value, but we're doing it a whole mini batch at a time
00:51:29.980 | Right and so it's because pytorch
00:51:32.340 | Can do a whole mini batch at a time with pretty much everything that we can get really easy speed up
00:51:37.580 | We don't have to write any loops on the whole to do everything through our mini batch
00:51:42.460 | And in fact if you do have a loop through your mini batch manually you don't get GPU acceleration
00:51:48.460 | That's really important to know right so you never want to loop have a for loop going through your mini batch
00:51:53.860 | You always want to do things in this kind of like whole mini batch at a time
00:51:58.120 | But pretty much everything in pytorch does things a whole mini batch at a time, so you shouldn't have to worry about it
00:52:04.060 | And then here's our product just like before all right so having defined
00:52:10.620 | That I'm now going to
00:52:16.980 | Go ahead and say you're at my X values is
00:52:19.540 | Everything except the rating and the timestamp
00:52:22.820 | In my ratings table my Y is my rating and then I can just say okay. Let's
00:52:27.920 | Grab a model data from a data frame using that X and that Y and here is our list of
00:52:35.380 | categorical variables
00:52:40.180 | And then so let's now instantiate that pytorch object
00:52:46.940 | All right, so we've now created that from scratch
00:52:49.260 | And then the next thing we need to do is to create an optimizer, so this is part of pytorch
00:52:56.420 | The only fast AI thing here is this line right because it's like I don't think showing you
00:53:03.840 | How to build data sets and data loaders is interesting enough really we might do that in part two of the course
00:53:10.060 | And it's actually so straightforward like a lot of you are already doing it on the forums
00:53:15.740 | So I'm not going to show you that in this part
00:53:17.860 | But if you're interested feel free to to talk on the forums about it
00:53:21.940 | But I'm just going to basically take the the thing that feeds this data as a given particularly because these things are so flexible
00:53:28.420 | Right you you know if you've got stuff in a data frame. You can just use this you don't have to rewrite it
00:53:32.900 | So that's the only fast AI thing we're using so this is a pytorch thing and so
00:53:39.060 | Optim is the thing in pytorch that gives us an optimizer. We'll be learning about that
00:53:45.620 | very shortly
00:53:47.420 | So it's actually the thing that's going to update our weights
00:53:50.200 | Pytorch
00:53:53.740 | Calls them the parameters of the model so earlier on we said model equals embedding dot blah blah blah
00:54:01.020 | All right, and because embedding dot
00:54:03.580 | Derives from nn dot module we get all of the pytorch module behavior and one of the things we got for free
00:54:11.140 | Is the ability to say dot parameters?
00:54:14.260 | So that's pretty that's pretty handy right that's the thing that basically is going to automatically
00:54:20.860 | Give us a list of all of the weights in our model that have to be updated and so that's what gets passed to the optimizer
00:54:28.900 | We also passed the optimizer the learning rate
00:54:31.820 | The weight decay which we'll talk about later and momentum that we'll talk about later
00:54:41.300 | Okay, one other thing that I'm not going to do right now
00:54:44.060 | But we will do later is to write a training loop so the training loop is a thing that loops through each mini batch
00:54:51.620 | Updates the weight to subtract the gradient times the volume rate
00:54:56.300 | There's a function in fast AI which is the training loop and it's
00:55:03.020 | It's pretty simple
00:55:10.140 | Here it is right for epoch in epochs
00:55:13.180 | This is just the thing that shows a progress bar so ignore this for X comma Y in my training data loader
00:55:20.900 | calculate the loss
00:55:24.260 | Print out the loss in our in a progress bar call any callbacks you have and at the end
00:55:35.860 | Call the call the metrics on the validation right so this there's just for each epoch go through each mini batch
00:55:43.900 | and do one step of our optimizer step is
00:55:48.680 | Basically going to take advantage of this optimizer, but we're rewriting that from scratch shortly
00:55:54.580 | So this is notice we're not using a learner
00:55:59.900 | Okay, we're just using a pipe watch module so this this fit thing although. It's past it part of fast AI
00:56:06.420 | It's like lower down the layers of abstraction now. This is the thing that takes a
00:56:11.560 | regular pipe torch model, so if you ever want to like
00:56:15.940 | skip as much
00:56:19.220 | Fast AI stuff as possible like you've got some pipe torch model. You've got some code on the internet
00:56:24.580 | You basically want to run it
00:56:26.060 | But you don't want to write your own training loop, then this is this is what you want to do
00:56:30.340 | You want to call fast AI's fit function and so what you'll find is like
00:56:34.300 | The library is designed so that you can kind of dig in at any layer of abstraction
00:56:39.140 | You like right and so at this layer of abstraction. You're not going to get things like
00:56:45.340 | Stochastic gradient descent with restarts you're not going to get like differential learning rates like all that stuff
00:56:52.180 | That's in the learner like you could do it, but you'd have to write it all by by hand yourself
00:56:56.620 | Right and that's the downside of kind of going down to this level of abstraction
00:57:00.900 | The upside is that as you saw the code for this is very simple. It's just a simple training loop
00:57:07.060 | It takes a standard pytorch model
00:57:09.060 | So this is like this is a good thing for us to use here
00:57:12.460 | We can we just call it and it looks exactly like what we're we're used to see right we get our
00:57:19.540 | validation and training loss for the three plus
00:57:23.220 | now you'll notice that
00:57:26.220 | We wanted something around point seven six
00:57:31.260 | So we're not there so in other words the the default fast AI collaborative filtering algorithm is doing something
00:57:39.620 | Smarter than this so we're going to try and do that
00:57:45.020 | One thing that we can do since we're calling our you know this lower level fit function
00:57:49.780 | There's no learning rate and kneeling we could do our own learning rate and kneeling so you can hear it
00:57:54.140 | See here there's a fast AR function called set learning rates
00:57:57.220 | you can pass in a standard pytorch optimizer and pass in your new learning rate and
00:58:02.540 | Then call fit again. And so this is how we can let manually do a learning rate schedule
00:58:09.100 | And so you can see we've got a little bit better
00:58:13.140 | We still got a long way to go
00:58:15.580 | Okay, so I think what we might do is we might have a
00:58:20.740 | Seven-minute break and then we're going to come back and try and improve this score of it
00:58:28.240 | For those who are interested somebody was asking me at the break for a kind of a quick
00:58:41.740 | Walk through so this is totally optional, but if you go into the fast AI library, there's a model.py file
00:58:53.740 | That's where fit is which we're just looking at which goes through
00:58:57.420 | Each epoch in epochs and then goes through each X and Y in the mini batch and then it calls this
00:59:05.500 | Step function so the step function
00:59:12.000 | Here and you can see the key thing is it calculates the output from the model the models for N right and so if you remember
00:59:21.400 | our dot product
00:59:24.320 | We didn't actually call model dot forward we just called model parentheses and that's because the
00:59:31.280 | nn dot module
00:59:34.080 | automatically
00:59:35.520 | You know when you call it as if it's a function it passes it along to forward
00:59:39.640 | Okay, so that's that's what that's doing there right and then the rest of this will will learn about shortly. Just basically doing the
00:59:46.860 | The loss function and then the backward pass
00:59:51.920 | Okay, so for those who are interested
00:59:54.200 | That's that's kind of gives you a bit of a sense of how the code is structured if you want to look at it
00:59:59.160 | and as I say like the the fast AI code is designed to
01:00:04.440 | both be world class performance, but also
01:00:08.680 | Pretty easy to read so like feel free like take a look at it
01:00:13.480 | And if you want to know what's going on just ask on the forums
01:00:16.400 | And if you know if you think there's anything that could be
01:00:19.400 | clearer
01:00:21.840 | Let us know
01:00:23.840 | Because yeah, the code is definitely know we're going to be digging into the code more and more
01:00:28.880 | Okay, so let's try and improve this a little bit and let's start off by improving it in Excel
01:00:38.040 | So you might have noticed here that we've kind of got the idea that
01:00:42.320 | User 72
01:00:44.880 | You know like sci-fi modern movies with special effects, you know
01:00:49.760 | Whatever and movie number 27 is sci-fi and has special effects and not much dialogue
01:00:55.520 | but we're missing an important case, which is like
01:01:01.400 | User 72 is pretty enthusiastic on the whole and on average rates things highly highly, you know and movie
01:01:12.080 | You know, it's just a popular movie
01:01:14.320 | You know which just on average it's higher
01:01:17.000 | so what we'd really like is to add a
01:01:20.040 | constant for the user and a constant for the movie and
01:01:24.880 | Remember in neural network terms we call that a bias
01:01:28.880 | That's we want to add a bias so we could easily do that and if we go into the bias tab here
01:01:35.040 | We've got the same data as before and we've got the same
01:01:38.560 | Latent factors as before and I've just got one extra
01:01:44.000 | Row here and one extra column here and you won't be surprised here that we now
01:01:50.800 | Take the same matrix multiplication as before and we add in
01:01:56.160 | that and we add in that
01:01:59.360 | Okay, so that's our bias
01:02:02.640 | So other than that we've got exactly the same loss function over here
01:02:07.720 | And so just like before we can now go ahead and solve that and now our changing variables include the
01:02:16.600 | bias and we can say solve and if we leave that for a little while it will come to a
01:02:24.320 | better result than we had before
01:02:26.600 | Okay, so that's the first thing we're going to do to improve our model and there's really very little show
01:02:35.000 | Just to
01:02:38.360 | Make the code a bit shorter
01:02:40.600 | I have to find a function called get embedding which takes a number of inputs and a number of factors
01:02:47.720 | so the number of rows and the embedding matrix and unposted with matrix creates the embedding and
01:02:55.040 | Randomly initializes it. I don't know why I'm doing negative to positive here and it's zero last time
01:03:00.440 | Honestly, it doesn't matter much as long as it's in the right ballpark
01:03:03.280 | And then we return that initialized embedding
01:03:06.780 | So now we need not just our users by factors, which I'll chuck into u, our movies by factors
01:03:14.680 | Which I've chuck into m, but we also need users by 1
01:03:18.440 | Which we'll put into ub, user bias, and movies by 1 which we'll put into movie bias
01:03:24.280 | Okay, so this is just doing a list comprehension
01:03:27.360 | Going through each of the tuples creating embedding for each of them and putting them into these things
01:03:32.800 | Okay, so now our forward is exactly the same as before
01:03:38.200 | U times m dot sum and this is actually a little confusing because we're doing it in two two steps
01:03:47.840 | Maybe to make it a bit easier. Let's pull this out
01:03:51.300 | Put it up here
01:03:54.240 | Put this in parentheses
01:03:56.960 | Okay, so maybe that looks a little bit more familiar
01:04:00.880 | All right, u times n dot sum that's the same dot product and then here we're just going to add in our user bias and
01:04:07.840 | our movie bias
01:04:10.040 | Dot squeeze is the PyTorch thing that adds an additional
01:04:17.480 | unit axis on
01:04:19.480 | That's not going to make any sense if you haven't done broadcasting before
01:04:23.040 | I'm not going to do broadcasting in this course because we've already done it and we're doing it in the machine learning course
01:04:29.980 | But basically in short
01:04:32.200 | Broadcasting is what happens when you do something like this where um is a matrix
01:04:39.440 | Self dot ub uses is a is a vector
01:04:42.000 | How do you add a vector to a matrix and basically what it does?
01:04:46.880 | Is it duplicates?
01:04:48.880 | the vector
01:04:50.400 | So that it makes it the same size as the matrix and the particular way whether it duplicates it across columns or down rows
01:04:57.240 | Or how it does it is called broadcasting the broadcasting rules are the same as numpy
01:05:02.700 | PyTorch didn't actually used to support broadcasting
01:05:06.100 | So I was actually the guy who first added broadcasting to PyTorch using an ugly hack and then the PyTorch authors did an awesome job
01:05:12.880 | Of supporting it actually inside the language
01:05:16.400 | So now you can use the same broadcasting operations in PyTorch is numpy
01:05:21.000 | If you haven't dealt with this before it's really important to learn it
01:05:26.540 | Because like it's it's kind of the most important fundamental way to do computations quickly in numpy and PyTorch
01:05:34.760 | It's the thing that lets you not have to do loops
01:05:37.100 | Could you imagine here if I had to loop through every row of this matrix and add each you know?
01:05:43.120 | This back to the every row it would be slow it would be you know a lot more code
01:05:47.640 | And the idea of broadcasting it actually goes all the way back to
01:05:52.840 | APL which is a language designed in the 50s by an extraordinary guy called Ken Iverson
01:05:58.120 | APL was originally a designed or written out as a new type of mathematical notation
01:06:04.200 | He has this great essay called
01:06:07.480 | Notation as a tool for thought and the idea was that like really good notation could actually make you think of better things
01:06:14.320 | And part of that notation is this idea of broadcasting. I'm incredibly enthusiastic about it, and we're going to use it plenty
01:06:25.080 | either watch the machine learning lesson or
01:06:29.360 | You know Google numpy broadcasting
01:06:32.760 | for information
01:06:35.560 | Anyway, so basically it works reasonably intuitively we can add on we can add the vectors to the matrix
01:06:43.160 | All right
01:06:47.400 | Having done that we're now going to do one more trick. Which is I think it was your net asked earlier about could we
01:06:55.840 | Squish the ratings to be between 1 and 5
01:07:00.200 | and the answer is
01:07:03.360 | We could right and specifically what we could do is
01:07:07.680 | we could
01:07:10.560 | Put it through a sigmoid function
01:07:12.600 | All right, so to remind you the sigmoid function
01:07:17.000 | Looks like that right and this is that's one
01:07:24.200 | All right, we could put it through a sigmoid function
01:07:28.040 | So we could take like 4.96 and put it through a sigmoid function and like that. You know that's kind of high
01:07:34.360 | So it kind of be over here somewhere right and then we could multiply that
01:07:39.400 | sigmoid like the result of that by 5
01:07:42.600 | For example right and in this case we want it to be between 1 and 5 right so maybe we might multiply it by 4 and
01:07:51.080 | Add 1 instance that's a basic idea
01:07:54.680 | And so here is that trick we take
01:07:58.240 | The result so the result is basically the thing that comes straight out of the dot product plus the addition of the biases
01:08:05.720 | And put it through a sigmoid function now in pytorch
01:08:10.440 | Basically all of the functions you can do the tensors are available
01:08:16.520 | Inside this thing called capital F, and this is like totally standard in pytorch
01:08:23.040 | It's actually called torch dot nn dot functional
01:08:26.000 | But everybody including all of the pytorch docs import torch dot nn dot functional as capital F
01:08:32.220 | Right so capital F dot sigmoid means a function called sigmoid that is coming from
01:08:38.360 | torches
01:08:40.560 | Functional module right and so that's going to apply a sigmoid function to the result
01:08:45.520 | So I've squished them all between 0 and 1 using that nice little shape, and then I can multiply that by
01:08:52.040 | 5 minus 1 plus 4
01:08:54.040 | Right and then add on 1 and that's going to give me something between 1 and 5 okay, so
01:08:59.560 | Like there's no need to do this. I could comment it out, and it will still work right
01:09:06.100 | But now it has to come up with a set of calculations that are always between
01:09:10.560 | 1 and 5 right where else if I leave this in then it's like makes it really easy
01:09:15.960 | It's basically like oh if you think this is a really good movie just calculate a really high number
01:09:20.520 | It's a really crappy movie cap a really low number, and I'll make sure it's in the right region
01:09:25.140 | So even though this isn't a neural network
01:09:27.160 | It's still a good example of this kind of like if you're doing any kind of parameter fitting
01:09:32.100 | Try and make it so that the thing that you want your function to return
01:09:36.240 | It's like it's easy for it to return that okay, so that's why we do that that function squishing
01:09:42.320 | So we call this embedding dot bias
01:09:47.280 | So we can create that in the same way as before you'll see here
01:09:50.960 | I'm calling dot CUDA to put it on the GPU because we're not using any learner stuff normally that'll happen for you
01:09:57.400 | But we have to manually say put it on the GPU
01:09:59.780 | This is the same as before create our optimizer
01:10:02.600 | Fit exactly the same as before and these numbers are looking good and again. We'll do a little
01:10:10.760 | Change to our learning rate and learning rate schedule, and we're down to 0.8. So we're actually pretty close
01:10:17.240 | pretty close
01:10:20.480 | So that's the key steps
01:10:28.080 | This is how
01:10:30.080 | This is how most
01:10:32.720 | Collaborative filtering is done
01:10:35.520 | And you're not reminded me of an important point which is that this is not
01:10:40.840 | strictly speaking a matrix factor ization because strictly speaking a matrix factor ization would take that matrix by
01:10:50.200 | that matrix to create
01:10:52.760 | this matrix and
01:10:55.560 | remembering
01:10:58.560 | Anywhere that this is empty
01:11:05.040 | Like here or here
01:11:07.040 | We're putting in a zero
01:11:10.080 | Right we're saying if the original was empty put in a zero
01:11:14.400 | right now normally
01:11:16.880 | You can't do that with normal matrix factor ization normal matrix factor ization that creates the whole matrix
01:11:23.440 | And so it was a real problem actually
01:11:25.600 | When people used to try and use traditional linear algebra for this because when you have these sparse matrices like in practice
01:11:33.720 | This matrix is not doesn't have many gaps because we picked the users that watch the most movies and the movies that are the most
01:11:40.980 | Watched but if you look at the whole matrix, it's it's mainly empty and so traditional
01:11:46.200 | Techniques treated empty is zero and so like you basically have to predict a zero
01:11:52.440 | As if the fact that I haven't watched a movie means I don't like the movie that gives terrible answers
01:11:57.740 | So this probabilistic matrix factor ization approach
01:12:02.880 | takes advantage of the fact that our data structure
01:12:06.960 | Actually looks like this
01:12:09.720 | Rather than that cross tab right and so it's only calculating the loss for the user ID movie ID
01:12:16.000 | Combinations that actually appear that's exactly like user ID one movie ID one or two nine should be three
01:12:21.880 | It's actually three and a half so our loss is point five like there's nothing here. That's ever going to calculate a
01:12:29.280 | Prediction or a loss for a user movie combination that doesn't appear in this table
01:12:33.680 | By definition the only stuff that we can appear in a mini batch is what's in this table?
01:12:39.720 | And like a lot of this happened interestingly enough actually in the Netflix prize
01:12:48.660 | So before the Netflix prize came along
01:12:52.320 | This probabilistic matrix factor ization it had actually already been invented, but nobody noticed
01:12:59.440 | Alright, and then in the first year of the Netflix prize
01:13:01.960 | Someone wrote this like really really famous blog post where they basically said like hey check this out
01:13:08.040 | Incredibly simple technique works incredibly well and suddenly all the Netflix leaderboard entries work much much better
01:13:15.560 | And so you know that's quite a few years ago now, and this is like now
01:13:19.960 | Every collaborative filtering approach does this not every collaborative filtering approach adds this sigmoid thing by the way. It's not like
01:13:29.000 | Rocket science this is this is not like the NLP thing we saw last week
01:13:32.760 | Which is like hey, this is a new state-of-the-art like this is you know not particularly uncommon
01:13:37.100 | But there are still people that don't do this and it definitely helps a lot right to have this and so
01:13:42.600 | Actually you know what we could do is maybe now's a good time to have a look at the definition of this right so
01:13:51.400 | the column data
01:13:54.360 | module
01:13:56.040 | Contains all these definitions
01:13:58.040 | and we can now compare this to the thing we originally used which was
01:14:04.560 | Whatever came out of collab filter data set all right, so let's go to
01:14:10.040 | collab
01:14:13.560 | Filter data set here. It is and we called
01:14:17.640 | Get learner right so we can go down to get learner and that created a collab filter learner
01:14:25.800 | passing in the model from get model is get model, so it created an embedding bias and
01:14:33.160 | So here is embedding bias and
01:14:37.080 | You can see here here. It is like. It's the same thing. There's the embedding for each of the things
01:14:43.040 | Here's our forward that does the u times I dot sum
01:14:47.880 | plus plus
01:14:49.960 | sigmoid so in fact
01:14:51.960 | We have just actually rebuilt
01:14:54.680 | What's in the fast AI library literally?
01:14:56.840 | It's a little shorter and easier because we're taking advantage of the fact that there's a special
01:15:05.240 | collaborative filtering data set
01:15:08.360 | So we can actually we're getting passed in the users and the items and we don't have to pull them out of cats and cunts
01:15:14.360 | But other than that this is exactly the same
01:15:17.440 | So hopefully you can see like the fast AI library is not some inscrutable code containing concepts
01:15:23.120 | You can never understand. We've actually just built up this entire thing from scratch ourselves
01:15:28.480 | And so why did we get?
01:15:31.640 | 0.76 rather than 0.8
01:15:35.720 | You know I think it's simply because we used stochastic gradient descent with restarts and a cycle multiplier and an atom optimizer
01:15:45.160 | You know like a few little
01:15:47.160 | training tricks
01:15:50.440 | So I'm looking at this and thinking that is we could totally improve this model, but maybe
01:15:57.800 | Looking at the date and doing some tricks with the date. Yes, this is kind of a just a regular
01:16:04.400 | Kind of model in a way. Yeah, you can add more features. Yeah, it's actually exactly so like now that you've seen this
01:16:11.760 | You could now you know even if you didn't have
01:16:16.200 | embedding bias in a notebook that you've written yourself some other model that's in fast AI you could look at it in fast AI and
01:16:22.440 | Be like oh that does most of the things that I'd want to do, but it doesn't deal with time and so you could just go
01:16:28.680 | Oh, okay. Let's grab it. Copy it. You know pop it into my notebook and
01:16:33.560 | Let's create you know the better version
01:16:36.920 | Right, and then you can start playing that and you can now create your own
01:16:41.920 | model plus
01:16:45.040 | from the open source code here, and so
01:16:48.080 | Yeah, you're that's mentioning a couple things we could do we could try incorporating time stamps, so we could assume that maybe
01:16:53.980 | Well, maybe there's just like some
01:16:57.000 | For a particular user over time users tend to get more or less positive about movies
01:17:02.640 | Also remember there was the list of genres for each movie. Maybe we could incorporate that
01:17:09.600 | So one problem is it's a little bit difficult to incorporate that stuff
01:17:15.600 | Into this embedding dot bias model because it's kind of it's pretty custom right so what we're going to do next is
01:17:22.320 | we're going to try to create a
01:17:24.320 | neural net version of this
01:17:27.840 | So the basic idea here is
01:17:32.000 | We're going to
01:17:35.440 | Take exactly the same thing as we had before here's our list of users
01:17:38.720 | right and here is
01:17:41.520 | our embeddings
01:17:43.840 | All right, and here's our list of movies and
01:17:46.440 | here is our
01:17:49.200 | Embeddings right and so as you can see I've just kind of transposed
01:17:52.600 | The movie ones so that so that they're all in the same orientation
01:17:57.120 | And here is our user movie rating
01:18:00.680 | But D cross tab okay, so in the original format so each row is a user movie rating
01:18:09.680 | So the first thing I do is I need to replace
01:18:14.000 | user 14
01:18:16.000 | with that users
01:18:18.040 | Contiguous index right and so I can do that in Excel using this match that basically says
01:18:25.400 | What you know how far down this list you have to go and it said
01:18:29.320 | User 14 was the first thing in that list
01:18:32.880 | Okay user 29 was the second thing that list so forth okay?
01:18:37.920 | So this is the same as that thing that we did
01:18:42.040 | In our Python code where we basically created a dictionary to map this stuff
01:18:46.640 | So now we can for this particular user movie rating
01:18:51.540 | Combination we can look up
01:18:54.640 | the appropriate embedding
01:18:56.960 | Right and so you can see here what it's doing is it saying all right. Let's basically offset
01:19:04.000 | from the start of this list
01:19:07.720 | And the number of rows we're going to go down is equal to the user index and the number of columns
01:19:12.200 | We're going to go across is
01:19:13.960 | One two three four five okay, and so you can see what it does is it creates point one nine point six three point three one
01:19:19.960 | Here it is point one nine, okay, so so this is literally
01:19:24.080 | modern embedding does but remember
01:19:27.280 | This is exactly the same as
01:19:30.120 | doing a
01:19:32.200 | one hot encoding right because if instead this was a
01:19:37.280 | Vector containing one zero zero zero zero zero right, and we multiplied that by this matrix
01:19:44.440 | Then the only row it's going to return would be the first one okay, so
01:19:49.960 | So it's really useful to remember that embedding
01:19:54.000 | Actually just is a matrix product
01:19:56.680 | The only reason it exists the only reason it exists is because this is an optimization
01:20:03.200 | You know this lets PyTorch know like okay. This is just a matrix multiply
01:20:08.480 | But I guarantee you that you know this thing is one hot encoded
01:20:13.360 | Therefore you don't have to actually do the matrix multiply you can just do a direct look up
01:20:17.540 | Okay, so that's literally all an embedding is is it is a computational?
01:20:22.960 | Performance thing for a particular kind of matrix multiply all right so that looks up that uses user
01:20:31.320 | And then we can look up that uses movie all right, so here is movie ID
01:20:36.880 | movie ID
01:20:39.360 | 417 which apparently is index number 14 here. It is here, so it should have been point seven five point four seven
01:20:46.200 | Yes, it is point seven five point four seven, okay, so we've now got the user embedding and the movie embedding
01:20:53.120 | and rather than doing a
01:20:55.800 | dot product of
01:20:59.440 | those two
01:21:01.440 | Right which is what we do normally?
01:21:03.920 | Instead what if we can catenate the two together into a single vector of length?
01:21:12.660 | Ten and then feed that into a neural net
01:21:17.400 | Right and so anytime we've got you know a
01:21:23.280 | tensor of input activations or in this case a tensor of
01:21:29.320 | Actually, this is a tensor of output activations. This is coming out of an embedding layer
01:21:32.840 | We can chuck it in a neural net because neural nets we now know can calculate
01:21:39.040 | Anything okay including hopefully collaborative filtering, so let's try that
01:21:44.960 | So here is our embedding net
01:21:51.880 | This time I have not bothered to create a separate
01:21:59.520 | because instead the
01:22:01.760 | Linear layer in PyTorch already has a bias in it right so when we go
01:22:08.960 | nn.linear
01:22:11.280 | right
01:22:13.280 | Let's kind of draw this out
01:22:16.480 | So we've got our
01:22:20.520 | U matrix right and this is the number of users and this is the
01:22:28.040 | number of factors
01:22:29.760 | Right and we've got our M matrix
01:22:32.480 | All right, so here's our number of movies and here's our again number of factors all right, and so remember we look up a
01:22:43.240 | Single user
01:22:48.440 | We look up a single movie and let's grab them and concatenate them together
01:22:55.720 | Right so here's like the user part. Here's the movie part and then let's put that
01:23:01.320 | through a matrix product
01:23:04.600 | Right so that number of rows here is going to have to be the number of users plus the number of movies
01:23:10.920 | because that's how long that is and
01:23:13.600 | then the number of columns
01:23:16.240 | Can be anything we want?
01:23:21.600 | Because we're going to take that so in this case. We're going to pick 10 apparently so it's picked 10 and then we're going to
01:23:27.920 | stick that through a
01:23:29.920 | value and
01:23:31.920 | Then stick that through another
01:23:35.000 | Matrix, which obviously needs to be of size 10 here
01:23:38.440 | And then the number of columns is a size 1 because we want to predict a single rating
01:23:49.760 | Okay, and so that's our kind of flow chart of what's going on right it is a standard
01:23:56.920 | I'm called a one hidden layer neural net it depends how you think of it like there's kind of an embedding layer
01:24:03.820 | But because this is linear and this is linear the two together is really one linear layer, right? It's just a computational convenience
01:24:11.600 | So it's really got one hidden layer because it's just got one layer before this nonlinear activation
01:24:20.460 | so in order to create a
01:24:22.460 | Linear layer with some number of rows and some number of columns you just go and end up in here
01:24:29.460 | In the machine learning class this week
01:24:33.560 | We learned how to create a linear layer from scratch by creating our own weight matrix in our own biases
01:24:40.680 | So if you want to check that out you can do so there right, but it's the same basic technique. We've already seen
01:24:49.240 | We create our embeddings we create our two linear layers
01:24:53.240 | That's all the stuff that we need to start with you know really if I wanted to make this more general
01:24:59.120 | I would have had another parameter here called like
01:25:02.400 | num hidden
01:25:05.640 | you know equals
01:25:07.640 | equals 10 and then this would be a parameter and
01:25:13.080 | Then you could like more easily play around with different numbers of activations
01:25:17.400 | So when we say like okay in this layer. I'm going to create a layer with this many activations all I mean
01:25:23.700 | assuming it's a fully connected layer is
01:25:26.360 | My linear layer has how many columns in its weight matrix. That's how many activations it creates
01:25:33.040 | All right, so we grab our users and movies we put them through our embedding matrix, and then we concatenate them together
01:25:41.560 | Okay, so torch dot cat
01:25:43.560 | Concatenates them together on the first dimension so in other words we can catenate the columns together to create longer rows
01:25:50.840 | Okay, so that's concatenating on dimension one
01:25:53.920 | Dropout will come back to her in a moment. We've looked at that briefly
01:25:59.880 | So then having done that we'll put it through that linear layer we had
01:26:07.440 | We'll do our relu and you'll notice that relu is again inside our capital F and end up functional
01:26:15.120 | It's just a function so remember activation functions are basically things that take one activation in and spit one activation out
01:26:23.320 | in this case take in something that can have negatives or positives and
01:26:27.440 | Truncate the negatives to zero. That's all relu does
01:26:31.380 | And then here's our sigma
01:26:36.600 | So that's that that is now a genuine
01:26:39.980 | Neural network. I don't know if we get to call it deep. It's only got one hidden layer
01:26:44.880 | But it's definitely a neural network right and so we can now construct it we can put it on the GPU
01:26:50.540 | We can create an optimizer for it, and we can fit it
01:26:54.360 | Now you'll notice there's one other thing. I've been passing to fit which is
01:26:59.440 | What loss function are we trying to minimize?
01:27:02.480 | Okay, and this is the mean squared error loss and again. It's inside F
01:27:06.260 | Okay, pretty much all the functions are inside it, okay?
01:27:11.720 | One of the things that you have to pass fit is something saying like how do you score is what counts as good or bad?
01:27:18.680 | so Jeremy now that we have a
01:27:22.360 | Real neural net do we have to use the same number of embeddings for users and that's a great question you don't know
01:27:30.440 | It's absolutely right you don't and so like we've got a lot of benefits here right because if we
01:27:36.040 | You know think about
01:27:39.560 | You know we're grabbing a user embedding we're concatenating it with a movie embedding which maybe is like I don't know some different size
01:27:51.960 | but then also perhaps we looked up the genre of the movie and like you know there's actually a
01:28:00.520 | Embedding matrix of like number of genres
01:28:02.960 | By I don't know
01:28:05.480 | Three or something and so like we could then concatenate like a genre embedding and then maybe the time stamp is in here as a continuous
01:28:12.780 | Number right and so then that whole thing we can then feed into
01:28:17.080 | you know
01:28:20.040 | Our neural net right and then at the end
01:28:27.120 | Remember our final non-linearity was a sigmoid right so we can now
01:28:31.080 | Recognize that this thing we did where we did sigmoid times max rating but minus min rating plus blah blah blah
01:28:36.560 | Is actually just another?
01:28:39.360 | Nonlinear activation function right and remember in our last layer
01:28:44.520 | We use generally different kinds of activation functions
01:28:48.200 | So as we said we don't need any activation function at all right we could just do
01:28:57.040 | But by not having any nonlinear activation function, we're just making it harder, so that's why we put the sigmoid in there as well, okay
01:29:07.160 | so we can then fit it in the usual way and
01:29:10.940 | There we go you know interestingly we actually got a better score than we did with our
01:29:17.800 | This model
01:29:22.000 | So it'll be interesting to try training this with stochastic gradient descent with restarts and see if it's actually better
01:29:27.480 | You know maybe you can play around with the number of hidden layers and the dropout and whatever else and see if you can
01:29:34.200 | Come up with you know get a better answer than
01:29:37.960 | Point
01:29:44.880 | Seven six ish
01:29:47.840 | Okay, so so general so this is like if you were going deep into collaborative filtering at your workplace
01:29:55.560 | Or whatever this wouldn't be a bad way to go like it's like I'd start out with like oh, okay
01:29:59.840 | Here's like a collaborative data set 30 and fast AI
01:30:02.800 | Get learner there's you know not much I can send it basically number of factors is about the only thing that I pass in
01:30:09.320 | I can learn for a while maybe try a few different approaches, and then you're like okay. There's like
01:30:17.040 | That's how I go if I use the defaults
01:30:19.200 | Okay, how do I make it better, and then I'd be like digging into the code and saying like okay?
01:30:24.600 | What would Jeremy actually do here? This is actually what I want you know and and and people around it
01:30:29.880 | So one of the nice things about the neural net approach
01:30:33.100 | Is that you know as you net mentioned?
01:30:36.720 | We can have different numbers of
01:30:39.380 | embeddings
01:30:41.640 | We can choose how many hidden and we can also choose
01:30:46.280 | dropout right so
01:30:48.400 | So what we're actually doing is we haven't just got really you that we're also going like okay. Let's
01:30:55.560 | Let's delete a few things at random
01:31:03.820 | All right, that's dropout. That's when this case. We were deleting
01:31:09.280 | after the first linear layer
01:31:14.240 | 75% of them all right and then after the second linear 75% of them so we can add a whole lot of regularization
01:31:19.960 | Yeah, so you know this it kind of feels like the this this embedding net
01:31:25.080 | You know you could you could change this again. We could like have it so that we can pass into the constructor
01:31:32.360 | Well if we wanted to make it look as much as possible like what we had before we could surpass in peace
01:31:41.640 | P's equals 0.75
01:31:44.680 | comma 0.75
01:31:47.160 | I'm not sure this is the best API, but it's not terrible
01:31:50.700 | Probably what since we've only got exactly two layers. We could say P1 equals 0.75
01:31:57.280 | P2 equals 0.75
01:32:07.600 | So then this will be
01:32:09.840 | P1 this will be
01:32:14.440 | P2 you know where we go and like if you wanted to go further
01:32:21.200 | You could make it look more like our
01:32:25.240 | Structured data learner you could actually have a thing this number of hidden
01:32:31.800 | You know maybe you could make a list and so then rather than creating exactly one
01:32:38.800 | Hidden layer and one output layer. This could be a little loop that creates and
01:32:43.440 | Hidden layers each one of the size you want so like this is all stuff you can play with during the during the week
01:32:49.120 | If you want to and I feel like if you've got like a much smaller collaborative filtering data set
01:32:55.620 | You know maybe you'd need like more regularization or whatever. It's a much bigger one
01:33:00.560 | Maybe more layers would help. I don't know you know I haven't seen
01:33:07.200 | Much discussion of this kind of neural network approach to collaborative filtering
01:33:10.880 | But I'm not a collaborative filtering expert, so maybe it's maybe it's around, but that'd be interesting thing to try
01:33:16.640 | So the next thing I wanted to do was to talk about
01:33:26.120 | The training loop, so what's actually happening inside the training loop?
01:33:34.120 | So at the moment we're basically passing off
01:33:37.600 | The actual updating of the weights to pie torches optimizer
01:33:43.840 | But what I want to do is like understand
01:33:47.640 | What that optimizer is is actually doing and we're also I also want to understand what this momentum term is doing
01:33:55.920 | so you'll find we have a
01:34:01.400 | spreadsheet called grad desk gradient descent
01:34:04.560 | And it's kind of designed to be read left to write sorry right to left worksheet wise
01:34:09.620 | so the rightmost worksheet
01:34:12.560 | Is some data right and we're going to implement gradient descent in Excel because obviously everybody wants to do deep learning in Excel and we've done
01:34:21.440 | Collaborative filtering in Excel we've done
01:34:24.080 | Convolutions in Excel so now we need SGD in Excel so we can replace Python once and for all okay, so
01:34:31.360 | Let's start by creating some data right and so here's
01:34:35.440 | you know here's some
01:34:38.560 | Independent you know I've got one column of X's you know and one column
01:34:44.720 | Of wise and these are actually directly linearly related, so this is this is random
01:34:51.640 | All right, and this one here is equal to X
01:34:54.860 | times 2
01:34:59.320 | 30 okay, so
01:35:01.320 | Let's try and use Excel to take
01:35:05.640 | That data and try and learn
01:35:09.520 | those parameters
01:35:12.600 | Okay, that's going to be our goal
01:35:17.440 | So let's start with the most basic version of SGD
01:35:21.800 | And so the first thing I'm going to do is I'm going to run a macro so you can see what this looks like
01:35:26.520 | So I hit run, and it does five epochs. I do another five epochs to another five epochs
01:35:33.680 | Okay, so
01:35:36.280 | The first one was pretty terrible. It's hard to see so I just delete that first one get better scaling
01:35:44.360 | All right, so you can see actually it's pretty constantly improving the loss right. This is the loss per epoch
01:35:52.360 | All right, so how do we do that? So let's reset it
01:35:55.040 | So here is my
01:36:00.120 | X's and my Y's and
01:36:02.200 | What I do is I start out by assuming
01:36:05.000 | Some intercept and some slope right so this is my randomly initialized weights
01:36:13.320 | So I have randomly initialized them both to one
01:36:16.120 | You could pick a different random number if you like, but I promise that I randomly picked the number one
01:36:23.400 | Twice there you go
01:36:26.080 | It was a random number between one and one
01:36:30.560 | So here is my intercept and slope. I'm just going to copy them over here right so you can literally see this is just equal
01:36:39.800 | Here is equals C2. Okay, so I'm going to start with my very first row of data x equals 40 and y equals 58
01:36:47.640 | And my goal is to come up
01:36:50.440 | After I look at this piece of data. I want to come up with a slightly better intercept and a slightly better slope
01:36:58.840 | So to do that I need to first of all basically figure out
01:37:04.920 | Which direction is is down in other words if I make my intercept a little bit higher
01:37:11.360 | Or a little bit lower would it make my error a little bit better or a little bit worse?
01:37:16.200 | So let's start out by calculating the error so to calculate the error the first thing we need is a prediction
01:37:22.400 | So the prediction is equal to the intercept
01:37:26.660 | Plus x times slope right so that is our
01:37:32.520 | Zero hidden layer neural network, okay?
01:37:35.560 | And so here is our error. It's equal to our prediction minus our actual square
01:37:41.840 | So we could like play around with this. I don't want my error to be 1849. I'd like it to be lower
01:37:48.240 | So what if we set the intercept to?
01:37:53.040 | 1849 goes to 1840 okay, so a higher intercept would be better
01:37:57.360 | Okay, what about the slope if I increase that?
01:38:01.720 | It goes from 1849
01:38:03.720 | To 1730 okay a higher slope would be better as well
01:38:07.520 | Not surprising because we know
01:38:10.120 | Actually that there should be 30 and 2
01:38:12.460 | So one way to
01:38:16.120 | Figure that out
01:38:18.480 | You know encode in the spreadsheet is to do literally what I just did
01:38:22.200 | It's to add a little bit to the intercept and the slope and see what happens
01:38:25.520 | And that's called finding the derivative through finite differencing right and so let's go ahead and do that
01:38:31.920 | So here is the value of my error if I add 0.01
01:38:40.480 | My intercept right so it's c4 plus open o1 and then I just put that into my linear function
01:38:46.680 | And then I subtract my actual all squared right and so that causes my error to go down a bit. That's are increasing
01:38:58.000 | Which one is that increasing c4 increasing the intercept a little bit has caused my error to go down
01:39:03.900 | So what's the derivative well the derivative is equal to how much the dependent variable changed by?
01:39:10.100 | Divided by how much the independent variable changed by right and so there it is right our
01:39:16.040 | Dependent variable changed by that minus that
01:39:18.600 | Right and our independent variable we changed by 0.01
01:39:22.080 | So there is the estimated value of
01:39:26.080 | the error DB
01:39:28.080 | All right, so remember when people talking about derivatives
01:39:31.200 | This is this is all they're doing is they're saying what's this value?
01:39:35.120 | But as we make this number smaller and smaller and smaller and smaller as it as it limits to zero
01:39:42.720 | I'm not smart enough to think in terms of like derivatives and integrals and stuff like that so whatever I think about this
01:39:49.360 | I always think about you know an actual like plus point oh one divided by point oh one because like I just find that
01:39:55.960 | Easier just like I never think about probability density functions. I always think about actual probabilities about toss a coin
01:40:02.880 | Something happens three times
01:40:05.480 | So I always think like remember. It's it's totally fair to do this because a computer is
01:40:11.240 | Discrete it's not continuous like a computer can't do anything infinitely small anyway, right?
01:40:17.880 | So it's actually got to be calculating things at some level of precision right and our brains kind of need that as well
01:40:25.920 | So this is like my version of Jeffrey Hinton's like to visualize things in more than two dimensions
01:40:32.000 | You just like say twelve dimensions really quickly while visualizing it in two dimensions
01:40:35.860 | This is my equivalent you know to think about derivatives. Just think about division
01:40:41.920 | And like although all the mathematicians say no you can't do that
01:40:46.120 | You actually can like if you think of dx dy is being literally you know changing x over changing y like
01:40:54.200 | The division actually like the calculations still work like all the time, so
01:40:59.080 | Okay, so let's do the same thing now with changing
01:41:03.480 | my slope by a little bit
01:41:06.160 | And so here's the same thing right and so you can see both of these are negative
01:41:10.560 | Okay, so that's saying if I increase my intercept my loss goes down if I increase my slope my loss goes down
01:41:20.200 | Right and so my derivative of my error
01:41:24.180 | with respect to my slope is is actually pretty high and that's not surprising because
01:41:32.280 | It's actually
01:41:35.400 | You know the constant term is just being added whereas the slope is being multiplied by 40
01:41:40.000 | Okay now
01:41:46.800 | Finite differencing is all very well and good, but it's a big problem with finite differencing in
01:41:51.920 | High dimensional spaces and the problem is this right and this is like
01:41:56.700 | You don't need to learn
01:42:00.760 | How to calculate derivatives or integrals, but you need to learn how to think about them spatially right and so remember
01:42:07.580 | We have some
01:42:10.160 | Vector very high dimensional vector. It's got like a million items in it right
01:42:16.560 | And it's going through
01:42:18.560 | Some weight matrix right of size like 1 million by size a hundred thousand or whatever and it's spitting out something of size 100,000
01:42:30.560 | So you need to realize like there isn't like a gradient here, but it's like for every one of these things in this vector
01:42:38.820 | Right, there's a gradient in every direction
01:42:43.760 | You know in every part of the output
01:42:46.360 | All right, so it actually has
01:42:49.040 | Not a single gradient number not even a gradient
01:42:53.100 | Vector but a gradient matrix
01:42:56.860 | right and so this
01:43:00.220 | This is a lot to calculate right
01:43:03.880 | I would literally have to like add a little bit to this and see what happens to all of these
01:43:08.920 | Add a little bit to this see what happens to all of these right to fill in
01:43:13.800 | one column of this at a time, so that's going to be
01:43:17.660 | Horrendously slow like that that so that's why like if you're ever thinking like oh we can just do this with finite differencing
01:43:24.720 | Just remember like okay. We're dealing in the with these very high dimensional vectors where
01:43:30.560 | You know this this kind of
01:43:33.880 | Matrix calculus like all the concepts are identical
01:43:39.760 | But when you actually draw it out like this you suddenly realize like okay for each number I could change
01:43:45.760 | There's a whole bunch of numbers that impacts and I have this whole matrix of things to compute right and so
01:43:52.040 | Your gradient calculations can take up a lot of memory, and they can take up a lot of time
01:43:58.080 | So we want to find some way to do this
01:44:01.080 | more quickly
01:44:03.640 | And it's definitely well worth like spending time
01:44:08.480 | kind of studying these ideas of like
01:44:11.360 | you know the idea of like the gradients like look up things like Jacobian and
01:44:17.860 | Hessian
01:44:22.800 | They're the things that you want to search for to start
01:44:26.480 | unfortunately people normally write about them with you know lots of Greek letters and
01:44:34.000 | Blah blah blahs right, but there are some there are some nice
01:44:38.200 | You know intuitive explanations out there, and hopefully you can share them on the forum if you find them because this is stuff
01:44:44.760 | You really need to
01:44:46.760 | Really need to understand in here
01:44:49.120 | You know because
01:44:52.400 | You're trying to train something and it's not working properly and like later on we'll learn how to like look inside
01:44:58.160 | Pytorch to like actually get the values of the gradients, and you need to know like okay
01:45:02.560 | Well, how would I like plot the gradients you know?
01:45:05.640 | What would I consider unusual like you know these are the things that turn you into a really awesome?
01:45:11.040 | deep learning practitioner is when you can like debug your problems by like
01:45:15.760 | Grabbing the gradients and doing histograms of them and like knowing you know that you could like plot that all each layer my
01:45:22.340 | Average gradients getting worse or you know bigger or you know whatever
01:45:26.160 | Okay, so the trick to doing this more quickly is to do it
01:45:31.960 | analytically
01:45:33.200 | Rather than through finite differencing and so analytically is basically there is a list you probably all learned it at high school
01:45:41.440 | There is a literally a list of rules that for every
01:45:44.360 | Mathematical function there's a like this is the derivative of that function right so
01:45:50.400 | You probably remember a few of them
01:45:53.640 | for example
01:45:56.680 | X squared
01:45:59.480 | To X right and so we actually have here an X squared
01:46:03.840 | So here is our two times
01:46:06.400 | now the one that I actually want you
01:46:08.760 | to know is
01:46:11.920 | Not any of the individual rules, but I want you to know the chain rule
01:46:17.040 | right, which is
01:46:19.520 | You've got some function of some function of something
01:46:24.200 | Why is this important I?
01:46:26.920 | Don't know that's a linear layer. That's a rally you right and
01:46:31.320 | Then we can kind of keep going backwards, right?
01:46:35.520 | Etc right a neural net is
01:46:40.080 | Just a function of a function of a function of a function where the innermost is you know it's basically linear
01:46:45.720 | rally you
01:46:47.840 | linear
01:46:49.080 | rally you
01:46:52.080 | dot dot dot dot linear
01:46:54.540 | sigmoid or soft mass
01:46:58.360 | All right, and so it's a function of a function of a function and so therefore to calculate the derivative of
01:47:05.680 | the weights in your model
01:47:08.520 | The loss of your model with respect to the weights of your model
01:47:12.440 | You're going to need to use the chain rule and
01:47:14.440 | Specifically whatever layer it is that you're up to like I want to calculate the derivative here
01:47:19.360 | I'm going to need to use all of these
01:47:21.760 | All of these ones because that's all that's that's the function that's being applied
01:47:25.680 | right and that's why they call this back propagation because the value of the derivative of
01:47:30.760 | that is
01:47:33.920 | equal to
01:47:35.800 | that derivative
01:47:37.800 | Now basically you can do it like this you can say let's call you
01:47:41.520 | Is this right let's call that you right then it's simply equal to
01:47:50.520 | derivative of that
01:47:52.320 | times
01:47:54.320 | Derivative of that right you just multiply them together and
01:47:58.040 | So that's what back propagation is like it's not that back propagation is a new thing for you to learn
01:48:04.840 | It's not a new
01:48:07.120 | algorithm it is literally
01:48:09.240 | Take the derivative of every one of your layers and
01:48:13.480 | multiply them all together so like it doesn't deserve a new name right apply the chain rule to my layers
01:48:21.920 | Does not deserve a new lane, but it gets one because us neural networks folk really need to seem as clever as possible
01:48:29.520 | It's really important that everybody else thinks that we are way outside of their capabilities
01:48:34.920 | Right so the fact that you're here means that we failed because you guys somehow think that you're capable
01:48:40.920 | Right so remember. It's really important when you talk to other people that you say back propagation and
01:48:46.640 | Rectified linear unit rather than like multiply the layers
01:48:51.360 | Gradients or replace negatives with zeros, okay, so so here we go so here is so I've just gone ahead and
01:48:59.480 | Grabbed the derivative unfortunately there is no automatic differentiation in Excel yet
01:49:05.920 | So I did the alternative which is to paste the formula into Wolfram Alpha and got back the derivatives
01:49:12.120 | So there's the first derivative, and there's the second derivative
01:49:14.640 | Analytically we only have one layer in this
01:49:18.240 | Infinite tiny small neural network, so we don't have to worry about the chain rule
01:49:22.880 | and we should see that this analytical derivative is pretty close to our estimated derivative from the finite differencing and
01:49:29.920 | Indeed it is right and we should see that these ones are pretty similar as well, and indeed they are right
01:49:36.440 | and if you're you know back when I
01:49:38.680 | implemented my own neural nets 20 years ago I
01:49:42.560 | You know had to actually calculate the derivatives
01:49:45.600 | And so I always would write like had something that would check the derivatives using finite differencing
01:49:50.800 | And so for those poor people that do have to write these things by hand
01:49:54.080 | You'll still see that they have like a finite differencing checker
01:49:58.280 | So if you ever do have to implement a derivative by hand, please make sure that you
01:50:03.800 | Have a finite differencing checker so that you can test it
01:50:07.480 | All right
01:50:09.480 | So there's our derivatives
01:50:11.480 | So we know that if we increase
01:50:14.080 | B, then we're going to get a slightly better loss, so let's increase B by a bit
01:50:20.340 | How much should we increase it by?
01:50:22.760 | Well we'll increase it by some multiple of this and the multiple
01:50:25.880 | We're going to choose is called a learning rate, and so here's our learning rate. So here's 1e neck 4
01:50:30.360 | Okay, so our new value
01:50:33.520 | Is equal to whatever it was before
01:50:37.960 | Minus our
01:50:42.040 | Derivative times our learning rate, okay, so we've gone from 1 to 1.01 and
01:50:48.400 | then a
01:50:51.280 | We've done the same thing so it's gone from 1 to
01:50:57.640 | So this is a special kind of mini batch. It's a mini batch of size 1. Okay, so we call this online gradient descent
01:51:05.080 | Just means mini batch of size 1 so then we can go on to the next one x is 86
01:51:11.440 | Y is 202 right. This is my intercept and slope copied across from the last row
01:51:19.880 | Okay, so here's my new y prediction. Here's my new error
01:51:24.480 | Here are my derivatives
01:51:27.400 | Here are my new a and B
01:51:29.480 | All right, so we keep doing that for every mini batch of one
01:51:33.120 | and two eventually
01:51:36.600 | We react run out the end of an epoch
01:51:39.280 | Okay, and so then at the end of an epoch we would grab
01:51:43.560 | our intercept and slope and
01:51:48.320 | Paste them back over here. That's our new values
01:51:52.220 | There we are and we can now continue again, right so we're now starting with
01:51:59.400 | Pops today see that in the wrong spot. It should be
01:52:03.640 | paste special transpose values
01:52:07.400 | All right
01:52:09.680 | Okay, so there's our new intercept. There's only slow possibly I've got those the wrong way around
01:52:13.920 | But anyway you get the idea and then we continue okay, so I recorded the world's tiniest macro
01:52:20.720 | which literally just
01:52:23.720 | Copies the final slope and puts it into the new slope copies the final intercept puts it into the new intercept
01:52:36.360 | does that
01:52:37.720 | Five times and after each time it grabs the root means greater error and pastes it into the next
01:52:43.840 | Spare area and that is attached to this run button and so that's going to go ahead and do that five times
01:52:50.280 | Okay, so that's stochastic gradient descent in Excel
01:52:55.240 | So it to turn this into a CNN right you would just replace
01:53:02.040 | This error function right and therefore this prediction with the output of that
01:53:08.120 | convolutional example spreadsheet
01:53:11.000 | Okay, and that then would be in CNN being trained with with SGD, okay
01:53:18.320 | Now the problem is that you'll see when I run this
01:53:29.320 | It's kind of going very slowly right we know that we need to get to a slope of 2 and an intercept of 30
01:53:35.640 | And you can kind of see at this rate
01:53:37.800 | It's going to take a very long time
01:53:40.560 | Right and specifically
01:53:43.680 | It's like it keeps going the same direction, so it's like come on take a hint. That's a good direction
01:53:55.080 | So the come on take a hint. That's a good direction. Please keep doing that but more is called momentum
01:54:00.920 | Right so on our next spreadsheet
01:54:03.720 | We're going to implement momentum
01:54:06.960 | Okay, so
01:54:10.320 | What momentum does is?
01:54:12.800 | The same thing and what to simplify this spreadsheet. I've removed the finite difference in columns, okay, other than that
01:54:20.520 | This is just the same right so he's still got our X's our Y's
01:54:24.720 | A's and B's our predictions
01:54:27.320 | Our error is now over here, okay?
01:54:30.600 | And here's our derivatives, okay?
01:54:34.440 | Our new calculation for this particular row
01:54:40.880 | Our new calculation here for our new a term just like before is it's equal to whatever a was before
01:54:52.880 | minus
01:54:54.880 | Okay, now this time. I'm not taking the derivative, but I'm financing some other number times the learning rate, so what's this other number?
01:55:04.240 | Okay, so this other number is equal to the derivative
01:55:11.520 | Times
01:55:15.160 | What's this K 1?
01:55:21.280 | 0.98 times
01:55:23.400 | the thing just above it
01:55:25.400 | Okay, so this is a linear
01:55:27.680 | interpolation
01:55:29.480 | between this rows derivative or this mini batches derivative and
01:55:33.560 | Whatever direction we went last time
01:55:36.640 | Right so in other words keep going the same direction as you were before
01:55:42.120 | right then update it a little bit right and so in our
01:55:48.000 | Stretch in our Python just before we had a momentum of 0.9
01:55:52.120 | Okay, so you can see what tends to happen is that our?
01:55:58.160 | negative kind of gets more and more negative right all the way up to like 2,000
01:56:04.480 | Where else with our standard SGD approach
01:56:11.320 | Add riveters are kind of all over the place, right? Sometimes there's 700 some negative seven positive hundred
01:56:17.640 | You know so this is basically saying like yeah, if you've been going
01:56:21.360 | Down for quite a while keep doing that until finally here. It's like okay. That's that seems to be far enough
01:56:28.360 | So that's getting less and less and less negative
01:56:30.360 | Right and still we start going positive again
01:56:32.760 | So you can kind of see why it's called momentum
01:56:35.120 | It's like once you start traveling in a particular direction for a particular weight
01:56:39.680 | You kind of the wheel start spinning and then once the gradient turns around the other way
01:56:45.040 | It's like oh slow down. We've got this kind of momentum, and then finally turn back around
01:56:49.680 | All right, so when we do it this way
01:56:52.520 | All right, we can do exactly the same thing right and after five iterations for 89
01:57:03.640 | Where else before after five iterations? We're at 104 right and after a few more. Let's do maybe 15
01:57:12.560 | Okay, so it's 102 for us here
01:57:20.560 | It's going right so it's it's it's a bit better. It's not heaps better. You can still see like
01:57:33.000 | These numbers they're not
01:57:35.000 | Zipping along right, but it's definitely an improvement and it also gives us something else to tune
01:57:40.920 | Which is nice like so if this is kind of a well-behaved error surface right in other words like
01:57:46.480 | Although it might be bumpy along the way. There's kind of some overall
01:57:51.200 | Direction like imagine you're going down a hill right and there's like bumps on it right so the more momentum
01:57:57.980 | You get up. We're going to skipping over the tops right so we could say like okay
01:58:01.220 | Let's increase our beta up to point nine eight
01:58:03.220 | Right and see if that like allows us to train a little faster and whoa look at that suddenly with 22
01:58:09.640 | All right so one nice thing about things like momentum is it's like another parameter that you can tune to try and make your model
01:58:16.720 | train better in practice
01:58:18.720 | Basically everybody does this every like you look at any like image net winner or whatever they all use momentum
01:58:30.000 | okay, and
01:58:36.200 | Back over here when we said use SGD that basically means use the the basic tab of our Excel spreadsheet
01:58:45.520 | But then momentum equals point nine
01:58:48.080 | means add in
01:58:50.800 | Put a point nine over here
01:58:53.720 | okay, and
01:58:55.760 | so that that's kind of your like
01:58:59.640 | default starting point
01:59:01.640 | So let's keep going and talk about
01:59:10.440 | So Adam is something which I
01:59:15.640 | Actually was not right earlier on in this course. I said we've been using Adam by default
01:59:22.500 | We actually haven't we've actually been I've noticed that we've actually been using SGD with momentum by default and the reason is
01:59:33.420 | Has had it's much faster as you'll see it's much much faster to learn with but there's been some problems
01:59:39.100 | Which is people haven't been getting quite as good like final answers with Adam as they have with SGD with momentum
01:59:45.660 | And that's why you'll see like all the you know image net winning
01:59:48.660 | Solutions and so forth and all the academic papers always use SGD with momentum and our Adam
01:59:56.220 | Seems to be a particular problem in NLP people really haven't got Adam working at all. Well
02:00:00.500 | The good news is this was I built it looks like this was solved two weeks ago
02:00:08.760 | It basically it turned out that the way people were dealing with a combination of weight decay in Adam
02:00:16.580 | Had a nasty kind of bargain it basically
02:00:19.540 | and that's that's kind of carried through to every single library and
02:00:24.060 | one of our students and answer ha has actually just
02:00:27.580 | Completed a prototype of adding is this new version of Adam is called Adam W into fast AI
02:00:34.420 | And he's confirmed that he's getting the much faster both the faster
02:00:39.780 | Performance and also the the better accuracy. So hopefully we'll have this Adam W in fast AI
02:00:47.780 | Ideally before next week. We'll see how we go very very soon
02:00:51.020 | So so it is worth telling you about about Adam
02:00:54.980 | So let's talk about it, it's actually incredibly simple
02:01:00.760 | But again, you know make sure you make it sound really complicated when you tell people so that you can look clever
02:01:07.180 | So here's the same spreadsheet again, right and here's our
02:01:12.620 | Randomly selected A and B again somehow it's still one. Here's our prediction. Here's our derivatives
02:01:20.020 | Okay, so now how we calculating our new A and our new B
02:01:23.800 | You can immediately see it's looking pretty hopeful because even by like row 10
02:01:30.460 | We're like we're seeing the numbers move a lot more. Alright, so this is looking pretty encouraging
02:01:36.300 | So how are we calculating this
02:01:40.300 | It's equal to our previous value of B
02:01:45.900 | Minus J8. Okay, so we're gonna have to find out what that is
02:01:50.140 | times
02:01:53.540 | our learning rate
02:01:55.580 | Divided by the square root of L8. Okay, so we're gonna have to dig it and see what's going on
02:02:00.340 | One thing to notice here is that my learning rate is way higher than it used to be
02:02:05.860 | But then we're dividing it by this
02:02:08.540 | Big number. Okay, so let's start out by looking and seeing what this J8 thing is
02:02:16.940 | J8 is identical to what we had before J8 is equal to the linear interpolation of the derivative and
02:02:26.540 | the previous direction
02:02:29.740 | Okay, so that was easy
02:02:32.300 | So one part of atom is to use momentum in the way we just defined
02:02:37.300 | Okay, the second piece was to divide by square root L8, what is that?
02:02:43.620 | square root L8, okay is another linear interpolation of something and
02:02:50.260 | Something else and specifically it's a linear interpolation of
02:02:55.340 | F8 squared, okay. It's a linear interpolation of the derivative squared
02:03:02.660 | Along with the derivative squared last time. Okay, so in other words, we've got two pieces of
02:03:11.860 | momentum going on here. One is
02:03:14.180 | calculating the
02:03:17.140 | momentum
02:03:18.740 | version of the gradient the other is calculating the momentum version of the gradient squared and
02:03:25.140 | We often refer to this idea as a
02:03:29.420 | Exponentially weighted moving average in other words
02:03:33.540 | It's basically equal to the average of this one and the last one and the last one and the last one, but we're like multiplicatively
02:03:40.380 | decreasing the previous ones because we're multiplying it by
02:03:43.220 | 0.9 times 0.9 times 0.9 times 0.9. And so you actually see that for instance in the fast AI code
02:03:51.340 | If you look at fit
02:04:02.740 | We don't just calculate the average loss, right?
02:04:09.660 | because
02:04:10.860 | What I actually want we certainly don't just report the loss for every mini batch because that just bounces around so much
02:04:16.700 | So instead I say average loss is equal to whatever the average loss was last time
02:04:23.060 | times
02:04:26.500 | Plus the loss this time times 0.02
02:04:29.900 | Right. So in other words the faster AI library
02:04:33.700 | The thing that it's actually when you do like the loading rate finder or plot loss
02:04:38.340 | It's actually showing you the exponentially weighted moving average of the loss
02:04:43.260 | Okay, so it's like a really handy concept. It appears quite a lot
02:04:48.180 | Right the other handy concept to know about it's this idea of like you've got two numbers
02:04:54.500 | One of them is multiplied by some value. The other is multiplied by 1 minus that value
02:04:59.980 | So this is a linear interpolation with two values. You'll see it all the time and
02:05:05.300 | for some reason
02:05:08.100 | Deep learning people nearly always use the value alpha when they do this
02:05:12.380 | So like keep an eye out if you're reading a paper or something and you see like alpha times blah blah blah blah plus
02:05:18.860 | 1 minus alpha
02:05:21.380 | Times some other blah blah blah blah right immediately like when people read papers
02:05:27.100 | None of us like read everything in the equation. We look at it. We go Oh linear interpolation
02:05:33.900 | Right and I say something I was just talking to Rachel about yesterday is like
02:05:37.940 | Whether we could start trying to find like a new way of writing papers where we literally refactor them
02:05:43.540 | Right like it'd be so much better to have written like
02:05:46.340 | linear interpolate
02:05:49.060 | Blah blah blah comma blah blah blah right because then you don't have to have that pattern recognition right but until we
02:05:55.940 | Convinced the world to change how they write papers
02:05:58.540 | This is what you have to do is you have to look you know
02:06:01.340 | Know what to look for right and once you do suddenly the huge page with formulas
02:06:07.420 | Aren't aren't bad at all like you often notice like for example the two things in here like they might be totally identical
02:06:14.900 | But this might be at time t and this might be at like time t minus 1 or something
02:06:19.060 | Right like it's very often these big ugly formulas turn out to be
02:06:23.100 | Really really simple if only they had repacked them
02:06:28.580 | So what are we doing with this gradient squared?
02:06:31.100 | So what we were doing with the gradient squared is
02:06:35.860 | We were taking the square root, and then we were adjusting the learning rate by dividing the learning rate by that
02:06:43.740 | Okay, so gradient squared is always positive
02:06:49.020 | okay, and
02:06:51.740 | We're taking the exponentially waiting move moving average of a bunch of things that are always positive
02:06:57.160 | And then we're taking the square root of that
02:06:59.160 | All right, so when is this number going to be high?
02:07:01.780 | It's going to be particularly high if there's like one big
02:07:05.840 | You know if the gradients got a lot of variation right so there's a high variance of gradient
02:07:11.760 | Then this G squared thing is going to be a really high number whereas if it's like a constant
02:07:18.060 | Amount right it's going to be smaller that because when you add things that are squared the square
02:07:25.020 | It's like jump out much bigger whereas if there wasn't if there wasn't much change that it's not going to be as big so basically
02:07:32.040 | This number at the bottom here
02:07:34.860 | It's going to be high
02:07:37.700 | If our gradient is changing a lot now, what do you want to do if?
02:07:42.060 | You've got something which is like first negative and then positive and then small and then high
02:07:47.300 | right
02:07:49.660 | Well you probably want to be more careful right you probably don't want to take a big step
02:07:55.100 | Because you can't really trust it right so when the when the variance of the gradient is high
02:08:00.340 | We're going to divide our learning rate by a big number
02:08:03.220 | Where else if our learning rate is?
02:08:06.060 | Very similar kind of size all the time then we probably feel pretty good about this step
02:08:11.340 | So we're dividing it by a small amount
02:08:13.460 | And so this is called an adaptive learning rate and like a lot of people will have this confusion about Adam
02:08:20.300 | I've seen it on the forum actually where people are like, isn't there some kind of adaptive learning rate where somehow you're like setting different
02:08:27.220 | Learning rates for different layers or something. It's like no not really
02:08:32.500 | Right all we're doing is we're just saying like just keep track of the average of the squares of the gradients and use that
02:08:40.620 | To adjust the learning rate, so there's still one learning rate
02:08:44.140 | Okay, in this case. It's one
02:08:47.300 | right, but effectively every parameter at every epoch is
02:08:52.340 | Being kind of like getting a bigger jump if the learning rate if the gradients been pretty constant for that weight and a smaller jump
02:09:00.760 | Otherwise okay, and that's Adam. That's the entirety of Adam in
02:09:05.260 | Excel right so there's now no reason at all why you can't
02:09:09.100 | Train image net in Excel because we've got you've got access to all of the pieces you need
02:09:13.860 | And so let's try this out run
02:09:16.660 | Okay, that's not bad right five and we're straight up to 29 and 2 right so the difference between like
02:09:26.780 | You know standard SGD in this is huge and basically that you know the key difference was that it figured out that we need to be
02:09:35.820 | you know
02:09:37.060 | moving this number
02:09:38.860 | much faster, okay, and so and so it did and
02:09:42.620 | So you can see we've now got like
02:09:46.020 | two different parameters one is kind of the momentum for the gradient piece the other is the momentum for the gradient squared piece and
02:09:53.540 | They I think they're called like I think there's just a couple of the beta
02:09:59.300 | I think when you when you want to change it in PyTorch is a thing called beta
02:10:02.840 | Which is just a couple of two numbers you can change
02:10:05.420 | Jeremy so
02:10:10.780 | So you said the
02:10:13.700 | yeah, I think I understand this concept of you know when they
02:10:17.700 | When a gradient is it goes up and down then you're not really sure
02:10:23.020 | Which direction should should go so you should kind of slow things down therefore you subtract that gradient from the learning rate
02:10:30.300 | So but how do you implement that how far do you go?
02:10:34.620 | I guess maybe I missed something early on you do you set a number somewhere we divide
02:10:41.460 | Yeah, we divide the learning rate
02:10:43.460 | Divided by the square root of the moving average gradient squared, so that's where we use it. Oh
02:10:50.940 | I'm sorry. Can you be a little more sure so d2 is the learning rate? Which is 1? Yeah m27 is
02:11:00.620 | our moving average of the squared gradients
02:11:03.260 | So we just go d2 divided by square root m27
02:11:08.060 | That's it
02:11:12.220 | Okay, thanks I
02:11:14.220 | Have one question yeah, so
02:11:17.340 | The new method that you just mentioned which is in the process of getting implemented in yes, I don't w yeah, I don't w
02:11:25.060 | How different is it from here? Okay? I can let's do that so
02:11:31.100 | To understand Adam W. We have to understand weight decay
02:11:35.260 | And maybe we'll learn more about that later. Let's see how we go now with weight decay
02:11:40.340 | so the idea is that
02:11:42.340 | when you have
02:11:45.060 | Lots and lots of parameters like we do with you know most of the neural nets we train
02:11:50.540 | You very often have like more parameters and data points or you know like regularization becomes important
02:11:57.940 | And we've learned how to avoid overfitting by using dropout right which randomly deletes some activations
02:12:06.460 | In the hope that it's going to learn some kind of more resilient set of weights
02:12:11.020 | There's another kind of regularization
02:12:13.740 | We can use called weight decay or L2 regularization
02:12:17.740 | And it's actually comes kind of it's a kind of classic statistical technique and the idea is that we take our loss function
02:12:24.300 | Right so we take our like
02:12:26.660 | Error squared loss function and we add an additional piece to it
02:12:31.200 | Let's add weight decay right now
02:12:34.740 | The additional piece we add is
02:12:36.740 | To basically add the square of the weights, so we'd say plus
02:12:42.620 | B squared
02:12:45.540 | Plus a squared
02:12:48.540 | Okay, that is now
02:12:52.680 | Weight decay or L2 regularization and so the idea is that now
02:13:02.100 | The the loss function wants to keep the weights small right because increasing the weights makes the loss worse and
02:13:09.380 | So it's only going to increase the weights if the loss improves by more
02:13:15.180 | Than the amount of that penalty and in fact to make this weight decay to proper weight decay we then need some
02:13:21.140 | Multiplier here right so if you remember back in our here, we said weight decay equals Wd 5e neg 4
02:13:31.780 | Okay, so to actually use the same weight decay. I would have to multiply by
02:13:34.780 | 0.0005
02:13:37.380 | So that's actually now the same weight decay, so
02:13:46.500 | You have a really high weight decay that it's going to set all the parameters to zero
02:13:50.500 | So it'll never over fit right because it can't set any parameter to anything
02:13:55.700 | And so as you gradually decrease the weight decay a few more weights
02:14:01.500 | Can actually be used right, but the ones that don't help much. It's still going to leave at zero or close to zero, right?
02:14:09.340 | So that's what that's what weight decay is is literally to change the loss function to add in this
02:14:17.420 | Sum of squares of weights
02:14:21.460 | times
02:14:23.420 | some parameter some some hyper parameter should see
02:14:26.360 | the problem is that
02:14:30.060 | If you put that into the loss function as I have here
02:14:33.700 | Then it ends up in the moving average of gradients and the moving average of squares of gradients
02:14:40.100 | For Adam right and so basically we end up
02:14:44.780 | When there's a lot of variation
02:14:48.020 | we end up
02:14:50.780 | Decreasing the amount of weight decay, and if there's very little variation we end up increasing the amount of weight decay, so we end up
02:14:57.460 | basically saying
02:15:00.140 | penalize parameters, you know weights that are really high
02:15:03.200 | Unless their gradient varies a lot, which is never what we intended right? That's just not not the plan at all
02:15:12.060 | So the trick with Adam W is we basically remove
02:15:16.220 | Weight decay from here
02:15:19.020 | So it's not in the loss function. It's not in the G not in the G squared
02:15:23.100 | And we move it so that instead it's it's it's added directly to the
02:15:30.260 | When we update with the learning rate, it's added there instead so in other words it would be
02:15:34.800 | We would put the weight decay or actually the gradient with the weight decay in here when we calculate the new a and ub
02:15:41.740 | So it never ends up in our G and G squared
02:15:45.940 | So that was like a super fast
02:15:48.620 | Description which will probably only make sense if you listen to a three or four times on the video and then talk about it
02:15:54.820 | on the forum
02:15:56.820 | Yeah, but if you're interested let me know and we can also look at a nonce code. That's implemented this
02:16:03.620 | And you know the the idea of using weight decay is it's a really helpful
02:16:11.980 | regularizer
02:16:15.020 | Because it's basically this way that we can kind of say like
02:16:17.940 | You know, please don't increase any of the weight values unless the
02:16:26.700 | you know improvement in the loss
02:16:28.700 | Is worth it?
02:16:31.540 | And so generally speaking pretty much all state-of-the-art models have both dropout and weight decay
02:16:38.220 | And I don't claim to know like how to set h1 and how much of each to use
02:16:44.320 | Other than to say like you it's worth trying both
02:16:47.920 | To go back to the idea of embeddings
02:16:52.220 | Is there any way to interpret the final sort of use it embeddings? Like absolutely. We're gonna look at that next week. It's super fun
02:16:59.340 | It turns out that you know, we'll learn what some of the worst movies of all time
02:17:03.540 | It's like um, it's a John Travolta Scientology ones like battleship earth or something
02:17:11.820 | I think that was like the worst movie of all time according to our settings
02:17:19.420 | Do you have any recommendations for scaling the L2 penalty or is that kind of based on how how wide the nodes or how many
02:17:27.100 | No, I have no
02:17:29.100 | suggestion at all like I I kind of look for like papers or Kaggle competitions or whatever similar and try to set up frankly
02:17:37.780 | The same it seems like in a particular area like computer vision object recognition
02:17:44.620 | It's like somewhere between 1e neg 4 or 1e neg 5 seems to work, you know
02:17:49.600 | actually in the Adam W paper
02:17:54.260 | The authors point out that with this new approach it actually becomes like it seems to be much more stable
02:17:59.220 | As to what the right weight decay amounts are so hopefully now when we start playing with it, we'll be able to have some
02:18:05.040 | definitive recommendations by the time we get to part 2
02:18:08.120 | All right. Well, that's 9 o'clock. So
02:18:11.340 | this week
02:18:14.420 | You know practice the thing that you're least familiar with so if it's like Jacobians and Hessians read about those if it's broadcasting
02:18:21.300 | Read about those if it's understanding Python. Oh read about that, you know, try and implement your own custom layers
02:18:27.560 | Read the fast AI layers, you know and and talk on the forum about anything that you find
02:18:33.860 | Weird or confusing? All right. See you next week
02:18:37.420 | [BLANK_AUDIO]