Back to Index

Lesson 5: Deep Learning 2018


Chapters

0:0 Intro
6:15 Carroll Seedlings Competition
7:48 Learning Objectives
9:43 Movie Lens Data
11:22 XLS
12:24 Matrix Factorization
37:3 PiTorch
40:13 Forward
41:31 Embedding matrices
43:48 Creating a Python module
47:47 Basic initialization
50:13 Minibatching
52:13 Creating an optimizer
54:40 Writing a training loop
55:56 Fast AI

Transcript

Welcome back So we had a busy lesson last week and I was really thrilled to see actually one of our master's students here at USF actually Actually took what we learned Took what we learned With structured deep learning and turned it into a blog post which as I suspected has been Incredibly popular because it's just something People didn't know about and so it actually ended up getting picked up by the towards data science publication Which I quite like actually if you're interested in keeping up with what's going on in data science.

It's quite good medium publication and so Karen talked about structured deep learning and basically introduced you know that the basic ideas that we learned about last week and It got picked up quite quite widely one of the one of the things I was pleased to see actually Sebastian Ruder who actually mentioned in last week's class as being one of my favorite researchers Tweeted it and then somebody from Stitch Fix said oh, yeah, we've actually been doing that for ages, which is kind of cute I Kind of know that this is happening in industry a lot and I've been telling people this is happening in industry a lot But nobody's been talking about it And now the Karen's kind of published a blog saying hey check out this cool thing and now Stitch Fix is like yeah We're doing that already so So that's been great Great to see and I think there's still a lot more that can be dug into with this structured deep learning stuff You know to build on top of Karen's post would be to maybe like experiment with some different data sets Maybe find some old Kaggle competitions and see like there's some competitions that you could now win with this or some which doesn't work For would be equally interesting and also like just Experimenting a bit with different amounts of dropout different layer sizes, you know Because nobody much has written about this.

I don't think there's been any blog posts about this before that. I've seen anywhere There's a lot of unexplored territory. So I think there's a lot we could we could build on top of here And there's definitely a lot of interest. I saw one person on Twitter saying this is what I've been looking for ages another thing which I was pleased to see is Nicky all who we saw his cricket versus baseball Predictor as well as his currency predictor after lesson one Went on to Download something a bit bigger which was to download a couple of hundred of images of actors and he manually Went through and checked which well I think first of all he like used Google to try and find ones with glasses and ones without then he manually went through and Check that they had been put in the right spot and this is a good example of one where Vanilla resnet didn't do so well with just the last layer And so what Nicole did was he went through and tried unfreezing the layers and using differential learning rates and got up to 100% accuracy and the thing I like about these things that Nicole is doing is the way he's He's not downloading a Kaggle data set.

He's like deciding on a problem that he's going to try and solve He's going from scratch from Google And he's actually got a link here even to a suggested way to help you download images from Google So I think this is great and I actually gave a talk just this afternoon at Singularity University to a Executive team of one of the world's largest telecommunications companies and actually showed them this post because the Folks there were telling me that that all the vendors that come to them and tell them they need like Millions of images and huge data centers full of hardware, and you know they have to buy special Software that only these vendors can provide and I said like actually this person's been doing a course for three weeks now And look at what he's just done with a computer that cost him 60 cents an hour And they were like they were so happy to hear that like okay.

They're you know this actually isn't the reach of normal people I'm assuming Nicole's a normal person. I haven't actually If you're proudly abnormal Nicole I apologize I actually went and actually had a look at his cricket Classifier and I was really pleased to see that his code actually is the exact same code That we used in lesson one.

I was hoping that would be the case. You know the anything he changed was the number of epochs I guess So this idea that we can take those four lines of code and reuse it to do other things It's definitely turned out to be true, and so these are good things to show like a your organization If you're anything like the executives at this big company.

I spoke to today. There'll be a certain amount of like Not to surprise but almost like pushback of like if this was true somebody there's you know they basically said that this is true Somebody would have told us so like why isn't everybody doing this already so like I think you might have to actually show them You know maybe you can build your own with some internal data You've got at work or something like here.

It is you know didn't cost me anything. It's all finished Viddly or vitally I don't know how to pronounce his name correctly has done another very nice post on just an introductory post on how we train neural networks, and I wanted to point this one out as being like I think This is one of the participants in this course Who's just got a particular knack for technical communication, and I think we can all learn from you know from his posts about about good technical writing What I really like particularly is that he?

He assumes almost nothing like he has a kind of a very chatty tone and describes everything But he also assumes that the reader is intelligent But you know so like he's not afraid to kind of say he's a paper or he's an equation or or whatever But then he's going to go through and tell you exactly what that equation means So it's kind of like this nice mix of like writing for Respectfully for an intelligent audience, but also not assuming any particular background knowledge so Then I made the mistake earlier this wake of posting a picture of my first placing on the Kaggle seedlings competition At which point five other fast AI students posted their pictures of them passing me over the next few days So this is the current leaderboard for the Kaggle plant seedlings competition I believe the put up top six are all fast AI students or in the worst of those teachers And so I think this is like a really Oh Look James has just passed he was first This is a really good example of like what you can do this is I'm trying to think it was like a small number of thousands of images And most of the images were only were less than a hundred pixels by hundred pixels And yet week, you know, I bet my my approach was basically to say let's just run through the notebook We have pretty much default took me.

I don't know an hour And I'm I think the other students doing a little bit more than that But not a lot more and basically what this is saying is yeah, these these techniques Work pretty reliably to a point where people that aren't using the fast AI libraries You know literally really struggling I suspect all these are fast AI students.

You might have to go down quite a way So I thought that was very interesting and really really cool So today we're going to Start what I would kind of call like the second half of this course so the first half of this course has been like getting through Like these are the applications that we can use this for The here's kind of the code you have to write Here's a fairly high level ish description of what it's doing and We're kind of we're kind of done for that bit and what we're now going to do is go in reverse We're going to go back over all of those exact same things again But this time we're going to dig into the detail of everyone and we're going to look inside the source code of a fast AI Library to see what it's doing and try to replicate that so in a sense like there's not going to be a lot more Best practices to show you like I've kind of shown you the best best practices I know But I feel like for us to now build on top of those to debug those models to come back to part 2 Where we're going to kind of try out some new things, you know, it really helps to understand what's going on Behind the scenes.

Okay, so the goal here today is we're going to try and create a pretty effective collaborative filtering model Almost entirely from scratch so we'll use the kind of we'll use pytorch as a Automatic differentiation tool and as a GPU programming tool and not very much else. We'll try not to use its neural net features We'll try not to use Fast AI library any more than necessary.

So that's the goal so Let's go back and you know, we only very quickly look at collaborative filtering last time So let's let's go back and have a look at collaborative filtering. And so we're going to look at this movie lens data set so the movie lens data set Basically is a list of ratings it's got a bunch of different users that are represented by some ID and a bunch of movies that are represented by some ID and Rating it also has a timestamp.

I haven't actually ever tried to use this I guess this is just like what what time did that person rate that movie? So that's all we're going to use for modeling is three columns user ID movie ID and rating and so thinking of that in kind of Structured data terms user ID and movie ID would be categorical variables we have two of them and rating would be a Would be a dependent variable We're not going to use this for modeling but we can use it for looking at stuff later We can grab a list of the names of the movies as well And you could use this genre information.

I haven't tried to be interested if during the week anybody tries it and finds it helpful I guess as you might not find it helpful. We'll see so In order to kind of look at this better. I just grabbed the Users that have watched the most movies and the movies that have been the most watched And made a cross tab of it right so this is exactly the same data But it's a subset and now rather than being user movie rating we've got user movie rating And so some users haven't watched some of these movies.

That's why some of these are not a number, okay? Then I copied that into Excel And you'll see there's a thing called collab filter dot XLS if you don't see it there now I'll make sure I've got it there by tomorrow And Here is where I've copied That table okay, so as I go through this like Set up of the problem and kind of how it's described and stuff if you're ever feeling lost feel free to Ask either directly or through the forum if you ask through the forum and somebody answers there I want you to answer it here but if somebody else asks a question you would like answered of course just like it and Your net will keep an eye out for that because kind of that's we're digging in To the details of what's going on behind the scenes.

It's kind of important that at each stage you feel like okay. I can see What's going on? Okay, so we're actually not going to build a neural net to start with Instead we're going to do something called a matrix factorization The reason we're not going to build a neural net to start with is that it so happens.

There's a really really simple Kind of way of solving these kinds of problems, which I'm going to show you and so if I scroll down I've basically what I've got here is the same the same thing, but this time these are my predictions Rather than my actuals, and I'm going to show you how I created these predictions, okay, so here are my actuals Right here are my predictions and then down here we have our score which is the sum of the different squared Average square root okay, so this is RMSE down here.

Okay, so on average we're Randomly initialized model is out by 2.8 So let me show you what this model is and I'm going to show you by saying how do we guess? How much user ID number 14? likes movie ID number 27 and The prediction here. This is just at this stage is still random is 0.91 So how are we calculating 0.91 and the answer is we're taking it as?

this vector here Dot product with this vector here so dot product means point seven one times point one nine plus point eight one times point six three plus point seven four plus point three one and so forth and in You know linear algebra speak because one of them is a column and one of them is a row This is the same as a matrix product so you can see here.

I've used the excel function matrix multiply And that's my prediction Having said that if the original Rating doesn't exist at all Then I'm just going to set this to zero right because like there's no error in predicting something that hasn't happened Okay, so what I'm going to do is I'm basically going to say all right every one of my rate Rate my predictions is not going to be a neural net.

It's going to be a single matrix multiplication now the matrix multiplication that it's doing is basically in practice is between like this matrix and this Matrix right so each one of these is a single part of that So I randomly initialize these these are just random numbers That I've just pasted in here So I've basically started off with two Random matrices, and I've said let's assume for the time being that every rating can be represented as The the matrix product of those two So then in excel you can actually do gradient descent You have to go to your options to the add-in section and check the box to say turn it on and once you do you'll See there's something there called solver And if I go solver it says okay, what's your?

Objective function, and you just choose the cell so in this case. We chose the cell that contains our written in spread error and Then it says okay. What do you want to? Change and you can see here We've selected this matrix and this matrix and so it's going to do a gradient descent For us by changing these matrices to try and in this case minimize this is min minimize this excel So right grg non-linear is a gradient descent method so say solve and you'll see it starts at 2.8 and Then down here you'll see that number is going down.

It's not actually showing us what it's doing but we can see that the numbers going down, so this is kind of got a neural neti feel to it in that we're doing like a matrix product and we're doing a gradient descent, but we don't have a Non-linear layer, and we don't have a second Linear layer on top of that so we don't get to call this deep learning so things where people do like deep learning each things where they have kind of Matrix products and gradient descents, but it's not deep people tend to just call that shallow learning okay, so we're doing shallow learning here Alright, so I'm just going to go ahead and press escape to stop it because I'm sick of waiting and so you can see We've now got down to the old point three nine all right, so for example It guessed that movie 72 for sorry movie 27 for use of 72 would get 4.44 rating 2772 and actually got a 4 rating so you can see like it's it's doing something quite useful So why is it doing something quite useful?

I mean something to note here is The number of things we're trying to predict here is there's 225 of them Right and the number of things we're using to predict is that times two so 150 of them So it's not like we can just exactly fit we actually have to do some kind of machine learning here So basically what this is saying is that there does seem to be some way of making predictions in this way and So for those of you that have done some linear algebra And this is actually a matrix decomposition normally in linear algebra you would do this using a Analytical technique or using some techniques that are specifically designed for this purpose, but the nice thing is that we can use Gradient descent to solve pretty much everything including this I don't like to so much think of it from a linear algebra point of view though I like to think of it from an intuitive point of view which is this let's say movie.

Sorry. Let's say movie ID 27 is Lord of the Rings part one and let's say Move and so let's say we're trying to make that prediction for user 272 are they going to like Lord of the Rings part one and so conceptually That particular movie. Maybe there's like there's four.

Sorry. There's five Numbers here and we could say like well What if the first one was like how much is it sci-fi and fantasy and the second one is like? How recent a movie and how much special effects is there you know and the one at the top might be like how dialogue driven?

Is it right like let's say those kind of five these five numbers represented particular things about the movie and so if that was the case Then we could have the same five numbers for the user saying like okay How much does the user like sci-fi and fantasy how much does the user like?

Modern this is a modern CGI driven movies. How much does this does this user like? Dialogue driven movies and so if you then took that cross product You would expect to have a good model right would expect to have a reasonable rating now the problem is We don't have this information for each user.

We don't have the information for each movie, so we're just going to like assume That this is a reasonable Kind of way of thinking about this system, and let's and let's stochastic gradient descent try and find these numbers Right so so in other words these these factors We call these things factors these factors and we call them factors because you can multiply them together to create this They're factors in a linear algebra sense these factors.

We call them latent factors because they're not actually This is not actually a Vector that we've like named and understood and like entered in manually we've kind of assumed That we can think of movie ratings this way we've assumed that we can think of them as a dot product of Some particular features about a movie and some particular features of what users like those kinds of movies, right?

And then we've used gradient descent To just say okay try and find some numbers that that work So that's that's basically the technique right and it's kind of The and the entirety is in this spreadsheet right so that is collaborative filtering using what we call probabilistic matrix factorization And as you can see the whole thing is easy to do in an Excel spreadsheet and the entirety of it really is this single Thing which is a single matrix multiplication plus randomly initializing We like to know if it would be better to cap this to zero and five maybe yeah Yeah, we're going to do that later right.

There's a whole lot of stuff. We can do improvements. This is like our Simple as possible starting point right so so what we're going to do now is we're going to try and implement this in Python And run it on the whole data set another question is how do you figure out how many?

You know how it's clear. How long are the metrics? Yeah, why is it five? Yeah, yeah So something to think about Given that this is like movie 49, right? And we're looking at a rating for movie 49 Think about this. This is actually an embedding matrix and So this length is actually the size of the embedding matrix.

I'm not saying this is an analogy I'm saying it literally this is literally an embedding matrix We could have a one hot encoding where 72 Where a one is in the 72nd position and so we'd like to look it up, and it would return this list of five numbers So the question is actually how do we decide on the dimensionality of our embedding vectors?

And the answer to that question is we have no idea We have to try a few things and see what works the underlying concept is you need to pick an embedding dimensionality, which is Enough to reflect the kind of true complexity of this causal system but not so big that you Have too many parameters that it could take forever to run or even with vectorization.

It might overfit So what does it mean when the factor is negative then The factor being negative in the movie case would mean like this is not dialogue driven in fact It's like the opposite dialogue here is terrible a negative for the user would be like I actually dislike modern CGI movies, so it's not from zero to whatever it's the range of Score it'd be negative is that range of score even like no no maximum.

No. There's no constraints at all here These are just standard embedding matrices Thanks Questions the first question is why do what why can we trust this embeddings because like if you take a number six It can be expressed as 1 into 6 or like 6 into 1 or 2 into 3 and 3 into 2 Also, are you saying like we could like reorder these five numbers in some other different order or like the value itself might be different As long as the product is something well, but you see we're using gradient descent to find the best numbers So like once we found a good minimum the idea is like Yeah, there are other numbers, but they don't give you as good an objective value And of course we should be checking that on a validation set really which we'll be doing in the Python version Okay, and the second question is when we have a new movie or a new user do we have to retrain the model?

That is a really good question, and there isn't a straightforward answer to that Time permitting will come back to it But basically you would need to have like a kind of a new user Model or a new movie model that you would use initially And then over time yes, you would then have to retrain the model So like I don't know if they still do it But Netflix used to have this thing that when you were first on board it onto Netflix They would say like what movies do you like?

And you'd have to go through and like say a bunch of movies you like and it would then like train its model Could you could you just find the nearest movie to the movie that you're trying to the new movie that you're trying to add? Yeah, you could use nearest neighbors for sure But the thing is initially at least in this case we have no Columns to describe a movie so if you had something about like the movies Genre release date who was in it or something you could have some kind of non collaborative filtering model And that was kind of what I meant.

I like a new movie model. You'd have to have some some kind of predictors Okay, so a Lot of this is going to look familiar and and the way I'm going to do this is again It's kind of this top-down approach. We're going to start using a Few features of pytorch and fast AI and gradually we're going to redo it a few times in a few different ways Kind of doing a little bit deeper each time Regardless we do need a validation set so we can use our standard cross validation indexes approach to grab a random set of IDs This is something called weight decay Which we'll talk about later in the course for those of you that have done some machine learning It's L2 regularization basically And this is where we choose how big a embedding matrix do we want okay?

So again, you know here's where we get our model data object from CSB Passing in that ratings file which remember Looks like that okay, so you'll see like stuff tends to look pretty familiar after a while And then you just have to pass in the What are your rows effectively?

What are your columns effectively, and what are your values effectively right so any any collaborative filtering? Recommendation system approach. There's basically a concept of like You know a user and an item Now they might not be users and items like if you're doing the Ecuadorian groceries competition There are stores and items and you're trying to predict.

How many things are you going to sell at? This store of this type But generally speaking just this idea of like you've got a couple of kind of high cardinality Categorical variables and something that you're measuring and you're kind of conceptualizing and saying okay, we could predict The rating we can predict the value by doing this this dot product Interestingly this is kind of relevant to that that last question or suggestion an Identical way to think about this or to express this is to say when we're deciding Whether user 72 will like movie 27 is basically saying which other users liked movies that 72 liked and Which other movies were liked by people like?

User 72 it turns out that these are basically two ways of saying the exact same thing So basically what collaborative filtering is doing? You know kind of conceptually is to say okay this movie and this user Which other movies are similar to it in terms of like? Similar people enjoyed them and which people are similar to this person based on people that like the same kind of movies so that's kind of the underlying Structure and anytime there's an underlying structure like this that kind of collaborative filtering approach is likely to be useful okay, so So you yeah, so there's basically two parts the two bits of your thing that you're factoring and then the value the dependent variable So as per usual we can take our model data and ask for a learner from it And we need to tell it what size embedding matrix to use How many sorry what validation set indexes to use what batch size to use and what optimizer?

To use and we're going to be talking more about optimizers shortly We won't do Adam today, but we'll do Adam next week or the week after And then we can go ahead and say fit Right, and it all looks pretty similar interest. It's usual interestingly I only had to do three epochs like this kind of models into train super quickly You can use the learning rate finder as per usual all the stuff you're familiar with will work fine And that was it so this took you know about two seconds the train.

There's no pre trained anythings here This is from random scratch, right? So this is our validation set and we can compare it we have this is a mean squared error Not a root mean squared error, so we can take the square root So that last time I ran it was 0.776 and that's 0.88 and there's some benchmarks available for this data set And when I scrolled through and found the bench the best benchmark I could find here from this Recommendation system specific library they had 0.91.

So we've got a better loss in two seconds Already, so that's good So that's basically how you can do collaborative filtering with the fast AI library without Thinking too much, but so now we're going to dig in and try and rebuild that we'll try and get to the point that we're getting something around 0.77 0.78 from scratch But if you want to do this yourself at home, you know without worrying about the detail That's you know, those three lines of code is all you need Okay, so we can get the predictions in the usual way and you know, we could for example plot SNS is seaborne seaborne is a really great flooding library.

It sits on top of matplotlib It actually leverages matplotlib So anything you learn about matplotlib will help you with seaborne. It's got a few like nice little plots like this joint plot Here is I'm doing predictions against Against actuals. So these are my actuals These are my predictions and you can kind of see the the shape here is that as we predict higher numbers they actually are higher numbers and you can also see the histogram of the Predictions and a histogram of the actions.

So I'm just kind of plotting that just to show you another interesting visualization Could you please explain the n-factors Why it's set to 15? It's set to 50 because I tried a few things that's in the work That's all. What does it mean? It's this it's the dimensionality of the embedding matrix Or to think of it in another way.

It's like how you know rather than being five. It's Jeremy I have a question about suppose that your Recommendation system is more implicit. So you have zeros or ones instead of just Actual numbers, right? So basically we would then Need to use a classifier instead of a regressor Have to sample the negative for something like that So if you don't have it, we just have once let's say like just kind of implicit feedback.

Oh I'm not sure we'll get to that one in this class But what I will say is like in the case that you're just doing classification rather than regression We haven't actually built that in the library yet Maybe somebody this week wants to try adding it. It would only be a small number of lines of code.

You basically have to change the activation function to be a sigmoid and you would have to change the Criterion or the loss function to be cross entropy rather than RMSE and that will give you a Classifier rather than a regressor. Those are the only things you'd have to change So hopefully somebody this week will take up that challenge and by the time we come back next week.

We will have that working Okay So I said that we're basically doing a dot product right or you know a dot product is kind of the vector version I guess of this matrix product So we're basically doing each of these things times each of these things and then add it together That's a dot product.

So let's just have a look at how we do that in PyTorch So we can create a tensor in PyTorch just using this little capital T thing You can just say that's the fast AI version the full version is torch dot from NumPy or something But I've got it set up so you can pass it through pass in even a list of lists So this is going to create a torch tensor with 1 2 3 4 and then here's a torch tensor with 2 2 10 10 Okay, so here are two Torch tensors, I didn't say dot CUDA.

So they're not on the GPU. They're sitting on the CPU just FYI We can multiply them together Right and so anytime you have a mathematical operator between tensors in NumPy or PyTorch It will do element wise Assuming that they're the same dimensionality, which they are they're both 2 by 2 Okay, and so here we've got 2 by 2 is 4 3 by 10 is 30 and so forth.

Okay, so there's our A times B So if you think about basically what we want to do here is we want to take Okay, so I've got 1 Times 2 is 2 2 times 2 is 4 2 plus 4 is 6 and so that is actually the dot product between 1 2 and 2 4 and Then here we've got 3 by 10 is 30 4 by 40 Sorry 4 by 10 is 40 30 and 40 is 70 So in other words a times B dot sum along the first dimension So that's summing up the columns.

In other words across a row Okay, this thing here is doing the dot product of Each of these rows with each of these rows That makes sense and obviously we could do that with You know some kind of matrix modification approach, but I'm trying to really do things with as little special case stuff as possible Okay, so that's what we're going to use for our dot products from now on so basically all we need to do now is Remember we have the data we have is not in that crosstab format So in Excel we've got it in this crosstab format, but we've got it here in this Listed format use a movie rating use a movie So conceptually we want to be like looking up this user Into our embedding matrix to find their 50 factors looking up that movie to find their 50 factors and then take the dot product of those two 50 long vectors So let's do that To do it we're going to build a layer our own custom neural net layer So the the more generic vocabulary we call this is we're going to build a pytorch module Okay, so a pytorch module is a very specific thing It's something that you can use as a layer and a neural net once you've created your own pytorch module You can throw it into a neural net And a module works by assuming we've already got one say called model You can pass in some things in parentheses, and it will calculate it right so assuming that we already have a module called dot product We can instantiate it like so To create our dot product object, and we can basically now treat that like a function All right, but the thing is it's not just a function because we'll be able to do things like take derivatives of it Stack them up together into a big Stack of neural network layers blah blah blah, right, so it's basically a function that we can kind of compose very conveniently So here how do we define a module which as you can see here returns a dot product well We have to create a Python class and so if you haven't done Python OO before You're going to have to learn because all pytorch modules are written in Python OO And it's one of the things I really like about pytorch is that it doesn't Reinvent totally new ways of doing things like TensorFlow does all the time in pytorch that you know really tend to use Pythonic ways to do things so in this case.

How do you create you know some kind of new behavior you create a Python class? So Jeremy suppose that you have a lot of data Not just a little bit of data. You can have a memory. Will you be able to use fast AI to solve corollary filtering? Yes, absolutely It's it uses mini batch stochastic gradient descent which does it a batch at a time the This particular version is going to create a Pandas data frame and a pandas data frame has to live in memory Having said that you can get easily 512 gig You know instances on Amazon so like if you had a CSV that was bigger than 512 gig You know that would be impressive if that did happen I guess you would have to instead save that as a B calls array and Create a slightly different version that reads from a B calls array to streaming in or maybe from a desk data frame which also so It would be easy to do I don't think I've seen Real world situations where you have 512 gigabyte collaborative filtering matrices, but yeah, we can do it Okay now This is PyTorch specific this next bit is that when you define like the actual work to be done which is here return user times movie dot sum You have to put it in a special method called forward Okay, and this is this idea that like it's very likely you're pretty neural net right and in a neural net the thing where you calculate the next Set of activations is called the forward pass and so that's doing a forward calculation The gradients is called the backward calculation We don't have to do that because PyTorch calculates that automatically so we just have to define Forward so we create a new class we define forward and here we write in our definition of dot product Okay, so that's it.

So now that we've created this class definition. We can instantiate our Model right and we can call our model and get back the numbers we expected. Okay, so that's it That's how we create a custom PyTorch layer and if you compare that to like any other Library around pretty much.

This is way easier Basically, I guess because we're leveraging What's already in Python? So let's go ahead and now create a more complex Module and we're going to basically do the same thing. We're going to have a forward again We're going to have our users times movies dot sum But we're going to do one more thing beforehand, which is we're going to create two Embedding matrices and then we're going to look up our users and our movies in those embedding matrices So let's go through and and do that so the first thing to realize is that The users the user IDs and the movie IDs may not be contiguous You know like they may be they start at a million and go to a million one thousand say, right?

So if we just used Those IDs directly to look up into an embedding matrix We would have to create an embedding matrix of size one million one thousand right which we don't want to do so the first thing I do is to get a list of the unique user IDs and then I create a mapping from every user ID to a Contiguous integer this thing I've done here where I've created a dictionary which maps from every unique thing to a unique index is Well worth studying during the week because like it's super super handy It's something you very very often have to do in all kinds of machine learning All right, and so I won't go through it here It's easy enough to figure out if you can't figure it out just ask on the forum Anyway, so once we've got the mapping from user to a contiguous index We then can say let's now replace the user ID column With that contiguous index right so pandas dot apply applies an arbitrary function In Python Lambda is how you create an anonymous function on the fly and this anonymous function simply returns the index through the same thing for movies and so after that we now have the same ratings table we had before but our IDs have been matched to contiguous Integers and therefore there are things that we can look up into an embedding matrix So let's get the count of our users in our movies And let's now go ahead and try and create our Python version of this Okay, so Earlier on when we created our simplest possible Pytorch module there was no like State we didn't need a constructor Because we weren't like saying how many users are there or how many movies are there or how many factors?

Do we want or whatever right anytime we want to do something like? This where we're passing in and saying we want to construct our Module with this number of users and this number of movies then we need a constructor for our class and you create a constructor in Python by defining a Dunder in it underscore underscore in it underscore underscore special name so this just creates a Constructor then if you haven't done over before You wanted to do some study during the week, but it's a pretty simple idea This is just the thing that when we create this object.

This is what gets run, okay? Again special Python thing when you create your own constructor You have to call the parent class constructor And if you want to have all of the cool behavior of a Pytorch module you get that by inheriting From an end up module neural net module okay, so basically by inheriting here and calling the superclass constructor We now have a fully functioning Pytorch layer, okay, so now we have to give it some behavior And so we give it some behavior by storing some things in it all right, so here.

We're going to create something called self dot you users and that is going to be an embedding layer Number of rows is and users number of columns is n factors So that is exactly this right the number of rows is n uses number of columns is n factors And then we'll have to do the same thing for movies All right, so that's going to go ahead and create these two randomly initialized arrays However when you randomly initialize or an array it's important to randomly initialize it to a Reasonable set of numbers like a reasonable scale, right if we randomly initialize them from like naught to a million Then we would start out and you know these things would start out being like You know billions and billions of size rotating and that's going to be very hard to do gradient descent on So I just kind of manually figured here like okay about what size Numbers that are going to give me about the right ratings, and so we don't we know we did ratings between about naught and five So if we start out with stuff between about naught and 0.05, then we're going to get ratings of about the right level You can easily enough like that calculate that in neural nets.

There are standard algorithms for Basically doing doing that calculation and the basic the key algorithm is Something called her initialization from timing her and the basic idea Is that you take the Here you basically set the weights equal to a normal distribution With a standard deviation, which is basically inversely proportional to the number of things in the previous layer And so in our previous layer So in this case we basically a set if you basically take that naught to 0.05 and multiply it by the fact that you've got 40 things I wasn't 40 or 50 things coming out of it 50 50 things coming out of it, then you're going to get something of about the right size Pytorch has already has like her initialization Class there like we don't in normally in real life have to think about this we can just call the existing initialization Functions, but we're trying to do this all like from scratch here.

Okay without any special stuff going on So there's quite a bit of Pytorch notation here, so self.you we've already set to an instance of the embedding class it Has a dot weight Attribute which contains the actual the actual embedding matrix so that contains this The actual embedding matrix is not a tensor It's a variable a variable is exactly the same as a tensor in other words it supports the exact same operations as a tensor, but it also Does automatic differentiation?

That's all a variable is basically To pull the tensor out of a variable you get its data attribute Okay, so this is so this is now the tensor of the weight matrix of the self.you embedding And then something that's really handy to know is that all of the tensor functions in Pytorch You can stick an underscore at the end, and that means do it in place Right so this is say create a random uniform random number of an appropriate size For this tensor and don't return it, but actually fill in that matrix In place okay, so that's a super handy thing to know about I mean it wouldn't be rocket science otherwise.

We would have to have gone Okay, there's the non in place version. That's what saves us some typing saves us some screen noise. That's all Okay So now we've got our randomly initialized embedding weight matrices And so now the forward I'm actually going to use the same columnar model data that we used for Rossman And so it's actually going to be passed both categorical variables and continuous variables and in this case there are no continuous variables, so I'm just going to grab the zeroth column out of the categorical variables and call it users and the first column and call it movies okay, so I'm just kind of Too lazy to create my own.

I'm not so much too lazy that we do have a special class for this But I'm trying to avoid creating a special class, so I'm just going to leverage this columnar model data class Okay, so we can basically grab our user and movies Mini batches right and remember this is not a single user in a single movie.

This is going to be a whole mini batch of them We can now look up that mini batch of users in our embedding matrix U and the movies in our embedding matrix M All right, so this is like exactly the same as just doing an array look up to grab the the user ID numbered Value, but we're doing it a whole mini batch at a time Right and so it's because pytorch Can do a whole mini batch at a time with pretty much everything that we can get really easy speed up We don't have to write any loops on the whole to do everything through our mini batch And in fact if you do have a loop through your mini batch manually you don't get GPU acceleration That's really important to know right so you never want to loop have a for loop going through your mini batch You always want to do things in this kind of like whole mini batch at a time But pretty much everything in pytorch does things a whole mini batch at a time, so you shouldn't have to worry about it And then here's our product just like before all right so having defined That I'm now going to Go ahead and say you're at my X values is Everything except the rating and the timestamp In my ratings table my Y is my rating and then I can just say okay.

Let's Grab a model data from a data frame using that X and that Y and here is our list of categorical variables Okay And then so let's now instantiate that pytorch object All right, so we've now created that from scratch And then the next thing we need to do is to create an optimizer, so this is part of pytorch The only fast AI thing here is this line right because it's like I don't think showing you How to build data sets and data loaders is interesting enough really we might do that in part two of the course And it's actually so straightforward like a lot of you are already doing it on the forums So I'm not going to show you that in this part But if you're interested feel free to to talk on the forums about it But I'm just going to basically take the the thing that feeds this data as a given particularly because these things are so flexible Right you you know if you've got stuff in a data frame.

You can just use this you don't have to rewrite it So that's the only fast AI thing we're using so this is a pytorch thing and so Optim is the thing in pytorch that gives us an optimizer. We'll be learning about that very shortly So it's actually the thing that's going to update our weights Pytorch Calls them the parameters of the model so earlier on we said model equals embedding dot blah blah blah All right, and because embedding dot Derives from nn dot module we get all of the pytorch module behavior and one of the things we got for free Is the ability to say dot parameters?

So that's pretty that's pretty handy right that's the thing that basically is going to automatically Give us a list of all of the weights in our model that have to be updated and so that's what gets passed to the optimizer We also passed the optimizer the learning rate The weight decay which we'll talk about later and momentum that we'll talk about later Okay, one other thing that I'm not going to do right now But we will do later is to write a training loop so the training loop is a thing that loops through each mini batch and Updates the weight to subtract the gradient times the volume rate There's a function in fast AI which is the training loop and it's It's pretty simple Here it is right for epoch in epochs This is just the thing that shows a progress bar so ignore this for X comma Y in my training data loader calculate the loss Print out the loss in our in a progress bar call any callbacks you have and at the end Call the call the metrics on the validation right so this there's just for each epoch go through each mini batch and do one step of our optimizer step is Basically going to take advantage of this optimizer, but we're rewriting that from scratch shortly So this is notice we're not using a learner Okay, we're just using a pipe watch module so this this fit thing although.

It's past it part of fast AI It's like lower down the layers of abstraction now. This is the thing that takes a regular pipe torch model, so if you ever want to like skip as much Fast AI stuff as possible like you've got some pipe torch model. You've got some code on the internet You basically want to run it But you don't want to write your own training loop, then this is this is what you want to do You want to call fast AI's fit function and so what you'll find is like The library is designed so that you can kind of dig in at any layer of abstraction You like right and so at this layer of abstraction.

You're not going to get things like Stochastic gradient descent with restarts you're not going to get like differential learning rates like all that stuff That's in the learner like you could do it, but you'd have to write it all by by hand yourself Right and that's the downside of kind of going down to this level of abstraction The upside is that as you saw the code for this is very simple.

It's just a simple training loop It takes a standard pytorch model So this is like this is a good thing for us to use here We can we just call it and it looks exactly like what we're we're used to see right we get our validation and training loss for the three plus now you'll notice that We wanted something around point seven six So we're not there so in other words the the default fast AI collaborative filtering algorithm is doing something Smarter than this so we're going to try and do that One thing that we can do since we're calling our you know this lower level fit function There's no learning rate and kneeling we could do our own learning rate and kneeling so you can hear it See here there's a fast AR function called set learning rates you can pass in a standard pytorch optimizer and pass in your new learning rate and Then call fit again.

And so this is how we can let manually do a learning rate schedule And so you can see we've got a little bit better 1.13 We still got a long way to go Okay, so I think what we might do is we might have a Seven-minute break and then we're going to come back and try and improve this score of it For those who are interested somebody was asking me at the break for a kind of a quick Walk through so this is totally optional, but if you go into the fast AI library, there's a model.py file And That's where fit is which we're just looking at which goes through Each epoch in epochs and then goes through each X and Y in the mini batch and then it calls this Step function so the step function is Here and you can see the key thing is it calculates the output from the model the models for N right and so if you remember our dot product We didn't actually call model dot forward we just called model parentheses and that's because the nn dot module automatically You know when you call it as if it's a function it passes it along to forward Okay, so that's that's what that's doing there right and then the rest of this will will learn about shortly.

Just basically doing the The loss function and then the backward pass Okay, so for those who are interested That's that's kind of gives you a bit of a sense of how the code is structured if you want to look at it and as I say like the the fast AI code is designed to both be world class performance, but also Pretty easy to read so like feel free like take a look at it And if you want to know what's going on just ask on the forums And if you know if you think there's anything that could be clearer Let us know Because yeah, the code is definitely know we're going to be digging into the code more and more Okay, so let's try and improve this a little bit and let's start off by improving it in Excel So you might have noticed here that we've kind of got the idea that User 72 You know like sci-fi modern movies with special effects, you know Whatever and movie number 27 is sci-fi and has special effects and not much dialogue but we're missing an important case, which is like User 72 is pretty enthusiastic on the whole and on average rates things highly highly, you know and movie 27 You know, it's just a popular movie You know which just on average it's higher so what we'd really like is to add a constant for the user and a constant for the movie and Remember in neural network terms we call that a bias That's we want to add a bias so we could easily do that and if we go into the bias tab here We've got the same data as before and we've got the same Latent factors as before and I've just got one extra Row here and one extra column here and you won't be surprised here that we now Take the same matrix multiplication as before and we add in that and we add in that Okay, so that's our bias So other than that we've got exactly the same loss function over here And so just like before we can now go ahead and solve that and now our changing variables include the bias and we can say solve and if we leave that for a little while it will come to a better result than we had before Okay, so that's the first thing we're going to do to improve our model and there's really very little show Just to Make the code a bit shorter I have to find a function called get embedding which takes a number of inputs and a number of factors so the number of rows and the embedding matrix and unposted with matrix creates the embedding and then Randomly initializes it.

I don't know why I'm doing negative to positive here and it's zero last time Honestly, it doesn't matter much as long as it's in the right ballpark And then we return that initialized embedding So now we need not just our users by factors, which I'll chuck into u, our movies by factors Which I've chuck into m, but we also need users by 1 Which we'll put into ub, user bias, and movies by 1 which we'll put into movie bias Okay, so this is just doing a list comprehension Going through each of the tuples creating embedding for each of them and putting them into these things Okay, so now our forward is exactly the same as before U times m dot sum and this is actually a little confusing because we're doing it in two two steps Maybe to make it a bit easier.

Let's pull this out Put it up here Put this in parentheses Okay, so maybe that looks a little bit more familiar All right, u times n dot sum that's the same dot product and then here we're just going to add in our user bias and our movie bias Dot squeeze is the PyTorch thing that adds an additional unit axis on That's not going to make any sense if you haven't done broadcasting before I'm not going to do broadcasting in this course because we've already done it and we're doing it in the machine learning course But basically in short Broadcasting is what happens when you do something like this where um is a matrix ub Self dot ub uses is a is a vector How do you add a vector to a matrix and basically what it does?

Is it duplicates? the vector So that it makes it the same size as the matrix and the particular way whether it duplicates it across columns or down rows Or how it does it is called broadcasting the broadcasting rules are the same as numpy PyTorch didn't actually used to support broadcasting So I was actually the guy who first added broadcasting to PyTorch using an ugly hack and then the PyTorch authors did an awesome job Of supporting it actually inside the language So now you can use the same broadcasting operations in PyTorch is numpy If you haven't dealt with this before it's really important to learn it Because like it's it's kind of the most important fundamental way to do computations quickly in numpy and PyTorch It's the thing that lets you not have to do loops Could you imagine here if I had to loop through every row of this matrix and add each you know?

This back to the every row it would be slow it would be you know a lot more code And the idea of broadcasting it actually goes all the way back to APL which is a language designed in the 50s by an extraordinary guy called Ken Iverson APL was originally a designed or written out as a new type of mathematical notation He has this great essay called Notation as a tool for thought and the idea was that like really good notation could actually make you think of better things And part of that notation is this idea of broadcasting.

I'm incredibly enthusiastic about it, and we're going to use it plenty so either watch the machine learning lesson or You know Google numpy broadcasting for information Anyway, so basically it works reasonably intuitively we can add on we can add the vectors to the matrix All right Having done that we're now going to do one more trick.

Which is I think it was your net asked earlier about could we Squish the ratings to be between 1 and 5 and the answer is We could right and specifically what we could do is we could Put it through a sigmoid function All right, so to remind you the sigmoid function Looks like that right and this is that's one All right, we could put it through a sigmoid function So we could take like 4.96 and put it through a sigmoid function and like that.

You know that's kind of high So it kind of be over here somewhere right and then we could multiply that sigmoid like the result of that by 5 For example right and in this case we want it to be between 1 and 5 right so maybe we might multiply it by 4 and Add 1 instance that's a basic idea And so here is that trick we take The result so the result is basically the thing that comes straight out of the dot product plus the addition of the biases And put it through a sigmoid function now in pytorch Basically all of the functions you can do the tensors are available Inside this thing called capital F, and this is like totally standard in pytorch It's actually called torch dot nn dot functional But everybody including all of the pytorch docs import torch dot nn dot functional as capital F Right so capital F dot sigmoid means a function called sigmoid that is coming from torches Functional module right and so that's going to apply a sigmoid function to the result So I've squished them all between 0 and 1 using that nice little shape, and then I can multiply that by 5 minus 1 plus 4 Right and then add on 1 and that's going to give me something between 1 and 5 okay, so Like there's no need to do this.

I could comment it out, and it will still work right But now it has to come up with a set of calculations that are always between 1 and 5 right where else if I leave this in then it's like makes it really easy It's basically like oh if you think this is a really good movie just calculate a really high number It's a really crappy movie cap a really low number, and I'll make sure it's in the right region So even though this isn't a neural network It's still a good example of this kind of like if you're doing any kind of parameter fitting Try and make it so that the thing that you want your function to return It's like it's easy for it to return that okay, so that's why we do that that function squishing So we call this embedding dot bias So we can create that in the same way as before you'll see here I'm calling dot CUDA to put it on the GPU because we're not using any learner stuff normally that'll happen for you But we have to manually say put it on the GPU This is the same as before create our optimizer Fit exactly the same as before and these numbers are looking good and again.

We'll do a little Change to our learning rate and learning rate schedule, and we're down to 0.8. So we're actually pretty close pretty close So that's the key steps and This is how This is how most Collaborative filtering is done And you're not reminded me of an important point which is that this is not strictly speaking a matrix factor ization because strictly speaking a matrix factor ization would take that matrix by that matrix to create this matrix and remembering Anywhere that this is empty Like here or here We're putting in a zero Right we're saying if the original was empty put in a zero right now normally You can't do that with normal matrix factor ization normal matrix factor ization that creates the whole matrix And so it was a real problem actually When people used to try and use traditional linear algebra for this because when you have these sparse matrices like in practice This matrix is not doesn't have many gaps because we picked the users that watch the most movies and the movies that are the most Watched but if you look at the whole matrix, it's it's mainly empty and so traditional Techniques treated empty is zero and so like you basically have to predict a zero As if the fact that I haven't watched a movie means I don't like the movie that gives terrible answers So this probabilistic matrix factor ization approach takes advantage of the fact that our data structure Actually looks like this Rather than that cross tab right and so it's only calculating the loss for the user ID movie ID Combinations that actually appear that's exactly like user ID one movie ID one or two nine should be three It's actually three and a half so our loss is point five like there's nothing here.

That's ever going to calculate a Prediction or a loss for a user movie combination that doesn't appear in this table By definition the only stuff that we can appear in a mini batch is what's in this table? And like a lot of this happened interestingly enough actually in the Netflix prize So before the Netflix prize came along This probabilistic matrix factor ization it had actually already been invented, but nobody noticed Alright, and then in the first year of the Netflix prize Someone wrote this like really really famous blog post where they basically said like hey check this out Incredibly simple technique works incredibly well and suddenly all the Netflix leaderboard entries work much much better And so you know that's quite a few years ago now, and this is like now Every collaborative filtering approach does this not every collaborative filtering approach adds this sigmoid thing by the way.

It's not like Rocket science this is this is not like the NLP thing we saw last week Which is like hey, this is a new state-of-the-art like this is you know not particularly uncommon But there are still people that don't do this and it definitely helps a lot right to have this and so Actually you know what we could do is maybe now's a good time to have a look at the definition of this right so the column data module Contains all these definitions and we can now compare this to the thing we originally used which was Whatever came out of collab filter data set all right, so let's go to collab Filter data set here.

It is and we called Get learner right so we can go down to get learner and that created a collab filter learner passing in the model from get model is get model, so it created an embedding bias and So here is embedding bias and You can see here here.

It is like. It's the same thing. There's the embedding for each of the things Here's our forward that does the u times I dot sum plus plus sigmoid so in fact We have just actually rebuilt What's in the fast AI library literally? It's a little shorter and easier because we're taking advantage of the fact that there's a special collaborative filtering data set So we can actually we're getting passed in the users and the items and we don't have to pull them out of cats and cunts But other than that this is exactly the same So hopefully you can see like the fast AI library is not some inscrutable code containing concepts You can never understand.

We've actually just built up this entire thing from scratch ourselves And so why did we get? 0.76 rather than 0.8 You know I think it's simply because we used stochastic gradient descent with restarts and a cycle multiplier and an atom optimizer You know like a few little training tricks So I'm looking at this and thinking that is we could totally improve this model, but maybe Looking at the date and doing some tricks with the date.

Yes, this is kind of a just a regular Kind of model in a way. Yeah, you can add more features. Yeah, it's actually exactly so like now that you've seen this You could now you know even if you didn't have embedding bias in a notebook that you've written yourself some other model that's in fast AI you could look at it in fast AI and Be like oh that does most of the things that I'd want to do, but it doesn't deal with time and so you could just go Oh, okay.

Let's grab it. Copy it. You know pop it into my notebook and Let's create you know the better version Right, and then you can start playing that and you can now create your own model plus from the open source code here, and so Yeah, you're that's mentioning a couple things we could do we could try incorporating time stamps, so we could assume that maybe Well, maybe there's just like some For a particular user over time users tend to get more or less positive about movies Also remember there was the list of genres for each movie.

Maybe we could incorporate that So one problem is it's a little bit difficult to incorporate that stuff Into this embedding dot bias model because it's kind of it's pretty custom right so what we're going to do next is we're going to try to create a neural net version of this So the basic idea here is We're going to Take exactly the same thing as we had before here's our list of users right and here is our embeddings All right, and here's our list of movies and here is our Embeddings right and so as you can see I've just kind of transposed The movie ones so that so that they're all in the same orientation And here is our user movie rating But D cross tab okay, so in the original format so each row is a user movie rating Okay So the first thing I do is I need to replace user 14 with that users Contiguous index right and so I can do that in Excel using this match that basically says What you know how far down this list you have to go and it said User 14 was the first thing in that list Okay user 29 was the second thing that list so forth okay?

So this is the same as that thing that we did In our Python code where we basically created a dictionary to map this stuff So now we can for this particular user movie rating Combination we can look up the appropriate embedding Right and so you can see here what it's doing is it saying all right.

Let's basically offset from the start of this list And the number of rows we're going to go down is equal to the user index and the number of columns We're going to go across is One two three four five okay, and so you can see what it does is it creates point one nine point six three point three one Here it is point one nine, okay, so so this is literally modern embedding does but remember This is exactly the same as doing a one hot encoding right because if instead this was a Vector containing one zero zero zero zero zero right, and we multiplied that by this matrix Then the only row it's going to return would be the first one okay, so So it's really useful to remember that embedding Actually just is a matrix product The only reason it exists the only reason it exists is because this is an optimization You know this lets PyTorch know like okay.

This is just a matrix multiply But I guarantee you that you know this thing is one hot encoded Therefore you don't have to actually do the matrix multiply you can just do a direct look up Okay, so that's literally all an embedding is is it is a computational? Performance thing for a particular kind of matrix multiply all right so that looks up that uses user And then we can look up that uses movie all right, so here is movie ID movie ID 417 which apparently is index number 14 here.

It is here, so it should have been point seven five point four seven Yes, it is point seven five point four seven, okay, so we've now got the user embedding and the movie embedding and rather than doing a dot product of those two Right which is what we do normally?

Instead what if we can catenate the two together into a single vector of length? Ten and then feed that into a neural net Right and so anytime we've got you know a tensor of input activations or in this case a tensor of Actually, this is a tensor of output activations.

This is coming out of an embedding layer We can chuck it in a neural net because neural nets we now know can calculate Anything okay including hopefully collaborative filtering, so let's try that So here is our embedding net so This time I have not bothered to create a separate bias because instead the Linear layer in PyTorch already has a bias in it right so when we go nn.linear right Let's kind of draw this out So we've got our U matrix right and this is the number of users and this is the number of factors Right and we've got our M matrix All right, so here's our number of movies and here's our again number of factors all right, and so remember we look up a Single user We look up a single movie and let's grab them and concatenate them together Right so here's like the user part.

Here's the movie part and then let's put that through a matrix product Right so that number of rows here is going to have to be the number of users plus the number of movies because that's how long that is and then the number of columns Can be anything we want?

Because we're going to take that so in this case. We're going to pick 10 apparently so it's picked 10 and then we're going to stick that through a value and Then stick that through another Matrix, which obviously needs to be of size 10 here And then the number of columns is a size 1 because we want to predict a single rating Okay, and so that's our kind of flow chart of what's going on right it is a standard I'm called a one hidden layer neural net it depends how you think of it like there's kind of an embedding layer But because this is linear and this is linear the two together is really one linear layer, right?

It's just a computational convenience So it's really got one hidden layer because it's just got one layer before this nonlinear activation Okay so in order to create a Linear layer with some number of rows and some number of columns you just go and end up in here In the machine learning class this week We learned how to create a linear layer from scratch by creating our own weight matrix in our own biases So if you want to check that out you can do so there right, but it's the same basic technique.

We've already seen so We create our embeddings we create our two linear layers That's all the stuff that we need to start with you know really if I wanted to make this more general I would have had another parameter here called like num hidden you know equals equals 10 and then this would be a parameter and Then you could like more easily play around with different numbers of activations So when we say like okay in this layer.

I'm going to create a layer with this many activations all I mean assuming it's a fully connected layer is My linear layer has how many columns in its weight matrix. That's how many activations it creates All right, so we grab our users and movies we put them through our embedding matrix, and then we concatenate them together Okay, so torch dot cat Concatenates them together on the first dimension so in other words we can catenate the columns together to create longer rows Okay, so that's concatenating on dimension one Dropout will come back to her in a moment.

We've looked at that briefly So then having done that we'll put it through that linear layer we had We'll do our relu and you'll notice that relu is again inside our capital F and end up functional It's just a function so remember activation functions are basically things that take one activation in and spit one activation out in this case take in something that can have negatives or positives and Truncate the negatives to zero.

That's all relu does And then here's our sigma So that's that that is now a genuine Neural network. I don't know if we get to call it deep. It's only got one hidden layer But it's definitely a neural network right and so we can now construct it we can put it on the GPU We can create an optimizer for it, and we can fit it Now you'll notice there's one other thing.

I've been passing to fit which is What loss function are we trying to minimize? Okay, and this is the mean squared error loss and again. It's inside F Okay, pretty much all the functions are inside it, okay? so One of the things that you have to pass fit is something saying like how do you score is what counts as good or bad?

so Jeremy now that we have a Real neural net do we have to use the same number of embeddings for users and that's a great question you don't know It's absolutely right you don't and so like we've got a lot of benefits here right because if we You know think about You know we're grabbing a user embedding we're concatenating it with a movie embedding which maybe is like I don't know some different size but then also perhaps we looked up the genre of the movie and like you know there's actually a Embedding matrix of like number of genres By I don't know Three or something and so like we could then concatenate like a genre embedding and then maybe the time stamp is in here as a continuous Number right and so then that whole thing we can then feed into you know Our neural net right and then at the end Remember our final non-linearity was a sigmoid right so we can now Recognize that this thing we did where we did sigmoid times max rating but minus min rating plus blah blah blah Is actually just another?

Nonlinear activation function right and remember in our last layer We use generally different kinds of activation functions So as we said we don't need any activation function at all right we could just do that But by not having any nonlinear activation function, we're just making it harder, so that's why we put the sigmoid in there as well, okay so we can then fit it in the usual way and There we go you know interestingly we actually got a better score than we did with our This model So it'll be interesting to try training this with stochastic gradient descent with restarts and see if it's actually better You know maybe you can play around with the number of hidden layers and the dropout and whatever else and see if you can Come up with you know get a better answer than Point Seven six ish Okay, so so general so this is like if you were going deep into collaborative filtering at your workplace Or whatever this wouldn't be a bad way to go like it's like I'd start out with like oh, okay Here's like a collaborative data set 30 and fast AI Get learner there's you know not much I can send it basically number of factors is about the only thing that I pass in I can learn for a while maybe try a few different approaches, and then you're like okay.

There's like That's how I go if I use the defaults Okay, how do I make it better, and then I'd be like digging into the code and saying like okay? What would Jeremy actually do here? This is actually what I want you know and and and people around it So one of the nice things about the neural net approach Is that you know as you net mentioned?

We can have different numbers of embeddings We can choose how many hidden and we can also choose dropout right so So what we're actually doing is we haven't just got really you that we're also going like okay. Let's Let's delete a few things at random All right, that's dropout.

That's when this case. We were deleting after the first linear layer 75% of them all right and then after the second linear 75% of them so we can add a whole lot of regularization Yeah, so you know this it kind of feels like the this this embedding net You know you could you could change this again.

We could like have it so that we can pass into the constructor Well if we wanted to make it look as much as possible like what we had before we could surpass in peace P's equals 0.75 comma 0.75 I'm not sure this is the best API, but it's not terrible Probably what since we've only got exactly two layers.

We could say P1 equals 0.75 P2 equals 0.75 and So then this will be P1 this will be P2 you know where we go and like if you wanted to go further You could make it look more like our Structured data learner you could actually have a thing this number of hidden You know maybe you could make a list and so then rather than creating exactly one Hidden layer and one output layer.

This could be a little loop that creates and Hidden layers each one of the size you want so like this is all stuff you can play with during the during the week If you want to and I feel like if you've got like a much smaller collaborative filtering data set You know maybe you'd need like more regularization or whatever.

It's a much bigger one Maybe more layers would help. I don't know you know I haven't seen Much discussion of this kind of neural network approach to collaborative filtering But I'm not a collaborative filtering expert, so maybe it's maybe it's around, but that'd be interesting thing to try So the next thing I wanted to do was to talk about The training loop, so what's actually happening inside the training loop?

So at the moment we're basically passing off The actual updating of the weights to pie torches optimizer But what I want to do is like understand What that optimizer is is actually doing and we're also I also want to understand what this momentum term is doing so you'll find we have a spreadsheet called grad desk gradient descent And it's kind of designed to be read left to write sorry right to left worksheet wise so the rightmost worksheet Is some data right and we're going to implement gradient descent in Excel because obviously everybody wants to do deep learning in Excel and we've done Collaborative filtering in Excel we've done Convolutions in Excel so now we need SGD in Excel so we can replace Python once and for all okay, so Let's start by creating some data right and so here's you know here's some Independent you know I've got one column of X's you know and one column Of wise and these are actually directly linearly related, so this is this is random All right, and this one here is equal to X times 2 plus 30 okay, so Let's try and use Excel to take That data and try and learn those parameters Okay, that's going to be our goal So let's start with the most basic version of SGD And so the first thing I'm going to do is I'm going to run a macro so you can see what this looks like So I hit run, and it does five epochs.

I do another five epochs to another five epochs Okay, so The first one was pretty terrible. It's hard to see so I just delete that first one get better scaling All right, so you can see actually it's pretty constantly improving the loss right. This is the loss per epoch All right, so how do we do that?

So let's reset it So here is my X's and my Y's and What I do is I start out by assuming Some intercept and some slope right so this is my randomly initialized weights So I have randomly initialized them both to one You could pick a different random number if you like, but I promise that I randomly picked the number one Twice there you go It was a random number between one and one So here is my intercept and slope.

I'm just going to copy them over here right so you can literally see this is just equal C1 Here is equals C2. Okay, so I'm going to start with my very first row of data x equals 40 and y equals 58 And my goal is to come up After I look at this piece of data.

I want to come up with a slightly better intercept and a slightly better slope Okay So to do that I need to first of all basically figure out Which direction is is down in other words if I make my intercept a little bit higher Or a little bit lower would it make my error a little bit better or a little bit worse?

So let's start out by calculating the error so to calculate the error the first thing we need is a prediction So the prediction is equal to the intercept Plus x times slope right so that is our Zero hidden layer neural network, okay? And so here is our error. It's equal to our prediction minus our actual square So we could like play around with this.

I don't want my error to be 1849. I'd like it to be lower So what if we set the intercept to? 1.1 1849 goes to 1840 okay, so a higher intercept would be better Okay, what about the slope if I increase that? It goes from 1849 To 1730 okay a higher slope would be better as well Not surprising because we know Actually that there should be 30 and 2 So one way to Figure that out You know encode in the spreadsheet is to do literally what I just did It's to add a little bit to the intercept and the slope and see what happens And that's called finding the derivative through finite differencing right and so let's go ahead and do that So here is the value of my error if I add 0.01 to My intercept right so it's c4 plus open o1 and then I just put that into my linear function And then I subtract my actual all squared right and so that causes my error to go down a bit.

That's are increasing my Which one is that increasing c4 increasing the intercept a little bit has caused my error to go down So what's the derivative well the derivative is equal to how much the dependent variable changed by? Divided by how much the independent variable changed by right and so there it is right our Dependent variable changed by that minus that Right and our independent variable we changed by 0.01 So there is the estimated value of the error DB All right, so remember when people talking about derivatives This is this is all they're doing is they're saying what's this value?

But as we make this number smaller and smaller and smaller and smaller as it as it limits to zero I'm not smart enough to think in terms of like derivatives and integrals and stuff like that so whatever I think about this I always think about you know an actual like plus point oh one divided by point oh one because like I just find that Easier just like I never think about probability density functions.

I always think about actual probabilities about toss a coin Something happens three times So I always think like remember. It's it's totally fair to do this because a computer is Discrete it's not continuous like a computer can't do anything infinitely small anyway, right? So it's actually got to be calculating things at some level of precision right and our brains kind of need that as well So this is like my version of Jeffrey Hinton's like to visualize things in more than two dimensions You just like say twelve dimensions really quickly while visualizing it in two dimensions This is my equivalent you know to think about derivatives.

Just think about division And like although all the mathematicians say no you can't do that You actually can like if you think of dx dy is being literally you know changing x over changing y like The division actually like the calculations still work like all the time, so Okay, so let's do the same thing now with changing my slope by a little bit And so here's the same thing right and so you can see both of these are negative Okay, so that's saying if I increase my intercept my loss goes down if I increase my slope my loss goes down Right and so my derivative of my error with respect to my slope is is actually pretty high and that's not surprising because It's actually You know the constant term is just being added whereas the slope is being multiplied by 40 Okay now Finite differencing is all very well and good, but it's a big problem with finite differencing in High dimensional spaces and the problem is this right and this is like You don't need to learn How to calculate derivatives or integrals, but you need to learn how to think about them spatially right and so remember We have some Vector very high dimensional vector.

It's got like a million items in it right And it's going through Some weight matrix right of size like 1 million by size a hundred thousand or whatever and it's spitting out something of size 100,000 and So you need to realize like there isn't like a gradient here, but it's like for every one of these things in this vector Right, there's a gradient in every direction You know in every part of the output All right, so it actually has Not a single gradient number not even a gradient Vector but a gradient matrix right and so this This is a lot to calculate right I would literally have to like add a little bit to this and see what happens to all of these Add a little bit to this see what happens to all of these right to fill in one column of this at a time, so that's going to be Horrendously slow like that that so that's why like if you're ever thinking like oh we can just do this with finite differencing Just remember like okay.

We're dealing in the with these very high dimensional vectors where You know this this kind of Matrix calculus like all the concepts are identical But when you actually draw it out like this you suddenly realize like okay for each number I could change There's a whole bunch of numbers that impacts and I have this whole matrix of things to compute right and so Your gradient calculations can take up a lot of memory, and they can take up a lot of time So we want to find some way to do this more quickly And it's definitely well worth like spending time kind of studying these ideas of like you know the idea of like the gradients like look up things like Jacobian and Hessian They're the things that you want to search for to start unfortunately people normally write about them with you know lots of Greek letters and Blah blah blahs right, but there are some there are some nice You know intuitive explanations out there, and hopefully you can share them on the forum if you find them because this is stuff You really need to Really need to understand in here You know because You're trying to train something and it's not working properly and like later on we'll learn how to like look inside Pytorch to like actually get the values of the gradients, and you need to know like okay Well, how would I like plot the gradients you know?

What would I consider unusual like you know these are the things that turn you into a really awesome? deep learning practitioner is when you can like debug your problems by like Grabbing the gradients and doing histograms of them and like knowing you know that you could like plot that all each layer my Average gradients getting worse or you know bigger or you know whatever Okay, so the trick to doing this more quickly is to do it analytically Rather than through finite differencing and so analytically is basically there is a list you probably all learned it at high school There is a literally a list of rules that for every Mathematical function there's a like this is the derivative of that function right so You probably remember a few of them for example X squared To X right and so we actually have here an X squared So here is our two times now the one that I actually want you to know is Not any of the individual rules, but I want you to know the chain rule right, which is You've got some function of some function of something Why is this important I?

Don't know that's a linear layer. That's a rally you right and Then we can kind of keep going backwards, right? Etc right a neural net is Just a function of a function of a function of a function where the innermost is you know it's basically linear rally you linear rally you dot dot dot dot linear sigmoid or soft mass All right, and so it's a function of a function of a function and so therefore to calculate the derivative of the weights in your model The loss of your model with respect to the weights of your model You're going to need to use the chain rule and Specifically whatever layer it is that you're up to like I want to calculate the derivative here I'm going to need to use all of these All of these ones because that's all that's that's the function that's being applied right and that's why they call this back propagation because the value of the derivative of that is equal to that derivative Now basically you can do it like this you can say let's call you Is this right let's call that you right then it's simply equal to the derivative of that times Derivative of that right you just multiply them together and So that's what back propagation is like it's not that back propagation is a new thing for you to learn It's not a new algorithm it is literally Take the derivative of every one of your layers and multiply them all together so like it doesn't deserve a new name right apply the chain rule to my layers Does not deserve a new lane, but it gets one because us neural networks folk really need to seem as clever as possible It's really important that everybody else thinks that we are way outside of their capabilities Right so the fact that you're here means that we failed because you guys somehow think that you're capable Right so remember.

It's really important when you talk to other people that you say back propagation and Rectified linear unit rather than like multiply the layers Gradients or replace negatives with zeros, okay, so so here we go so here is so I've just gone ahead and Grabbed the derivative unfortunately there is no automatic differentiation in Excel yet So I did the alternative which is to paste the formula into Wolfram Alpha and got back the derivatives So there's the first derivative, and there's the second derivative Analytically we only have one layer in this Infinite tiny small neural network, so we don't have to worry about the chain rule and we should see that this analytical derivative is pretty close to our estimated derivative from the finite differencing and Indeed it is right and we should see that these ones are pretty similar as well, and indeed they are right and if you're you know back when I implemented my own neural nets 20 years ago I You know had to actually calculate the derivatives And so I always would write like had something that would check the derivatives using finite differencing And so for those poor people that do have to write these things by hand You'll still see that they have like a finite differencing checker So if you ever do have to implement a derivative by hand, please make sure that you Have a finite differencing checker so that you can test it All right So there's our derivatives So we know that if we increase B, then we're going to get a slightly better loss, so let's increase B by a bit How much should we increase it by?

Well we'll increase it by some multiple of this and the multiple We're going to choose is called a learning rate, and so here's our learning rate. So here's 1e neck 4 Okay, so our new value Is equal to whatever it was before Minus our Derivative times our learning rate, okay, so we've gone from 1 to 1.01 and then a We've done the same thing so it's gone from 1 to 1.12 So this is a special kind of mini batch.

It's a mini batch of size 1. Okay, so we call this online gradient descent Just means mini batch of size 1 so then we can go on to the next one x is 86 Y is 202 right. This is my intercept and slope copied across from the last row Okay, so here's my new y prediction.

Here's my new error Here are my derivatives Here are my new a and B All right, so we keep doing that for every mini batch of one and two eventually We react run out the end of an epoch Okay, and so then at the end of an epoch we would grab our intercept and slope and Paste them back over here.

That's our new values There we are and we can now continue again, right so we're now starting with Pops today see that in the wrong spot. It should be paste special transpose values All right Okay, so there's our new intercept. There's only slow possibly I've got those the wrong way around But anyway you get the idea and then we continue okay, so I recorded the world's tiniest macro which literally just Copies the final slope and puts it into the new slope copies the final intercept puts it into the new intercept and does that Five times and after each time it grabs the root means greater error and pastes it into the next Spare area and that is attached to this run button and so that's going to go ahead and do that five times Okay, so that's stochastic gradient descent in Excel So it to turn this into a CNN right you would just replace This error function right and therefore this prediction with the output of that convolutional example spreadsheet Okay, and that then would be in CNN being trained with with SGD, okay Now the problem is that you'll see when I run this It's kind of going very slowly right we know that we need to get to a slope of 2 and an intercept of 30 And you can kind of see at this rate It's going to take a very long time Right and specifically It's like it keeps going the same direction, so it's like come on take a hint.

That's a good direction So the come on take a hint. That's a good direction. Please keep doing that but more is called momentum Right so on our next spreadsheet We're going to implement momentum Okay, so What momentum does is? The same thing and what to simplify this spreadsheet. I've removed the finite difference in columns, okay, other than that This is just the same right so he's still got our X's our Y's A's and B's our predictions Our error is now over here, okay?

And here's our derivatives, okay? Our new calculation for this particular row Our new calculation here for our new a term just like before is it's equal to whatever a was before minus Okay, now this time. I'm not taking the derivative, but I'm financing some other number times the learning rate, so what's this other number?

Okay, so this other number is equal to the derivative Times What's this K 1? 0.02 plus 0.98 times the thing just above it Okay, so this is a linear interpolation between this rows derivative or this mini batches derivative and Whatever direction we went last time Right so in other words keep going the same direction as you were before right then update it a little bit right and so in our Stretch in our Python just before we had a momentum of 0.9 Okay, so you can see what tends to happen is that our?

negative kind of gets more and more negative right all the way up to like 2,000 Where else with our standard SGD approach Add riveters are kind of all over the place, right? Sometimes there's 700 some negative seven positive hundred You know so this is basically saying like yeah, if you've been going Down for quite a while keep doing that until finally here.

It's like okay. That's that seems to be far enough So that's getting less and less and less negative Right and still we start going positive again So you can kind of see why it's called momentum It's like once you start traveling in a particular direction for a particular weight You kind of the wheel start spinning and then once the gradient turns around the other way It's like oh slow down.

We've got this kind of momentum, and then finally turn back around All right, so when we do it this way All right, we can do exactly the same thing right and after five iterations for 89 Where else before after five iterations? We're at 104 right and after a few more.

Let's do maybe 15 Okay, so it's 102 for us here It's going right so it's it's it's a bit better. It's not heaps better. You can still see like These numbers they're not Zipping along right, but it's definitely an improvement and it also gives us something else to tune Which is nice like so if this is kind of a well-behaved error surface right in other words like Although it might be bumpy along the way.

There's kind of some overall Direction like imagine you're going down a hill right and there's like bumps on it right so the more momentum You get up. We're going to skipping over the tops right so we could say like okay Let's increase our beta up to point nine eight Right and see if that like allows us to train a little faster and whoa look at that suddenly with 22 All right so one nice thing about things like momentum is it's like another parameter that you can tune to try and make your model train better in practice Basically everybody does this every like you look at any like image net winner or whatever they all use momentum okay, and So Back over here when we said use SGD that basically means use the the basic tab of our Excel spreadsheet But then momentum equals point nine means add in Put a point nine over here okay, and so that that's kind of your like default starting point So let's keep going and talk about Adam So Adam is something which I Actually was not right earlier on in this course.

I said we've been using Adam by default We actually haven't we've actually been I've noticed that we've actually been using SGD with momentum by default and the reason is that Adam Has had it's much faster as you'll see it's much much faster to learn with but there's been some problems Which is people haven't been getting quite as good like final answers with Adam as they have with SGD with momentum And that's why you'll see like all the you know image net winning Solutions and so forth and all the academic papers always use SGD with momentum and our Adam Seems to be a particular problem in NLP people really haven't got Adam working at all.

Well The good news is this was I built it looks like this was solved two weeks ago It basically it turned out that the way people were dealing with a combination of weight decay in Adam Had a nasty kind of bargain it basically and that's that's kind of carried through to every single library and one of our students and answer ha has actually just Completed a prototype of adding is this new version of Adam is called Adam W into fast AI And he's confirmed that he's getting the much faster both the faster Performance and also the the better accuracy.

So hopefully we'll have this Adam W in fast AI Ideally before next week. We'll see how we go very very soon So so it is worth telling you about about Adam So let's talk about it, it's actually incredibly simple But again, you know make sure you make it sound really complicated when you tell people so that you can look clever So here's the same spreadsheet again, right and here's our Randomly selected A and B again somehow it's still one.

Here's our prediction. Here's our derivatives Okay, so now how we calculating our new A and our new B You can immediately see it's looking pretty hopeful because even by like row 10 We're like we're seeing the numbers move a lot more. Alright, so this is looking pretty encouraging So how are we calculating this It's equal to our previous value of B Minus J8.

Okay, so we're gonna have to find out what that is times our learning rate Divided by the square root of L8. Okay, so we're gonna have to dig it and see what's going on One thing to notice here is that my learning rate is way higher than it used to be But then we're dividing it by this Big number.

Okay, so let's start out by looking and seeing what this J8 thing is Okay J8 is identical to what we had before J8 is equal to the linear interpolation of the derivative and the previous direction Okay, so that was easy So one part of atom is to use momentum in the way we just defined Okay, the second piece was to divide by square root L8, what is that?

square root L8, okay is another linear interpolation of something and Something else and specifically it's a linear interpolation of F8 squared, okay. It's a linear interpolation of the derivative squared Along with the derivative squared last time. Okay, so in other words, we've got two pieces of momentum going on here.

One is calculating the momentum version of the gradient the other is calculating the momentum version of the gradient squared and We often refer to this idea as a Exponentially weighted moving average in other words It's basically equal to the average of this one and the last one and the last one and the last one, but we're like multiplicatively decreasing the previous ones because we're multiplying it by 0.9 times 0.9 times 0.9 times 0.9.

And so you actually see that for instance in the fast AI code If you look at fit We don't just calculate the average loss, right? because What I actually want we certainly don't just report the loss for every mini batch because that just bounces around so much So instead I say average loss is equal to whatever the average loss was last time times 0.98 Plus the loss this time times 0.02 Right.

So in other words the faster AI library The thing that it's actually when you do like the loading rate finder or plot loss It's actually showing you the exponentially weighted moving average of the loss Okay, so it's like a really handy concept. It appears quite a lot Right the other handy concept to know about it's this idea of like you've got two numbers One of them is multiplied by some value.

The other is multiplied by 1 minus that value So this is a linear interpolation with two values. You'll see it all the time and for some reason Deep learning people nearly always use the value alpha when they do this So like keep an eye out if you're reading a paper or something and you see like alpha times blah blah blah blah plus 1 minus alpha Times some other blah blah blah blah right immediately like when people read papers None of us like read everything in the equation.

We look at it. We go Oh linear interpolation Right and I say something I was just talking to Rachel about yesterday is like Whether we could start trying to find like a new way of writing papers where we literally refactor them Right like it'd be so much better to have written like linear interpolate Blah blah blah comma blah blah blah right because then you don't have to have that pattern recognition right but until we Convinced the world to change how they write papers This is what you have to do is you have to look you know Know what to look for right and once you do suddenly the huge page with formulas Aren't aren't bad at all like you often notice like for example the two things in here like they might be totally identical But this might be at time t and this might be at like time t minus 1 or something Right like it's very often these big ugly formulas turn out to be Really really simple if only they had repacked them Okay So what are we doing with this gradient squared?

So what we were doing with the gradient squared is We were taking the square root, and then we were adjusting the learning rate by dividing the learning rate by that Okay, so gradient squared is always positive okay, and We're taking the exponentially waiting move moving average of a bunch of things that are always positive And then we're taking the square root of that All right, so when is this number going to be high?

It's going to be particularly high if there's like one big You know if the gradients got a lot of variation right so there's a high variance of gradient Then this G squared thing is going to be a really high number whereas if it's like a constant Amount right it's going to be smaller that because when you add things that are squared the square It's like jump out much bigger whereas if there wasn't if there wasn't much change that it's not going to be as big so basically This number at the bottom here It's going to be high If our gradient is changing a lot now, what do you want to do if?

You've got something which is like first negative and then positive and then small and then high right Well you probably want to be more careful right you probably don't want to take a big step Because you can't really trust it right so when the when the variance of the gradient is high We're going to divide our learning rate by a big number Where else if our learning rate is?

Very similar kind of size all the time then we probably feel pretty good about this step So we're dividing it by a small amount And so this is called an adaptive learning rate and like a lot of people will have this confusion about Adam I've seen it on the forum actually where people are like, isn't there some kind of adaptive learning rate where somehow you're like setting different Learning rates for different layers or something.

It's like no not really Right all we're doing is we're just saying like just keep track of the average of the squares of the gradients and use that To adjust the learning rate, so there's still one learning rate Okay, in this case. It's one right, but effectively every parameter at every epoch is Being kind of like getting a bigger jump if the learning rate if the gradients been pretty constant for that weight and a smaller jump Otherwise okay, and that's Adam.

That's the entirety of Adam in Excel right so there's now no reason at all why you can't Train image net in Excel because we've got you've got access to all of the pieces you need And so let's try this out run Okay, that's not bad right five and we're straight up to 29 and 2 right so the difference between like You know standard SGD in this is huge and basically that you know the key difference was that it figured out that we need to be you know moving this number much faster, okay, and so and so it did and So you can see we've now got like two different parameters one is kind of the momentum for the gradient piece the other is the momentum for the gradient squared piece and They I think they're called like I think there's just a couple of the beta I think when you when you want to change it in PyTorch is a thing called beta Which is just a couple of two numbers you can change Jeremy so So you said the yeah, I think I understand this concept of you know when they When a gradient is it goes up and down then you're not really sure Which direction should should go so you should kind of slow things down therefore you subtract that gradient from the learning rate So but how do you implement that how far do you go?

I guess maybe I missed something early on you do you set a number somewhere we divide Yeah, we divide the learning rate Divided by the square root of the moving average gradient squared, so that's where we use it. Oh I'm sorry. Can you be a little more sure so d2 is the learning rate?

Which is 1? Yeah m27 is our moving average of the squared gradients So we just go d2 divided by square root m27 That's it Okay, thanks I Have one question yeah, so The new method that you just mentioned which is in the process of getting implemented in yes, I don't w yeah, I don't w How different is it from here?

Okay? I can let's do that so To understand Adam W. We have to understand weight decay And maybe we'll learn more about that later. Let's see how we go now with weight decay so the idea is that when you have Lots and lots of parameters like we do with you know most of the neural nets we train You very often have like more parameters and data points or you know like regularization becomes important And we've learned how to avoid overfitting by using dropout right which randomly deletes some activations In the hope that it's going to learn some kind of more resilient set of weights There's another kind of regularization We can use called weight decay or L2 regularization And it's actually comes kind of it's a kind of classic statistical technique and the idea is that we take our loss function Right so we take our like Error squared loss function and we add an additional piece to it Let's add weight decay right now The additional piece we add is To basically add the square of the weights, so we'd say plus B squared Plus a squared Okay, that is now Weight decay or L2 regularization and so the idea is that now The the loss function wants to keep the weights small right because increasing the weights makes the loss worse and So it's only going to increase the weights if the loss improves by more Than the amount of that penalty and in fact to make this weight decay to proper weight decay we then need some Multiplier here right so if you remember back in our here, we said weight decay equals Wd 5e neg 4 Okay, so to actually use the same weight decay.

I would have to multiply by 0.0005 So that's actually now the same weight decay, so If You have a really high weight decay that it's going to set all the parameters to zero So it'll never over fit right because it can't set any parameter to anything And so as you gradually decrease the weight decay a few more weights Can actually be used right, but the ones that don't help much.

It's still going to leave at zero or close to zero, right? So that's what that's what weight decay is is literally to change the loss function to add in this Sum of squares of weights times some parameter some some hyper parameter should see the problem is that If you put that into the loss function as I have here Then it ends up in the moving average of gradients and the moving average of squares of gradients For Adam right and so basically we end up When there's a lot of variation we end up Decreasing the amount of weight decay, and if there's very little variation we end up increasing the amount of weight decay, so we end up basically saying penalize parameters, you know weights that are really high Unless their gradient varies a lot, which is never what we intended right?

That's just not not the plan at all So the trick with Adam W is we basically remove Weight decay from here So it's not in the loss function. It's not in the G not in the G squared And we move it so that instead it's it's it's added directly to the When we update with the learning rate, it's added there instead so in other words it would be We would put the weight decay or actually the gradient with the weight decay in here when we calculate the new a and ub So it never ends up in our G and G squared So that was like a super fast Description which will probably only make sense if you listen to a three or four times on the video and then talk about it on the forum Yeah, but if you're interested let me know and we can also look at a nonce code.

That's implemented this And you know the the idea of using weight decay is it's a really helpful regularizer Because it's basically this way that we can kind of say like You know, please don't increase any of the weight values unless the you know improvement in the loss Is worth it?

And so generally speaking pretty much all state-of-the-art models have both dropout and weight decay And I don't claim to know like how to set h1 and how much of each to use Other than to say like you it's worth trying both To go back to the idea of embeddings Is there any way to interpret the final sort of use it embeddings?

Like absolutely. We're gonna look at that next week. It's super fun It turns out that you know, we'll learn what some of the worst movies of all time It's like um, it's a John Travolta Scientology ones like battleship earth or something I think that was like the worst movie of all time according to our settings Do you have any recommendations for scaling the L2 penalty or is that kind of based on how how wide the nodes or how many No, I have no suggestion at all like I I kind of look for like papers or Kaggle competitions or whatever similar and try to set up frankly The same it seems like in a particular area like computer vision object recognition It's like somewhere between 1e neg 4 or 1e neg 5 seems to work, you know actually in the Adam W paper the The authors point out that with this new approach it actually becomes like it seems to be much more stable As to what the right weight decay amounts are so hopefully now when we start playing with it, we'll be able to have some definitive recommendations by the time we get to part 2 All right.

Well, that's 9 o'clock. So this week You know practice the thing that you're least familiar with so if it's like Jacobians and Hessians read about those if it's broadcasting Read about those if it's understanding Python. Oh read about that, you know, try and implement your own custom layers Read the fast AI layers, you know and and talk on the forum about anything that you find Weird or confusing?

All right. See you next week