Back to Index

Lesson 8 - Practical Deep Learning for Coders 2022


Chapters

0:0 Neural net from scratch
4:46 Parameters in PyTorch
7:42 Embedding from scratch
12:21 Embedding interpretation
18:6 Collab filtering in fastai
22:11 Embedding distance
24:22 Collab filtering with DL
30:25 Embeddings for NLP
34:56 Embeddings for tabular
44:33 Convolutions
57:7 Optimizing convolutions
58:0 Pooling
65:12 Convolutions as matrix products
68:21 Dropout
74:27 Activation functions
80:41 Jeremy AMA
80:57 How do you stay motivated?
83:38 Skew towards big expensive models
86:25 How do you homeschool children
88:26 Walk-through as a separate course
89:59 How do you turn model into a business
92:46 Jeremy's productivity hacks
96:3 Final words

Transcript

So welcome to the last lesson of part 1 of practical deep learning for coders. It's been a really fun time doing this course and depending on when you're watching and listening to this you may want to check the forums or the fast.ai website to see whether we have a part 2 planned which is going to be sometime towards the end of 2022.

Or if it's already past that then maybe there's even a part 2 already on the website. So part 2 goes a lot deeper than part 1 technically in terms of getting to the point that you should be able to read and implement research papers and deploy models in a very kind of real life situation.

So yeah, last lesson we started on the collaborative filtering notebook and we were looking at collaborative filtering and this is where we got to which is creating your own embedding module and this is a very cool place to start the lesson because you're going to learn a lot about what's really going on.

And it's really important before you dig into this to make sure that you're really comfortable with the 05 EDM model and neural net from scratch notebook. So if parts of this are not totally clear put it aside and redo this notebook because what we're looking at from here are kind of the abstractions that PyTorch and fastai add on top of functionality that we've built ourselves from scratch.

So if you remember in the neural network from scratch we built we initialized a number of coefficients a couple of different layers and a bias term and then during as the model trained we updated those coefficients by going through each layer of them and subtracting out the gradients by the learning rate.

You've probably noticed that in PyTorch we don't have to go to all that trouble and I wanted to show you how PyTorch does this. PyTorch we don't have to keep track of what our coefficients or parameters or weights are. PyTorch does that for us and the way it does that is it looks inside our our module and it tries to find anything that looks like a neural network parameter or a tensor of neural network parameters and it keeps track of them and so here is a class we've created called T which is a subclass of module and I've created one thing inside it which is something with the attribute A.

So this is A in the T module and it just contains three ones and so the idea is you know maybe we're creating a module and we're initializing some parameter that we want to train. Now we can find out what trainable parameters or just what parameters in general PyTorch knows about in our model by instantiating our model and then asking for the parameters which you then have to turn that into a list or in fast call we have a thing called capital L which is like a fancy list which prints out the number of items in the list and shows you those items.

Now in this case when we create our object of type T and ask for its parameters we get told there are zero tensors of parameters and a list with nothing in it. Now why is that we actually said we wanted to create three tensor with three ones in it how would we make those parameters?

Well the answer is that the way you create your way you tell PyTorch what your parameters are is you actually just have to put them inside a special object called an nn.parameter. This thing almost doesn't really do anything. In fact last time I checked it really quite literally had almost no code in it sometimes these things change but let's take a look.

Yeah okay so it's about a dozen lines of code or 20 lines of code which does almost nothing it's got a way of being copied it's got a way of printing itself it's got a way of saving itself and it's got a way of being initialized. So parameter hardly does anything the key thing is though that when PyTorch checks to see which parameters should it update when it optimizes it just looks for anything that's been wrapped in this parameter class.

So if we do exactly the same thing as before which is to set an attribute containing a tensor with three ones in it but this case we wrap it in a parameter we now get told okay there's one parameter tensor in this model and it contains a tensor with three ones and you can see it also actually by default assumes that we're going to want require gradient it's assuming that anything that's a parameter is something that you want to calculate gradients for.

Now most of the time we don't have to do this because PyTorch provides lots of convenient things for us such as what you've seen before nn.linear which is something that also contain creates a tensor so this would contain a create a tensor of 1 by 3 without a bias term in it.

This has not been wrapped in an nn.parameter but that's okay PyTorch knows that anything which is basically a layer in a neural net is going to be a parameter so it automatically considers this a parameter. So here's exactly the same thing again I construct my object of type T I've checked for its parameters and I can see there's three of one tensor of parameters and there's our three things and you'll notice that it's also automatically randomly initialize them which again is generally what we want.

So PyTorch does go to some effort to try to make things easy for you. So the this attribute A is a is a linear layer and it's got a bunch of things in it one of the things in it is the weights and that's where you'll actually find the parameters that is of type parameter so a linear layer is something that contains attributes of type parameter.

Okay so what we want to do is we want to create something that works just like this did which is something that creates a matrix which will be trained as we train the model. Okay so an embedding is something which yeah it's going to create a matrix of this by this and it will be a parameter and it's something that yeah we need to be able to index into as we did here and so yeah what is what is happening behind the scenes you know we're in PyTorch it's nice to be able to create these things ourselves in Scratch because it means we really understand it and so let's create that exact same module that we did last time but this time we're going to use a function I've created called createParams you pass in a size so such as in this case n uses by n factors and it's going to call torch.zeros to create a tensor of zeros of the size that you request and then it's going to do normal random distributions or a Gaussian distribution of mean zero standard deviation 0.01 to randomly initialize those and it'll put the whole thing into an nn.parameter so that so this here is going to create an attribute called user factors which will be a parameter containing some tensor of normally distributed random numbers of this size excuse me and because it's a parameter that's going to be stored inside that's going to be available as in parameters in the module almost nothing so user bias will be a vector of parameters user factors will be a matrix of parameters movie factors will be a matrix and movies by n factors movie bias will be a vector of n movies and this is the same as before so now in the forward we can do exactly what we did before the thing is when you put a tensor inside a parameter it has all the exact same features that a tensor has so for example we can index into it so this whole thing is identical to what we had before and so that's actually believe it or not all that's required to replicate pytorches embedding layer from scratch so let's run those and see if it works and there it is it's training so we'll be able to have a look when this is done at for example model dot let's have a look movie bias and here it is right it's a parameter containing a bunch of numbers that have been trained as we'd expect it's got 1665 things in because that's how many movies we have so a question from Jonah Raphael was does torch dot zeros not produce all zeros yes torch dot zeros does produce all zeros but remember a method that ends in underscore changes in place the tensor it's being applied to and so if you look up pytorch normal underscore you'll see it fills itself with elements sampled from the normal distribution so this is actually modifying this tensor in place and so that's why we end up with something which isn't just zeros now this is a bit iphone really fun is we train this model but what did it do how is it going about predicting who's going to like what movie what well one of the things that's happened is we've created this movie bias parameter which has been optimized and what we could do is we could find which movie IDs have the highest numbers here and the lowest numbers so I think this is going to start lowest and then we can print out we can look inside our data loaders and grab the names of those movies for each of those five lowest numbers and what's happened here well we can see broadly speaking that it is printed out some pretty crappy movies and why is that well that's because when it does that matrix product that we saw in the Excel spreadsheet last week it's trying to figure out who's going to like what movie based on previous movies people have enjoyed or not and then it adds movie bias which can be positive or negative that's a different number for each movie so in order to do a good job of predicting whether you're going to like a movie or not it has to know which movies are crap and so the crap movies are going to end up with a very low movie bias parameter and so we can actually find out which movies to people not only which movies to people really not like but which movies to people like like less than one would expect given the kind of movie that it is so lawnmower man 2 for example not only apparently is it a crappy movie but based on the kind of movie it is you know it's kind of like a high-tech pop kind of sci-fi movie people who like those kinds of movies still don't like lawnmower man 2 so that's what this is meaning so it's kind of nice that we can like use a model not just to predict things but to understand things about the data so if we saw it by descending it'll give us the exact opposite so here are movies that people enjoy even when they don't normally enjoy that kind of movie so for example LA confidential classic kind of film noir detective movie with the Aussie Guy Pearce even if you don't really like film noir detective movies you might like this one you know silence of the lambs classic kind of I guess you'd say like horror kind of not horror is it suspense movie even people who don't normally like kind of serial killer suspense movies tend to like this this one now the other thing we can do is not just look at what's happening in in the bias oh and by the way we could do the same thing with users and find out like which user just loves movies even the crappy ones you know just likes all movies and vice versa but what about the other thing we didn't just have bias we also had movie factors which has got the number of movies as one axis and the number of factors as the other and we passed in 50 what's in that huge matrix well pretty hard to visualize such a huge matrix and we're not going to talk about the details but you can do something called PCA which stands for principal component analysis and that basically tries to compress those 50 columns down into three columns and then we can draw a chart of the top two and so this is PCA component number one and this is PCA component number two and here's a bunch of movies and this is a compressed view of these latent factors that it created and you can see that they obviously have some kind of meaning right so over here towards the right we've got kind of you know very pop mainstream kind of movies and over here on the left we've got more of the kind of critically acclaimed gritty kind of movies and then towards the top we've got very kind of action-oriented and sci-fi movies and then down towards the bottom we've got very dialogue driven movies so remember we didn't program in any of these things and we don't have any data at all about what movie is what kind of movie but thanks to the magic of SGD we just told it to please try and optimize these parameters and the way it was able to predict who would like what movie was it had to figure out what kinds of movies are there or what kind of taste is there for each movie so I think that's pretty interesting so this is called visualizing embeddings and then this is visualizing the bias we we obviously would rather not do everything by hand like this or even like this and fast AI provides an application for collaborative learner and so we can create one and this is going to look much the same as what we just had we're going to say how many latent factors we want and what the y range is to do the sigmoid in the multiply and then we can do fit and away it goes so let's see how it does all right so it's done a bit better than our manual one let's take a look at the model it created the model looks very similar to what we created in terms of the parameters you can see these are the two embeddings and these are the two biases and we can do exactly the same thing we can look in that model and we can find the you'll see it's not called movies it's i for items it's users and items this is the item bias so we can look at the item bias grab the weights sort and we get a very similar result in this case it's very even more confident that LA Confidential is a movie that you should probably try watching even if you don't like those kind of movies and titanic's right up there as well even if you don't really like romancy kind of movies you might like this one even if you don't like classic detective you might like this one you know we can have a look at the source code for collab learner and we can see that let's see username is false by default so where our model is going to be of this type embedding bias so we can take a look at that here it is and look this does look very similar okay it's creating an embedding using the size we requested for each of users by factors and items by factors and users and items and then it's grabbing each thing from the embedding in the forward and it's doing the model play and it's adding it up and it's doing the sigmoid so yeah it looks looks exactly the same isn't that neat so you can see that what's actually happening in real models is not yeah it's not it's not that weird or magic so Kurian is asking is PCA useful in any other areas and the answer is absolutely and what I suggest you do if you're interested is check out our contribon putational linear algebra course it's five years old now but it I mean this is stuff which hasn't changed for decades really and this will teach you all about things like PCA and stuff like that it's it's not nearly as directly practical as practical deep learning for coders but it's definitely like very interesting and it's the kind of thing which if you want to go deeper you know it's it can become pretty useful later along your path okay so here's something else interesting we can do let's grab the movie factors so that's in our model it's the item weights and it's the weight attribute that PyTorch creates okay and now we can convert the movie Silence of the Lambs into its class ID and we can do that with object to ID O to I for the titles and so that's the movie index of Silence of the Lambs and what we can do now is we can look through all of the movies in our latent factors and calculate how far apart the each vector is each each embedding vector is from this one and this cosine similarity is very similar to basically the Euclidean distance you know the kind of the root sum squared of the differences but it normalizes it so it's basically the angle between the vectors so this is going to calculate how similar each movie is to the Silence of the Lambs based on these latent factors and so then we can find which ID is the closest yeah so based on this embedding distance the closest is dial M for murder which makes a lot of sense I'm not going to discuss it today but in the book there's also some discussion about what's called the bootstrapping problem which is the question of like if you've got a new company or a new product how would you get started with making recommendations given that you don't have any previous history with which to make recommendations and that's a very interesting problem that you can read about in the book now that's one way to do collaborative filtering which is where we create that do that matrix completion exercise using all those dot products there's a different way however which is we can use deep learning and to do it with deep learning what we could do is we can we could basically create our user and item embeddings as per usual and then we could create a sequential model so sequential model is just layers of a deep learning neural network in order and what we could do is we could just concatenate so in forward we could just concatenate the user and item embeddings together and then do a value so this is this is basically a single hidden layer neural network and then a linear layer at the end to create a single output so this is a very you know world's most simple neural net exactly the same as the style that we created back here in our neural net from scratch this is exactly the same but we're using pytorch as functionality to do it more easily so in the forward here we're going to in the same exactly the same way as we have before we'll look up the user embeddings and we'll look up the item embeddings and then this is new this is where we concatenate those two things together and put it through our neural network and then finally do our sigmoid now one thing different this time is that we're going to ask fastai to figure out how big our embeddings should be and so fastai has something called get embedding sizes and it just uses a rule of thumb that says that for 944 users we recommend 74 factor embeddings and for 1665 movies or is it the other way around I can't remember we recommend 102 factors your embeddings so that's what those sizes are so now we can create that model and we can pop it into a learner and fit in the usual way and so rather than doing all that from scratch what you can do is you can do exactly the same thing that we've done before which is to call collaborative learner but you can pass in the parameter use neural network equals true and you can then say how big do you want each layer so this is going to create a two hidden layer deep learning neural net the first will have 1500 and the second will have 50 and then you can say fit and away it goes okay so here is our we got 0.87 so these are doing less well than our dot product version which is not too surprising because kind of the dot product version is really trying to take advantage of our understanding of the problem domain in practice nowadays a lot of companies kind of combine they kind of create a combined model that have a as a dot product component and also has a neural net component the neural net components particularly helpful if you've got metadata for example information about your users like when did they sign up how old are they what sex are they you know where are they from and then those are all things that you could concatenate in with your embeddings and ditto with metadata about the movie how old is it what genre is it and so forth all right so we've got a question from Jonah which I think is interesting and the question is is there an issue where the bias components are overwhelmingly determined by the non-experts in a genre in general actually there's a there's a more general issue which is in collaborative filtering recommendation systems very often a small number of users or a small number of movies overwhelm everybody else and the classic one is anime a relatively small number of people watch anime and those group of people watch a lot of anime so in movie recommendations like there's a classic problem which is every time people try to make a list of well-loved movies all the top ones into be anime and so you can imagine what's happening in the matrix completion exercise is that there are yeah some some users that just you know really watch this one genre of movie and they watch an awful lot of them so in general you've actually do have to be pretty careful about the you know these subtlety kind of issues and yeah we're going to details about how to deal with them but they generally involve kind of taking various kinds of ratios or normalizing things or so forth all right so that's collaborative filtering and I wanted to show you something interesting then about embeddings which is that embeddings are not just for collaborative filtering and in fact if you've heard about embeddings before you've probably heard about them in the context of natural language processing so you might have been wondering back when we did the hugging face transformers stuff how did we go about you know using text as inputs to models and we talked about how you can turn words into integers we make a list so here's here's the movie a certain movie here's the poem I am Sam I am Daniel I am Sam Sam I am that Sam I am etc etc we can find a list of all the unique words in that poem and make this list here and then we can give each of those words a unique ID just arbitrarily well actually in this case it's alphabetical order but it doesn't have to be and so we kind of talked about that and that's what we do with categories in general but how do we turn those into like you know lists of random numbers and you might not be surprised to hear what we do is we create an embedding matrix so here's an embedding matrix containing four latent factors for each word in the vocab so here's each word in the vocab and here's the embedding matrix so if we then want to present this poem to a neural net then what we do is we list out our our poem I do not like that Sam I am do you like green eggs and ham etc then for each word we look it up so in Excel for example we use match so that will find this word over here and find it is word ID 8 and then we will find the eighth word and the first embedding and so that's gives us that's not right 8 oh no that is right sorry here it is it's just weird column it's so it's going to be 0.22 then 0.1 0.01 and here it is 0.22 0.1 0.1 etc so this is the embedding matrix we end up with for this poem and so if you wanted to train or use and train neural network on this poem you basically turn it into this matrix of numbers and so this is what an embedding matrix looks like in an NLP model and it works exactly the same way as you can see and then you can do exactly the same things in terms of interpretation of an NLP model by looking at both the bias factors and the latent factors in a word embedding matrix so hopefully you're getting the idea here that our you know our different models you know the inputs to them that they're based on a relatively small number of kind of basic principles and these principles are generally things like lock up something in array and then we know inside the model we're basically multiplying things together adding them up and replacing the negatives of zeros so hopefully you're getting the idea that what's going on inside a neural network is generally not that complicated but it happens very quickly in its scale now it's not just collaborative filtering and NLP but also tabular analysis so in chapter 9 of the book we've talked about how random forests can be used for this which was for this is for the thing where we're predicting the auction sale price of industrial heavy equipment like bulldozers instead of using a random forest we can use a neural net now in this data set there are some continuous columns and there are some categorical columns now I'm not going to go into the details too much but in short the we can separate out the continuous columns and categorical columns using cont cat split and that will automatically find which is which based on their data types and so in this case it looks like okay so continuous columns the elapsed sale date so I think it's the number of seconds or years or something since the start of the data set is a continuous variable and then here are the cut the categorical variables so for example there are six different product sizes and two couple of systems five thousand and fifty nine model descriptions six enclosures seventeen tire sizes and so forth so we can use fast AI basically to say okay we'll take that data frame and pass in the categorical and continuous variables and create some random splits and what's the dependent variable and we can create data loaders from that and from that we can create a tabular learner and basically what that's going to do is it's going to create a pretty regular multi-layer neural network not that different to this one that we created by hand and each of the categorical variables it's going to create an embedding for it and so I can actually show you this right so we're going to use tabular learner to create the learner and so tabular learner is one two three four five six seven eight nine lines of code and basically the main thing it does is create a tabular model and so then tabular model you're not going to understand all of it but you might be surprised at how much so a tabular model is a module we're going to be passing in how big is each embedding going to be and tabular learner what's that passing in it's going to call get embedding sizes just like we did manually before automatically so that's how it gets its embedding sizes and then it's going to create an embedding for each of those embedding sizes from number of inputs to number of factors dropout we're going to come back to later batch norm we won't do till part two so then it's going to create a layer for each of the layers we want which is going to contain a linear layer followed by batch norm followed by dropout it's going to add the sigmoid range we've talked about at the very end and so the forward this is the entire thing if there's some embeddings it'll go through and get each of the embeddings using the same indexing approach we've used before it'll concatenate them all together and then it'll run it through the layers of the neural net which are these so yeah we don't know all of those details yet but we know quite a few of them so that's encouraging hopefully and once we've got that we can do the standard LR find and fit now this exact data set was used in a Kaggle competition this this data set was in a Kaggle competition and the third place getter published a paper about their technique and it's basically the exact almost the exact one I'm showing you here so it wasn't this sorry it wasn't this data set it was a data set it was a different one it was about predicting the amount of sales in different stores but they they use this basic kind of technique and one of the interesting things is that they used a lot less manual feature engineering than the other high placed entries like they had a much simpler approach and one of the interesting things they published a paper about their approach so they published a paper about their approach so this is the team from this company and they basically describe here exactly what I just showed you these different embedding layers being concatenated together and then going through a couple of layers of a neural network and it's showing here it points out in the paper exactly what we learned in the last lesson which is embedding layers are exactly equivalent to linear layers on top of a one hot encoded input and yeah they found that their their technique worked really well one of the interesting things they also showed is that you can take you can create your neural net get your trained embeddings and then you can put those embeddings into a random forest or gradient booster tree and your main average percent error will dramatically improve so you can actually combine random forests and embeddings or gradient booster trees and embeddings which is really interesting now what I really wanted to show you though is what they then did so as I said this was a thing about the predicted amount that different products would sell for at different shops around Germany and what they did was they they had a so one of their embedding matrices was embeddings by region and then they did a I think this is a PCA principal component analysis of the embeddings for their German regions and when they could a chart of them you can see that the locations that are close together in the embedding matrix are the same locations that are close together in Germany so you can see here's the blue ones and here's the blue ones and again it's important to recognize that the data that they used had no information about the location of these places the fact that they are close together geographically is something that was figured out as being something that actually helped it to predict sales and so in fact they then did a plot showing each of these dots is a shop a store and it's showing for each pair of stores how far away is it in real life in metric space and then how far away is it in embedding space and there's this very strong correlation right so it's you know it's kind of reconstructed somehow this kind of the kind of the geography of Germany by figuring out how how people shop and similar for days of the week so there was no information really about days of the week but when they put it on the embedding matrix the days of the week Monday Tuesday Wednesday close to each other Thursday Friday close to each other as you can see Saturday and Sunday close to each other and ditto for months of the year January February March April May June so yeah really interesting cool stuff I think what's actually going on and since inside a neural network all right let's take a 10-minute break and I will see you back here at 710 all right folks this is something I think is really fun which is we're going to we've looked at what goes into the the start of a model the input we've learned about how they can be categories or embeddings and embeddings are basically kind of one hot encoded category categories with a little compute trick or they can just be continuous numbers we've learned about what comes the other out the other side which is a bunch of activation so just a bunch a tensor of numbers which we can use things like softmax to constrain them to add up to one and and so forth and we've looked at what can go in the middle which is the matrix butterflies sandwiched together with you know as rectified linear units and I mentioned that there are other things that can go in the middle as well but we haven't really talked about what those other things are so I thought we might look at one of the most important and interesting version of things that can go in the middle but what you'll see is it turns out it's actually just another kind of matrix multiplication which might not be obvious at first but I'll explain we're going to look at something called a convolution and convolutions are at the heart of a convolutional neural network so the first thing to realize is a convolutional neural network is very very very similar to the neural networks we've seen so far it's got imports it's got things that are a lot like or actually are a form of matrix multiplication sandwich with activation functions which can be rectified linear but there's a particular thing which makes them very useful for computer vision and I'm going to show you using this excel spreadsheet that's in our repo called conv example and we're going to look at it using an image from MNIST so MNIST is kind of the world's most famous computer vision data set I think because it was like the first one really which really showed image recognition being being cracked it's pretty small by today's standards it's a data set of handwritten digits each one is 28 by 28 pixels but it yeah you know back in the mid 90s Jan LeCun showed you know really practically useful performance on this data set and as a result ended up with convnets being used in the American banking system for reading checks so here's an example of one of those digits this is a seven that somebody drew it's one of those ones with a stroke through it and this is what it looks like this is this is the image and so I got it from this is just one of the images from MNIST which I put into excel and what you see in the in the next column is a version of the image where the horizontal lines are being recognized and another one where the vertical lines are being recognized and if you think back to that Zyla and Fergus paper that talked about what the layers of a neural net does this is absolutely an example of something that we we know that the first layer of a neural network tends to learn how to do now how did I do this I did this using something called a convolution and so what we're going to do now is we're going to zoom in to this Excel notebook we're going to keep zooming in we're going to keep zooming in so take a look keep a lay on this on this image and you'll see that once we zoom in enough it's actually just made of numbers which as we discussed in the very first in the very first lesson we saw how images are made of numbers so here they are right here are the numbers between zero and one and what I just did is I just used a little trick I used Microsoft Excel's conditional formatting to basically make things the higher numbers more red so that's how I turn this Excel sheet and I've just rounded it off to the nearest decimal but it's actually they're actually bigger than that and so yeah so here is the image as numbers and so let me show you how we went about creating this top edge detector what we did was we created this formula don't worry about the max let's focus on this what it's doing is have a look at the colored in areas it's taking each of these cells and multiplying them by each of these cells and then adding them up and then we do the rectified linear part which is if that ends up less than zero then make it zero so this is a this is like a rectified linear unit but it's not doing the normal matrix product it's doing the equivalent of a dot product but just on these nine cells and with just these nine weights so you might not be surprised to hear that if I move now one to the right then now it's using the next nine cells right so if I move like to the right quite a bit and down quite a bit here it's using these nine cells so it's still doing a dot product right which as we know is a form of matrix multiplication but it's doing it in this way where it's kind of taking advantage of the of the geometry of this situation that the things that are close to each other are being multiplied by this consistent group of the same nine weights each time because there's actually 28 by 28 numbers here right which I think is 768 28 times 28 that plus enough 784 but we don't want we're not we don't have 784 parameters we only have nine parameters and so this is called a convolution so a convolution is where you basically slide this kind of little three by three matrix across a bigger matrix and at each location you do a dot product of the corresponding elements of that three by three with the corresponding elements of this three by three matrix of coefficients now why does that create something that finds as you see top edges well it's because of the particular way I constructed this three by three matrix what I said was that all of the rows just above so these ones are going to get a one and all of the ones just below are going to get a minus one and all of the ones in the middle are going to get a zero.

So let's think about what happens somewhere like here, right? That is, let's try to find the right one, here it is. So here we're going to get 1 times 1 plus 1 times 1 plus 1 times 1 minus 1 times 1 minus 1 times 1 minus 1 times 1, we're going to get 0.

But what about up here? Here we're going to get 1 times 1 plus 1 times 1 plus 1 times 1, these do nothing because they're times 0, minus 1 times 0. So we're going to get 3. So we're only going to get 3, the highest possible number, in the situation where these are all as black as possible, or in this case as red as possible, and these are all white.

And so that's only going to happen at a horizontal edge. So the one underneath it does exactly the same thing, exactly the same formulas. Oopsie dozy. The one underneath are exactly the same formulas. The 3 by 3 sliding thing here, but this time we've got a different matrix, different little mini matrix of coefficients, which is all ones going down and all minus ones going down.

And so for exactly the same reason, this will only be 3 in situations where they're all 1 here and they're all 0 here. So you can think of a convolution as being a sliding window of little mini dot products of these little 3 by 3 matrices. And they don't have to be 3 by 3, right?

You could have, we could just have easily done 5 by 5, and then we'd have a 5 by 5 matrix of coefficients, or whatever, whatever size you like. So the size of this is called its kernel size. This is a 3 by 3 kernel for this convolution. So then, because this is deep learning, we just repeat the, we just repeat these steps again and again and again.

So this is, this layer I'm calling conv1, it's the first convolutional layer. So conv2, it's going to be a little bit different, because on conv1 we only had a single channel input. It's just black and white, or you know, yeah, black and white, grayscale, one channel. But now we've got two channels.

We've got the, let's make it a little smaller so we can see better, we've got the horizontal edges channel and the vertical edges channel. And we'd have a similar thing in the first layer of its color. We'd have a red channel, a green channel, and blue channel. So now our, our filter, this has got the filter, this little mini matrix has got the filter.

Our filter, our filter now contains a 3 by 3 by depth 2, or if you want to think of another way, 2 3 by 3 kernels, or 1 3 by 3 by 2 kernel. And we basically do exactly the same thing, which is we're going to multiply each of these by each of these and sum them up.

But then we do it for the second bit as well, we multiply each of these by each of these and sum them up. And so that gives us, and then I think I just picked some random numbers here, right? So this is going to now be something which can combine, oh sorry, the second one, the second set, so it's, sorry, each of the red ones by each of the blue ones, that's here, plus each of the green ones times each of the mauve ones, that's here.

So this first filter is being applied to the horizontal edge detector and the second filter is being applied to the vertical edge detector. And as a result we can end up with something that combines features of the two things. And so then we can have a second channel over here, which is just a different bunch of convolutions for each of the two channels, this one times this one.

Again you can see the colors. So what we could do is if, you know, once we kind of get to the end, we'll end up, as I'll show you how in a moment, we'll end up with a single set of 10 activations, one per digit we're recognising, 0 to 9, or in this case I think we could just create one, you know, maybe we're just trying to recognise nothing but the number, number seven, or not the number seven, so we could just have one activation.

And then we would back propagate through this using SGD in the usual way and that is going to end up optimising these numbers. So in this case I manually put in the numbers I knew would create edge detectors. In real life you start with random numbers and then you use SGD to optimise these parameters.

Okay so there's a few things we can do next and I'm going to show you the way that was more common a few years ago and then I'll explain some changes that have been made more recently. What happened a few years ago was we would then take these these activations, which as you can see these activations now kind of in a grid pattern, and we would do something called max pooling.

And max pooling is kind of like a convolution, it's a sliding window, but this time as the sliding window goes across, so here we're up to here, we don't do a dot product over a filter, but instead we just take a maximum. See here, just this is the maximum of these four numbers and if we go across a little bit this is the maximum of these four numbers.

Go across a bit, go across a bit and so forth, oh that goes off the edge. And you can see what happens when this is called a 2 by 2 max pooling. So you can see what happens with a 2 by 2 max pooling, we end up losing half of our activations on each dimension.

So we're going to end up with only one quarter of the number of activations we used to have. And that's actually a good thing because if we keep on doing convolution, max pool, convolution, max pool, we're going to get fewer and fewer and fewer activations until eventually we'll just have one left, which is what we want.

That's effectively what we used to do, but the other thing I mentioned is we didn't normally keep going until there's only one left. What we used to then do is we'd basically say okay at some point we're going to take all of the activations that are left and we're going to basically just do a dot product of those with a bunch of coefficients, not as a convolution but just as a normal linear layer, and this is called the dense layer.

And then we would add them all up. So we basically end up with our final big dot product of all of the max pooled activations by all of the weights, and we do that for each channel. And so that would give us our final activation. And as I say here, MNIST would actually have 10 activations, so you'd have a separate set of weights for each of the digits you're predicting, and then softmax after that.

Okay, nowadays we do things very slightly differently. Nowadays we normally don't have max pool layers, but instead what we normally do is when we do our sliding window like this one here, we don't normally - let's go back to C - so when I go one to the right, so currently we're starting in cell column G, if I go one to the right the next one is column H, and if I go one to the right the next one starts in column I.

So you can see it's sliding the window every three by three. Nowadays what we tend to do instead is we generally skip one. So we would normally only look at every second. So we would after doing column I, we would skip columns J and would go straight to column K.

And that's called a stride to convolution. We do that both across the rows and down the columns. And what that means is every time we do a convolution we reduce our effective kind of feature size, grid size, by two on each axis. So it reduces it by four in total.

So that's basically instead of doing max pooling. And then the other thing that we do differently is nowadays we don't normally have a single dense layer at the end, a single matrix multiply at the end. But instead what we do, we generally keep doing stride two convolutions. So each one's going to reduce the grid size by two by two.

We keep going down until we've got about a seven by seven grid. And then we do a single pooling at the end. And we don't normally do max pool nowadays. Instead we do an average pool. So we average the the activations of each one of the seven by seven features.

This is actually quite important to know because if you think about what that means, it means that something like an ImageNet style image detector is going to end up with a seven by seven grid. Let's try to say is this a bear? And in each of the parts of the seven by seven grid it's basically saying is there a bear in this part of the photo?

Is there a bear in this part of the photo? Is there a bear in this part of the photo? And then it takes the average of those 49 seven by seven predictions to decide whether there's a bear in the photo. That works very well if it's basically a photo of a bear, right?

Because most you know if it's if the bear is big and takes up most of the frame then most of those seven by seven bits are bits of a bear. On the other hand, if it's a teeny tiny bear in the corner, then potentially only one of those 49 squares has a bear in it.

And even worse, if it's like a picture of lots and lots of different things, only one of which is a bear, it could end up not being a great bear detector. And so this is where like the details of how we construct our model turn out to be important.

And so if you're trying to find like just one part of a photo that has a small bear in it, you might decide to use maximum pooling instead of average pooling. Because max pooling will just say, "I think this is a picture of a bear if any one of those 49 bits of my grid has something that looks like a bear in it." So these are, you know, these are potentially important details which often get hand-waved over.

Although, you know, again, like the key thing here is that this is happening right at the very end, right? That max pool or that average pool. And actually FastAI handles this for you. We do a special thing which we kind of independently invented. I think we did it first, which is we do both max pool and average pool and we concatenate them together.

We call that concat pooling. And that has since been reinvented in at least one paper. And so that means that you don't have to think too much about it because we're going to try both for you basically. So I mentioned that this is actually really just matrix multiplication. And to show you that, I'm going to show you some images created by a guy called Matthew Kline-Smith who did this actually, I think this is the now very first ever course, might have been the part two, first part two course.

And he basically pointed out that in a certain way of thinking about it, it turns out that convolution is the same thing as a matrix multiplier. So I want to show you how he shows this. He basically says, "Okay, let's take this 3x3 image and a 2x2 kernel containing the coefficients alpha, beta, gamma, delta." And so in this, as we slide the window over, each of the colors, each of the colors are multiplied together, red by red plus green by green plus, what is that, orange by orange plus blue by blue gives you this.

And so to put it another way, algebraically P equals alpha times A plus beta times B, etc. And so then as we slide to this part, we're multiplying again, red by red, green by green, and so forth. So we can say Q equals alpha times B plus beta times C, etc.

And so this is how we calculate a convolution using the approach we just described as a sliding window. But here's another way of thinking about it. We could say, "Okay, we've got all these different things, A, B, C, D, E, F, G, H, J. Let's put them all into a single vector and then let's create a single matrix that has alpha, alpha, alpha, alpha, beta, beta, beta, beta, etc.

And then if we do this matrix multiplied by this vector, we get this with these gray zeros in the appropriate places, which gives us this, which is the same as this. And so this shows that a convolution is actually a special kind of matrix multiplication. It's a matrix multiplication where there are some zeros that are fixed and some numbers that are forced to be the same.

Now in practice it's going to be faster to do it this way, but it's a useful kind of thing to think about, I think, that just to realize like, "Oh, it's just another of these special types of matrix multiplications." Okay, I think, well let's look at one more thing because there was one other thing that we saw and I mentioned we would look at in the tabular model, which is called dropout.

And I actually have this in my Excel spreadsheet. If you go to the conv example dropout page, you'll see we've actually got a little bit more stuff here. We've got the same input as before and the same first convolution as before and the same second convolution as before. And then we've got a bunch of random numbers.

They're showing as between 0 and 1, but they're actually, that's just because they're rounding off, they're actually random numbers that are floats between 0 and 1. Over here, we're then saying, "If..." Let's have a look. So way up here, I'll zoom in a bit, I've got a dropout factor.

Let's change this say to 0.5. There we go. So over here, this is something that says if the random number in the equivalent place is greater than 0.5, then 1, otherwise 0. And so here's a whole bunch of 1s and 0s. Now this thing here is called a dropout mask.

Now what happens is we multiply over here, we multiply the dropout mask and we multiply it by our filtered image. And what that means is we end up with exactly the same image we started with. Here's the image we started with, but it's corrupted. Random bits of it have been deleted.

And based on the amount of dropout we use, so if we change it to say 0.2, not very much if it's deleted at all, so it's still very easy to recognize. Or else if we use lots of dropout, say 0.8, it's almost impossible to see what the number was.

And then we use this as the input to the next layer. So that seems weird. Why would we delete some data at random from our processed image from our activations after a layer of the convolutions? Well the reason is that a human is able to look at this corrupted image and still recognize it's a seven.

And the idea is that a computer should be able to as well. And if we randomly delete different bits of the activations each time, then the computer is forced to learn the underlying real representation rather than overfitting. You can think of this as data augmentation, but it's data augmentation not for the inputs, but data augmentation for the activations.

So this is called a dropout layer. And so dropout layers are really helpful for avoiding overfitting. And you can decide how much you want to compromise between good generalization, so avoiding overfitting, versus getting something that works really well on the training data. And so the more dropout you use, the less good it's going to be on the training data, but the better it ought to generalize.

And so this comes from a paper by Jeffrey Hinton's group quite a few years ago now. Ruslan's now at Apple I think. And then Kajeski and Hinton went on to found Google Brain. And you can see here they've got this picture of a like fully connected neural network, two layers just like the one we built.

And here look they're kind of randomly deleting some of the activations. And all that's left is these connections. And so that's a different bunch that's going to be deleted, each batch. I thought this is an interesting point. So dropout, which is super important, was actually developed in a master's thesis.

And it was rejected from the main neural networks conference, then called NIPS, now called NeurIPS. So it ended up being disseminated through Archive, which is a preprint server. And it's just been pointed out on our chat that Ilya was one of the founders of OpenAI. I don't know what happened to Nitish.

I think he went to Google Brain as well, maybe. Yeah, so you know peer review is a very fallible thing in both directions. And it's great that we have preprint servers so we can read stuff like this even if reviewers decide it's not worthy. It's been one of the most important papers ever.

Okay, now I think that's given us a good tour now. We've really seen quite a few ways of dealing with input to a neural network, quite a few of the things that can happen in the middle of a neural network. We've only talked about rectified linear units, which is this one here, zero if x is less than zero or x otherwise.

These are some of the other activations you can use. Don't use this one, of course, because you end up with a linear model. But they're all just different functions. I should mention, it turns out these don't matter very much. Basically, pretty much any non-linearity works fine. So we don't spend much time talking about activation functions, even in part two of the course, just a little bit.

So, yeah, so we understand there's our inputs. They can be one hot encoded or embeddings, which is a computational shortcut. There are sandwiched layers of matrix multipliers and activation functions. The matrix multipliers can sometimes be special cases, such as the convolutions or the embeddings. The output can go through some tweaking, such as softmax.

And then, of course, you've got the loss function, such as cross entropy loss or mean squared error or absolute error. But there's nothing too crazy going on in there. So I feel like we've got a good sense now of what goes inside a wide range of neural nets. You're not going to see anything too weird from here.

And we've also seen a wide range of applications. So before you come back to do part two, you know, what now? And we're going to have a little AMA session here. And in fact, one of the questions was what now? So this is quite, quite good. One thing I strongly suggest is if you've got this far, it's probably worth you investing your time in reading Radex's book, which is meta-learning.

And so meta-learning is very heavily based on the kind of teachings of fast AI over the last few years and is all about how to learn deep learning and learn pretty much anything. Yeah, because, you know, you've got to this point, you may as well know how to get to the next point as well as possible.

And the main thing you'll see that Radex talks about, or one of the main things is practicing and writing. So if you've kind of zipped through the videos on, you know, 2x and haven't done any exercises, you know, go back and watch the videos again. You know, a lot of the best students end up watching them two or three times, probably more like three times, and actually go through and code as you watch, you know, and experiment.

You know, write posts, blog posts about what you're doing. Spend time on the forum, both helping others and seeing other people's answers to questions. Read the success stories on the forum and of people's projects to get inspiration for things you could try. One of the most important things to do is to get together with other people.

For example, you can do, you know, a Zoom study group, in fact, on our Discord, which you can find through our forum. There's always study groups going on, or you can create your own, you know, a study group to go through the book together. Yeah, and of course, you know, build stuff.

And sometimes it's tricky to always be able to build stuff for work, because maybe there isn't, you're not quite in the right area, or they're not quite ready to try out deep learning yet. But that's okay. You know, build some hobby projects, build some stuff just for fun, or build some stuff that you're passionate about.

Yeah, so it's really important to not just put the videos away and go away and do something else, because you'll forget everything you've learned and you won't have practiced. So one of our community members went on to create an activation function, for example, which is Mish, which is now, as Tanisha just reminded me on our forums, is now used in many of the state-of-the-art networks around the world, which is pretty cool.

And he's now at Miele, I think, one of the top research labs in the world. I wonder how that's doing. Let's have a look, go to Google Scholar. Nice, 486 citations. They're doing great. All right, let's have a look at how our AMA topic is going and pick out some of the highest ranked AMAs.

Okay. So the first one is from Lucas, and actually maybe I should, actually let's switch our view here. So our first AMA is from Lucas, and Lucas asks, "How do you stay motivated? I often find myself overwhelmed in this field. There are so many new things coming up that I feel like I have to put so much energy just to keep my head above the waterline." Yeah, that's a very interesting question.

I mean, I think, Lucas, the important thing is to realize you don't have to know everything, you know. In fact, nobody knows everything. And that's okay. What people do is they take an interest in some area, and they follow that, and they try and do the best job they can of keeping up with some little sub area.

And if your little sub area is too much to keep up on, pick a sub sub area. Yeah, there's nothing like, there's no need for it to be demotivating that there's a lot of people doing a lot of interesting work and a lot of different sub fields. That's cool, you know.

It used to be kind of dull, but then there's only basically five labs in the world working on neural nets. And yeah, from time to time, you know, take a dip into other areas that maybe you're not following as closely. But when you're just starting out, you'll find that things are not changing that fast at all, really.

They can kind of look that way because people are always putting out press releases about their new tweaks. But fundamentally, the stuff that is in the course now is not that different to what was in the course five years ago. The foundations haven't changed. And it's not that different, in fact, to the convolutional neural network that Yann LeCun used on MNIST back in 1996.

It's, you know, the basic ideas I've described are forever, you know, the way the inputs work and the sandwiches of matrix multipliers and activation functions and the stuff you do to the to the final layer, you know, everything else is tweaks. And the more you learn about those basic ideas, the more you'll recognize those tweaks as simple little tricks that you'll be able to quickly get your head around.

So then Lucas goes on to ask or to comment, another thing that constantly bothers me as I feel the field is getting more and more skewed towards bigger and more computationally expensive models and huge amounts of data. I keep wondering if in some years now, I would still be able to train reasonable models with a single GPU, or if everything is going to require a compute cluster.

Yeah, that's a great question. I get that a lot. But interestingly, you know, I've been teaching people machine learning and data science stuff for nearly 30 years. And I've had a variation of this question throughout. And the reason is that engineers always want to push the envelope in like the on the biggest computers they can find, you know, that's just this, like, fun thing engineers love to do.

And by definition, they're going to get slightly better results than people doing exactly the same thing on smaller computers. So it always looks like, oh, you need big computers to be state of the art. But that's actually never true, right? Because there's always smarter ways to do things, not just bigger ways to do things.

And so, you know, when you look at fast ai's dawn bench success, when we trained image net faster than anybody had trained it before, on standard GPUs, you know, me and a bunch of students, that was not meant to happen. You know, Google was working very hard with their TPU introduction to try to show how good they were.

Intel was using like 256 PCs in parallel or something. But yeah, you know, we used common sense and smarts and showed what can be done. You know, it's also a case of picking the problems you solve. So I would not be probably doing like going head to head up against codecs and trying to create code from English descriptions.

You know, because that's a problem that does probably require very large neural nets and very large amounts of data. But if you pick areas in different domains, you know, there's still huge areas where much smaller models are still going to be state of the art. So hopefully that helped answer your question.

Let's see what else we got here. So Daniel has obviously been following my journey with teaching my daughter math. Yeah, he's so I homeschool my daughter. And Daniel asks, how do you homeschool young children, science in general and math in particular? Would you share your experiences by blogging or in lectures someday?

Yeah, I could do that. So I actually spent quite a few months just reading research papers about education recently. So I do probably have a lot I probably need to talk about at some stage. But yeah, broadly speaking, I lean into using computers and tablets a lot more than most people.

Because actually, there's an awful lot of really great apps that are super compelling. They're adaptive, so they go at the right speed for the student. And they're fun. And I really like my daughter to have fun. You know, I really don't like to force her to do things. And for example, there's a really cool app called Dragonbox algebra five plus, which teaches algebra to five year olds by using a really fun computer game involving helping dragon eggs to hatch.

And it turns out that yeah, algebra, the basic ideas of algebra are no more complex than the basic ideas that we do in other kindergarten math. And all the parents I know of who have given their kids Dragonbox algebra five plus their kids have successfully learned algebra. So that would be an example.

But yeah, we should talk about this more at some point. All right, let's see what else we've got here. So Farah says the walkthroughs have been a game changer for me. The knowledge and tips you shared in those sessions are skills required to become an effective machine learning practitioner and utilize fast AI more effectively.

Have you considered making the walkthroughs a more formal part of the course, doing a separate software engineering course, or continuing live coding sessions between part one and two? So yes, I am going to keep doing live coding sessions. At the moment, we've switched to those specifically to focusing on APL.

And then in a couple of weeks, they're going to be going to fast AI study groups. And then after that, they'll gradually turn back into more live coding sessions. But yeah, the thing I try to do in my live coding or study groups, whatever is definitely try to show the foundational techniques that just make life easier as a coder or a data scientist.

When I say foundational, I mean, yeah, the stuff which you can reuse again and again and again, like learning regular expressions really well, or knowing how to use a VM or understanding how to use the terminal and command line, you know, all that kind of stuff. Never goes out of style.

It never gets old. And yeah, I do plan to at some point hopefully actually do a course really all about that stuff specifically. But yeah, for now, the best approach is follow along with the live coding and stuff. Okay, WGPubs, which is Wade asks, how do you turn a model into a business?

Specifically, how does a coder with a little or no startup experience turn an ML based radio prototype into a legitimate business venture? Okay, I plan to do a course about this at some point as well. So, you know, obviously, there isn't a two minute version to this. But the key thing with creating a legitimate business venture is to solve a legitimate problem, you know, a problem that people need solved, solving, and which they will pay you to solve.

And so it's important not to start with your fun radio prototype as the basis your business, but instead start with, here's a problem I want to solve. And generally speaking, you should try to pick a problem that you understand better than most people. So it's either a problem that you face day to day in your work, or in some hobby, your passion that you have, or that you know, your club has, or your local school has, or your, your spouse deals with in their workplace, you know, it's something where you understand that there's something that doesn't work as well as it ought to.

Particularly something where you think yourself, you know, if they just used deep learning here, or some algorithm here, or some better compute here, that problem would go away. And that's, that's the start of a business. And so then my friend Eric Reese wrote a book called The Lean Startup, where he describes what you do next, which is basically you fake it, you create, so he calls it the minimum viable product, you create something that solves that problem, that takes you as little time as possible to create, it could be very manual, it can be loss making, it's fine, you know, even the bit in the middle where you're like, oh, there's going to be a neural net here, it's fine to like launch without the neural net and do everything by hand.

You're just trying to find out if people are going to pay for this, and this is actually useful. And then once you have, you know, hopefully confirmed that the need is real, that people will pay for it, and you can solve the need, you can gradually make it less and less of a fake, you know, and do, you know, more and more getting the product to where you want it to be.

Okay, I don't know how to pronounce the name M-I-W-O-J-C. M-I-W-O-J-C says, Jeremy, can you share some of your productivity hacks from the content you produce, it may seem you work 24 hours a day. Okay, I certainly don't do that. I think one of my main productivity hacks actually is not to work too hard, or at least, no, not to work too hard, not to work too much.

I spend probably less hours a day working than most people, I would guess. But I think I do a couple of things differently when I'm working. One is I've spent half, at least half of every working day since I was about 18, learning or practicing something new. Could be a new language, could be a new algorithm, could be something I read about.

And nearly all of that time, therefore, I've been doing that thing more slowly than I would if I just used something I already knew. Which often drives my co-workers crazy, because they're like, you know, why aren't you focusing on getting that thing done? But in the other 50% of the time, I'm constantly, you know, building up this kind of exponentially improving base of expertise in a wide range of areas.

And so now I do find, you know, I can do things, often orders of magnitude faster than people around me, or certainly many multiples faster than people around me, because I, you know, know a whole bunch of tools and skills and ideas which, yeah, no, other people don't necessarily know.

So like, I think that's one thing that's been helpful. And then another is, yeah, like trying to really not overdo things, like get good sleep and eat well and exercise well. And also, I think it's a case of like tenacity, you know, I've noticed a lot of people give up much earlier than I do.

So, yeah, if you just keep going until something's actually finished, then that's going to put you in a small minority, to be honest. Most people don't do that. And when I say finished, like finish something really nicely. And I try to make it like, so I particularly like coding, and so I try to do a lot of coding related stuff.

So I create things like NBDev, and NBDev makes it much, much easier for me to finish something nicely, you know. So in my kind of chosen area, I've spent quite a bit of time trying to make sure it's really easy for me to like, get out a blog post, get out a Python library, get out a notebook analysis, whatever.

So, yeah, trying to make these things I want to do easier, and so then I'll do them more. So, well, thank you, everybody. That's been a lot of fun. Really appreciate you taking the time to go through this course with me. Yeah, if you enjoyed it, it would really help if you would give a like on YouTube, because it really helps other people find the course, goes into the YouTube recommendation system.

And please do come and help other beginners on forums.fast.ai. It's a great way to learn yourself, is to try to teach other people. And yeah, I hope you'll join us in part two. Thanks everybody very much. I've really enjoyed this process, and I hope to get to meet more of here in person in the future.

Bye.