Lesson 8 - Practical Deep Learning for Coders 2022

00:00:00.000 | So welcome to the last lesson of part 1 of practical deep learning for coders. It's been

00:00:17.980 | a really fun time doing this course and depending on when you're watching and listening to this

00:00:28.920 | you may want to check the forums or the fast.ai website to see whether we have a part 2 planned

00:00:36.160 | which is going to be sometime towards the end of 2022. Or if it's already past that

00:00:45.280 | then maybe there's even a part 2 already on the website. So part 2 goes a lot deeper than

00:00:51.920 | part 1 technically in terms of getting to the point that you should be able to read

00:00:59.440 | and implement research papers and deploy models in a very kind of real life situation. So

00:01:09.700 | yeah, last lesson we started on the collaborative filtering notebook and we were looking at

00:01:26.960 | collaborative filtering and this is where we got to which is creating your own embedding

00:01:30.740 | module and this is a very cool place to start the lesson because you're going to learn a

00:01:36.320 | lot about what's really going on. And it's really important before you dig into this

00:01:42.960 | to make sure that you're really comfortable with the 05 EDM model and neural net from

00:01:51.560 | scratch notebook. So if parts of this are not totally clear put it aside and redo this

00:02:01.240 | notebook because what we're looking at from here are kind of the abstractions that PyTorch

00:02:08.760 | and fastai add on top of functionality that we've built ourselves from scratch. So if

00:02:16.480 | you remember in the neural network from scratch we built we initialized a number of coefficients

00:02:24.320 | a couple of different layers and a bias term and then during as the model trained we updated

00:02:34.960 | those coefficients by going through each layer of them and subtracting out the gradients

00:02:40.880 | by the learning rate. You've probably noticed that in PyTorch we don't have to go to all

00:02:50.440 | that trouble and I wanted to show you how PyTorch does this. PyTorch we don't have to

00:02:56.760 | keep track of what our coefficients or parameters or weights are. PyTorch does that for us and

00:03:06.680 | the way it does that is it looks inside our our module and it tries to find anything that

00:03:15.960 | looks like a neural network parameter or a tensor of neural network parameters and it

00:03:22.640 | keeps track of them and so here is a class we've created called T which is a subclass

00:03:27.760 | of module and I've created one thing inside it which is something with the attribute A.

00:03:33.760 | So this is A in the T module and it just contains three ones and so the idea is you know maybe

00:03:41.600 | we're creating a module and we're initializing some parameter that we want to train. Now

00:03:46.760 | we can find out what trainable parameters or just what parameters in general PyTorch

00:03:52.920 | knows about in our model by instantiating our model and then asking for the parameters

00:04:01.880 | which you then have to turn that into a list or in fast call we have a thing called capital

00:04:06.080 | L which is like a fancy list which prints out the number of items in the list and shows

00:04:11.420 | you those items. Now in this case when we create our object of type T and ask for its parameters

00:04:19.760 | we get told there are zero tensors of parameters and a list with nothing in it. Now why is

00:04:25.880 | that we actually said we wanted to create three tensor with three ones in it how would

00:04:29.960 | we make those parameters? Well the answer is that the way you create your way you tell

00:04:38.520 | PyTorch what your parameters are is you actually just have to put them inside a special object

00:04:44.600 | called an nn.parameter. This thing almost doesn't really do anything. In fact last time

00:04:51.920 | I checked it really quite literally had almost no code in it sometimes these things change

00:04:55.840 | but let's take a look. Yeah okay so it's about a dozen lines of code or 20 lines of code

00:05:08.840 | which does almost nothing it's got a way of being copied it's got a way of printing itself

00:05:14.040 | it's got a way of saving itself and it's got a way of being initialized. So parameter hardly

00:05:21.000 | does anything the key thing is though that when PyTorch checks to see which parameters

00:05:27.200 | should it update when it optimizes it just looks for anything that's been wrapped in

00:05:33.080 | this parameter class. So if we do exactly the same thing as before which is to set an

00:05:37.700 | attribute containing a tensor with three ones in it but this case we wrap it in a parameter

00:05:47.000 | we now get told okay there's one parameter tensor in this model and it contains a tensor

00:05:55.080 | with three ones and you can see it also actually by default assumes that we're going to want

00:06:01.080 | require gradient it's assuming that anything that's a parameter is something that you want

00:06:04.960 | to calculate gradients for. Now most of the time we don't have to do this because PyTorch

00:06:11.320 | provides lots of convenient things for us such as what you've seen before nn.linear

00:06:18.960 | which is something that also contain creates a tensor so this would contain a create a

00:06:24.920 | tensor of 1 by 3 without a bias term in it. This has not been wrapped in an nn.parameter

00:06:32.320 | but that's okay PyTorch knows that anything which is basically a layer in a neural net

00:06:39.320 | is going to be a parameter so it automatically considers this a parameter. So here's exactly

00:06:46.080 | the same thing again I construct my object of type T I've checked for its parameters and

00:06:51.160 | I can see there's three of one tensor of parameters and there's our three things and you'll notice

00:06:55.960 | that it's also automatically randomly initialize them which again is generally what we want.

00:07:04.280 | So PyTorch does go to some effort to try to make things easy for you. So the this attribute

00:07:15.880 | A is a is a linear layer and it's got a bunch of things in it one of the things in it is

00:07:31.160 | the weights and that's where you'll actually find the parameters that is of type parameter

00:07:37.040 | so a linear layer is something that contains attributes of type parameter. Okay so what

00:07:43.880 | we want to do is we want to create something that works just like this did which is something

00:07:51.520 | that creates a matrix which will be trained as we train the model. Okay so an embedding

00:08:03.760 | is something which yeah it's going to create a matrix of this by this and it will be a

00:08:13.000 | parameter and it's something that yeah we need to be able to index into as we did here

00:08:19.480 | and so yeah what is what is happening behind the scenes you know we're in PyTorch it's

00:08:24.640 | nice to be able to create these things ourselves in Scratch because it means we really understand

00:08:29.360 | it and so let's create that exact same module that we did last time but this time we're

00:08:38.680 | going to use a function I've created called createParams you pass in a size so such as

00:08:46.040 | in this case n uses by n factors and it's going to call torch.zeros to create a tensor

00:08:57.960 | of zeros of the size that you request and then it's going to do normal random distributions

00:09:07.840 | or a Gaussian distribution of mean zero standard deviation 0.01 to randomly initialize those

00:09:14.960 | and it'll put the whole thing into an nn.parameter so that so this here is going to create an

00:09:20.760 | attribute called user factors which will be a parameter containing some tensor of normally

00:09:28.480 | distributed random numbers of this size excuse me and because it's a parameter that's going

00:09:40.160 | to be stored inside that's going to be available as in parameters in the module almost nothing

00:09:51.720 | so user bias will be a vector of parameters user factors will be a matrix of parameters

00:09:59.000 | movie factors will be a matrix and movies by n factors movie bias will be a vector of

00:10:04.760 | n movies and this is the same as before so now in the forward we can do exactly what

00:10:10.400 | we did before the thing is when you put a tensor inside a parameter it has all the exact

00:10:18.240 | same features that a tensor has so for example we can index into it so this whole thing is

00:10:30.560 | identical to what we had before and so that's actually believe it or not all that's required

00:10:35.800 | to replicate pytorches embedding layer from scratch so let's run those and see if it works

00:10:47.440 | and there it is it's training so we'll be able to have a look when this is done at for

00:10:53.760 | example model dot let's have a look movie bias

00:11:09.520 | and here it is right it's a parameter containing a bunch of numbers that have been trained

00:11:18.440 | as we'd expect it's got 1665 things in because that's how many movies we have so a question

00:11:26.360 | from Jonah Raphael was does torch dot zeros not produce all zeros yes torch dot zeros does

00:11:36.360 | produce all zeros but remember a method that ends in underscore changes in place the tensor

00:11:44.320 | it's being applied to and so if you look up pytorch normal underscore you'll see it fills

00:11:59.280 | itself with elements sampled from the normal distribution so this is actually modifying

00:12:07.640 | this tensor in place and so that's why we end up with something which isn't just zeros

00:12:20.880 | now this is a bit iphone really fun is we train this model but what did it do how is

00:12:32.520 | it going about predicting who's going to like what movie what well one of the things that's

00:12:39.120 | happened is we've created this movie bias parameter which has been optimized and what

00:12:48.720 | we could do is we could find which movie IDs have the highest numbers here and the lowest

00:13:00.280 | numbers so I think this is going to start lowest and then we can print out we can look

00:13:04.240 | inside our data loaders and grab the names of those movies for each of those five lowest

00:13:10.500 | numbers and what's happened here well we can see broadly speaking that it is printed out

00:13:22.400 | some pretty crappy movies and why is that well that's because when it does that matrix

00:13:29.680 | product that we saw in the Excel spreadsheet last week it's trying to figure out who's

00:13:36.160 | going to like what movie based on previous movies people have enjoyed or not and then

00:13:41.880 | it adds movie bias which can be positive or negative that's a different number for each

00:13:46.080 | movie so in order to do a good job of predicting whether you're going to like a movie or not

00:13:51.920 | it has to know which movies are crap and so the crap movies are going to end up with a

00:13:56.720 | very low movie bias parameter and so we can actually find out which movies to people not

00:14:08.040 | only which movies to people really not like but which movies to people like like less

00:14:13.120 | than one would expect given the kind of movie that it is so lawnmower man 2 for example not

00:14:23.300 | only apparently is it a crappy movie but based on the kind of movie it is you know it's kind

00:14:29.680 | of like a high-tech pop kind of sci-fi movie people who like those kinds of movies still

00:14:37.400 | don't like lawnmower man 2 so that's what this is meaning so it's kind of nice that

00:14:42.120 | we can like use a model not just to predict things but to understand things about the

00:14:46.560 | data so if we saw it by descending it'll give us the exact opposite so here are movies that

00:14:55.760 | people enjoy even when they don't normally enjoy that kind of movie so for example LA

00:15:02.360 | confidential classic kind of film noir detective movie with the Aussie Guy Pearce even if you

00:15:10.600 | don't really like film noir detective movies you might like this one you know silence of

00:15:19.200 | the lambs classic kind of I guess you'd say like horror kind of not horror is it suspense

00:15:26.960 | movie even people who don't normally like kind of serial killer suspense movies tend

00:15:31.120 | to like this this one now the other thing we can do is not just look at what's happening

00:15:40.280 | in in the bias oh and by the way we could do the same thing with users and find out

00:15:44.240 | like which user just loves movies even the crappy ones you know just likes all movies

00:15:52.240 | and vice versa but what about the other thing we didn't just have bias we also had movie

00:15:58.600 | factors which has got the number of movies as one axis and the number of factors as the

00:16:05.440 | other and we passed in 50 what's in that huge matrix well pretty hard to visualize such

00:16:12.600 | a huge matrix and we're not going to talk about the details but you can do something

00:16:16.720 | called PCA which stands for principal component analysis and that basically tries to compress

00:16:21.920 | those 50 columns down into three columns and then we can draw a chart of the top two and

00:16:34.120 | so this is PCA component number one and this is PCA component number two and here's a bunch

00:16:43.080 | of movies and this is a compressed view of these latent factors that it created and you

00:16:51.840 | can see that they obviously have some kind of meaning right so over here towards the

00:16:57.120 | right we've got kind of you know very pop mainstream kind of movies and over here on

00:17:05.200 | the left we've got more of the kind of critically acclaimed gritty kind of movies and then towards

00:17:13.440 | the top we've got very kind of action-oriented and sci-fi movies and then down towards the

00:17:19.440 | bottom we've got very dialogue driven movies so remember we didn't program in any of these

00:17:26.360 | things and we don't have any data at all about what movie is what kind of movie but thanks

00:17:34.440 | to the magic of SGD we just told it to please try and optimize these parameters and the

00:17:44.240 | way it was able to predict who would like what movie was it had to figure out what kinds

00:17:49.920 | of movies are there or what kind of taste is there for each movie so I think that's

00:17:56.040 | pretty interesting so this is called visualizing embeddings and then this is visualizing the

00:18:04.480 | bias we we obviously would rather not do everything by hand like this or even like this and fast

00:18:28.720 | AI provides an application for collaborative learner and so we can create one and this

00:18:35.080 | is going to look much the same as what we just had we're going to say how many latent

00:18:38.200 | factors we want and what the y range is to do the sigmoid in the multiply and then we

00:18:44.040 | can do fit and away it goes so let's see how it does all right so it's done a bit better

00:19:01.440 | than our manual one let's take a look at the model it created the model looks very similar

00:19:11.120 | to what we created in terms of the parameters you can see these are the two embeddings and

00:19:15.560 | these are the two biases and we can do exactly the same thing we can look in that model and

00:19:21.680 | we can find the you'll see it's not called movies it's i for items it's users and items

00:19:26.840 | this is the item bias so we can look at the item bias grab the weights sort and we get

00:19:34.280 | a very similar result in this case it's very even more confident that LA Confidential is

00:19:38.800 | a movie that you should probably try watching even if you don't like those kind of movies

00:19:42.440 | and titanic's right up there as well even if you don't really like romancy kind of movies

00:19:47.100 | you might like this one even if you don't like classic detective you might like this

00:19:52.480 | one you know we can have a look at the source code for collab learner and we can see that

00:20:12.140 | let's see username is false by default so where our model is going to be of this type

00:20:17.040 | embedding bias so we can take a look at that here it is and look this does look very similar

00:20:27.240 | okay it's creating an embedding using the size we requested for each of users by factors

00:20:37.840 | and items by factors and users and items and then it's grabbing each thing from the embedding

00:20:44.520 | in the forward and it's doing the model play and it's adding it up and it's doing the sigmoid

00:20:55.040 | so yeah it looks looks exactly the same isn't that neat so you can see that what's actually

00:21:03.680 | happening in real models is not yeah it's not it's not that weird or magic so Kurian

00:21:18.320 | is asking is PCA useful in any other areas and the answer is absolutely and what I suggest

00:21:27.000 | you do if you're interested is check out our contribon putational linear algebra course

00:21:40.120 | it's five years old now but it I mean this is stuff which hasn't changed for decades

00:21:44.880 | really and this will teach you all about things like PCA and stuff like that it's it's not

00:21:55.200 | nearly as directly practical as practical deep learning for coders but it's definitely

00:22:00.160 | like very interesting and it's the kind of thing which if you want to go deeper you know

00:22:05.480 | it's it can become pretty useful later along your path okay so here's something else interesting

00:22:17.240 | we can do let's grab the movie factors so that's in our model it's the item weights

00:22:23.640 | and it's the weight attribute that PyTorch creates okay and now we can convert the movie

00:22:31.880 | Silence of the Lambs into its class ID and we can do that with object to ID O to I for

00:22:39.760 | the titles and so that's the movie index of Silence of the Lambs and what we can do now

00:22:45.520 | is we can look through all of the movies in our latent factors and calculate how far apart

00:22:55.080 | the each vector is each each embedding vector is from this one and this cosine similarity

00:23:02.760 | is very similar to basically the Euclidean distance you know the kind of the root sum

00:23:09.700 | squared of the differences but it normalizes it so it's basically the angle between the

00:23:18.000 | vectors so this is going to calculate how similar each movie is to the Silence of the

00:23:23.280 | Lambs based on these latent factors and so then we can find which ID is the closest yeah

00:23:36.920 | so based on this embedding distance the closest is dial M for murder which makes a lot of

00:23:45.020 | sense I'm not going to discuss it today but in the book there's also some discussion about

00:24:01.320 | what's called the bootstrapping problem which is the question of like if you've got a new

00:24:05.800 | company or a new product how would you get started with making recommendations given

00:24:13.080 | that you don't have any previous history with which to make recommendations and that's a

00:24:16.600 | very interesting problem that you can read about in the book

00:24:26.600 | now that's one way to do collaborative filtering which is where we create that do that matrix

00:24:38.400 | completion exercise using all those dot products there's a different way however which is we

00:24:43.760 | can use deep learning and to do it with deep learning what we could do is we can we could

00:24:55.360 | basically create our user and item embeddings as per usual and then we could create a sequential

00:25:02.400 | model so sequential model is just layers of a deep learning neural network in order and

00:25:11.040 | what we could do is we could just concatenate so in forward we could just concatenate the

00:25:17.520 | user and item embeddings together and then do a value so this is this is basically a

00:25:25.840 | single hidden layer neural network and then a linear layer at the end to create a single

00:25:30.240 | output so this is a very you know world's most simple neural net exactly the same as

00:25:37.760 | the style that we created back here in our neural net from scratch this is exactly the

00:25:46.320 | same but we're using pytorch as functionality to do it more easily so in the forward here

00:25:57.080 | we're going to in the same exactly the same way as we have before we'll look up the user

00:26:01.280 | embeddings and we'll look up the item embeddings and then this is new this is where we concatenate

00:26:07.320 | those two things together and put it through our neural network and then finally do our

00:26:12.000 | sigmoid now one thing different this time is that we're going to ask fastai to figure

00:26:24.560 | out how big our embeddings should be and so fastai has something called get embedding

00:26:28.560 | sizes and it just uses a rule of thumb that says that for 944 users we recommend 74 factor

00:26:37.240 | embeddings and for 1665 movies or is it the other way around I can't remember we recommend

00:26:43.640 | 102 factors your embeddings so that's what those sizes are so now we can create that

00:26:52.840 | model and we can pop it into a learner and fit in the usual way and so rather than doing

00:27:08.780 | all that from scratch what you can do is you can do exactly the same thing that we've done

00:27:13.160 | before which is to call collaborative learner but you can pass in the parameter use neural

00:27:21.200 | network equals true and you can then say how big do you want each layer so this is going

00:27:26.680 | to create a two hidden layer deep learning neural net the first will have 1500 and the

00:27:31.960 | second will have 50 and then you can say fit and away it goes

00:27:48.080 | okay so here is our we got 0.87 so these are doing less well than our dot product version

00:27:57.200 | which is not too surprising because kind of the dot product version is really trying to

00:28:01.480 | take advantage of our understanding of the problem domain in practice nowadays a lot

00:28:08.920 | of companies kind of combine they kind of create a combined model that have a as a dot

00:28:14.280 | product component and also has a neural net component the neural net components particularly

00:28:22.100 | helpful if you've got metadata for example information about your users like when did

00:28:28.440 | they sign up how old are they what sex are they you know where are they from and then

00:28:33.960 | those are all things that you could concatenate in with your embeddings and ditto with metadata

00:28:39.960 | about the movie how old is it what genre is it and so forth all right so we've got a question

00:28:51.480 | from Jonah which I think is interesting and the question is is there an issue where the

00:28:56.480 | bias components are overwhelmingly determined by the non-experts in a genre in general actually

00:29:10.720 | there's a there's a more general issue which is in collaborative filtering recommendation

00:29:16.440 | systems very often a small number of users or a small number of movies overwhelm everybody

00:29:27.480 | else and the classic one is anime a relatively small number of people watch anime and those

00:29:36.400 | group of people watch a lot of anime so in movie recommendations like there's a classic

00:29:42.240 | problem which is every time people try to make a list of well-loved movies all the top

00:29:47.360 | ones into be anime and so you can imagine what's happening in the matrix completion

00:29:51.480 | exercise is that there are yeah some some users that just you know really watch this

00:29:59.440 | one genre of movie and they watch an awful lot of them so in general you've actually

00:30:05.600 | do have to be pretty careful about the you know these subtlety kind of issues and yeah

00:30:14.280 | we're going to details about how to deal with them but they generally involve kind of taking

00:30:17.300 | various kinds of ratios or normalizing things or so forth all right so that's collaborative

00:30:32.320 | filtering and I wanted to show you something interesting then about embeddings which is

00:30:38.920 | that embeddings are not just for collaborative filtering and in fact if you've heard about

00:30:45.760 | embeddings before you've probably heard about them in the context of natural language processing

00:30:51.160 | so you might have been wondering back when we did the hugging face transformers stuff

00:30:56.880 | how did we go about you know using text as inputs to models and we talked about how you

00:31:06.480 | can turn words into integers we make a list so here's here's the movie a certain movie

00:31:12.920 | here's the poem I am Sam I am Daniel I am Sam Sam I am that Sam I am etc etc we can

00:31:22.760 | find a list of all the unique words in that poem and make this list here and then we can

00:31:28.240 | give each of those words a unique ID just arbitrarily well actually in this case it's

00:31:36.280 | alphabetical order but it doesn't have to be and so we kind of talked about that and

00:31:40.640 | that's what we do with categories in general but how do we turn those into like you know

00:31:46.640 | lists of random numbers and you might not be surprised to hear what we do is we create

00:31:51.920 | an embedding matrix so here's an embedding matrix containing four latent factors for

00:32:01.260 | each word in the vocab so here's each word in the vocab and here's the embedding matrix

00:32:06.640 | so if we then want to present this poem to a neural net then what we do is we list out

00:32:19.600 | our our poem I do not like that Sam I am do you like green eggs and ham etc then for each

00:32:26.280 | word we look it up so in Excel for example we use match so that will find this word over

00:32:34.300 | here and find it is word ID 8 and then we will find the eighth word and the first embedding

00:32:52.080 | and so that's gives us that's not right 8 oh no that is right sorry here it is it's just

00:33:07.400 | weird column it's so it's going to be 0.22 then 0.1 0.01 and here it is 0.22 0.1 0.1 etc

00:33:14.660 | so this is the embedding matrix we end up with for this poem and so if you wanted to

00:33:23.700 | train or use and train neural network on this poem you basically turn it into this matrix

00:33:29.300 | of numbers and so this is what an embedding matrix looks like in an NLP model and it works

00:33:38.580 | exactly the same way as you can see and then you can do exactly the same things in terms

00:33:45.700 | of interpretation of an NLP model by looking at both the bias factors and the latent factors

00:33:57.440 | in a word embedding matrix so hopefully you're getting the idea here that our you know our

00:34:10.420 | different models you know the inputs to them that they're based on a relatively small number

00:34:16.700 | of kind of basic principles and these principles are generally things like lock up something

00:34:22.380 | in array and then we know inside the model we're basically multiplying things together

00:34:29.140 | adding them up and replacing the negatives of zeros so hopefully you're getting the idea

00:34:33.100 | that what's going on inside a neural network is generally not that complicated but it happens

00:34:39.720 | very quickly in its scale now it's not just collaborative filtering and NLP but also tabular

00:34:59.180 | analysis so in chapter 9 of the book we've talked about how random forests can be used

00:35:08.380 | for this which was for this is for the thing where we're predicting the auction sale price

00:35:13.780 | of industrial heavy equipment like bulldozers instead of using a random forest we can use

00:35:20.320 | a neural net now in this data set there are some continuous columns and there are some

00:35:36.180 | categorical columns now I'm not going to go into the details too much but in short the

00:35:43.980 | we can separate out the continuous columns and categorical columns using cont cat split

00:35:50.420 | and that will automatically find which is which based on their data types and so in

00:35:58.020 | this case it looks like okay so continuous columns the elapsed sale date so I think it's

00:36:07.260 | the number of seconds or years or something since the start of the data set is a continuous

00:36:13.300 | variable and then here are the cut the categorical variables so for example there are six different

00:36:21.340 | product sizes and two couple of systems five thousand and fifty nine model descriptions

00:36:27.700 | six enclosures seventeen tire sizes and so forth so we can use fast AI basically to say

00:36:42.740 | okay we'll take that data frame and pass in the categorical and continuous variables and

00:36:49.820 | create some random splits and what's the dependent variable and we can create data loaders from

00:36:56.620 | that and from that we can create a tabular learner and basically what that's going to

00:37:07.140 | do is it's going to create a pretty regular multi-layer neural network not that different

00:37:15.420 | to this one that we created by hand and each of the categorical variables it's going to

00:37:26.780 | create an embedding for it and so I can actually show you this right so we're going to use

00:37:31.700 | tabular learner to create the learner and so tabular learner is one two three four five

00:37:39.020 | six seven eight nine lines of code and basically the main thing it does is create a tabular

00:37:43.620 | model and so then tabular model you're not going to understand all of it but you might

00:37:51.420 | be surprised at how much so a tabular model is a module we're going to be passing in how

00:37:58.220 | big is each embedding going to be and tabular learner what's that passing in it's going

00:38:08.620 | to call get embedding sizes just like we did manually before automatically so that's how

00:38:15.180 | it gets its embedding sizes and then it's going to create an embedding for each of those

00:38:22.780 | embedding sizes from number of inputs to number of factors dropout we're going to come back

00:38:30.140 | to later batch norm we won't do till part two so then it's going to create a layer for

00:38:37.500 | each of the layers we want which is going to contain a linear layer followed by batch

00:38:42.740 | norm followed by dropout it's going to add the sigmoid range we've talked about at the

00:38:48.060 | very end and so the forward this is the entire thing if there's some embeddings it'll go

00:38:56.460 | through and get each of the embeddings using the same indexing approach we've used before

00:39:01.340 | it'll concatenate them all together and then it'll run it through the layers of the neural

00:39:08.740 | net which are these so yeah we don't know all of those details yet but we know quite

00:39:16.140 | a few of them so that's encouraging hopefully

00:39:32.020 | and once we've got that we can do the standard LR find and fit now this exact data set was

00:39:43.620 | used in a Kaggle competition this this data set was in a Kaggle competition and the third

00:39:52.100 | place getter published a paper about their technique and it's basically the exact almost

00:39:57.700 | the exact one I'm showing you here so it wasn't this sorry it wasn't this data set it was

00:40:04.500 | a data set it was a different one it was about predicting the amount of sales in different

00:40:12.540 | stores but they they use this basic kind of technique and one of the interesting things

00:40:21.540 | is that they used a lot less manual feature engineering than the other high placed entries

00:40:29.340 | like they had a much simpler approach and one of the interesting things they published

00:40:33.460 | a paper about their approach so they published a paper about their approach so this is the

00:40:48.960 | team from this company and they basically describe here exactly what I just showed you

00:40:55.220 | these different embedding layers being concatenated together and then going through a couple of

00:41:00.220 | layers of a neural network and it's showing here it points out in the paper exactly what

00:41:06.340 | we learned in the last lesson which is embedding layers are exactly equivalent to linear layers

00:41:11.780 | on top of a one hot encoded input and yeah they found that their their technique worked

00:41:23.860 | really well one of the interesting things they also showed is that you can take you

00:41:28.860 | can create your neural net get your trained embeddings and then you can put those embeddings

00:41:34.980 | into a random forest or gradient booster tree and your main average percent error will dramatically

00:41:42.620 | improve so you can actually combine random forests and embeddings or gradient booster

00:41:50.280 | trees and embeddings which is really interesting now what I really wanted to show you though

00:41:55.940 | is what they then did so as I said this was a thing about the predicted amount that different

00:42:02.180 | products would sell for at different shops around Germany and what they did was they

00:42:10.260 | they had a so one of their embedding matrices was embeddings by region and then they did

00:42:15.660 | a I think this is a PCA principal component analysis of the embeddings for their German

00:42:22.940 | regions and when they could a chart of them you can see that the locations that are close

00:42:31.980 | together in the embedding matrix are the same locations that are close together in Germany

00:42:38.540 | so you can see here's the blue ones and here's the blue ones and again it's important to

00:42:42.860 | recognize that the data that they used had no information about the location of these

00:42:48.240 | places the fact that they are close together geographically is something that was figured

00:42:56.300 | out as being something that actually helped it to predict sales and so in fact they then

00:43:05.220 | did a plot showing each of these dots is a shop a store and it's showing for each pair

00:43:13.140 | of stores how far away is it in real life in metric space and then how far away is it

00:43:21.340 | in embedding space and there's this very strong correlation right so it's you know it's kind

00:43:28.220 | of reconstructed somehow this kind of the kind of the geography of Germany by figuring

00:43:34.700 | out how how people shop and similar for days of the week so there was no information really

00:43:41.900 | about days of the week but when they put it on the embedding matrix the days of the week

00:43:47.620 | Monday Tuesday Wednesday close to each other Thursday Friday close to each other as you

00:43:51.980 | can see Saturday and Sunday close to each other and ditto for months of the year January

00:43:56.860 | February March April May June so yeah really interesting cool stuff I think what's actually

00:44:07.140 | going on and since inside a neural network all right let's take a 10-minute break and

00:44:29.620 | I will see you back here at 710 all right folks this is something I think is really fun which

00:44:40.020 | is we're going to we've looked at what goes into the the start of a model the input we've

00:44:49.180 | learned about how they can be categories or embeddings and embeddings are basically kind

00:44:55.020 | of one hot encoded category categories with a little compute trick or they can just be

00:44:59.340 | continuous numbers we've learned about what comes the other out the other side which is

00:45:04.180 | a bunch of activation so just a bunch a tensor of numbers which we can use things like softmax

00:45:11.220 | to constrain them to add up to one and and so forth and we've looked at what can go in

00:45:20.700 | the middle which is the matrix butterflies sandwiched together with you know as rectified

00:45:29.820 | linear units and I mentioned that there are other things that can go in the middle as

00:45:36.020 | well but we haven't really talked about what those other things are so I thought we might

00:45:42.340 | look at one of the most important and interesting version of things that can go in the middle

00:45:48.900 | but what you'll see is it turns out it's actually just another kind of matrix multiplication

00:45:54.220 | which might not be obvious at first but I'll explain we're going to look at something called

00:45:57.620 | a convolution and convolutions are at the heart of a convolutional neural network so

00:46:03.020 | the first thing to realize is a convolutional neural network is very very very similar to

00:46:07.260 | the neural networks we've seen so far it's got imports it's got things that are a lot

00:46:12.380 | like or actually are a form of matrix multiplication sandwich with activation functions which can

00:46:17.140 | be rectified linear but there's a particular thing which makes them very useful for computer

00:46:25.020 | vision and I'm going to show you using this excel spreadsheet that's in our repo called

00:46:31.420 | conv example and we're going to look at it using an image from MNIST so MNIST is kind

00:46:39.740 | of the world's most famous computer vision data set I think because it was like the first

00:46:45.740 | one really which really showed image recognition being being cracked it's pretty small by today's

00:46:54.300 | standards it's a data set of handwritten digits each one is 28 by 28 pixels but it yeah you

00:47:02.860 | know back in the mid 90s Jan LeCun showed you know really practically useful performance

00:47:09.820 | on this data set and as a result ended up with convnets being used in the American banking

00:47:17.420 | system for reading checks so here's an example of one of those digits this is a seven that

00:47:22.660 | somebody drew it's one of those ones with a stroke through it and this is what it looks

00:47:27.020 | like this is this is the image and so I got it from this is just one of the images from

00:47:34.140 | MNIST which I put into excel and what you see in the in the next column is a version

00:47:48.780 | of the image where the horizontal lines are being recognized and another one where the

00:47:55.940 | vertical lines are being recognized and if you think back to that Zyla and Fergus paper

00:48:00.420 | that talked about what the layers of a neural net does this is absolutely an example of

00:48:04.600 | something that we we know that the first layer of a neural network tends to learn how to

00:48:11.140 | do now how did I do this I did this using something called a convolution and so what

00:48:17.900 | we're going to do now is we're going to zoom in to this Excel notebook we're going to keep

00:48:23.260 | zooming in we're going to keep zooming in so take a look keep a lay on this on this

00:48:28.460 | image and you'll see that once we zoom in enough it's actually just made of numbers

00:48:33.420 | which as we discussed in the very first in the very first lesson we saw how images are

00:48:42.500 | made of numbers so here they are right here are the numbers between zero and one and what

00:48:50.200 | I just did is I just used a little trick I used Microsoft Excel's conditional formatting

00:49:00.020 | to basically make things the higher numbers more red so that's how I turn this Excel sheet

00:49:07.340 | and I've just rounded it off to the nearest decimal but it's actually they're actually

00:49:12.140 | bigger than that and so yeah so here is the image as numbers and so let me show you how

00:49:22.940 | we went about creating this top edge detector what we did was we created this formula don't

00:49:34.300 | worry about the max let's focus on this what it's doing is have a look at the colored in

00:49:42.540 | areas it's taking each of these cells and multiplying them by each of these cells and

00:49:54.900 | then adding them up and then we do the rectified linear part which is if that ends up less

00:50:03.780 | than zero then make it zero so this is a this is like a rectified linear unit but it's not

00:50:12.260 | doing the normal matrix product it's doing the equivalent of a dot product but just on

00:50:20.820 | these nine cells and with just these nine weights so you might not be surprised to hear

00:50:27.460 | that if I move now one to the right then now it's using the next nine cells right so if

00:50:37.340 | I move like to the right quite a bit and down quite a bit here it's using these nine cells

00:50:45.700 | so it's still doing a dot product right which as we know is a form of matrix multiplication

00:50:52.640 | but it's doing it in this way where it's kind of taking advantage of the of the geometry

00:50:56.580 | of this situation that the things that are close to each other are being multiplied by

00:51:01.700 | this consistent group of the same nine weights each time because there's actually 28 by 28

00:51:08.440 | numbers here right which I think is 768 28 times 28 that plus enough 784 but we don't

00:51:21.060 | want we're not we don't have 784 parameters we only have nine parameters and so this is

00:51:26.340 | called a convolution so a convolution is where you basically slide this kind of little three

00:51:33.980 | by three matrix across a bigger matrix and at each location you do a dot product of the

00:51:40.780 | corresponding elements of that three by three with the corresponding elements of this three

00:51:45.260 | by three matrix of coefficients now why does that create something that finds as you see

00:51:51.860 | top edges well it's because of the particular way I constructed this three by three matrix

00:51:58.340 | what I said was that all of the rows just above so these ones are going to get a one

00:52:09.580 | and all of the ones just below are going to get a minus one and all of the ones in the

00:52:14.140 | middle are going to get a zero. So let's think about what happens somewhere like

00:52:19.640 | here, right? That is, let's try to find the right one, here it is. So here we're

00:52:32.200 | going to get 1 times 1 plus 1 times 1 plus 1 times 1 minus 1 times 1 minus 1

00:52:39.240 | times 1 minus 1 times 1, we're going to get 0. But what about up here? Here we're

00:52:48.940 | going to get 1 times 1 plus 1 times 1 plus 1 times 1, these do nothing because

00:52:57.840 | they're times 0, minus 1 times 0. So we're going to get 3. So we're only going to

00:53:04.100 | get 3, the highest possible number, in the situation where these are all as black

00:53:10.820 | as possible, or in this case as red as possible, and these are all white. And so

00:53:15.980 | that's only going to happen at a horizontal edge. So the one underneath it

00:53:25.460 | does exactly the same thing, exactly the same formulas. Oopsie dozy. The one

00:53:33.880 | underneath are exactly the same formulas. The 3 by 3 sliding thing here, but this

00:53:40.480 | time we've got a different matrix, different little mini matrix of

00:53:43.960 | coefficients, which is all ones going down and all minus ones going down. And

00:53:48.760 | so for exactly the same reason, this will only be 3 in situations where they're

00:53:55.720 | all 1 here and they're all 0 here. So you can think of a convolution as being a

00:54:05.200 | sliding window of little mini dot products of these little 3 by 3

00:54:11.540 | matrices. And they don't have to be 3 by 3, right? You could have, we could just

00:54:15.820 | have easily done 5 by 5, and then we'd have a 5 by 5 matrix of coefficients, or

00:54:22.360 | whatever, whatever size you like. So the size of this is called its kernel size.

00:54:27.700 | This is a 3 by 3 kernel for this convolution. So then, because this is deep

00:54:38.100 | learning, we just repeat the, we just repeat these steps again and again and

00:54:42.900 | again. So this is, this layer I'm calling conv1, it's the first convolutional

00:54:47.520 | layer. So conv2, it's going to be a little bit different, because on conv1

00:54:52.220 | we only had a single channel input. It's just black and white, or you know, yeah,

00:54:56.640 | black and white, grayscale, one channel. But now we've got two channels. We've got

00:55:05.600 | the, let's make it a little smaller so we can see better, we've got the horizontal

00:55:12.680 | edges channel and the vertical edges channel. And we'd have a similar thing in

00:55:17.640 | the first layer of its color. We'd have a red channel, a green channel, and blue

00:55:20.800 | channel. So now our, our filter, this has got the filter, this little mini matrix

00:55:31.560 | has got the filter. Our filter,

00:55:44.720 | our filter now contains a 3 by 3 by depth 2, or if you want to think of

00:55:52.960 | another way, 2 3 by 3 kernels, or 1 3 by 3 by 2 kernel. And we basically do

00:55:59.200 | exactly the same thing, which is we're going to multiply each of these by each

00:56:04.120 | of these and sum them up. But then we do it for the second bit as well, we

00:56:09.120 | multiply each of these by each of these and sum them up. And so that gives us, and

00:56:17.000 | then I think I just picked some random numbers here, right? So this is going to

00:56:20.560 | now be something which can combine, oh sorry, the second one, the second set, so

00:56:25.120 | it's, sorry, each of the red ones by each of the blue ones, that's here, plus each

00:56:32.240 | of the green ones times each of the mauve ones, that's here. So this first filter

00:56:38.200 | is being applied to the horizontal edge detector and the second filter is being

00:56:43.520 | applied to the vertical edge detector. And as a result we can end up with

00:56:47.520 | something that combines features of the two things. And so then we can have a

00:56:53.440 | second channel over here, which is just a different bunch of convolutions for each

00:57:01.720 | of the two channels, this one times this one. Again you can see the colors. So what

00:57:08.440 | we could do is if, you know, once we kind of get to the end, we'll end up, as I'll

00:57:14.160 | show you how in a moment, we'll end up with a single set of 10 activations, one

00:57:22.720 | per digit we're recognising, 0 to 9, or in this case I think we could just

00:57:28.080 | create one, you know, maybe we're just trying to recognise nothing but the

00:57:30.840 | number, number seven, or not the number seven, so we could just have one

00:57:33.640 | activation. And then we would back propagate through this using SGD in the

00:57:40.920 | usual way and that is going to end up optimising these numbers. So in this case

00:57:46.680 | I manually put in the numbers I knew would create edge detectors. In real life

00:57:51.800 | you start with random numbers and then you use SGD to optimise these

00:57:56.240 | parameters. Okay so there's a few things we can do next and I'm going to

00:58:05.040 | show you the way that was more common a few years ago and then I'll explain some

00:58:09.800 | changes that have been made more recently. What happened a few years ago

00:58:13.720 | was we would then take these these activations, which as you can see these

00:58:20.920 | activations now kind of in a grid pattern, and we would do something called max

00:58:27.280 | pooling. And max pooling is kind of like a convolution, it's a sliding window, but

00:58:32.400 | this time as the sliding window goes across, so here we're up to here, we don't

00:58:37.520 | do a dot product over a filter, but instead we just take a maximum. See here,

00:58:43.520 | just this is the maximum of these four numbers and if we go across a little bit

00:58:48.880 | this is the maximum of these four numbers. Go across a bit, go across a bit

00:58:54.600 | and so forth, oh that goes off the edge. And you can see what happens when this

00:59:00.800 | is called a 2 by 2 max pooling. So you can see what happens with a 2 by 2 max

00:59:11.380 | pooling, we end up losing half of our activations on each dimension. So we're

00:59:20.600 | going to end up with only one quarter of the number of activations we used to

00:59:24.920 | have. And that's actually a good thing because if we keep on doing convolution,

00:59:32.120 | max pool, convolution, max pool, we're going to get fewer and fewer and fewer

00:59:37.360 | activations until eventually we'll just have one left, which is what we want.

00:59:44.600 | That's effectively what we used to do, but the other thing I mentioned is we

00:59:51.400 | didn't normally keep going until there's only one left. What we used to then do is

00:59:56.000 | we'd basically say okay at some point we're going to take all of the

00:59:59.960 | activations that are left and we're going to basically just do a dot product of

01:00:08.120 | those with a bunch of coefficients, not as a convolution but just as a normal

01:00:13.680 | linear layer, and this is called the dense layer. And then we would add them

01:00:19.520 | all up. So we basically end up with our final big dot product of all of the max

01:00:29.240 | pooled activations by all of the weights, and we do that for each channel. And so

01:00:34.880 | that would give us our final activation. And as I say here, MNIST would actually

01:00:40.480 | have 10 activations, so you'd have a separate set of weights for each of the

01:00:44.240 | digits you're predicting, and then softmax after that. Okay, nowadays we do

01:00:50.920 | things very slightly differently. Nowadays we normally don't have max pool

01:00:55.160 | layers, but instead what we normally do is when we do our sliding window like

01:01:03.200 | this one here, we don't normally - let's go back to C - so when I go one to the

01:01:11.880 | right, so currently we're starting in cell column G, if I go one to the right

01:01:18.040 | the next one is column H, and if I go one to the right the next one starts in

01:01:22.680 | column I. So you can see it's sliding the window every three by three.

01:01:27.160 | Nowadays what we tend to do instead is we generally skip one. So we would

01:01:32.240 | normally only look at every second. So we would after doing column I, we would skip

01:01:38.640 | columns J and would go straight to column K. And that's called a stride to

01:01:43.480 | convolution. We do that both across the rows and down the columns. And what that

01:01:47.400 | means is every time we do a convolution we reduce our effective kind of feature

01:01:53.840 | size, grid size, by two on each axis. So it reduces it by four in total. So that's

01:02:01.840 | basically instead of doing max pooling. And then the other thing that we do

01:02:07.840 | differently is nowadays we don't normally have a single dense layer at

01:02:14.600 | the end, a single matrix multiply at the end. But instead what we do, we generally

01:02:19.000 | keep doing stride two convolutions. So each one's going to reduce the grid size

01:02:23.760 | by two by two. We keep going down until we've got about a seven by seven grid. And

01:02:30.560 | then we do a single pooling at the end. And we don't normally do max pool

01:02:35.080 | nowadays. Instead we do an average pool. So we average the the activations of each

01:02:43.360 | one of the seven by seven features. This is actually quite important to know

01:02:49.160 | because if you think about what that means, it means that something like an

01:02:54.840 | ImageNet style image detector is going to end up with a seven by seven grid. Let's

01:03:01.600 | try to say is this a bear? And in each of the parts of the seven by seven grid

01:03:06.040 | it's basically saying is there a bear in this part of the photo? Is there a bear

01:03:09.600 | in this part of the photo? Is there a bear in this part of the photo? And then

01:03:12.800 | it takes the average of those 49 seven by seven predictions to decide whether

01:03:17.560 | there's a bear in the photo. That works very well if it's basically a photo of a

01:03:24.400 | bear, right? Because most you know if it's if the bear is big and takes up most of

01:03:28.600 | the frame then most of those seven by seven bits are bits of a bear. On the

01:03:34.880 | other hand, if it's a teeny tiny bear in the corner, then potentially only one of

01:03:40.560 | those 49 squares has a bear in it. And even worse, if it's like a picture of

01:03:46.800 | lots and lots of different things, only one of which is a bear, it could end up

01:03:50.640 | not being a great bear detector. And so this is where like the details of how we

01:03:56.720 | construct our model turn out to be important. And so if you're trying to

01:04:02.640 | find like just one part of a photo that has a small bear in it, you might decide

01:04:09.120 | to use maximum pooling instead of average pooling. Because max

01:04:13.200 | pooling will just say, "I think this is a picture of a bear if any one of those

01:04:17.760 | 49 bits of my grid has something that looks like a bear in it." So these are, you

01:04:25.120 | know, these are potentially important details which often get hand-waved over.

01:04:33.120 | Although, you know, again, like the key thing here is that this is happening

01:04:42.800 | right at the very end, right? That max pool or that average pool. And actually

01:04:46.800 | FastAI handles this for you. We do a special thing which we kind of

01:04:51.160 | independently invented. I think we did it first, which is we do both max pool and

01:04:56.440 | average pool and we concatenate them together. We call that concat pooling. And

01:05:00.880 | that has since been reinvented in at least one paper. And so that means that

01:05:08.880 | you don't have to think too much about it because we're going to try both for

01:05:11.800 | you basically. So I mentioned that this is actually really just matrix

01:05:19.840 | multiplication. And to show you that, I'm going to show you some images created by

01:05:28.320 | a guy called Matthew Kline-Smith who did this actually, I think this is the now

01:05:31.720 | very first ever course, might have been the part two, first part two course. And he

01:05:39.720 | basically pointed out that in a certain way of thinking about it, it turns out

01:05:44.520 | that convolution is the same thing as a matrix multiplier. So I want to show you

01:05:49.320 | how he shows this. He basically says, "Okay, let's take this 3x3 image and a

01:05:56.560 | 2x2 kernel containing the coefficients alpha, beta, gamma, delta." And so in this, as

01:06:06.760 | we slide the window over, each of the colors, each of the colors are multiplied

01:06:16.160 | together, red by red plus green by green plus, what is that, orange by orange plus

01:06:20.880 | blue by blue gives you this. And so to put it another way,

01:06:24.400 | algebraically P equals alpha times A plus beta times B, etc. And so then as we

01:06:37.800 | slide to this part, we're multiplying again, red by red, green by green, and so

01:06:42.800 | forth. So we can say Q equals alpha times B plus beta times C, etc. And so this is

01:06:48.600 | how we calculate a convolution using the approach we just described as a sliding

01:06:53.560 | window. But here's another way of thinking about it. We could say, "Okay,

01:07:04.080 | we've got all these different things, A, B, C, D, E, F, G, H, J. Let's put them all into a

01:07:12.120 | single vector and then let's create a single matrix that has alpha, alpha, alpha,

01:07:20.680 | alpha, beta, beta, beta, beta, etc. And then if we do this matrix multiplied by

01:07:27.560 | this vector, we get this with these gray zeros in the appropriate places, which

01:07:37.040 | gives us this, which is the same as this. And so this shows that a convolution is

01:07:47.360 | actually a special kind of matrix multiplication. It's a matrix

01:07:51.240 | multiplication where there are some zeros that are fixed and some numbers

01:07:55.800 | that are forced to be the same. Now in practice it's going to be faster to do it

01:08:02.560 | this way, but it's a useful kind of thing to think about, I think, that just to

01:08:08.160 | realize like, "Oh, it's just another of these special types of matrix

01:08:12.920 | multiplications." Okay, I think, well let's look at one more thing because there was

01:08:31.360 | one other thing that we saw and I mentioned we would look at in the

01:08:34.320 | tabular model, which is called dropout. And I actually have this in my Excel

01:08:39.880 | spreadsheet. If you go to the conv example dropout page, you'll see we've

01:08:52.000 | actually got a little bit more stuff here. We've got the same input as before

01:08:55.120 | and the same first convolution as before and the same second convolution as

01:09:00.080 | before. And then we've got a bunch of random numbers. They're showing as between

01:09:14.160 | 0 and 1, but they're actually, that's just because they're rounding off, they're

01:09:17.880 | actually random numbers that are floats between 0 and 1. Over here, we're

01:09:30.480 | then saying, "If..." Let's have a look. So way up here, I'll zoom in a bit, I've got a

01:09:47.720 | dropout factor. Let's change this say to 0.5. There we go. So over here, this is

01:09:56.760 | something that says if the random number in the equivalent place is greater than

01:10:03.080 | 0.5, then 1, otherwise 0. And so here's a whole bunch of 1s and 0s. Now this thing

01:10:10.960 | here is called a dropout mask. Now what happens is we multiply over here, we

01:10:20.080 | multiply the dropout mask and we multiply it by our filtered image. And what that

01:10:28.320 | means is we end up with exactly the same image we started with. Here's the image

01:10:35.160 | we started with, but it's corrupted. Random bits of it have been deleted. And

01:10:42.560 | based on the amount of dropout we use, so if we change it to say 0.2, not very much

01:10:50.720 | if it's deleted at all, so it's still very easy to recognize. Or else if we use lots

01:10:55.080 | of dropout, say 0.8, it's almost impossible to see what the number was. And then we

01:11:04.000 | use this as the input to the next layer. So that seems weird. Why would we delete

01:11:13.040 | some data at random from our processed image from our activations after a layer

01:11:22.640 | of the convolutions? Well the reason is that a human is able to look at this

01:11:29.200 | corrupted image and still recognize it's a seven. And the idea is that a computer

01:11:34.400 | should be able to as well. And if we randomly delete different bits of the

01:11:40.720 | activations each time, then the computer is forced to learn the underlying real

01:11:50.640 | representation rather than overfitting. You can think of this as data

01:11:56.200 | augmentation, but it's data augmentation not for the inputs, but data augmentation

01:12:01.640 | for the activations. So this is called a dropout layer. And so dropout layers are

01:12:09.200 | really helpful for avoiding overfitting. And you can decide how much you want to

01:12:20.760 | compromise between good generalization, so avoiding overfitting,

01:12:27.560 | versus getting something that works really well on the training data. And so

01:12:32.440 | the more dropout you use, the less good it's going to be on the training data,

01:12:37.000 | but the better it ought to generalize. And so this comes from a paper by Jeffrey

01:12:49.240 | Hinton's group quite a few years ago now. Ruslan's now at Apple I think. And then

01:12:57.480 | Kajeski and Hinton went on to found Google Brain. And you can see here

01:13:04.600 | they've got this picture of a like fully connected neural network, two layers just

01:13:08.360 | like the one we built. And here look they're kind of randomly deleting some

01:13:12.040 | of the activations. And all that's left is these connections. And so that's a

01:13:16.520 | different bunch that's going to be deleted, each batch. I thought

01:13:27.560 | this is an interesting point. So dropout, which is super important, was actually

01:13:32.600 | developed in a master's thesis. And it was rejected from the main neural

01:13:37.640 | networks conference, then called NIPS, now called NeurIPS. So it ended up being

01:13:43.160 | disseminated through Archive, which is a preprint server. And it's just been

01:13:52.600 | pointed out on our chat that Ilya was one of the founders of OpenAI. I don't

01:13:59.880 | know what happened to Nitish. I think he went to Google Brain as well, maybe.

01:14:05.560 | Yeah, so you know peer review is a very fallible thing in both directions. And

01:14:15.480 | it's great that we have preprint servers so we can read stuff like this even if

01:14:19.000 | reviewers decide it's not worthy. It's been one of the most important papers ever.

01:14:30.840 | Okay, now I think that's given us a good tour now. We've really seen quite a few

01:14:38.120 | ways of dealing with input to a neural network, quite a few of the things that

01:14:41.480 | can happen in the middle of a neural network. We've only talked about rectified

01:14:45.480 | linear units, which is this one here, zero if x is less than zero or x otherwise.

01:14:55.800 | These are some of the other activations you can use. Don't use this one, of course,

01:15:01.640 | because you end up with a linear model. But they're all just different functions.

01:15:06.360 | I should mention, it turns out these don't matter very much.

01:15:10.440 | Basically, pretty much any non-linearity works fine. So we don't spend much time

01:15:20.040 | talking about activation functions, even in part two of the course, just a little bit.

01:15:23.960 | So, yeah, so we understand there's our inputs. They can be one hot encoded or embeddings,

01:15:32.440 | which is a computational shortcut. There are sandwiched layers of matrix multipliers

01:15:40.360 | and activation functions. The matrix multipliers can sometimes be special cases,

01:15:44.360 | such as the convolutions or the embeddings. The output can go through some tweaking,

01:15:52.200 | such as softmax. And then, of course, you've got the loss function, such as cross entropy loss

01:15:58.440 | or mean squared error or absolute error. But there's nothing too crazy going on in there.

01:16:08.920 | So I feel like we've got a good sense now of what goes inside

01:16:16.280 | a wide range of neural nets. You're not going to see anything too weird from here.

01:16:22.520 | And we've also seen a wide range of applications.

01:16:25.160 | So before you come back to do part two,

01:16:29.560 | you know, what now? And we're going to have a little AMA session here. And in fact,

01:16:41.720 | one of the questions was what now? So this is quite, quite good.

01:16:50.600 | One thing I strongly suggest is if you've got this far, it's probably worth

01:16:56.680 | you investing your time in reading Radex's book, which is meta-learning.

01:17:06.200 | And so meta-learning is very heavily based on the kind of teachings of fast AI over the last few

01:17:16.440 | years and is all about how to learn deep learning and learn pretty much anything.

01:17:22.520 | Yeah, because, you know, you've got to this point,

01:17:28.920 | you may as well know how to get to the next point as well as possible.

01:17:40.200 | And the main thing you'll see that Radex talks about, or one of the main things is

01:17:51.240 | practicing and writing. So if you've kind of zipped through the videos on, you know,

01:18:01.240 | 2x and haven't done any exercises, you know, go back and watch the videos again. You know,

01:18:06.760 | a lot of the best students end up watching them two or three times, probably more like three times,

01:18:11.800 | and actually go through and code as you watch, you know, and experiment.

01:18:17.320 | You know, write posts, blog posts about what you're doing.

01:18:22.520 | Spend time on the forum, both helping others and seeing other people's answers to questions.

01:18:33.160 | Read the success stories on the forum and of people's projects to get inspiration for

01:18:38.280 | things you could try. One of the most important things to do is to get together with other people.

01:18:44.120 | For example, you can do, you know, a Zoom study group, in fact, on our Discord,

01:18:50.600 | which you can find through our forum. There's always study groups going on,

01:18:54.440 | or you can create your own, you know, a study group to go through the book together.

01:19:02.200 | Yeah, and of course, you know, build stuff. And sometimes it's tricky to

01:19:07.400 | always be able to build stuff for work, because maybe there isn't,

01:19:14.120 | you're not quite in the right area, or they're not quite ready to try out deep learning yet.

01:19:17.560 | But that's okay. You know, build some hobby projects, build some stuff just for fun,

01:19:24.920 | or build some stuff that you're passionate about.

01:19:29.800 | Yeah, so it's really important to not just put the videos away and go away and do something else,

01:19:35.960 | because you'll forget everything you've learned and you won't have practiced.

01:19:38.360 | So one of our community members

01:19:50.440 | went on to create an activation function, for example,

01:19:57.320 | which is Mish, which is now, as Tanisha just reminded me on our forums, is now

01:20:04.760 | used in many of the state-of-the-art networks around the world,

01:20:10.840 | which is pretty cool. And he's now at Miele, I think, one of the top research labs in the world.

01:20:20.600 | I wonder how that's doing. Let's have a look, go to Google Scholar.

01:20:25.640 | Nice, 486 citations. They're doing great.

01:20:35.960 | All right, let's have a look at how our AMA topic is going and pick out

01:20:47.560 | some of the highest ranked AMAs.

01:20:53.320 | Okay.

01:20:57.400 | So the first one is from Lucas, and actually maybe I should, actually let's switch our view here.

01:21:09.400 | So our first AMA is from Lucas, and Lucas asks, "How do you stay motivated?

01:21:19.960 | I often find myself overwhelmed in this field. There are so many new things coming up that I

01:21:26.520 | feel like I have to put so much energy just to keep my head above the waterline."

01:21:31.240 | Yeah, that's a very interesting question. I mean, I think, Lucas, the important thing is to realize

01:21:41.560 | you don't have to know everything, you know. In fact, nobody knows everything.

01:21:49.240 | And that's okay. What people do is they take an interest in some area, and they follow that,

01:21:59.400 | and they try and do the best job they can of keeping up with some little sub area. And if

01:22:06.600 | your little sub area is too much to keep up on, pick a sub sub area. Yeah, there's nothing like,

01:22:13.960 | there's no need for it to be demotivating that there's a lot of people doing a lot of interesting

01:22:17.800 | work and a lot of different sub fields. That's cool, you know. It used to be kind of dull,

01:22:23.080 | but then there's only basically five labs in the world working on neural nets.

01:22:27.000 | And yeah, from time to time, you know, take a dip into other areas that maybe you're not following

01:22:35.880 | as closely. But when you're just starting out, you'll find that things are not changing that fast

01:22:43.080 | at all, really. They can kind of look that way because people are always putting out press

01:22:47.320 | releases about their new tweaks. But fundamentally, the stuff that is in the course now is not that

01:22:55.800 | different to what was in the course five years ago. The foundations haven't changed. And it's

01:23:03.640 | not that different, in fact, to the convolutional neural network that Yann LeCun used on MNIST back

01:23:09.480 | in 1996. It's, you know, the basic ideas I've described are forever, you know, the way the

01:23:18.280 | inputs work and the sandwiches of matrix multipliers and activation functions and the

01:23:22.520 | stuff you do to the to the final layer, you know, everything else is tweaks. And the more you learn

01:23:28.760 | about those basic ideas, the more you'll recognize those tweaks as simple little tricks that you'll

01:23:35.560 | be able to quickly get your head around. So then Lucas goes on to ask or to comment, another thing

01:23:41.560 | that constantly bothers me as I feel the field is getting more and more skewed towards bigger and

01:23:46.200 | more computationally expensive models and huge amounts of data. I keep wondering if in some years

01:23:52.440 | now, I would still be able to train reasonable models with a single GPU, or if everything is

01:23:57.800 | going to require a compute cluster. Yeah, that's a great question. I get that a lot.

01:24:04.680 | But interestingly, you know, I've been teaching people machine learning and data science stuff

01:24:13.560 | for nearly 30 years. And I've had a variation of this question throughout. And the reason is that

01:24:22.760 | engineers always want to push the envelope in like the on the biggest computers they can find,

01:24:30.760 | you know, that's just this, like, fun thing engineers love to do. And by definition,

01:24:35.880 | they're going to get slightly better results than people doing exactly the same thing on smaller

01:24:42.280 | computers. So it always looks like, oh, you need big computers to be state of the art.

01:24:50.440 | But that's actually never true, right? Because there's always smarter ways to do things,

01:24:59.400 | not just bigger ways to do things. And so, you know, when you look at fast ai's

01:25:04.360 | dawn bench success, when we trained image net faster than anybody had trained it before,

01:25:11.240 | on standard GPUs, you know, me and a bunch of students, that was not meant to happen.

01:25:18.440 | You know, Google was working very hard with their TPU introduction to try to show how good they

01:25:23.000 | were. Intel was using like 256 PCs in parallel or something. But yeah, you know, we used common

01:25:34.680 | sense and smarts and showed what can be done. You know, it's also a case of picking the problems

01:25:41.960 | you solve. So I would not be probably doing like going head to head up against codecs and trying to

01:25:49.560 | create code from English descriptions. You know, because that's a problem that does probably

01:25:58.520 | require very large neural nets and very large amounts of data. But if you pick areas in different

01:26:05.640 | domains, you know, there's still huge areas where much smaller models are still going to be state

01:26:13.960 | of the art. So hopefully that helped answer your question. Let's see what else we got here.

01:26:22.600 | So Daniel has obviously been following my journey with teaching my daughter math. Yeah, he's

01:26:33.960 | so I homeschool my daughter. And Daniel asks, how do you homeschool young children,

01:26:39.160 | science in general and math in particular? Would you share your experiences by blogging or in

01:26:46.680 | lectures someday? Yeah, I could do that. So I actually spent quite a few months just reading

01:26:55.480 | research papers about education recently. So I do probably have a lot I probably need to talk about

01:27:02.120 | at some stage. But yeah, broadly speaking, I lean into using computers and tablets a lot more than

01:27:15.000 | most people. Because actually, there's an awful lot of really great apps that are super compelling.

01:27:19.960 | They're adaptive, so they go at the right speed for the student. And they're fun. And I really like

01:27:28.760 | my daughter to have fun. You know, I really don't like to force her to do things.

01:27:33.320 | And for example, there's a really cool app called Dragonbox algebra five plus, which teaches

01:27:42.440 | algebra to five year olds by using a really fun computer game involving helping dragon eggs to

01:27:47.880 | hatch. And it turns out that yeah, algebra, the basic ideas of algebra are no more complex than

01:27:56.040 | the basic ideas that we do in other kindergarten math. And all the parents I know of who have given

01:28:02.200 | their kids Dragonbox algebra five plus their kids have successfully learned algebra. So that would

01:28:08.200 | be an example. But yeah, we should talk about this more at some point. All right, let's see what else

01:28:22.120 | we've got here. So Farah says the walkthroughs have been a game changer for me. The knowledge and tips

01:28:33.640 | you shared in those sessions are skills required to become an effective machine learning practitioner

01:28:38.040 | and utilize fast AI more effectively. Have you considered making the walkthroughs a more formal

01:28:42.840 | part of the course, doing a separate software engineering course, or continuing live coding

01:28:47.400 | sessions between part one and two? So yes, I am going to keep doing live coding sessions.

01:28:52.200 | At the moment, we've switched to those specifically to focusing on APL. And then in a couple of weeks,

01:28:59.000 | they're going to be going to fast AI study groups. And then after that, they'll gradually turn back

01:29:04.040 | into more live coding sessions. But yeah, the thing I try to do in my live coding or study groups,

01:29:12.200 | whatever is definitely try to show the foundational techniques that just make life easier as a coder

01:29:20.360 | or a data scientist. When I say foundational, I mean, yeah, the stuff which you can reuse again

01:29:26.440 | and again and again, like learning regular expressions really well, or knowing how to

01:29:32.040 | use a VM or understanding how to use the terminal and command line, you know, all that kind of stuff.

01:29:40.200 | Never goes out of style. It never gets old. And yeah, I do plan to

01:29:45.080 | at some point hopefully actually do a course really all about that stuff specifically. But yeah,

01:29:54.280 | for now, the best approach is follow along with the live coding and stuff.

01:29:58.120 | Okay, WGPubs, which is Wade asks, how do you turn a model into a business?

01:30:06.760 | Specifically, how does a coder with a little or no startup experience turn an ML based

01:30:12.120 | radio prototype into a legitimate business venture? Okay, I plan to do a course about

01:30:17.560 | this at some point as well. So, you know, obviously, there isn't a two minute version

01:30:26.920 | to this. But the key thing with creating a legitimate business venture is to solve

01:30:35.560 | a legitimate problem, you know, a problem that people need solved,

01:30:39.880 | solving, and which they will pay you to solve. And so it's important not to start with your

01:30:47.480 | fun radio prototype as the basis your business, but instead start with, here's a problem I want

01:30:54.520 | to solve. And generally speaking, you should try to pick a problem that you understand better than

01:31:03.480 | most people. So it's either a problem that you face day to day in your work, or in some hobby,

01:31:09.400 | your passion that you have, or that you know, your club has, or your local school has, or your,

01:31:15.080 | your spouse deals with in their workplace, you know, it's something where you understand that

01:31:23.240 | there's something that doesn't work as well as it ought to. Particularly something where you think

01:31:30.280 | yourself, you know, if they just used deep learning here, or some algorithm here, or some

01:31:37.480 | better compute here, that problem would go away. And that's, that's the start of a business.

01:31:45.400 | And so then my friend Eric Reese wrote a book called The Lean Startup, where he describes

01:31:54.040 | what you do next, which is basically you fake it, you create, so he calls it the minimum viable

01:31:59.640 | product, you create something that solves that problem, that takes you as little time as possible

01:32:05.320 | to create, it could be very manual, it can be loss making, it's fine, you know, even the bit

01:32:10.280 | in the middle where you're like, oh, there's going to be a neural net here, it's fine to like launch

01:32:14.600 | without the neural net and do everything by hand. You're just trying to find out if people are going

01:32:19.880 | to pay for this, and this is actually useful. And then once you have, you know, hopefully confirmed

01:32:26.120 | that the need is real, that people will pay for it, and you can solve the need, you can gradually

01:32:30.920 | make it less and less of a fake, you know, and do, you know, more and more getting the

01:32:38.760 | product to where you want it to be. Okay, I don't know how to pronounce the name M-I-W-O-J-C.

01:32:53.160 | M-I-W-O-J-C says, Jeremy, can you share some of your productivity hacks

01:32:59.640 | from the content you produce, it may seem you work 24 hours a day.

01:33:03.960 | Okay, I certainly don't do that. I think one of my main productivity hacks actually is not to work

01:33:12.840 | too hard, or at least, no, not to work too hard, not to work too much. I spend probably less hours

01:33:21.400 | a day working than most people, I would guess. But I think I do a couple of things differently

01:33:28.120 | when I'm working. One is I've spent half, at least half of every working day since I was about 18,

01:33:37.080 | learning or practicing something new. Could be a new language, could be a new algorithm,

01:33:45.080 | could be something I read about. And nearly all of that time, therefore, I've been doing that thing

01:33:53.160 | more slowly than I would if I just used something I already knew.

01:33:58.680 | Which often drives my co-workers crazy, because they're like, you know, why aren't you focusing

01:34:07.080 | on getting that thing done? But in the other 50% of the time, I'm constantly, you know, building up

01:34:14.600 | this kind of exponentially improving base of expertise in a wide range of areas. And so now

01:34:21.960 | I do find, you know, I can do things, often orders of magnitude faster than people around me, or

01:34:31.160 | certainly many multiples faster than people around me, because I, you know, know a whole bunch of

01:34:36.280 | tools and skills and ideas which, yeah, no, other people don't necessarily know. So like, I think

01:34:43.960 | that's one thing that's been helpful. And then another is, yeah, like trying to really

01:34:47.160 | not overdo things, like get good sleep and eat well and exercise well.

01:34:53.480 | And also, I think it's a case of like tenacity, you know, I've noticed a lot of people

01:35:04.280 | give up much earlier than I do. So, yeah, if you just keep going until something's actually

01:35:15.640 | finished, then that's going to put you in a small minority, to be honest. Most people don't do that.

01:35:23.080 | And when I say finished, like finish something really nicely. And I try to make it like, so I

01:35:28.840 | particularly like coding, and so I try to do a lot of coding related stuff. So I create things

01:35:33.880 | like NBDev, and NBDev makes it much, much easier for me to finish something nicely, you know.

01:35:40.600 | So in my kind of chosen area, I've spent quite a bit of time trying to make sure it's really easy

01:35:47.400 | for me to like, get out a blog post, get out a Python library, get out a notebook analysis,

01:35:53.560 | whatever. So, yeah, trying to make these things I want to do easier, and so then I'll do them more.

01:36:01.160 | So, well, thank you, everybody. That's been a lot of fun. Really appreciate you taking the time to go

01:36:13.080 | through this course with me. Yeah, if you enjoyed it, it would really help if you would give a like

01:36:21.000 | on YouTube, because it really helps other people find the course, goes into the YouTube recommendation

01:36:26.600 | system. And please do come and help other beginners on forums.fast.ai. It's a great way to learn

01:36:34.440 | yourself, is to try to teach other people. And yeah, I hope you'll join us in part two.

01:36:42.120 | Thanks everybody very much. I've really enjoyed this process, and I hope to get to meet more of

01:36:50.520 | here in person in the future. Bye.

01:36:53.640 | [BLANK_AUDIO]

Lesson 8 - Practical Deep Learning for Coders 2022

Chapters