back to index

Machine Learning 1: Lesson 8


Chapters

0:0 Intro
2:0 Random forests
5:20 Recognition
7:30 Data
12:55 Tensors
16:40 Negative log likelihood
33:49 Slicing
38:20 Neural Network
44:15 Getting Started
47:15 PITorch
47:43 GPU
53:24 AWS
56:38 PI Torch

Whisper Transcript | Transcript Only Page

00:00:00.000 | So, I don't want to embarrass Rachel, but I'm very excited that Rachel's here.
00:00:07.440 | So this is Rachel, for those of you that don't know.
00:00:11.920 | She's not quite back on her feet after her illness, but well enough to at least come
00:00:15.600 | to at least part of this lesson, so don't worry if she can't stay for the whole thing.
00:00:19.720 | I'm really glad she's here because Rachel actually wrote the vast majority of the lesson
00:00:24.240 | we're going to see.
00:00:25.240 | I think it's a really, really cool work, so I'm glad she's going to at least see it being
00:00:29.760 | taught, even if unfortunately she's not teaching herself.
00:00:37.640 | Good thanksgiving present, best thanksgiving present.
00:00:41.700 | So as we discussed at the end of last lesson, we're kind of moving from the decision tree
00:00:47.400 | ensembles to neural nets, broadly defined, and as we discussed, random forests and decision
00:00:55.840 | trees are limited by the fact in the end that they're basically doing nearest neighbors.
00:01:07.680 | All they can do is to return the average of a bunch of other points.
00:01:11.560 | And so they can't extrapolate out to, if you're thinking what happens if I increase my prices
00:01:16.000 | by 20% and you've never priced at that level before, or what's going to happen to sales
00:01:21.920 | next year, and obviously we've never seen next year before, it's very hard to extrapolate,
00:01:27.440 | it's also hard if it can only do around log-based 2n decisions, and so if there's a time series
00:01:38.080 | it needs to fit to that takes like 4 steps to get to the right time area, then suddenly
00:01:43.760 | there's not many decisions left for it to make, so there's this limited amount of computation
00:01:48.800 | that it can do, so there's a limited complexity of relationship that it can model.
00:01:56.000 | Can I ask about one more drawback of random forests that I've had?
00:02:05.620 | So if we have data as categorical variables which are not in sequential order, so for
00:02:12.360 | random forests we encode them and treat them as numbers, let's say we have 20 cardinality
00:02:18.080 | in 1 to 20, so the result that random forest gives is something like less than 5, less
00:02:26.520 | than 6, but if the categories are not sequential, not in any order, what does that mean?
00:02:37.660 | So if you've got like, let's go back to bulldozers, EROPS with a C, OROPS and a whatever, and
00:03:03.320 | we arbitrarily label them like this.
00:03:09.160 | And so actually we know that all that really mattered was if it had air conditioning.
00:03:13.880 | So what's going to happen?
00:03:15.640 | Well it's basically going to say like, okay, if I group it into those together and those
00:03:22.320 | together, that's an interesting break, just because it so happens that the air conditioning
00:03:28.280 | ones all are going to end up in the right-hand side.
00:03:31.680 | And then having done that, it's then going to say okay, well within the group with the
00:03:37.280 | 2 and 3, it's going to notice that it's furthermore going to have to split it into two more groups.
00:03:42.120 | So eventually it's going to get there, it's going to pull out that category, it's going
00:03:47.960 | to take more splits than we would ideally like.
00:03:50.600 | So it's kind of similar to the fact that for it to model a line, it can only do it with
00:03:56.040 | lots of splits and only approximately.
00:03:58.080 | So random forest is fine with categories that are not sequential also?
00:04:02.160 | Yeah, so it can do it, it's just like in some way it's sub-optimal because we just need
00:04:06.360 | to do more break points than we would have liked, but it gets there, it does a pretty
00:04:10.840 | good job.
00:04:11.840 | And so even although random forests do have some deficiencies, they're incredibly powerful,
00:04:19.760 | particularly because they have so few assumptions, they really had to screw up, and it's kind
00:04:25.320 | of hard to actually win a Kaggle competition with a random forest, but it's very easy to
00:04:30.560 | get top 10%.
00:04:32.160 | So in real life where often that third decimal place doesn't matter, random forests are often
00:04:37.800 | what you end up doing.
00:04:41.280 | But for some things like this Ecuadorian groceries competition, it's very, very hard to get a
00:04:46.200 | good result with a random forest because there's a huge time series component and nearly everything
00:04:52.400 | is these two massively high cardinality categorical variables, which is the store and the item,
00:04:58.640 | and so there's very little there to even throw at a random forest, and the difference between
00:05:05.000 | every pair of stores is kind of different in different ways.
00:05:09.360 | So there are some things that are just hard to get even relatively good results with a
00:05:15.440 | random forest.
00:05:17.040 | Another example is recognizing numbers.
00:05:22.640 | You can get like okay results with a random forest, but in the end the relationship between
00:05:29.760 | the spatial structure turns out to be important.
00:05:33.960 | And you kind of want to be able to do computations like finding edges or whatever that kind of
00:05:39.720 | carry forward through the computation.
00:05:42.620 | So just doing a clever nearest neighbors like a random forest turns out not to be ideal.
00:05:53.300 | So for stuff like this, neural networks turn out that they are ideal.
00:05:57.360 | Neural networks turn out to be something that works particularly well for both things like
00:06:02.360 | the Ecuadorian groceries competition, so forecasting sales over time by store and by item, and
00:06:09.140 | for things like recognizing digits and for things like turning voice into speech.
00:06:15.640 | And so it's kind of nice between these two things, neural nets and random forests, we
00:06:20.200 | kind of cover the territory.
00:06:22.520 | I haven't needed to use anything other than these two things for a very long time.
00:06:28.520 | And we'll actually learn, I don't know what course exactly, but at some point we'll learn
00:06:32.880 | also how to combine the two, because you can combine the two in really cool ways.
00:06:38.080 | So here's a picture from Adam Geitge of an image.
00:06:43.800 | So an image is just a bunch of numbers, and each of those numbers is 0 to 255, and the
00:06:51.400 | dark ones are close to 255, the light ones are close to 0.
00:06:55.880 | So here is an example of a digit from this MNIST dataset.
00:07:01.360 | MNIST is a really old, it's like a hello world of neural networks.
00:07:06.040 | And so here's an example.
00:07:07.600 | And so there are 28 by 28 pixels.
00:07:12.080 | If it was color, there would be three of these, one for red, one for green, one for blue.
00:07:19.620 | So our job is to look at the array of numbers and figure out that this is the number 8,
00:07:27.160 | which is tricky.
00:07:29.300 | How do we do that?
00:07:31.720 | So we're going to use a small number of fast.ai pieces, and we're gradually going to remove
00:07:39.340 | more and more and more until by the end we'll have implemented our own neural network from
00:07:44.960 | scratch, our own training loop from scratch, and our own matrix multiplication from scratch.
00:07:50.460 | So we're gradually going to dig in further and further.
00:07:55.640 | So the data for MNIST, which is the name of this very famous dataset, is available from
00:08:02.720 | here, and we have a thing in fast.ai.io called getData, which will grab it from a URL and
00:08:09.880 | store it on your computer, unless it's already there, in which case it will just go ahead
00:08:13.960 | and use it.
00:08:15.760 | And then we've got a little function here called load MNIST, which simply loads it up.
00:08:22.160 | You'll see that it's zipped, so we can just use Python's gzip to open it up.
00:08:28.640 | And then it's also pickled.
00:08:30.380 | So if you have any kind of Python object at all, you can use this built-in Python library
00:08:37.320 | called pickle to dump it out onto your disk, share it around, load it up later, and you
00:08:45.120 | get back the same Python object you started with.
00:08:48.440 | So you've already seen there's something like this with pandas_feather format.
00:08:55.560 | Pickle is not just for pandas, it's not just for anything.
00:08:59.040 | It works for nearly every Python object, which might lead to the question why didn't we use
00:09:06.120 | pickle for a pandas data frame?
00:09:08.760 | And the answer is pickle works for nearly every Python object, but it's probably not
00:09:15.200 | like optimal for nearly any Python object.
00:09:19.320 | So because we were looking at pandas data frames with over 100 million rows, we really
00:09:24.640 | want to save that quickly, and so feather is a format that's specifically designed for
00:09:29.480 | that purpose, and so it's going to do that really fast.
00:09:32.080 | If we tried to pickle it, it would have been taken a lot longer.
00:09:38.240 | Also note that pickle files are only for Python, so you can't give them to somebody else, whereas
00:09:44.200 | a feather file you can hand around.
00:09:47.960 | So it's worth knowing that pickle exists because if you've got some dictionary or some kind
00:09:54.440 | of object floating around that you want to save for later or send to somebody else, you
00:09:59.000 | can always just pickle it.
00:10:02.640 | So in this particular case, the folks at deeplearning.net were kind enough to provide a pickled version.
00:10:11.280 | Pickle has changed slightly over time, and so old pickle files like this one, you actually
00:10:19.000 | have to, this is a Python 2 one, so you have to tell it that it was encoded using this
00:10:23.080 | particular Python 2 character set.
00:10:26.000 | But other than that, Python 2 and 3 can normally open each other's pickle files.
00:10:32.400 | So once we've loaded that in, we load it in like so.
00:10:37.160 | And so this thing which we're doing here, this is called destructuring.
00:10:41.280 | Destructuring means that loadMnist is giving us back a tuple of tuples, and so if we have
00:10:48.320 | on the left-hand side of the equal sign a tuple of tuples, we can fill all these things
00:10:53.280 | So we're giving back a tuple of training data, a tuple of validation data, and a tuple of
00:10:58.760 | test data.
00:10:59.760 | In this case, I don't care about the test data, so I just put it into a variable called
00:11:04.080 | underscore, which kind of by like, Python people tend to think of underscore as being a special
00:11:12.900 | variable which we put things we're going to throw away into.
00:11:16.440 | It's actually not special, but it's really common.
00:11:19.260 | If you see something assigned to underscore, it probably means you're just throwing it
00:11:22.760 | away.
00:11:23.760 | By the way, in a Jupyter notebook, it does have a special meaning which is the last cell
00:11:29.780 | that you calculate is always available in underscore.
00:11:36.960 | So then the first thing in that tuple is itself a tuple, and so we're going to stick that
00:11:42.060 | into x and y for our training data, and then the second one goes into x and y for our validation
00:11:48.520 | data.
00:11:49.520 | So that's called destructuring, and it's pretty common in lots of languages.
00:11:55.560 | Some languages don't support it, but those that do, life becomes a lot easier.
00:11:59.680 | So as soon as I look at some new dataset, I just check out what have I got.
00:12:04.420 | So what's its type?
00:12:05.420 | Okay, it's a NumPy array.
00:12:08.240 | What's its shape?
00:12:09.980 | It's 50,000 by 784.
00:12:12.560 | And then what about the dependent variables?
00:12:13.960 | That's an array, its shape is 50,000.
00:12:18.480 | So this image is not of length 784, it's of size 28 by 28.
00:12:28.360 | So what happened here?
00:12:30.240 | Well, we could guess, and we can check on the website, it turns out we would be right,
00:12:34.360 | that all they did was they took the second row and concatenated it to the first row, and
00:12:38.640 | the third row and concatenated it to that, and the fourth row and concatenated it to
00:12:41.600 | that.
00:12:42.600 | So in other words, they took this whole 28 by 28 and flattened it out into a single 1D
00:12:48.000 | array.
00:12:49.000 | Does that make sense?
00:12:50.000 | So it's going to be of size 28 squared.
00:12:54.640 | This is not normal by any means, so don't think everything you see is going to be like
00:13:00.320 | this.
00:13:01.320 | Most of the time when people share images, they share them as JPGs or PNGs, you load
00:13:05.920 | them up, you get back a nice 2D array.
00:13:09.120 | But in this particular case, for whatever reason, the thing that they pickled was flattened
00:13:15.080 | out to be 784.
00:13:18.520 | And this word "flatten" is very common with working with tensors.
00:13:24.800 | So when you flatten a tensor, it just means that you're turning it into a lower rank tensor
00:13:30.480 | than you started with.
00:13:31.480 | In this case, we started with a rank2 tensor matrix for each image, and we turned each
00:13:37.640 | one into a rank1 tensor, i.e. a vector.
00:13:41.680 | So overall the whole thing is a rank2 tensor rather than a rank3 tensor.
00:13:50.360 | So just to remind us of the jargon here, this in math we would call a vector.
00:14:05.000 | In computer science we would call it a 1D array, but because deep learning people have
00:14:12.400 | to come across as smarter than everybody else, we have to call this a rank1 tensor.
00:14:21.040 | They all mean the same thing, more or less, unless you're a physicist, in which case this
00:14:25.640 | means something else, and you get very angry at the deep learning people because you say
00:14:28.920 | it's not a tensor.
00:14:31.040 | So there you go.
00:14:32.560 | Don't blame me, this is just what people say.
00:14:35.720 | So this is either a matrix or a 2D array or a rank2 tensor.
00:14:48.640 | And so once we start to get into 3 dimensions, we start to run out of mathematical names,
00:14:55.040 | which is why we start to be nice, just to say rank3 tensor.
00:14:58.600 | And so there's actually nothing special about vectors and matrices that make them in any
00:15:02.880 | way more important than rank3 tensors or rank4 tensors or whatever.
00:15:08.560 | So I try not to use the terms vector and matrix where possible because I don't really think
00:15:15.120 | they're any more special than any other rank of tensor.
00:15:19.720 | So it's good to get used to thinking of this as a rank2 tensor.
00:15:26.360 | And then the rows and columns, if we're computer science people, we would call this dimension
00:15:43.080 | 0 and dimension 1, but if we're deep learning people we would call this axis 0 or axis 1.
00:15:57.480 | And then just to be really confusing, if you're an image person, this is the first axis and
00:16:03.680 | this is the second axis.
00:16:04.800 | So if you think about TVs, 1920x1080, columns by rows, everybody else, including deep learning
00:16:14.200 | and mathematicians, rows by columns.
00:16:17.240 | So this is pretty confusing if you use the Python imaging library.
00:16:21.920 | You get back columns by rows, pretty much everything else, rows by columns.
00:16:26.680 | So be careful.
00:16:29.680 | Because they hate us, because they're bad people, I guess.
00:16:44.000 | Particularly in deep learning, a whole lot of different areas have come together, like
00:16:47.160 | information theory, computer vision, statistics, signal processing, and you've ended up with
00:16:52.520 | this hodgepodge of nomenclature in deep learning.
00:16:56.080 | Then every version of things will be used.
00:16:59.520 | So today we're going to hear about something that's called either negative log likelihood
00:17:04.360 | or binomial or categorical cross entropy, depending on where you come from.
00:17:09.520 | We've already seen something that's called either one-hot encoding or dummy variables
00:17:13.360 | depending on where you come from.
00:17:14.520 | And really it's just like the same concept gets somewhat independently invented in different
00:17:19.480 | fields and eventually they find their way to machine learning and then we don't know
00:17:24.080 | what to call them, so we call them all of the above, something like that.
00:17:29.720 | So I think that's what's happened with computer vision rows and columns.
00:17:34.600 | So there's this idea of normalizing data, which is subtracting out the mean and dividing
00:17:42.120 | by the standard deviation.
00:17:44.720 | So a question for you, often it's important to normalize the data so that we can more
00:17:54.040 | easily train a model.
00:17:56.480 | Do you think it would be important to normalize the independent variables for a random forest
00:18:02.420 | if we're training a random forest?
00:18:04.360 | I'm going to be honest, I don't know why, we don't need to normalize, I just know that
00:18:12.960 | we don't.
00:18:13.960 | We don't.
00:18:14.960 | Okay.
00:18:15.960 | Does anybody want to think about why?
00:18:17.520 | Cara.
00:18:18.520 | It wouldn't matter because each scaling and transformation we can have will be applied
00:18:26.760 | to each row and we will be computing means as we were doing, like local averages.
00:18:33.520 | And at the end we will of course want to de-normalize it back to give, so it wouldn't change the
00:18:39.480 | result.
00:18:40.480 | I'm talking about the independent variables, not the dependent variable.
00:18:43.480 | I thought you asked about dependent variables.
00:18:45.640 | Okay, who wants to have a go, Matthew?
00:18:52.120 | It might be because we just care about the relationship between the independent variables
00:18:55.920 | and the dependent variable, so scale doesn't really matter.
00:18:59.800 | Okay, go on.
00:19:00.800 | Cat, why?
00:19:02.800 | Because at each split point we can just divide to see, regardless of what scale you're on,
00:19:14.960 | what minimizes variance.
00:19:16.160 | Right, so really the key is that when we're deciding where to split, all that matters
00:19:22.640 | is the order, all that matters is how they're sorted.
00:19:26.180 | So if we subtract the mean and divide by the standard deviation, they're still sorted in
00:19:31.120 | the same order.
00:19:32.120 | Remember when we implemented the random forest, we said sort them, and then we completely
00:19:37.720 | ignored the values, we just said now add on one thing from the dependent at a time.
00:19:44.320 | So random forests only care about the sort order of the independent variables, they don't
00:19:49.840 | care at all about their size.
00:19:52.720 | And so that's why they're wonderfully immune to outliers, because they totally ignore the
00:19:57.080 | fact that it's an outlier.
00:19:58.480 | They only care about which one's higher than what other thing.
00:20:02.320 | So this is an important concept, it doesn't just appear in random forests, it occurs in
00:20:06.160 | some metrics as well.
00:20:07.780 | For example, area under the ROC curve, you come across a lot, that area under the ROC
00:20:13.760 | curve completely ignores scale and only cares about sort.
00:20:18.980 | We saw something else when we did the dendrogram, Spearman's correlation is a rank correlation,
00:20:25.160 | only cares about order, not about scale.
00:20:29.180 | So random forests, one of the many wonderful things about them are that we can completely
00:20:34.080 | ignore a lot of these statistical distribution issues.
00:20:38.960 | But we can't for deep learning, because for deep learning we're trying to train a parameterized
00:20:44.560 | model.
00:20:45.560 | So we do need to normalize our data.
00:20:48.160 | If we don't, then it's going to be much harder to create a network that trains effectively.
00:20:54.480 | So we grab the mean and the standard deviation of our training data and subtract out the
00:20:58.560 | mean, divide by the standard deviation, and that gives us a mean of 0 and a standard deviation
00:21:04.080 | of 1.
00:21:05.760 | Now for our validation data, we need to use the standard deviation and mean from the training
00:21:11.600 | data.
00:21:12.800 | We have to normalize it the same way.
00:21:15.280 | Just like categorical variables, we had to make sure they had the same indexes mapped
00:21:20.160 | to the same levels for a random forest, or missing values, we had to make sure we had
00:21:26.280 | the same median used when we were replacing the missing values.
00:21:30.760 | You need to make sure anything you do in the training set, you do exactly the same thing
00:21:34.680 | in the test and validation set.
00:21:36.400 | So here I'm subtracting out the training set mean, the training set, standard deviation.
00:21:39.920 | So this is not exactly 0, this is not exactly 1, but it's pretty close.
00:21:45.440 | And so in general, if you find you try something on a validation set or a test set and it's
00:21:50.440 | like much, much, much worse than your training set, it's probably because you normalized it
00:21:57.680 | in an inconsistent way or encoded categories in an inconsistent way or something like that.
00:22:04.320 | So let's take a look at some of this data.
00:22:07.720 | So we've got 10,000 images in the validation set, and each one is a rank 1 tensor of length
00:22:16.720 | In order to display it, I want to turn it into a rank 2 tensor of 28x28.
00:22:23.000 | So NumPy has a reshape function that takes a tensor in and reshapes it to whatever size
00:22:33.480 | tensor you request.
00:22:35.860 | Now if you think about it, you only need to tell it about if there are d axes, you only
00:22:42.840 | need to tell it about d-1 of the axes you want, because the last one it can figure out
00:22:47.160 | for itself.
00:22:48.920 | So in total, there are 10,000 x 784 numbers here altogether.
00:22:56.240 | So if you say, well I want my last axes to be 28x28, then you can figure out that this
00:23:02.360 | must be 10,000, otherwise it's not going to fit.
00:23:08.360 | So if you put -1, it says make it as big or as small as you have to make it fit.
00:23:14.120 | So you can see here it figured out it has to be 10,000.
00:23:18.100 | So you'll see this used in neural net, software, pre-processing and stuff like that all the
00:23:25.000 | time.
00:23:26.000 | I could have written 10,000 here, but I try to get into a habit of any time I'm referring
00:23:30.580 | to how many items are in my input, I tend to use -1 because it just means later on I
00:23:37.240 | could use a sub-sample, this code wouldn't break.
00:23:41.280 | I could do some kind of stratified sampling, if it was unbalanced, this code wouldn't break.
00:23:46.160 | So by using this kind of approach of saying -1 here for the size, it just makes it more
00:23:51.840 | resilient to change this later, it's a good habit to get into.
00:23:56.700 | So this kind of idea of being able to take tensors and reshape them and change axes around
00:24:04.360 | and stuff like that is something you need to be totally do without thinking, because
00:24:11.640 | it's going to happen all the time.
00:24:13.360 | So for example, here's one, I tried to read in some images, they were flattened, I need
00:24:18.200 | to unflatten them into a bunch of matrices, reshape, bang.
00:24:23.720 | I read some images in with OpenCV, and it turns out OpenCV orders the channels blue,
00:24:30.840 | green, red, everything else expects them to be red, green, blue, I need to reverse the
00:24:35.400 | last axis, how do you do that?
00:24:38.880 | I read in some images with Python imaging library, it reads them as rows by columns
00:24:45.720 | by channels, PyTorch expects channels by rows by columns, how do I transform that?
00:24:52.600 | So these are all things you need to be able to do without thinking, like straight away,
00:24:58.400 | because it happens all the time and you never want to be sitting there thinking about it
00:25:02.960 | for ages.
00:25:03.960 | So make sure you spend a lot of time over the week just practicing with things like
00:25:08.560 | all the stuff we're going to see today, reshaping, slicing, reordering dimensions, stuff like
00:25:15.120 | that.
00:25:16.120 | And so the best way is to create some small tensors yourself and start thinking like,
00:25:21.600 | okay, what shall I experiment with?
00:25:23.360 | So here, can we pass that over there?
00:25:26.900 | Do you mind if I backtrack a little bit?
00:25:29.160 | Of course, I love it.
00:25:30.240 | So back in normalize, you say, you might have gone over this, but I'm still like wrestling
00:25:37.040 | with it a little bit, saying many machine learning algorithms behave better when the
00:25:39.920 | data is normalized, but you also just said that scales don't really matter.
00:25:44.760 | I said it doesn't matter for random forests.
00:25:48.920 | So random forests are just going to spit things based on order.
00:25:52.780 | And so we love them.
00:25:53.920 | We love random forests for the way they're so immune to worrying about distributional
00:25:58.800 | assumptions.
00:25:59.800 | But we're not doing random forests, we're doing deep learning, and deep learning does
00:26:02.520 | care.
00:26:03.520 | We have a parametric then we should scale, if we have a non-parametric then we should
00:26:12.200 | now do scale.
00:26:13.200 | Can we generalize?
00:26:14.200 | No, not quite, because k nearest neighbors is non-parametric and scale matters a hell
00:26:19.720 | of a lot.
00:26:20.720 | So I would say things involving trees generally are just going to split at a point, and so
00:26:26.280 | probably you don't care about scale.
00:26:28.760 | But you probably just need to think, is this an algorithm that uses order or does it use
00:26:35.000 | specific numbers?
00:26:39.320 | Can you please give us an intuition of why it needs scale, just because that will clarify
00:26:45.960 | some of the issues?
00:26:46.960 | Not until we get to doing SGD.
00:26:49.280 | So we're going to get to that.
00:26:50.680 | So for now, we're just going to take my word for it.
00:26:53.400 | Can you pass it to Daniel?
00:26:55.440 | So this is probably a dumb question, but can you explain a little bit more what you mean
00:26:59.720 | by scale?
00:27:00.760 | Because I guess when I think of scale, I'm like, oh, all the numbers should be generally
00:27:04.520 | the same size.
00:27:05.520 | That's exactly what we mean.
00:27:07.960 | But is that like the case with the cats and dogs that we went over with the deep learning?
00:27:12.880 | You could have a small cat and a larger cat, but it would still know that those were both
00:27:17.100 | cats.
00:27:18.100 | Oh, I guess this is one of these problems where language gets overloaded.
00:27:22.120 | So in computer vision, when we scale an image, we're actually increasing the size of the
00:27:27.880 | In this case, we're scaling the actual pixel values.
00:27:32.160 | So in both cases, scaling means to make something bigger and smaller.
00:27:35.440 | In this case, we're taking the numbers from 0 to 255 and making them so that they have
00:27:39.560 | an average of 0 and a standard deviation of 1.
00:27:43.160 | Jeremy, could you please explain, is it by column, by row?
00:27:49.440 | By pixel.
00:27:50.440 | By pixel.
00:27:51.440 | In general, when you're scaling, I'm just not thinking about every picture, but I'm
00:27:59.480 | kind of an input to how much you're learning.
00:28:01.640 | Yeah, sure.
00:28:02.640 | So it's a little bit subtle.
00:28:04.280 | But in this case, I've just got a single mean and a single standard deviation.
00:28:08.240 | So it's basically, on average, how much black is there.
00:28:14.520 | And so on average, we have a mean and a standard deviation across all the pixels.
00:28:23.080 | In computer vision, we would normally do it by channel.
00:28:26.120 | So we would normally have one number for red, one number for green, one number for blue.
00:28:33.280 | In general, you need a different set of normalization coefficients for each thing you would expect
00:28:41.280 | to behave differently.
00:28:43.000 | So if we were doing a structured data set where we've got income, distance in kilometers,
00:28:50.120 | and number of children, you'd need three separate normalization coefficients for those.
00:28:54.800 | They're very different kinds of things.
00:28:57.520 | So it's kind of like a bit domain-specific here.
00:29:02.040 | In this case, all of the pixels are levels of gray, so we've just got a single scaling
00:29:08.560 | number.
00:29:09.560 | Where else you could imagine if they were red versus green versus blue, you could need
00:29:12.480 | to scale those channels in different ways.
00:29:15.360 | Can you pass that back, please?
00:29:19.720 | So I'm having a bit of trouble imagining what would happen if we don't normalize in this
00:29:24.040 | case.
00:29:25.040 | So we'll get there.
00:29:28.120 | So this is kind of what Yannette was saying, why do we normalize?
00:29:31.400 | And for now we're normalizing because I say we have to.
00:29:35.320 | When we get to looking at stochastic gradient descent, we'll basically discover that if
00:29:40.400 | you -- basically to skip ahead a little bit, we're going to be doing a matrix multiply
00:29:46.480 | by a bunch of weights.
00:29:47.960 | We're going to pick those weights in such a way that when we do the matrix multiply,
00:29:52.200 | we're going to try to keep the numbers at the same scale that they started out as.
00:29:56.360 | And that's going to basically require the initial numbers.
00:29:59.800 | We're going to have to know what their scale is.
00:30:02.440 | So basically it's much easier to create a single kind of neural network architecture
00:30:07.440 | that works for lots of different kinds of inputs if we know that they're consistently
00:30:11.240 | going to be mean zero, standard deviation one.
00:30:14.520 | That would be the short answer.
00:30:16.920 | But we'll learn a lot more about it.
00:30:18.560 | And if in a couple of lessons you're still not quite sure why, let's come back to it
00:30:24.120 | because it's a really interesting thing to talk about.
00:30:26.600 | Yes, I'm just trying to visualize the axes we're working with here.
00:30:31.720 | So under plots, when you write -- so x-valid shape, we get 10,000 by 7, 8, 4.
00:30:37.480 | Does that mean that we brought in 10,000 pictures of that dimension?
00:30:41.800 | Exactly.
00:30:42.800 | Okay.
00:30:43.800 | And then in the next line, when you choose to reshape it, is there a reason why you put
00:30:47.800 | 28, 28 on as a y or z coordinates, or is there a reason why they're in that order?
00:30:53.920 | Yeah, there is.
00:30:56.560 | Pretty much all neural network libraries assume that the first axis is kind of the equivalent
00:31:04.080 | of a row.
00:31:05.080 | It's like a separate thing.
00:31:06.080 | It's a sentence, or an image, or an example of sales, or whatever.
00:31:12.760 | So I want each image to be a separate item of the first axis, so that leaves two more
00:31:20.980 | axes for the rows and columns of the images.
00:31:23.480 | That's pretty standard.
00:31:24.640 | That's totally standard.
00:31:25.640 | Yeah, I don't think I've ever seen a library that doesn't work that way.
00:31:28.880 | Can you pass it to our bureau?
00:31:39.120 | So while normalizing the validation data, I saw you have used mean of x and standard
00:31:45.600 | deviation of x data, training data only.
00:31:49.160 | So shouldn't we use mean and standard deviation of validation data?
00:31:54.000 | You mean like join them together, or?
00:31:56.280 | Separately calculating mean.
00:31:58.240 | No, because then you would be normalizing the validation set using different numbers.
00:32:03.720 | So now the meaning of this pixel has a value of 3 in the validation set has a different
00:32:10.540 | meaning to the meaning of 3 in the training set.
00:32:13.880 | It would be like if we had days of the week encoded such that Monday was a 1 in the training
00:32:21.600 | set and was a 0 in the validation set.
00:32:25.360 | We've got now two different sets where the same number has a different meaning.
00:32:29.080 | So let me give you an example.
00:32:33.880 | Let's say we were doing full color images and our training set contained green frogs,
00:32:42.760 | green snakes, and gray elephants, and we're trying to figure out which was which.
00:32:47.000 | We normalize using each channel mean, and then we have a validation set and a test set which
00:32:56.040 | are just green frogs and green snakes.
00:32:59.800 | So if we were to normalize by the validation sets statistics, we would end up saying things
00:33:06.000 | on average are green, and so we would remove all the greenness out.
00:33:11.000 | And so we would now fail to recognize the green frogs and the green snakes effectively.
00:33:17.280 | So we actually want to use the same normalization coefficients that we were training on.
00:33:22.480 | And for those of you doing the deep learning class, we actually go further than that.
00:33:25.840 | When we use a pre-trained network, we have to use the same normalization coefficients
00:33:29.720 | that the original authors trained on.
00:33:31.840 | So the idea is that a number needs to have this consistent meaning across every data
00:33:39.160 | set where you use it.
00:33:42.360 | Can you pass it to us, Mehta?
00:33:49.380 | That means when you are looking at the test set, you normalize the test set based on this
00:33:54.720 | mean and set.
00:33:55.720 | That's right.
00:33:56.720 | Okay.
00:33:57.720 | So the validation y values are just rank1 tensor of 10,000.
00:34:12.400 | Remember there's this kind of weird Python thing where a tuple with just one thing in
00:34:17.200 | it needs a trailing comma.
00:34:19.760 | So this is a rank1 tensor of length 10,000.
00:34:22.840 | And so here's an example of something from that, it's just the number 3.
00:34:27.140 | So that's our labels.
00:34:28.940 | So here's another thing you need to be able to do in your sleep, slicing into a tensor.
00:34:35.960 | So in this case, we're slicing into the first axis with zero.
00:34:40.760 | That means we're grabbing the first slice.
00:34:44.200 | So because this is a single number, this is going to reduce the rank of the tensor by
00:34:48.920 | It's going to turn it from a 3-dimensional tensor into a 2-dimensional tensor.
00:34:52.760 | So you can see here, this is now just a matrix, and then we're going to grab 10 through 14
00:34:59.240 | inclusive rows, 10 through 14 inclusive columns, and here it is.
00:35:03.760 | So this is the kind of thing you need to be super comfortable grabbing pieces out, looking
00:35:08.800 | at the numbers, and looking at the picture.
00:35:13.320 | So here's an example of a little piece of that first image.
00:35:18.560 | And so you kind of want to get used to this idea that if you're working with something
00:35:23.000 | like pictures or audio, this is something your brain is really good at interpreting.
00:35:28.080 | So keep showing pictures of what you're doing whenever you can.
00:35:33.080 | But also remember behind the scenes, they're numbers.
00:35:35.760 | So if something's going weird, print out a few of the actual numbers.
00:35:39.640 | You might find somehow some of them have become infinity, or they're all zero, or whatever.
00:35:45.480 | So use this interactive environment to explore the data as you go.
00:35:51.660 | Just a quick, I guess, semantic question.
00:36:02.520 | Why when it's a tensor of rank 3 is it stored as x, y, z instead of, to me it would make
00:36:10.280 | more sense to store it as like a list of 2D tensors.
00:36:15.880 | Let's do it as either, right?
00:36:18.540 | Let's look at this as a 3D.
00:36:23.160 | So here's a 3D.
00:36:25.160 | So a 3D tensor is formatted as showing a list of 2D tensors basically.
00:36:31.320 | But when you're extracting it, if you're extracting the first one, why isn't it x images square
00:36:39.320 | brackets zero, closed square brackets, and then a second set of square brackets?
00:36:43.520 | Oh, because that has a different meaning, right?
00:36:46.260 | So it's kind of the difference between tensors and jagged arrays, right?
00:36:53.280 | So basically if you do something like that, that says take the second list item and from
00:37:02.040 | it grab the third list item.
00:37:03.840 | And so we tend to use that when we have something called a jagged array, which is where each
00:37:07.440 | subarray may be of a different length, right?
00:37:10.480 | Where else we have like a single object of three dimensions.
00:37:17.240 | And so we're trying to say like which little piece of it do we want.
00:37:21.740 | And so the idea is that that is a single slice object to go in and grab that piece out.
00:37:31.680 | Okay, so here's an example of a few of those images along with their labels.
00:37:42.480 | And this kind of stuff you want to be able to do pretty quickly with matplotlib.
00:37:47.480 | It's going to help you a lot in life in your exam.
00:37:51.440 | So you can have a look at what Rachel wrote here when she wrote plots.
00:37:56.280 | We can use add_subplot to basically create those little separate plots, and you need
00:38:04.280 | to know that imshow is how we basically take a numpy array and draw it as a picture.
00:38:11.360 | And then we've also added the title on top.
00:38:18.480 | So there it is.
00:38:19.960 | So let's now take that data and try to build a neural network with it.
00:38:31.680 | And so a neural network -- and sorry, this is going to be a lot of review for those of
00:38:36.200 | you already doing deep learning -- a neural network is just a particular mathematical
00:38:41.160 | function or a class of mathematical functions.
00:38:43.200 | But it's a really important class because it has the property, it supports what's called
00:38:47.480 | the universal approximation theorem, which means that a neural network can approximate
00:38:54.280 | any other function arbitrarily closely.
00:38:58.160 | So in other words, in theory it can do anything as long as we make it big enough.
00:39:05.720 | So this is very different to a function like 3x+5, which can only do one thing.
00:39:13.380 | It's a very specific function.
00:39:15.820 | For the class of functions ax+b, which can only represent lines of different slopes moving
00:39:23.080 | it up and down different amounts, or even the function ax^2+bx+c+sin(d), again, only
00:39:32.080 | can represent a very specific subset of relationships.
00:39:36.480 | The neural network, however, is a function that can represent any other function to arbitrarily
00:39:42.280 | close accuracy.
00:39:45.000 | So what we're going to do is we're going to learn how to take a function, let's take
00:39:48.760 | work ax+b, and we're going to learn how to find its parameters, in this case a and b,
00:39:54.840 | which allow it to fit as closely as possible to a set of data.
00:39:58.840 | And so this here is showing an example from a notebook that we'll be looking at in the
00:40:04.200 | deep learning course, which basically shows what happens when we use something called
00:40:07.520 | stochastic gradient descent to try and set a and b.
00:40:11.520 | Basically what happens is we're going to pick a random a to start with, a random b to start
00:40:17.320 | with, and then we're going to basically figure out do I need to increase or decrease a to
00:40:23.400 | make the line closer to the dots, do I need to increase or decrease b to make the line
00:40:28.840 | closer to the dots, and then just keep increasing and decreasing a and b lots and lots of times.
00:40:34.880 | So that's what we're going to do.
00:40:36.120 | And to answer the question do I need to increase or decrease a and b, we're going to take the
00:40:40.200 | derivative.
00:40:41.600 | So the derivative of the function with respect to a and b tells us how will that function
00:40:46.760 | change as we change a and b.
00:40:49.480 | So that's basically what we're going to do.
00:40:52.000 | But we're not going to start with just a line, the idea is we're going to build up to actually
00:40:57.280 | having a neural net, and so it's going to be exactly the same idea, but because it's
00:41:01.680 | an infinitely flexible function, we're going to be able to use this exact same technique
00:41:07.000 | to fit arbitrarily complex relationships.
00:41:11.640 | That's basically the idea.
00:41:13.440 | So then what you need to know is that a neural net is actually a very simple thing.
00:41:21.280 | A neural net actually is something which takes as input, let's say we've got a vector, does
00:41:34.920 | a matrix product by that vector, so this is of size, let's draw this properly, so this
00:41:45.320 | is size r, this is like r/c, a matrix product will spit out something of size c.
00:41:56.960 | And then we do something called a nonlinearity, which is basically we're going to throw away
00:42:00.640 | all the negative values, so it's basically max(0,x).
00:42:06.400 | And then we're going to put that through another matrix multiply, and then we're going to put
00:42:11.200 | that through another max(0,x), and we're going to put that through another matrix multiply,
00:42:17.720 | and so on, until eventually we end up with the single vector that we want.
00:42:24.080 | So in other words, each stage of our neural network is the key thing going on is a matrix
00:42:31.720 | multiply, so in other words, a linear function.
00:42:35.600 | So basically deep learning, most of the calculation is lots and lots of linear functions, but
00:42:41.880 | between each one we're going to replace the negative numbers with zeros.
00:42:48.360 | The short answer is if you apply a linear function to a linear function to a linear function,
00:43:04.680 | it's still just a linear function, so it's totally useless.
00:43:09.200 | But if you throw away the negatives, that's actually a nonlinear transformation.
00:43:13.720 | So it turns out that if you apply a linear function to the thing where you threw away
00:43:18.720 | the negatives, apply that to a linear function, that creates a neural network, and it turns
00:43:23.480 | out that's the thing that can approximate any other function arbitrarily closely.
00:43:28.700 | So this tiny little difference actually makes all the difference.
00:43:32.480 | And if you're interested in it, check out the deep learning video where we cover this
00:43:36.880 | because I actually show a nice visual, intuitive proof, not something that I created, but something
00:43:45.020 | that Michael Nielsen created, or if you want to skip straight to his website, you can go
00:43:50.520 | to Michael Nielsen, Universal, I think I spelled his name wrong, there we go, Neural Networks
00:44:03.840 | in Deep Learning Chapter 4, and he's got a really nice walkthrough basically with lots
00:44:09.840 | of animations where you can see why this works.
00:44:18.040 | I feel like the hardest thing with getting started with technical writing on the internet
00:44:27.240 | is just posting your first thing.
00:44:33.980 | If you do a search for Rachel Thomas Medium blog, you'll find this, we'll put it on the
00:44:40.080 | Lesson Wiki, where she actually says the top advice she would give to her younger self
00:44:46.840 | would be to start blogging sooner.
00:44:50.080 | And she has both reasons why you should do it, some examples of places she's blogged
00:44:58.880 | and it's turned out to be great for her and her career, but then some tips about how to
00:45:03.440 | get started.
00:45:04.840 | I remember when I first suggested to Rachel she might think about blogging because she
00:45:08.640 | had so much interesting to say, and at first she was kind of surprised at the idea that
00:45:14.160 | she could blog.
00:45:15.760 | And now people come up to us at conferences and they're like, "You're Rachel Thomas, I
00:45:20.960 | love your writing!"
00:45:21.960 | So I've kind of seen that transition from wow, could I blog, to being known as a strong
00:45:30.040 | technical author.
00:45:32.120 | So check out this article if you still need convincing or if you're wondering how to get
00:45:38.760 | started.
00:45:39.760 | And since the first one is the hardest, maybe your first one should be something really
00:45:44.880 | easy for you to write.
00:45:47.080 | So it could be like, here's a summary of the first 15 minutes of lesson 3 of our machine
00:45:55.440 | learning course, here's why it's interesting, here's what we learned.
00:45:59.120 | Or it could be like, here's a summary of how I used a random forest to solve a particular
00:46:07.920 | problem in my practicum.
00:46:09.120 | I often get questions like, oh my practicum, my organization, we've got sensitive commercial
00:46:14.520 | data.
00:46:15.520 | That's fine, just find another dataset and do it on that instead to show the example,
00:46:23.400 | or anonymize all of the values and change the names of the variables or whatever.
00:46:30.200 | You can talk to your employer or your practicum partner to make sure that they're comfortable
00:46:37.440 | with whatever it is you're writing.
00:46:39.160 | In general though, people love it when their interns and staff blog about what they're
00:46:46.240 | working on because it makes them look super cool.
00:46:48.720 | It's like, hey, I'm an intern working at this company and I wrote this post about this cool
00:46:55.240 | analysis I did and then other people would be like, wow, that looks like a great company
00:46:58.600 | to work for.
00:46:59.600 | So generally speaking, you should find people are pretty supportive.
00:47:04.400 | Besides which, there's lots and lots of datasets out there available, so even if you can't
00:47:10.880 | base it on the work you're doing, you can find something similar for sure.
00:47:16.700 | So we're going to start building our neural network, we're going to build it using something
00:47:21.000 | called PyTorch.
00:47:22.000 | PyTorch is a library that basically looks a lot like NumPy, but when you create some
00:47:33.280 | code with PyTorch, you can run it on the GPU rather than the CPU.
00:47:39.920 | So the GPU is something which is basically going to be probably at least an order of
00:47:49.200 | magnitude, possibly hundreds of times faster than the code that you might write for the
00:47:53.480 | CPU for particularly stuff involving lots of linear algebra.
00:47:58.720 | So with deep learning, neural nets, if you don't have a GPU, you can do it on the CPU.
00:48:07.520 | But it's going to be frustratingly slow.
00:48:13.280 | Your Mac does not have a GPU that we can use for this, because I'm actually advertising
00:48:21.480 | today, we need an Nvidia GPU.
00:48:25.400 | I would actually much prefer that we could use your Macs because competition is great.
00:48:29.760 | But Nvidia were really the first ones to create a GPU which did a good job of supporting general
00:48:35.240 | purpose graphics programming units, GPU-GPU.
00:48:39.040 | So in other words, that means using a GPU for things other than playing computer games.
00:48:44.640 | They created a framework called CUDA, it's a very good framework, it's pretty much universally
00:48:52.440 | used in deep learning.
00:48:54.360 | If you don't have an Nvidia GPU, you can't use it, no current Macs have an Nvidia GPU.
00:49:01.840 | Most laptops of any kind don't have an Nvidia GPU.
00:49:04.720 | If you're interested in doing deep learning on your laptop, the good news is that you
00:49:09.600 | need to buy one which is really good for playing computer games on.
00:49:14.720 | There's a place called Exotic PC, gaming laptops, where you can go and buy yourself a great
00:49:21.000 | laptop for doing deep learning.
00:49:24.240 | You can tell your parents that you need the money to do deep learning.
00:49:34.560 | So you'll generally find a whole bunch of laptops with names like Predator and Viper
00:49:41.840 | with pictures of robots and stuff, StealthPro, Radar, Leopard.
00:49:54.080 | Having said that, I don't know that many people that do much deep learning on their laptop.
00:49:57.960 | Most people will log into a cloud environment.
00:50:01.080 | By far the easiest I know of to use is called Cressel.
00:50:05.480 | With Cressel, you can basically sign up and straight away the first thing you get is thrown
00:50:12.640 | straight into a Jupyter notebook, backed by a GPU, costs 60 cents an hour with all of the
00:50:19.040 | fast AI libraries and data already available.
00:50:23.720 | So that makes life really easy.
00:50:27.760 | It's less flexible and in some ways less fast than using AWS, which is the Amazon Web Services
00:50:37.200 | option.
00:50:38.200 | It costs a little bit more, 90 cents an hour rather than 60 cents an hour, but it's very
00:50:45.160 | likely that your employer is already using that.
00:50:49.960 | It's like it's good to get to know anyway.
00:50:52.520 | They've got more different choices around GPUs and it's a good choice.
00:50:56.500 | If you Google for GitHub Student Pack, if you're a student, you can get $150 of credits
00:51:05.840 | straight away pretty much, and that's a really good way to get started.
00:51:09.080 | Daniel, did you have a question?
00:51:14.200 | I just wanted to know your opinion on, I know that Intel recently published an open source
00:51:21.440 | way of boosting regular packages that they claim is equivalent, like if you use the bottom
00:51:27.480 | tier GPU on your CPU, if you use their boost packages, you can get the same performance.
00:51:35.160 | Do you know anything about that?
00:51:36.160 | Yeah, I do.
00:51:37.160 | It's a good question.
00:51:38.160 | And actually, Intel makes some great numerical programming libraries, particularly this one
00:51:42.780 | called MKL, the matrix kernel library. They definitely make things faster than not using
00:51:51.560 | those libraries, but if you look at a graph of performance over time, GPUs have consistently
00:51:59.120 | throughout the last 10 years, including now, are about 10 times more floating point operations
00:52:06.040 | per second than the equivalent CPU, and they're generally about a fifth of the price for that
00:52:13.400 | performance.
00:52:19.840 | And then because of that, everybody doing anything with deep learning basically does
00:52:24.140 | it on Nvidia GPUs, and therefore using anything other than Nvidia GPUs is currently very annoying.
00:52:32.020 | So slower, more expensive, more annoying. I really hope there will be more activity around
00:52:36.620 | AMG GPUs in particular in this area, but AMG has got literally years of catching up to
00:52:42.080 | do, so it might take a while.
00:52:46.800 | So I just wanted to point out that you can also buy things such as a GPU extender to
00:52:51.100 | a laptop that's also kind of like maybe a first-step solution before you really want
00:52:55.720 | to put something on.
00:52:56.720 | Yeah, I think for like 300 bucks or so, you can buy something that plugs into your Thunderbolt
00:53:01.920 | port if you have a Mac, and then for another 500 or 600 bucks you can buy a GPU to plug
00:53:06.080 | into that. Having said that, for about a thousand bucks you can actually create a pretty good
00:53:11.600 | GPU-based desktop, and so if you're considering that, the fast.ai forums have lots of threads
00:53:19.560 | where people help each other spec out something at a particular price point.
00:53:23.920 | So to start with, let's say use Cressel, and then when you're ready to invest a few extra
00:53:32.560 | minutes getting going, use AWS. To use AWS, you're basically talking to the folks online
00:53:47.640 | as well.
00:53:55.500 | So AWS, when you get there, go to EC2. EC2, there's lots of stuff on AWS. EC2 is the bit
00:54:04.520 | where we get to rent computers by the hour. Now we're going to need a GPU-based instance.
00:54:14.200 | Unfortunately when you first sign up for AWS, they don't give you access to them, so you
00:54:19.240 | have to request that access. So go to limits, up in the top left, and the main GPU instance
00:54:27.240 | we'll be using is called the P2. So scroll down to P2, and here p2.xlarge, you need to
00:54:34.240 | make sure that that number is not zero. If you've just got a new account, it probably
00:54:38.040 | is zero, which means you won't be allowed to create one, so you have to go request limit
00:54:41.800 | increase. And the trick there is when it asks you why you want the limit increase, type
00:54:47.920 | fast.ai because AWS knows to look out, and they know that fast.ai people are good people,
00:54:54.040 | so they'll do it quite quickly. That takes a day or two, generally speaking, to go through.
00:55:00.560 | So once you get the email saying you've been approved for P2 instances, you can then go
00:55:05.760 | back here and say Launch Instance, and so we've basically set up one that has everything you
00:55:13.560 | need. So if you click on Community AMI, and AMI is an Amazon machine image, it's basically
00:55:19.640 | a completely set up computer. So if you type fast.ai, or one word, you'll find here fast.ai
00:55:28.920 | DL Part 1 version 2 for the P2. So that's all set up, ready to go. So if you click on Select,
00:55:38.360 | and it'll say, "Okay, what kind of computer do you want?" And so we have to say, "I want
00:55:43.400 | a GPU compute type, and specifically I want a P2 extra large." And then you can say Review
00:55:53.320 | and Launch. I'm assuming you already know how to deal with SSH keys and all that kind of
00:55:58.240 | stuff. If you don't, check out the introductory tutorials and workshop videos that we have
00:56:04.500 | online, or Google around for SSH keys. Very important skill to know anyway.
00:56:12.880 | So hopefully you get through all that. You have something running on a GPU with the fast.ai
00:56:20.040 | repo. If you use Cressel, just cd fastai2, the repo is already there, git pull. AWS,
00:56:29.280 | cd fastai, the repo is already there, git pull. If it's your own computer, you'll just
00:56:34.800 | have to git clone, and away you go.
00:56:39.480 | So part of all of those is PyTorch is pre-installed. So PyTorch basically means we can write code
00:56:46.040 | that looks a lot like NumPy, but it's going to run really quickly on the GPU. Secondly,
00:56:53.760 | since we need to know which direction and how much to move our parameters to improve
00:56:59.400 | our loss, we need to know the derivative of functions. PyTorch has this amazing thing
00:57:05.440 | where any code you write using the PyTorch library, it can automatically take the derivative
00:57:11.120 | of that for you. So we're not going to look at any calculus in this course, and I don't
00:57:16.240 | look at any calculus in any of my courses or in any of my work basically ever in terms
00:57:21.180 | of actually calculating derivatives myself, because I've never had to. It's done for me
00:57:27.800 | by the library. So as long as you write the Python code, the derivative is done. So the
00:57:32.600 | only calculus you really need to know to be an effective practitioner is what does it
00:57:37.480 | mean to be a derivative? And you also need to know the chain rule, which we'll come to.
00:57:47.160 | So we're going to start out kind of top-down, create a neural net, and we're going to assume
00:57:51.860 | a whole bunch of stuff. And gradually we're going to dig into each piece. So to create
00:57:57.800 | neural nets, we need to import the PyTorch neural net library. PyTorch, funnily enough,
00:58:04.200 | is not called PyTorch, it's called Torch. Torch.nn is the PyTorch subsection that's responsible
00:58:12.860 | for neural nets. So we'll call that nn. And then we're going to import a few bits out
00:58:17.240 | of fast.ai just to make life a bit easier for us.
00:58:21.680 | So here's how you create a neural network in PyTorch. The simplest possible neural network.
00:58:28.920 | You say sequential, and sequential means I am now going to give you a list of the layers
00:58:34.000 | that I want in my neural network. So in this case, my list has two things in it. The first
00:58:41.760 | thing says I want a linear layer. So a linear layer is something that's basically going
00:58:46.760 | to do y=ax+b. But matrix, matrix, multiply, not univariate, obviously. So it's going to
00:58:57.280 | do a matrix product, basically. So the input to the matrix product is going to be a vector
00:59:03.480 | of length 28 times 28, because that's how many pixels we have. And the output needs
00:59:11.200 | to be of size 10. We'll talk about y in a moment. But for now, this is how we define
00:59:16.360 | a linear layer. And then again, we're going to dig into this in detail, but every linear
00:59:21.040 | layer just about in neural nets has to have a nonlinearity after it. And we're going to
00:59:26.200 | learn about this particular nonlinearity in a moment. It's called the softmax. And if you've
00:59:30.360 | done the DL course, you've already seen this. So that's how we define a neural net. This
00:59:35.280 | is a two-layer neural net. There's also kind of an implicit additional first layer, which
00:59:40.920 | is the input. But with PyTorch, you don't have to explicitly mention the input. But normally
00:59:46.680 | we think conceptually like the input image is kind of also a layer. Because we're kind
00:59:54.360 | of doing things pretty manually with PyTorch, we're not taking advantage of any of the convenience
01:00:00.040 | is in fast.ai for building this stuff. We have to then write .cuda, which tells PyTorch
01:00:05.600 | to copy this neural network across to the GPU. So from now on, that network is going
01:00:12.240 | to be actually running on the GPU. If we didn't say that, it would run on the CPU. So that
01:00:19.400 | gives us back a neural net, a very simple neural net. So we're then going to try and
01:00:25.480 | fit the neural net to some data. So we need some data. So fast.ai has this concept of
01:00:32.000 | a model data object, which is basically something that wraps up training data, validation data,
01:00:38.680 | and optionally test data. And so to create a model data object, you can just say I want
01:00:44.520 | to create some image classifier data, I'm going to grab it from some arrays, and you
01:00:49.720 | just say this is the path that I'm going to save any temporary files, this is my training
01:00:54.920 | data arrays, and this is my validation data arrays. And so that just returns an object
01:01:02.520 | that's going to wrap that all up, and so we're going to be able to fit to that data.
01:01:07.480 | So now that we have a neural net, and we have some data, we're going to come back to this
01:01:12.080 | in a moment, but we basically say what loss function do we want to use, what optimizer
01:01:16.520 | do we want to use, and then we say fit. We say fit this network to this data going over
01:01:26.480 | every image once using this loss function, this optimizer, and print out these metrics.
01:01:33.760 | And this says here, this is 91.8% accurate. So that's the simplest possible neural net.
01:01:43.840 | So what that's doing is it's creating a matrix multiplication followed by a nonlinearity,
01:01:54.280 | and then it's trying to find the values for this matrix which basically fit the data as
01:02:03.040 | well as possible, that end up predicting this is a 1, this is a 9, this is a 3. And so we
01:02:09.320 | need some definition for as well as possible. And so the general term for that thing is
01:02:14.600 | called the loss function. So the loss function is the function that's going to be lower if
01:02:20.560 | this is better. Just like with random forests, we had this concept of information gain, and
01:02:26.160 | we got to pick what function do you want to use to define information gain, and we were
01:02:31.000 | mainly looking at root mean squared error. Most machine learning algorithms we call something
01:02:37.560 | very similar to loss. So the loss is how do we score how good we are. And so in the end
01:02:43.680 | we're going to calculate the derivative of the loss with respect to the weight matrix
01:02:50.080 | that we're multiplying by to figure out how to update it. So we're going to use something
01:02:56.080 | called negative log likelihood loss. So negative log likelihood loss is also known as cross
01:03:04.040 | entropy. They're literally the same thing. There's two versions, one called binary cross
01:03:10.680 | entropy, or binary negative log likelihood, and another called categorical cross entropy.
01:03:16.520 | The same thing, one is for when you've only got a 0 or 1 dependent, the other is if you've
01:03:21.960 | got like cat, dog, airplane or horse, or 0, 1, through 9, and so forth. So what we've
01:03:28.960 | got here is the binary version of cross entropy. And so here is the definition. I think maybe
01:03:38.200 | the easiest way to understand this definition is to look at an example. So let's say we're
01:03:43.440 | trying to predict cat vs dog. 1 is cat, 0 is dog. So here we've got cat, dog, dog, cat.
01:03:53.960 | And here are our predictions. We said 90% sure it's a cat, 90% sure it's a dog, 80% sure
01:04:03.280 | it's a dog, 80% sure it's a cat. So we can then calculate the binary cross entropy by
01:04:11.600 | calling our function. So it's going to say, okay, for the first one we've got y = 1, so
01:04:17.160 | it's going to be 1 times log of 0.9, plus 1 - y, 1 - 1, is 0, so that's going to be skipped.
01:04:31.360 | And then the second one is going to be a 0, so it's going to be 0 times something, so
01:04:35.800 | that's going to be skipped. And the second part will be 1 - 0. So this is 1 times log
01:04:43.240 | of 1 - p, 1 - 0.1 is 0.9. So in other words, the first piece and the second piece of this
01:04:52.640 | are going to give exactly the same number. Which makes sense because the first one we
01:04:57.400 | said we were 90% confident it was a cat, and it was. And the second we said we were 90%
01:05:03.600 | confident it was a dog, and it was. So in each case the loss is coming from the fact
01:05:09.600 | that we could have been more confident. So if we said we were 100% confident the loss
01:05:15.120 | would have been 0. So let's look at that in Excel.
01:05:21.240 | So here's our 0.9, 0.1, 0.2, 0.8, and here's our predictions, 1, 0, 0, 1. So here's 1 - the
01:05:34.080 | prediction, here is log of our prediction, here is log of 1 - our prediction, and so
01:05:45.160 | then here is our sum. So if you think about it, and I want you to think about this during
01:05:54.640 | the week, you could replace this with an if statement rather than y. Because y is always
01:06:03.880 | 1 or 0, then it's only ever going to use either this or this. So you could replace this with
01:06:10.040 | an if statement. So I'd like you during the week to try to rewrite this with an if statement.
01:06:18.360 | And then see if you can then scale it out to be a categorical cross-entropy. So categorical
01:06:25.360 | cross-entropy works this way. Let's say we were trying to predict 3 and then 6 and then
01:06:30.920 | 7 and then 2. So if we were trying to predict 3, and the actual thing that was predicted
01:06:38.160 | was like 4.7, we're trying to predict 3 and we actually predicted 5. Or we're trying to
01:06:47.160 | predict 3 and we accidentally predicted 9. Being 5 instead of 3 is no better than being
01:06:54.120 | 9 instead of 3. So we're not actually going to say how far away is the actual number,
01:06:59.760 | we're going to express it differently. Or to put it another way, what if we're trying
01:07:03.840 | to predict cats, dogs, horses and airplanes? How far away is cat from horse? So we're going
01:07:11.040 | to express these a little bit differently. Rather than thinking of it as a 3, let's think
01:07:15.240 | of it as a vector with a 1 in the third location. And rather than thinking of it as a 6, let's
01:07:24.240 | think of it as a vector of zeros with a 1 in the sixth location. So in other words, one
01:07:29.480 | hot encoding. So let's one hot encode a dependent variable.
01:07:34.800 | And so that way now, rather than trying to predict a single number, let's predict 10
01:07:42.000 | numbers. Let's predict what's the probability that it's a 0, what's the probability it's
01:07:46.920 | a 1, what's the probability that it's a 2, and so forth. And so let's say we're trying
01:07:52.400 | to predict a 2, then here is our binary cross entropy, sorry, categorical cross entropy.
01:08:00.760 | So it's just saying, okay, did this one predict correctly or not, how far off was it, and
01:08:06.880 | so forth for each one. And so add them all up. So categorical cross entropy is identical
01:08:13.240 | to binary cross entropy, we just have to add it up across all of the categories.
01:08:20.140 | So try and turn the binary cross entropy function in Python into a categorical cross entropy
01:08:25.640 | Python and maybe create both the version with the if statement and the version with the
01:08:29.800 | sum and the product.
01:08:36.040 | So that's why in our PyTorch we had 10 as the output dimensionality for this matrix
01:08:47.200 | because when we multiply a matrix with 10 columns, we're going to end up with something
01:08:53.180 | of length 10, which is what we want. We want to have 10 predictions.
01:09:03.520 | So that's the loss function that we're using. So then we can fit the model, and what it
01:09:12.560 | does is it goes through every image this many times, in this case it's just looking at every
01:09:19.320 | image once, and going to slightly update the values in that weight matrix based on those
01:09:26.800 | gradients.
01:09:29.040 | And so once we've trained it, we can then say predict using this model on the validation
01:09:37.920 | set. And now that's bits out something of 10,000 by 10. Can somebody tell me why is this of
01:09:47.120 | shape these predictions? Why are they of shape 10,000 by 10?
01:09:51.000 | Go for it Chris, it's right next to you.
01:09:56.120 | Well it's because we have 10,000 images we're training on.
01:10:02.240 | 10,000 images we're training on, so we're validating on in this case, but same thing.
01:10:06.440 | So 10,000 we're validating on, so that's the first axis, and the second axis is because
01:10:11.520 | we actually make 10 predictions per image.
01:10:13.560 | Good, exactly. So each one of these rows is the probabilities that it's a 0, that it's
01:10:18.640 | a 1, that it's a 2, that it's a 3, and so forth.
01:10:24.160 | So in math, there's a really common operation we do called argmax. When I say it's common,
01:10:30.720 | it's funny, at high school I never saw argmax, first year undergrad I never saw argmax, but
01:10:39.560 | somehow after university everything's about argmax. So it's one of these things that's
01:10:44.160 | for some reason not really taught at school, but it actually turns out to be super critical.
01:10:48.340 | And so argmax is both something that you'll see in math, and it's just written out in
01:10:51.980 | full, argmax. It's in numpy, it's in pytorch, it's super important.
01:10:58.640 | And what it does is it says, let's take this array of preds, and let's figure out on this
01:11:05.520 | axis, remember axis 1 is columns, so across as Chris said, the 10 predictions for each
01:11:12.600 | row, let's find which prediction has the highest value, and return, not that, if it just said
01:11:19.240 | max it would return the value, argmax returns the index of the value. So by saying argmax
01:11:27.140 | axis equals 1, it's going to return the index, which is actually the number itself. So let's
01:11:34.160 | grab the first 5, so for the first one it thinks it's a 3, then it thinks the next one's
01:11:39.320 | an 8, the next one's a 6, the next one's a 9, the next one's a 6 again. So that's how
01:11:44.400 | we can convert our probabilities back into predictions.
01:11:51.280 | So if we save that away, call it preds, we can then say, okay, when does preds equal
01:11:58.420 | the ground truth? So that's going to return an array of balls, which we can treat as 1s
01:12:04.760 | and 0s, and the mean of a bunch of 1s and 0s is just the average, so that gives us the
01:12:11.360 | accuracy, so there's our 91.8%. And so you want to be able to replicate the numbers you
01:12:18.320 | see, and here it is, there's our 91.8%. So when we train this, the last thing it tells
01:12:24.640 | us is whatever metric we asked for, and we asked for accuracy, okay. So the last thing
01:12:33.200 | it tells us is our metric, which is accuracy, and then before that we get the training set
01:12:37.760 | loss, and the loss is again whatever we asked for, negative log likelihood, and the second
01:12:43.460 | thing is the validation set loss. PyTorch doesn't use the word loss, they use the word
01:12:49.240 | criterion, so you'll see here, crit. So that's criterion equals loss. This is what loss function
01:12:55.920 | we want to use, they call that the criterion. Same thing.
01:13:03.480 | So here's how we can recreate that accuracy. So now we can go ahead and plot 8 of the images
01:13:13.300 | along with their predictions. And we've got 3, 8, 6, 9, wrong, 5, wrong, okay. And you
01:13:21.920 | can see why they're wrong. This is pretty close to a 9, it's just missing a little cross
01:13:26.680 | at the top. This is pretty close to a 5, it's got a little bit of the extra here, right.
01:13:32.080 | So we've made a start, and all we've done so far is we haven't actually created a deep
01:13:38.800 | neural net, we've actually got only one layer. So what we've actually done is we've created
01:13:44.280 | a logistic regression. So a logistic regression is literally what we just built, and you could
01:13:51.080 | try and replicate this with sklearn's logistic regression package. When I did it, I got similar
01:13:58.440 | accuracy, but this version ran much faster because this is running on the GPU where else
01:14:04.480 | sklearn runs on the CPU. So even for something like logistic regression, we can implement
01:14:11.520 | it very quickly with PyTorch. How can you pass that to Ian?
01:14:16.200 | So when we're creating our net, we have to do .cuda, what would be the consequence of
01:14:23.520 | not doing that? Would it just not run?
01:14:26.040 | It wouldn't run quickly. It will run on the CPU. Can you pass it to Jake?
01:14:35.280 | So maybe with the neural network, why is that we have to do linear and followed by nonlinear?
01:14:46.040 | So the short answer is because that's what the universal approximation theorem says is
01:14:51.080 | a structure which can give you arbitrarily accurate functions for any functional form.
01:14:57.680 | So the long answer is the details of why the universal approximation theorem works. Another
01:15:04.520 | version of the short answer is that's the definition of a neural network. So the definition
01:15:08.760 | of a neural network is a linear layer followed by an activation function, followed by a linear
01:15:14.800 | layer, followed by an activation function, etc.
01:15:19.040 | We go into a lot more detail of this in the deep learning course. But for this purpose,
01:15:25.600 | it's enough to know that it works. So far, of course, we haven't actually built a deep
01:15:31.600 | neural net at all. We've just built a logistic regression. And so at this point, if you think
01:15:37.440 | about it, all we're doing is we're taking every input pixel and multiplying it by a weight
01:15:42.240 | for each possible outcome. So we're basically saying on average, the number 1 has these
01:15:50.800 | pixels turned on, the number 2 has these pixels turned on, and that's why it's not terribly
01:15:54.920 | accurate. That's not how digit recognition works in real life, but that's always built
01:16:03.240 | so far.
01:16:04.240 | Michael Nielsen has this great website called Neural Networks in Deep Learning, and his
01:16:28.360 | chapter 4 is actually kind of famous now. In it, he does this walkthrough of basically
01:16:35.240 | showing that a neural network can approximate any other function to arbitrarily close accuracy
01:16:48.680 | as long as it's big enough.
01:16:50.800 | And we walk through this in a lot of detail in the deep learning course. But the basic
01:16:56.080 | trick is that he shows that with a few different numbers, you can basically cause these things
01:17:04.580 | to create little boxes. You can move the boxes up and down, you can move them around, you
01:17:09.240 | can join them together to eventually create connections of towers, which you can use to
01:17:15.520 | approximate any kind of surface.
01:17:19.480 | So that's basically the trick, and so all we need to do given that is to kind of find
01:17:29.680 | the parameters for each of the linear functions in that neural network, so to find the weights
01:17:36.760 | in each of the matrices. So far, we've got just one matrix, and so we've just built a
01:17:44.580 | simple logistic regression.
01:17:51.160 | Just a small note, I just want to confirm that when you showed examples of the images
01:17:55.840 | which were misclassified, they look rectangular, so it's just that while rendering the pixels
01:18:00.640 | are being scaled differently. So are they still 28 by 28 squares?
01:18:04.400 | They are 28 by 28. I think they're square, I think they just look rectangular because
01:18:08.120 | they've got titles on the top. I'm not sure. Good question. I don't know. Anyway, they
01:18:11.880 | are square. Matplotlib does often fiddle around with what it considers black versus white
01:18:20.440 | and having different size axes and stuff, so you do have to be very careful there sometimes.
01:18:32.540 | Hopefully this will now make more sense because what we're going to do is dig in a layer deeper
01:18:36.920 | and define logistic regression without using nn.sequential, without using nn.linear, without
01:18:43.640 | using nn.logsoftmax. So we're going to do nearly all of the layer definition from scratch.
01:18:52.480 | So to do that, we're going to have to define a PyTorch module. A PyTorch module is basically
01:18:59.000 | either a neural net or a layer in a neural net, which is actually kind of a powerful
01:19:04.160 | concept of itself. Basically anything that can kind of behave like a neural net can itself
01:19:09.280 | be part of another neural net. And so this is like how we can construct particularly
01:19:14.380 | powerful architectures combining lots of other pieces.
01:19:19.520 | So to create a PyTorch module, just create a Python class, but it has to inherit from
01:19:25.840 | nn.module. So we haven't done inheritance before. Other than that, this is all the same
01:19:32.980 | concepts we've seen in OO already.
01:19:36.880 | Basically if you put something in parentheses here, what it means is that our class gets
01:19:41.400 | all of the functionality of this class for free. It's called subclassing it. So we're
01:19:47.120 | going to get all of the capabilities of a neural network module that the PyTorch authors
01:19:51.920 | have provided, and then we're going to add additional functionality to it.
01:19:57.880 | When you create a subclass, there is one key thing you need to remember to do, which is
01:20:02.720 | when you initialize your class, you have to first of all initialize the superclass. So
01:20:09.280 | the superclass is the nn.module. So the nn.module has to be built before you can start adding
01:20:16.560 | your pieces to it. And so this is just like something you can copy and paste into every
01:20:21.300 | one of your modules. You just say super.init, this just means construct the superclass first.
01:20:31.920 | Having done that, we can now go ahead and define our weights and our bias. So our weights is
01:20:39.400 | the weight matrix. It's the actual matrix that we're going to multiply our data by.
01:20:44.440 | And as we discussed, it's going to have 28x28 rows and 10 columns. And that's because if
01:20:51.760 | we take an image which we flattened out into a 28x28 length vector, then we can multiply
01:21:01.160 | it by this weight matrix to get back out a length 10 vector, which we can then use to
01:21:11.360 | consider it as a set of predictions.
01:21:15.640 | So that's our weight matrix. Now the problem is that we don't just want y=ax, we want y=ax+b.
01:21:26.240 | So the +b in neural nets is called bias, and so as well as defining weights, we're also
01:21:32.160 | going to find bias. And so since this thing is going to spit out for every image something
01:21:38.640 | of length 10, that means that we need to create a vector of length 10 to be our biases. In
01:21:46.920 | other words, for everything 0, 1, 2, 3, up to 9, we're going to have a different +b that
01:21:53.800 | we'll be adding.
01:21:56.440 | So we've got our data matrix here, which is of length 10,000 by 28x28. And then we've got
01:22:15.720 | our weight matrix, which is 28x28 rows by 10. So if we multiply those together, we get something
01:22:31.720 | of size 10,000 by 10. And then we want to add on our bias, like so. And so when we add
01:22:57.760 | on, and we're going to learn a lot more about this later, but when we add on a vector like
01:23:03.120 | this, it basically is going to get added to every row. So the bias is going to get added
01:23:11.800 | to every row.
01:23:13.880 | So we first of all define those. And so to define them, we've created a tiny little function
01:23:19.100 | called get_weights, which is over here, which basically just creates some normally distributed
01:23:25.160 | random numbers. So torch.rand_n returns a tensor filled with random numbers from a normal distribution.
01:23:33.760 | We have to be a bit careful though. When we do deep learning, like when we add more linear
01:23:39.200 | layers later, imagine if we have a matrix which on average tends to increase the size
01:23:47.800 | of the inputs we give to it. If we then multiply by lots of matrices of that size, it's going
01:23:54.200 | to make the numbers bigger and bigger and bigger, like exponentially bigger. Or what
01:23:59.440 | if it made them a bit smaller? It's going to make them smaller and smaller and smaller
01:24:03.000 | exponentially smaller.
01:24:04.960 | So because a deep network applies lots of linear layers, if on average they result in
01:24:11.400 | things a bit bigger than they started with, or a bit smaller than they started with, it's
01:24:16.320 | going to exponentially multiply that difference. So we need to make sure that the weight matrix
01:24:23.800 | is of an appropriate size that the mean of the inputs basically is not going to change.
01:24:32.960 | So it turns out that if you use normally distributed random numbers and divide it by the number
01:24:40.640 | of rows in the weight matrix, it turns out that particular random initialization keeps
01:24:48.280 | your numbers at about the right scale. So this idea that if you've done linear algebra,
01:24:55.080 | basically if the first eigenvalue is bigger than 1 or smaller than 1, it's going to cause
01:25:01.640 | the gradients to get bigger and bigger, or smaller and smaller, that's called gradient
01:25:06.240 | explosion.
01:25:08.040 | So we'll talk more about this in the deep learning course, but if you're interested,
01:25:12.280 | you can look up Kaiming, her initialization, and read all about this concept. But for now,
01:25:21.240 | it's probably just enough to know that if you use this type of random number generation,
01:25:27.680 | you're going to get random numbers that are un-nicely behaved. You're going to start out
01:25:32.000 | with an input, which is mean 0, standard deviation 1. Once you put it through this set of random
01:25:38.120 | numbers, you'll still have something that's about mean 0, standard deviation 1. That's
01:25:42.400 | basically the goal.
01:25:45.240 | One nice thing about PyTorch is that you can play with this stuff. So torch.random, try
01:25:52.320 | it out. Every time you see a function being used, run it and take a look. And so you'll
01:25:57.520 | see it looks a lot like NumPy, but it doesn't return a NumPy array, it returns a tensor.
01:26:06.060 | And in fact, now I'm GPU programming. I just multiplied that matrix by 3 very quickly on
01:26:20.920 | the GPU. So that's how we do GPU programming with PyTorch.
01:26:29.080 | So this is our weight matrix. As I said, we create 1.28.28 by 10. 1 is just rank 1 of
01:26:36.640 | 10 for the biases. We have to make them a parameter. This is basically telling PyTorch
01:26:42.920 | which things to update when it does SGD. That's very minor technical detail.
01:26:49.600 | So having created the weight matrices, we then define a special method with the name
01:26:54.560 | forward. This is a special method. The name forward has a special meaning in PyTorch.
01:27:01.800 | A method called forward in PyTorch is the name of the method that will get called when
01:27:07.360 | your layer is calculated. So if you create a neural net or a layer, you have to define
01:27:14.080 | forward. And it's going to get past the data from the previous layer.
01:27:20.520 | So our definition is to do a matrix multiplication of our input data times our weights and add
01:27:28.480 | on the biases. So that's it. That's what happened earlier on when we said nn.linear. It created
01:27:37.800 | this thing for us.
01:27:41.840 | Now unfortunately though, we're not getting a 28x28 long vector. We're getting a 28 row
01:27:48.560 | by 28 column matrix, so we have to flatten it. Unfortunately in PyTorch, they tend to
01:27:55.800 | rename things. They spell reshape, they spell it view. So view means reshape. So you can
01:28:05.160 | see here we end up with something where the number of images we're going to leave the
01:28:09.840 | same, and then we're going to replace row by column with a single axis, again -1 meaning
01:28:17.720 | as long as required.
01:28:20.160 | So this is how we flatten something using PyTorch. So we flatten it, do a matrix multiply,
01:28:26.920 | and then finally we do a softmax. So softmax is the activation function we use. If you
01:28:35.200 | look in the deep learning repo, you'll find something called entropy example, where you'll
01:28:41.440 | see an example of softmax. But a softmax simply takes the outputs from our final layer, so
01:28:49.040 | we get our outputs from our linear layer, and what we do is we go e^of for each output,
01:28:58.660 | and then we take that number and we divide by the sum of the e^ofs. That's called softmax.
01:29:06.640 | Why do we do that? Well, because we're dividing this by the sum, that means that the sum of
01:29:13.680 | those itself must add to 1, and that's what we want. We want the probabilities of all
01:29:20.320 | the possible outcomes add to 1.
01:29:23.200 | Furthermore, because we're using e^of, that means we know that every one of these is between
01:29:28.640 | 0 and 1, and probabilities we know should be between 0 and 1.
01:29:34.680 | And then finally, because we're using e^of, it tends to mean that slightly bigger values
01:29:43.680 | in the input turn into much bigger values in the output. So you'll see generally speaking
01:29:47.960 | in my softmax, there's going to be one big number and lots of small numbers. And that's
01:29:53.160 | what we want, because we know that the output is one hot encoded.
01:29:58.120 | So in other words, a softmax activation function, the softmax nonlinearity, is something that
01:30:05.080 | returns things that behave like probabilities, and where one of those probabilities is more
01:30:10.800 | likely to be kind of high and the other ones are more likely to be low. And we know that's
01:30:15.520 | what we want to map to our one hot encoding, so a softmax is a great activation function
01:30:22.400 | to use to help the neural net, make it easier for the neural net to map to the output that
01:30:30.040 | you wanted.
01:30:31.040 | And this is what we generally want. When we're designing neural networks, we try to come
01:30:35.140 | up with little architectural tweaks that make it as easy for it as possible to match the
01:30:41.680 | output that we know we want.
01:30:45.560 | So that's basically it, right? Rather than doing sequential and using nn.linear and nn.softmax,
01:30:52.040 | we have to find it from scratch. We can now say, just like before, our net is equal to
01:30:57.600 | that class .cuda and we can say .fit and we get to within a slight random deviation exactly
01:31:04.120 | the same output.
01:31:07.200 | So what I'd like you to do during the week is to play around with torch.randn to generate
01:31:12.380 | some random tensors, torch.matmul to start multiplying them together, adding them up,
01:31:18.480 | try to make sure that you can rewrite softmax yourself from scratch, try to fiddle around
01:31:24.720 | a bit with reshaping view, all that kind of stuff. So by the time you come back next week,
01:31:30.720 | you feel pretty comfortable with PyTorch.
01:31:33.120 | And if you Google for PyTorch tutorial, you'll see there's a lot of great material actually
01:31:38.960 | on the PyTorch website to help you along, basically showing you how to create tensors
01:31:46.800 | and modify them and do operations on them.
01:31:50.040 | All right, great. Yes, you had a question. Can you pass it over?
01:31:58.000 | So I see that the forward is the layer that gets applied after each of the linear layers.
01:32:02.960 | Well, not quite. The forward is just the definition of the module. So this is like how we're implementing
01:32:11.120 | linear.
01:32:12.120 | Does that mean after each linear layer, we have to apply the same function? Let's say
01:32:17.000 | we can't do a log softmax after layer one and then apply some other function after layer
01:32:23.480 | two if we have like a multi-layer neural network.
01:32:28.120 | So normally we define neural networks like so. We just say here is a list of the layers
01:32:41.240 | we want. You don't have to write your own forward. All we did just now was to say instead of
01:32:50.160 | doing this, let's not use any of this at all, but write it all by hand ourselves.
01:32:56.680 | So you can write as many layers as you like in any order you like here. The point was
01:33:04.060 | that here we're not using any of that. We've written our own matmul plus bias, our own
01:33:13.520 | softmax. This is just Python code. You can write whatever Python code inside forward
01:33:20.920 | that you like to define your own neural net.
01:33:26.200 | You won't normally do this yourself. Normally you'll just use the layers that PyTorch provides
01:33:30.700 | and your use.sequential to put them together, or even more likely you'll download a predefined
01:33:35.720 | architecture and use that. We're just doing this to learn how it works behind the scenes.
01:33:42.120 | Alright, great. Thanks everybody.
01:33:44.880 | [BLANK_AUDIO]