back to indexLesson 8 (2019) - Deep Learning from the Foundations
Chapters
0:0 Introduction
0:40 Overview
5:12 Bottomup
6:47 Why Swift
10:50 Swift for Tensorflow
14:50 The Game
17:0 Why do this
18:47 Homework
19:42 Remember Part 1
21:1 Three Steps to Training a Good Model
23:10 Reading Papers
25:25 Symbols
28:10 Jupiter Notebooks
32:56 Run Notebook
34:13 Notebook to Script
36:0 Standard Library
38:38 Plot
39:52 Matrix Multiplication
44:36 Removing Python
45:12 Elementwise Addition
46:2 Frobenius Norm
48:50 Recap
49:26 Replace inner loop
51:51 Broadcasting
57:5 Columns
00:00:00.000 |
So, welcome back to part two of what previously was called Practical Deep Learning for Coders, 00:00:08.960 |
but part two is not called that, as you will see. 00:00:11.680 |
It's called Deep Learning from the Foundations. 00:00:15.680 |
It's lesson eight because it's lesson eight of the full journey, lesson one of part two, 00:00:20.440 |
or lesson eight, mod seven, as we sometimes call it. 00:00:26.540 |
So those of you, I know a lot of you do every year's course and keep coming back. 00:00:30.280 |
For those of you doing that, this will not look at all familiar to you. 00:00:36.240 |
We're really excited about it and hope you like it as well. 00:00:40.520 |
The basic idea of deep learning from the foundations is that we are going to implement much of 00:00:49.640 |
Now, talk about exactly what I mean by foundations in a moment, but it basically means from scratch. 00:00:55.880 |
So we'll be looking at basic matrix calculus and creating a training loop from scratch 00:01:01.200 |
and creating an optimizer from scratch and lots of different layers and architectures 00:01:05.360 |
and so forth, and not just to create some kind of dumbed down library that's not useful 00:01:11.760 |
for anything, but to actually build from scratch something you can train cutting edge world-class 00:01:22.080 |
I don't think anybody's ever done this before. 00:01:24.500 |
So I don't exactly know how far we'll get, but this is the journey that we're on. 00:01:30.600 |
So in the process, we will be having to read and implement papers because the fast AI library 00:01:40.160 |
So you're not going to be able to do this if you're not reading and implementing papers. 00:01:44.160 |
Along the way, we'll be implementing much of PyTorch as well, as you'll see. 00:01:52.520 |
We'll also be going deeper into solving some applications that are not kind of fully baked 00:02:00.960 |
into the fast AI library yet, so it's going to require a lot of custom work. 00:02:05.720 |
So things like object detection, sequence to sequence with attention, transformer and 00:02:10.680 |
the transform Excel, cycleGAN, audio, stuff like that. 00:02:17.360 |
We'll also be doing a deeper dive into some performance considerations like doing distributed 00:02:22.200 |
multi-GPU training, using the new just-in-time compiler, which we'll just call JIT from now 00:02:34.720 |
And then the last two lessons, implementing some subset of that in Swift. 00:02:44.000 |
So this is otherwise known as impractical deep learning for coders. 00:02:48.880 |
Because really none of this is stuff that you're going to go and use right away. 00:02:55.520 |
Part one was like, oh, we've been spending 20 minutes on this. 00:02:58.160 |
You can now create a world-class vision classification model. 00:03:03.960 |
This is not that, because you already know how to do that. 00:03:08.200 |
And so back in the earlier years, part two used to be more of the same thing, but it 00:03:13.520 |
was kind of like more advanced types of model, more advanced architectures. 00:03:19.400 |
But there's a couple of reasons we've changed this year. 00:03:21.600 |
The first is so many papers come out now, because this whole area has increased in scale 00:03:28.360 |
so quickly, that I can't pick out for you the 12 papers to do in the next seven weeks 00:03:34.560 |
that you really need to know, because there's too many. 00:03:37.880 |
And it's also kind of pointless, because once you get into it, you realize that all the 00:03:41.120 |
papers pretty much say minor variations on the same thing. 00:03:46.080 |
So instead, what I want to be able to do is show you the foundations that let you read 00:03:50.840 |
the 12 papers you care about and realize like, oh, that's just that thing with this minor 00:03:56.800 |
And I now have all the tools I need to implement that and test it and experiment with it. 00:04:01.340 |
So that's kind of a really key issue in why we want to go in this direction. 00:04:09.600 |
Also it's increasingly clear that, you know, we used to call part two cutting edge deep 00:04:14.720 |
learning for coders, but it's increasingly clear that the cutting edge of deep learning 00:04:19.120 |
is really about engineering, not about papers. 00:04:24.880 |
The difference between really effective people in deep learning and the rest is really about 00:04:29.200 |
who can like make things encode that work properly. 00:04:36.120 |
So really, the goal of this part two is to deepen your practice so you can understand, 00:04:44.920 |
you know, the things that you care about and build the things you care about and have them 00:04:56.200 |
And so it's impractical in the sense that like none of these are things that you're 00:05:00.200 |
going to go probably straight away and say, here's this thing I built, right? 00:05:06.340 |
Because Swift, we're actually going to be learning a language in a library that as you'll 00:05:10.480 |
see is far from ready for use, and I'll describe why we're doing that in a moment. 00:05:16.080 |
So part one of this course was top down, right? 00:05:20.960 |
So that you got the context you needed to understand, you got the motivation you needed 00:05:24.520 |
to keep going, and you got the results you needed to make it useful. 00:05:30.920 |
And we started doing some bottom up at the end of part one, right? 00:05:34.980 |
But really bottom up lets you, when you've built everything from the bottom yourself, 00:05:41.560 |
then you can see the connections between all the different things. 00:05:44.380 |
You can see they're all variations of the same thing, you know? 00:05:47.500 |
And then you can customize, rather than picking algorithm A or algorithm B, you create your 00:05:52.600 |
own algorithm to solve your own problem doing just the things you need it to do. 00:05:58.040 |
And then you can make sure that it performs well, that you can debug it, profile it, maintain 00:06:05.760 |
it, because you understand all of the pieces. 00:06:08.720 |
So normally when people say bottom up in this world, in this field, they mean bottom up 00:06:20.780 |
So today, step one will be to implement matrix multiplication from scratch in Python. 00:06:30.240 |
Because bottom up with code means that you can experiment really deeply on every part 00:06:38.360 |
You can see exactly what's going in, exactly what's coming out, and you can figure out 00:06:41.820 |
why your model's not training well, or why it's slow, or why it's giving the wrong answer, 00:06:52.280 |
And be clear, we are only talking the last two lessons, right? 00:06:55.800 |
You know, our focus, as I'll describe, is still very much Python and PyTorch, right? 00:07:03.580 |
But there's something very exciting going on. 00:07:07.320 |
The first exciting thing is this guy's face you see here, Chris Latner. 00:07:11.400 |
Chris is unique, as far as I know, as being somebody who has built, I think, what is the 00:07:17.320 |
world's most widely used compiler framework, LLVM. 00:07:21.920 |
He's built the default C and C++ compiler for Mac, being Clang. 00:07:29.600 |
And he's built what's probably like the world's fastest growing fairly new computer language, 00:07:37.840 |
And he's now dedicating his life to deep learning, right? 00:07:42.160 |
So we haven't had somebody from that world come into our world before. 00:07:47.040 |
And so when you actually look at stuff like the internals of something like TensorFlow, 00:07:53.060 |
it looks like something that was built by a bunch of deep learning people, not by a 00:07:59.840 |
And so I've been wanting for over 20 years for there to be a good numerical programming 00:08:07.040 |
language that was built by somebody that really gets programming languages. 00:08:14.480 |
So we've had like, in the early days, it was elispstat in LISP, and then it was R and then 00:08:22.440 |
None of these languages were built to be good at data analysis. 00:08:30.400 |
They weren't built by people that really deeply understood compilers. 00:08:34.720 |
They certainly weren't built for today's kind of modern, highly parallel processor situation 00:08:44.780 |
And so we've got this unique situation where for the first time, you know, a really widely 00:08:49.960 |
used language, a really well-designed language from the ground up, is actually being targeted 00:08:58.980 |
towards numeric programming and deep learning. 00:09:01.360 |
So there's no way I'm missing out on that boat. 00:09:05.600 |
And I don't want you to miss out on it either. 00:09:09.160 |
I should mention there's another language which you could possibly put in there, which 00:09:12.440 |
is a language called Julia, which has maybe as much potential. 00:09:17.720 |
But it's, you know, it's about ten times less used than Swift. 00:09:26.280 |
So I'd say, like, maybe there's two languages which you might want to seriously consider 00:09:40.400 |
But that's one of the things I'm excited about for it. 00:09:42.780 |
So I actually spent some time over the Christmas break kind of digging into numeric programming 00:09:50.920 |
And I was delighted to find that I could create code from scratch that was competitive with 00:10:00.520 |
the fastest hand-tuned vendor linear algebra libraries, even though I am -- was and remain 00:10:11.640 |
I found it was a language that, you know, was really delightful. 00:10:18.760 |
And I could write everything in Swift, you know, rather than having to kind of get to 00:10:24.160 |
some layer where it's like, oh, that's crude DNN now or that's MKL now or whatever. 00:10:34.080 |
And so the really exciting news, as I'm sure you've heard, is that Chris Latner himself 00:10:39.720 |
is going to come and join us for the last two lessons, and we're going to teach Swift 00:10:48.520 |
So Swift for deep learning means Swift for TensorFlow. 00:10:53.360 |
That's specifically the library that Chris and his team at Google are working on. 00:11:01.480 |
We will call that S for TF when I write it down because I couldn't be bothered typing 00:11:15.400 |
And interestingly, they're the opposite of each other. 00:11:19.800 |
PyTorch and Python's pros are you can get stuff done right now with this amazing ecosystem, 00:11:30.800 |
You know, it's just a really great practical system for solving problems. 00:11:38.440 |
And to be clear, Swift for TensorFlow is not. 00:11:42.040 |
It's not any of those things right now, right? 00:11:49.280 |
You have to learn a whole new language if you don't know Swift already. 00:11:53.880 |
Now, I'm not sure about Swift in particular, but the kind of Swift for TensorFlow and Swift 00:11:58.640 |
for deep learning and even Swift for numeric programming. 00:12:01.360 |
I was kind of surprised when I got into it to find there was hardly any documentation 00:12:06.480 |
about Swift for numeric programming, even though I was pretty delighted by the experience. 00:12:11.960 |
People have had this view that Swift is kind of for iPhone programming. 00:12:18.400 |
I guess that's kind of how it was marketed, right? 00:12:20.840 |
But actually it's an incredibly well-designed, incredibly powerful language. 00:12:27.680 |
And then TensorFlow, I mean, to be honest, I'm not a huge fan of TensorFlow in general. 00:12:33.600 |
I mean, if I was, we wouldn't have switched away from it. 00:12:42.520 |
And the bits of it I particularly don't like are largely the bits that Swift for TensorFlow 00:12:48.120 |
But I think long-term, the kind of things I see happening, like there's this fantastic 00:12:54.200 |
new kind of compiler project called MLIR, which Chris is also co-leading, which I think actually 00:13:01.400 |
has the potential long-term to allow Swift to replace most of the yucky bits or maybe 00:13:07.520 |
even all of the yucky bits of TensorFlow with stuff where Swift is actually talking directly 00:13:13.800 |
You'll be hearing a lot more about LLVM in the coming, in the last two weeks, the last 00:13:18.560 |
Basically, it's the compiler infrastructure that kind of everybody uses, that Julia uses, 00:13:28.920 |
And Swift is this kind of almost this thin layer on top of it, where when you write stuff 00:13:34.400 |
in Swift, it's really easy for LLVM to compile it down to super-fast optimized code. 00:13:46.120 |
With Python, as you'll see today, we almost never actually write Python code. 00:13:51.880 |
We write code in Python that gets turned into some other language or library, and that's 00:13:58.560 |
And this mismatch, this impedance mismatch between what I'm trying to write and what 00:14:03.680 |
actually gets run makes it very hard to do the kind of deep dives that we're going to 00:14:14.760 |
So I'm excited about getting involved in these very early days for impractical deep learning 00:14:20.440 |
in Swift for TensorFlow, because it means that me and those of you that want to follow along 00:14:27.080 |
can be the pioneers in something that I think is going to take over this field. 00:14:35.320 |
We'll be the ones that understand it really well. 00:14:37.960 |
And in your portfolio, you can actually point at things and say, "That library that everybody 00:14:44.440 |
This piece of documentation that's like on the Swift for TensorFlow website, I wrote that. 00:14:51.700 |
So let's put that aside for the next five weeks. 00:14:57.600 |
And let's try to create a really high bar for the Swift for TensorFlow team to have to 00:15:06.760 |
We're going to try to implement as much of fast AI and many parts of PyTorch as we can 00:15:11.880 |
and then see if the Swift for TensorFlow team can help us build that in Swift in five weeks' 00:15:19.320 |
So the goal is to recreate fast AI from the foundations and much of PyTorch like matrix 00:15:26.200 |
multiplication, a lot of torch.nn, torch.optm, dataset, data loader from the foundations. 00:15:35.880 |
The game we're going to play is we're only allowed to use these bits. 00:15:39.000 |
We're allowed to use pure Python, anything in the Python standard library, any non-data 00:15:45.920 |
science modules, so like a requests library for HTTP or whatever, we can use PyTorch but 00:15:53.000 |
only for creating arrays, random number generation, and indexing into arrays. 00:16:00.440 |
We can use the fastai.datasets library because that's the thing that has access to like MNIST 00:16:04.440 |
and stuff, so we don't have to worry about writing our own HTTP stuff. 00:16:09.480 |
We don't have to write our own plotting library. 00:16:13.800 |
So we're going to try and recreate all of this from that. 00:16:18.440 |
And then the rules are that each time we have replicated some piece of fastai or PyTorch 00:16:24.560 |
from the foundations, we can then use the real version if we want to, okay? 00:16:34.240 |
What I've discovered as I started doing that is that I started actually making things a 00:16:39.840 |
So I'm now realizing that fastai version 1 is kind of a disappointment because there 00:16:43.720 |
was a whole lot of things I could have done better. 00:16:47.600 |
As you go along this journey, you'll find decisions that I made or the PyTorch teammate 00:16:51.480 |
or whatever where you think, what if they'd made a different decision there? 00:16:55.880 |
And you can maybe come up with more examples of things that we could do differently, right? 00:17:03.960 |
Well, the main reason is so that you can really experiment, right? 00:17:08.320 |
So you can really understand what's going on in your models, what's really going on in 00:17:13.000 |
And you'll actually find that in the experiments that we're going to do in the next couple 00:17:17.400 |
of classes, we're going to actually come up with some new insights. 00:17:22.600 |
If you can create something from scratch yourself, you know that you understand it. 00:17:28.160 |
And then once you've created something from scratch and you really understand it, then 00:17:32.840 |
But you suddenly realize that there's not this object detection system and this confident 00:17:41.640 |
They're all like a kind of semi-arbitrary bunch of particular knobs and choices. 00:17:47.040 |
And that it's pretty likely that your particular problem would want a different set of knobs 00:17:57.120 |
For those of you looking to contribute to open source, to fast AI or to PyTorch, you'll 00:18:04.640 |
Because you'll understand how it's all built up. 00:18:06.320 |
You'll understand what bits are working well, which bits need help. 00:18:09.560 |
You know how to contribute tests or documentation or new features or create your own libraries. 00:18:17.200 |
And for those of you interested in going deeper into research, you'll be implementing papers, 00:18:23.280 |
which means you'll be able to correlate the code that you're writing with the paper that 00:18:28.160 |
And if you're a poor mathematician like I am, then you'll find that you'll be getting 00:18:33.720 |
a much better understanding of papers that you might otherwise have thought were beyond 00:18:39.360 |
And you realize that all those Greek symbols actually just map to pieces of code that you're 00:18:48.080 |
So there are a lot of opportunities in part one to blog and to do interesting things, 00:18:57.160 |
In part two, you can be doing homework that's actually at the cutting edge, actually doing 00:19:02.080 |
experiments people haven't done before, making observations people haven't made before. 00:19:06.800 |
Because you're getting to the point where you're a more competent deep learning practitioner 00:19:12.560 |
than the vast majority that are out there, and we're looking at stuff that other people 00:19:17.960 |
So please try doing lots of experiments, particularly in your domain area, and consider writing 00:19:32.080 |
So write stuff down for the you of six months ago. 00:19:40.720 |
Okay, so I am going to be assuming that you remember the contents of part one, which was 00:19:55.200 |
In practice, it's very unlikely you remember all of these things because nobody's perfect. 00:20:00.480 |
So what I'm actually expecting you to do is as I'm going on about something which you're 00:20:05.040 |
thinking I don't know what he's talking about, that you'll go back and watch the video about 00:20:12.020 |
Don't just keep blasting forwards, because I'm assuming that you already know the content 00:20:19.280 |
Particularly if you're less confident about the second half of part one, where we went 00:20:24.040 |
a little bit deeper into what's an activation, and what's a parameter really, and exactly 00:20:30.360 |
Particularly in today's lesson, I'm going to assume that you really get that stuff. 00:20:36.400 |
So if you don't, then go back and re-look at those videos. 00:20:42.020 |
Go back to that SGD from scratch and take your time. 00:20:47.600 |
I've kind of designed this course to keep most people busy up until the next course. 00:20:55.720 |
So feel free to take your time and dig deeply. 00:21:02.040 |
So the most important thing, though, is we're going to try and make sure that you can train 00:21:07.780 |
And there are three steps to training a really good model. 00:21:11.560 |
Step one is to create something with way more capacity you need, and basically no regularization, 00:21:23.320 |
It means that your training loss is lower than your validation loss? 00:21:33.840 |
A well-fit model will almost always have training loss lower than the validation loss. 00:21:40.000 |
Remember that overfit means you have actually personally seen your validation error getting 00:21:47.200 |
Until you see that happening, you're not overfitting. 00:21:54.520 |
And then step three, okay, there is no step three. 00:21:57.640 |
Well, I guess step three is to visualize the inputs and outputs and stuff like that, right? 00:22:01.960 |
That is to experiment and see what's going on. 00:22:12.560 |
It's not really that hard, but it's basically these are the five things that you can do 00:22:21.560 |
If you can do more data augmentation, you should. 00:22:24.000 |
If you can use a more generalizable architecture, you should. 00:22:27.040 |
And then if all those things are done, then you can start adding regularization like drop-out, 00:22:31.920 |
or weight decay, but remember, at that point, you're reducing the effective capacity of 00:22:42.080 |
So it's less good than the first three things. 00:22:44.240 |
And then last of all, reduce the architecture complexity. 00:22:48.920 |
And most people, most beginners especially, start with reducing the complexity of the 00:22:54.680 |
architecture, but that should be the last thing that you try. 00:22:58.440 |
Unless your architecture is so complex that it's too slow for your problem, okay? 00:23:04.400 |
So that's kind of a summary of what we want to be able to do that we learned about in 00:23:14.560 |
So we're going to be reading papers, which we didn't really do in part one. 00:23:18.680 |
And papers look something like this, which if you're anything like me, that's terrifying. 00:23:25.320 |
But I'm not going to lie, it's still the case that when I start looking at a new paper, 00:23:30.520 |
every single time I think I'm not smart enough to understand this, I just can't get past 00:23:37.880 |
that immediate reaction because I just look at this stuff and I just go, that's not something 00:23:43.800 |
But then I remember, this is the Adam paper, and you've all seen Adam implemented in one 00:23:53.320 |
When it actually comes down to it, every time I do get to the point where I understand if 00:23:57.520 |
I've implemented a paper, I go, oh my God, that's all it is, right? 00:24:03.100 |
So a big part of reading papers, especially if you're less mathematically inclined than 00:24:07.720 |
I am, is just getting past the fear of the Greek letters. 00:24:16.920 |
There are lots of them, right? And it's very hard to read something that you can't actually 00:24:25.800 |
Because you're just saying to yourself, oh, squiggle bracket one plus squiggle one, G 00:24:31.920 |
And it's like all the squiggles, you just get lost, right? 00:24:34.520 |
So believe it or not, it actually really helps to go and learn the Greek alphabet so you 00:24:40.160 |
can pronounce alpha times one plus beta one, right? 00:24:45.320 |
Whenever you can start talking to other people about it, you can actually read it out loud. 00:24:54.400 |
Note that the people that write these papers are generally not selected for their outstanding 00:25:02.860 |
So you will often find that there'll be a blog post or a tutorial that does a better 00:25:08.880 |
job of explaining the concept than the paper does. 00:25:11.960 |
So don't be afraid to go and look for those as well, but do go back to the paper, right? 00:25:16.560 |
Because in the end, the paper's the one that's hopefully got it mainly right. 00:25:26.080 |
One of the tricky things about reading papers is the equations have symbols and you don't 00:25:30.200 |
know what they mean and you can't Google for them. 00:25:33.760 |
So a couple of good resources, if you see symbols you don't recognize, Wikipedia has 00:25:39.480 |
an excellent list of mathematical symbols page that you can scroll through. 00:25:44.800 |
And even better, de-techify is a website where you can draw a symbol you don't recognize 00:25:51.040 |
and it uses the power of machine learning to find similar symbols. 00:25:57.200 |
There are lots of symbols that look a bit the same, so you will have to use some level 00:26:01.680 |
But the thing that it shows here is the LaTeX name and you can then Google for the LaTeX 00:26:20.640 |
Here's what we're going to do over the next couple of lessons. 00:26:24.040 |
We're going to try to create a pretty competent modern CNN model. 00:26:31.960 |
And we actually already have this bit because we did that in the last course, right? 00:26:39.840 |
We already have our layers for creating a ResNet. 00:26:44.720 |
So we just have to do all these things, okay, to get us from here to here. 00:26:53.080 |
After that we're going to go a lot further, right? 00:26:56.920 |
So today we're going to try to get to at least the point where we've got the backward 00:27:02.120 |
So remember, we're going to build a model that takes an input array and we're going 00:27:07.280 |
to try and create a simple, fully connected network, right? 00:27:12.360 |
So we're going to start with some input, do a matrix multiply, do a value, do a matrix 00:27:21.200 |
And so that's a forward pass and that'll tell us our loss. 00:27:24.960 |
And then we will calculate the gradients of the weights and biases with respect to the 00:27:32.520 |
weights and biases in order to basically multiply them by some learning rate, which we will 00:27:38.520 |
then subtract off the parameters to get our new set of parameters. 00:27:46.280 |
So to get to our fully connected backward pass, we will need to first of all have the 00:27:51.680 |
fully connected forward pass and the fully connected forward pass means we will need 00:27:55.480 |
to have some initialized parameters and we'll need a value and we will also need to be able 00:28:19.600 |
And what I'm showing you here is how I'm going to go about building up our library in Jupyter 00:28:28.580 |
A lot of very smart people have assured me that it is impossible to do effective library 00:28:36.080 |
development in Jupyter notebooks, which is a shame because I've built a library in Jupyter 00:28:43.440 |
So anyway, people will often tell you things are impossible, but I will tell you my point 00:28:47.920 |
of view, which is that I've been programming for over 30 years and in the time I've been 00:28:55.560 |
using Jupyter notebooks to do my development, I would guess I'm about two to three times 00:29:02.160 |
I've built a lot more useful stuff in the last two or three years than I did beforehand. 00:29:08.640 |
I'm not saying you have to do things this way either, but this is how I develop and 00:29:14.280 |
hopefully you find some of this useful as well. 00:29:22.400 |
We can't just create one giant notebook with our whole library. 00:29:26.360 |
Somehow we have to be able to pull out those little gems, those bits of code where we think, 00:29:33.520 |
We have to be able to pull that out into a package that we reuse. 00:29:37.520 |
So in order to tell our system that here is a cell that I want you to keep and reuse, 00:29:44.480 |
I use this special comment, hash export at the top of the cell. 00:29:50.200 |
And then I have a program called notebook to script, which goes through the notebook 00:29:56.800 |
and finds those cells and puts them into a Python module. 00:30:03.640 |
So if I run this cell, okay, so if I run this cell and then I head over and notice I don't 00:30:12.640 |
have to type all of exports because I have tab completion, even for file names in jupyter 00:30:20.240 |
So tab is enough and I could either run this here or I could go back to my console and 00:30:30.560 |
Okay, so that says converted exports.ipanb to nb00. 00:30:36.640 |
And what I've done is I've made it so that these things go into a directory called exp 00:30:47.360 |
So you can see other than a standard header, it's got the contents of that one cell. 00:30:52.240 |
So now I can import that at the top of my next notebook from exp nb00 import star. 00:31:01.560 |
And I can create a test that that variable equals that value. 00:31:12.900 |
And notice there's a lot of test frameworks around, but it's not always helpful to use 00:31:19.480 |
Like here we've created a test framework or the start of one. 00:31:23.880 |
I've created a function called test, which checks whether A and B return true or false 00:31:29.720 |
based on this comparison function by using assert. 00:31:34.460 |
And then I've created something called test equals, which calls test passing in A and 00:31:42.360 |
So if they're wrong, assertion error equals test, test one. 00:31:51.960 |
So we've been able to write a test, which so far has basically tested that our little 00:32:00.560 |
We probably want to be able to run these tests somewhere other than just inside a notebook 00:32:04.440 |
. So we have a little program called run notebook dot P Y and you pass it the name of a notebook. 00:32:16.920 |
So I should save this one with our failing test so you can see it fail. 00:32:24.600 |
So first time it passed and then I make the failing test and you can see here it is assertion 00:32:27.960 |
error and tells you exactly where it happened. 00:32:31.840 |
So we now have an automatable unit testing framework in our Jupyter Notebook. 00:32:41.240 |
I'll point out that the contents of these two Python scripts, let's look at them. 00:32:51.840 |
So the first one was run notebook dot P Y, which is our test runner. 00:32:58.280 |
So there's a thing called nb format, so if you condor install nb format, then it basically 00:33:04.060 |
lets you execute a notebook and it prints out any errors. 00:33:10.500 |
You'll notice that I'm using a library called fire. 00:33:14.920 |
Fire is a really neat library that lets you take any function like this one and automatically 00:33:23.640 |
So here I've got a function called run notebook and then it says fire, run notebook. 00:33:28.960 |
So if I now go Python, run notebook, then it says, oh, this function received no value, 00:33:39.920 |
So you can see that what it did was it converted my function into a command line interface, 00:33:47.200 |
And it handles things like optional arguments and classes and it's super useful, particularly 00:33:52.920 |
for this kind of Jupiter first development, because you can grab stuff that's in Jupiter 00:33:57.560 |
and turn it into a script often by just copying and pasting the function or exporting it. 00:34:07.120 |
The other one notebook to script is not much more complicated. 00:34:17.880 |
It's one screen of code, which again, the main thing here is to call fire, which calls 00:34:23.400 |
this one function and you'll see basically it uses JSON.load because notebooks are JSON. 00:34:29.880 |
The reason I mentioned this to you is that Jupiter notebook comes with this whole kind 00:34:35.920 |
of ecosystem of libraries and APIs and stuff like that. 00:34:44.880 |
I find that just doing JSON.load is the easiest way. 00:34:49.240 |
And specifically I build my Jupiter notebook infrastructure inside Jupiter notebooks. 00:34:55.000 |
So here's how it looks, right, import JSON, JSON.load this file and gives you an array 00:35:03.640 |
and there's the contents of source, my first row, right? 00:35:08.880 |
So if you do want to play around with doing stuff in Jupiter notebook, it's a really great 00:35:13.560 |
environment for kind of automating stuff and running scripts on it and stuff like that. 00:35:23.420 |
So that's the entire contents of our development infrastructure. 00:35:30.720 |
One of the great things about having unit tests in notebooks is that when one does fail, 00:35:36.960 |
you open up a notebook, which can have pros saying, this is what this test does. 00:35:44.060 |
You can see all the stuff above it that's setting up all the context for it. 00:35:48.800 |
It's a really great way to fix those failing tests because you've got the whole truly literate 00:36:00.840 |
So before we start doing matrix multiply, we need some matrices to multiply. 00:36:06.500 |
So these are some of the things that are allowed by our rules. 00:36:09.480 |
We've got some stuff that's part of the standard library. 00:36:12.560 |
This is the fast AI data sets library to let us grab the data sets we need some more standard 00:36:18.220 |
We're only allowed to use this for indexing and array creation that plot lab. 00:36:27.600 |
So to grab Evanist, we just don't, we can use faster data sets to download it. 00:36:34.160 |
And then we can use a standard library gzip to open it. 00:36:40.560 |
So in Python, the kind of standard serialization format is called pickle. 00:36:44.660 |
And so this Evanist version on deeplearning.net is stored in that, in that format. 00:36:49.620 |
And so we can, it's basically gives us a tuple of tuples of data sets like so x train, y 00:36:58.800 |
It actually contains NumPy arrays, but NumPy arrays are not allowed in our foundations. 00:37:08.700 |
So we can just use the Python map to map the tensor function over each of these four arrays 00:37:19.200 |
A lot of you will be more familiar with NumPy arrays than PyTorch tensors. 00:37:26.040 |
But you know, everything you can do in NumPy arrays, you can also do in PyTorch tensors, 00:37:31.300 |
you can also do it on the GPU and have all this nice deeplearning infrastructure. 00:37:36.800 |
So it's a good idea to get used to using PyTorch tensors, in my opinion. 00:37:42.280 |
So we can now grab the number of rows and number of columns in the training search. 00:37:51.520 |
So here's Evanist, hopefully pretty familiar to you already. 00:37:56.320 |
It's 50,000 rows by 784 columns, and the y data looks something like this. 00:38:04.120 |
The y shape is just 50,000 rows, and the minimum and maximum of the dependent variable is zero 00:38:16.260 |
So the n should be equal to the shape of the y, should be equal to 50,000. 00:38:24.920 |
The number of columns should be equal to 28 by 28, because that's how many pixels there 00:38:31.320 |
And we're just using that test equals function that we created just above. 00:38:44.160 |
And we pass that to imshow after casting it to a 28 by 28. 00:38:51.160 |
I think we saw it a few times in part one, but get very familiar with it. 00:38:54.420 |
This is how we reshape our 768 long vector into a 28 by 28 matrix that's suitable for 00:39:06.900 |
And let's start by creating a simple linear model. 00:39:13.920 |
So for a linear model, we're going to need to basically have something where y equals 00:39:19.540 |
a x plus b. And so our a will be a bunch of weights. 00:39:26.480 |
So it's going to be to be 784 by 10 matrix, because we've got 784 coming in and 10 going 00:39:35.680 |
So that's going to allow us to take in our independent variable and map it to something 00:39:44.320 |
And then for our bias, we'll just start with 10 zeros. 00:39:49.720 |
So if we're going to do y equals a x plus b, then we're going to need a matrix multiplication. 00:39:56.560 |
So almost everything we do in deep learning is basically matrix multiplication or a variant 00:40:08.180 |
So you want to be very comfortable with matrix multiplication. 00:40:12.480 |
So this cool website, matrix multiplication dot x, y, z, shows us exactly what happens 00:40:23.320 |
So we take the first column of the first row and the first row, and we multiply each of 00:40:29.240 |
them element-wise, and then we add them up, and that gives us that one. 00:40:36.680 |
And now you can see we've got two sets going on at the same time, so that gives us two 00:40:39.520 |
more, and then two more, and then the final one. 00:40:55.240 |
We've got the loop of this thing scrolling down here. 00:41:01.200 |
They're really columns, so we flip them around. 00:41:03.960 |
And then we've got the loop of the multiply and add. 00:41:13.360 |
And notice this is not going to work unless the number of columns here and the number 00:41:26.960 |
So let's grab the number of rows and columns of A, and the number of rows and columns of 00:41:33.440 |
B, and make sure that AC equals BR, just to double check. 00:41:40.160 |
And then let's create something of size AR by BC, because the size of this is going to 00:41:45.000 |
be AR by BC with zeros in, and then have our three loops. 00:42:02.800 |
OK, so right in the middle, the result in I comma J is going to be AIK by BKJ. 00:42:16.040 |
And this is the vast majority of what we're going to be doing in deep learning. 00:42:20.800 |
So get very, very comfortable with that equation, because we're going to be seeing it in three 00:42:27.040 |
or four different variants of notation and style in the next few weeks, in the next few 00:42:35.440 |
And it's got kind of a few interesting things going on. 00:42:49.840 |
And look, it's got to be the same number in each place, because this is the bit where 00:42:53.160 |
we're multiplying together the element-wise things. 00:42:57.960 |
So let's create a nice small version, grab the first five rows of the validation set. 00:43:10.560 |
And then here's their sizes, five, because we just grabbed the first five rows, five 00:43:23.200 |
And so now we can go ahead and do that matrix multiplication. 00:43:28.560 |
And it's given us 50,000-- sorry, length of-- sorry. 00:43:40.800 |
As you would expect, a five rows by 10 column output. 00:43:54.800 |
So it's going to take about 50,000 seconds to do a single matrix multiplication in Python. 00:44:02.920 |
So imagine doing MNIST where every layer for every pass took about 10 hours. 00:44:14.300 |
So that's why we don't really write things in Python. 00:44:17.800 |
Like, when we say Python is too slow, we don't mean 20% too slow. 00:44:27.940 |
So let's see if we can speed this up by 50,000 times. 00:44:33.080 |
Because if we could do that, it might just be fast enough. 00:44:36.260 |
So the way we speed things up is we start in the innermost loop. 00:44:43.840 |
So the way to make Python faster is to remove Python. 00:44:49.840 |
And the way we remove Python is by passing our computation down to something that's written 00:44:54.760 |
in something other than Python, like PyTorch. 00:44:58.520 |
Because PyTorch behind the scenes is using a library called A10. 00:45:04.720 |
And so we want to get this going down to the A10 library. 00:45:07.480 |
So the way we do that is to take advantage of something called element-wise operations. 00:45:14.800 |
For example, if I have two tensors, A and B, both of length three, I can add them together. 00:45:23.080 |
And when I add them together, it simply adds together the corresponding items. 00:45:31.360 |
Or I could do less than, in which case it's going to do element-wise less than. 00:45:37.340 |
So what percentage of A is less than the corresponding item of B, A less than B dot float dot mean. 00:45:48.120 |
We can do element-wise operations on things not just of rank one, but we could do it on 00:45:58.660 |
So here's our rank two tensor, M. Let's calculate the Frobenius norm. 00:46:07.160 |
How many people know about the Frobenius norm? 00:46:12.240 |
And it looks kind of terrifying, right, but actually it's just this. 00:46:18.520 |
It's a matrix times itself dot sum dot square root. 00:46:25.080 |
So here's the first time we're going to start trying to translate some equations into code 00:46:34.720 |
So this says, when you see something like A with two sets of double lines around it, 00:46:41.640 |
and an F underneath, that means we are calculating the Frobenius norm. 00:46:46.720 |
So any time you see this, and you will, it actually pops up semi-regularly in deep learning 00:46:50.600 |
literature, when you see this, what it actually means is this function. 00:46:57.500 |
As you probably know, capital sigma means sum, and this says we're going to sum over 00:47:05.240 |
The first for loop will be called i, and we'll go from 1 to n. 00:47:10.760 |
And the second for loop will also be called j, and will also go from 1 to n. 00:47:16.360 |
And in these nested for loops, we're going to grab something out of a matrix A, that 00:47:25.120 |
We're going to square it, and then we're going to add all of those together, and then we'll 00:47:42.860 |
And yet I did create this Jupyter notebook, so it looks a lot like I created some LaTeX, 00:47:47.920 |
which is certainly the impression I like to give people sometimes. 00:47:51.120 |
But the way I actually write LaTeX is I find somebody else who wrote it, and then I copy 00:47:56.480 |
So the way you do this most of the time is you Google for Frobenius Norm, you find the 00:48:01.680 |
wiki page for Frobenius Norm, you click edit next to the equation, and you copy and paste 00:48:11.680 |
And check dollar signs or even two dollar signs around it. 00:48:19.760 |
Method two is if it's in a paper on archive, did you know on archive you can click on download 00:48:26.440 |
other formats in the top right, and then download source, and that will actually give you the 00:48:32.080 |
original tech source, and then you can copy and paste their LaTeX. 00:48:38.040 |
So I'll be showing you a bunch of equations during these lessons, and I can promise you 00:48:48.920 |
All right, so you now know how to implement the Frobenius Norm from scratch in TensorFlow. 00:48:58.480 |
You could also have written it, of course, as m.pal2, but that would be illegal under 00:49:08.160 |
We're not allowed to use PAL yet, so that's why we did it that way. 00:49:14.680 |
So that's just doing the element-wise multiplication of a rank two tensor with itself. 00:49:20.880 |
One times one, two times two, three times three, etc. 00:49:27.120 |
So that is enough information to replace this loop, because this loop is just going through 00:49:36.120 |
the first row of A and the first column of B, and doing an element-wise multiplication 00:49:45.800 |
So our new version is going to have two loops, not three. 00:49:50.440 |
So this is all the same, but now we've replaced the inner loop, and you'll see that basically 00:49:59.520 |
it looks exactly the same as before, but where it used to say k, it now says colon. 00:50:04.680 |
So in pytorch and numpy, colon means the entirety of that axis. 00:50:11.860 |
So Rachel, help me remember the order of rows and columns when we talk about matrices, which 00:50:34.560 |
And this is column number j, the whole column. 00:50:39.160 |
So multiply all of column j by all of row i, and that gives us back a rank one tensor, 00:50:49.160 |
That's exactly the same as what we had before. 00:50:55.280 |
We've removed one line of code, and it's 178 times faster. 00:51:00.360 |
So we successfully got rid of that inner loop. 00:51:03.720 |
And so now this is running in C. We didn't really write Python here. 00:51:10.360 |
We wrote kind of a Python-ish thing that said, please call this C code for us. 00:51:20.720 |
We can't really check that it's equal, because floats are sometimes changed slightly, depending 00:51:28.640 |
So instead, let's create something called near, which calls torch.all_close to some tolerance. 00:51:35.480 |
And then we'll create a test_near function that calls our test function using our near 00:51:46.880 |
So we've now got our matrix multiplication at 65 microseconds. 00:51:52.760 |
Now we need to get rid of this loop, because now this is our innermost loop. 00:51:58.200 |
And to do that, we're going to have to use something called broadcasting. 00:52:08.200 |
So broadcasting is about the most powerful tool we have in our toolbox for writing code 00:52:16.560 |
in Python that runs at C speed, or in fact, with PyTorch, if you put it on the GPU, it's 00:52:27.900 |
It allows us to get rid of nearly all of our loops, as you'll see. 00:52:33.960 |
The term "broadcasting" comes from NumPy, but the idea actually goes all the way back 00:52:43.400 |
And it's a really, really powerful technique. 00:52:46.480 |
A lot of people consider it a different way of programming, where we get rid of all of 00:52:50.360 |
our for loops and replace them with these implicit, broadcasted loops. 00:53:01.940 |
Remember our tensor A, which contains 10, 6, 4? 00:53:06.000 |
If you say A greater than 0, then on the left-hand side, you've got to rank one tensor. 00:53:16.640 |
And the reason why is that this value 0 is broadcast three times. 00:53:22.800 |
It becomes 0, 0, 0, and then it does an element-wise comparison. 00:53:28.000 |
So every time, for example, you've normalized a dataset by subtracting the mean and divided 00:53:33.560 |
by the standard deviation in a kind of one line like this, you've actually been broadcasting. 00:53:45.420 |
So A plus one also broadcasts a scalar to a tensor. 00:53:52.680 |
Here we can multiply our rank two tensor by two. 00:53:57.340 |
So there's the simplest kind of broadcasting. 00:53:59.520 |
And any time you do that, you're not operating at Python speed, you're operating at C or 00:54:15.800 |
So here's a rank one tensor C. And here's our previous rank two tensor M. So M's shape 00:54:36.320 |
10, 20, 30 plus 1, 2, 3, 10, 20, 30 plus 4, 5, 6, 10, 20, 30 plus 7, 8, 9. 00:54:46.560 |
It's broadcast this row across each row of the matrix. 00:54:57.800 |
So this, there's no loop, but it sure looks as if there was a loop. 00:55:08.600 |
So we can write C dot expand as M. And it shows us what C would look like when broadcast 00:55:23.000 |
So you can see M plus T is the same as C plus M. So basically it's creating or acting as 00:55:33.240 |
if it's creating this bigger rank two tensor. 00:55:39.740 |
So this is pretty cool because it now means that any time we need to do something between 00:55:44.040 |
a vector and a matrix, we can do it at C speed with no loop. 00:55:51.920 |
Now you might be worrying though that this looks pretty memory intensive if we're kind 00:55:56.060 |
of turning all of our rows into big matrices, but fear not. 00:56:00.260 |
Because you can look inside the actual memory used by PyTorch. 00:56:05.040 |
So here T is a 3 by 3 matrix, but T dot storage tells us that actually it's only storing one 00:56:15.760 |
T dot shape tells us that T knows it's meant to be a 3 by 3 matrix. 00:56:21.280 |
And T dot stride tells us that it knows that when it's going from column to column, it 00:56:30.800 |
But when it goes from row to row, it should take zero steps. 00:56:35.780 |
And so that's how calm it repeats 10, 20, 30, 10, 20, 30, 10, 20, 30. 00:56:40.680 |
So this is a really powerful thing that appears in pretty much every linear algebra library 00:56:44.680 |
you'll come across is this idea that you can actually create tensors that behave like higher 00:56:56.680 |
It basically means that this broadcasting functionality gives us C like speed with no 00:57:04.680 |
Okay, what if we wanted to take a column instead of a row? 00:57:12.820 |
So in other words, a rank 2 tensor of shape 3, 1. 00:57:21.520 |
We can create a rank 2 tensor of shape 3, 1 from a rank 1 tensor by using the unsqueeze 00:57:33.040 |
Unsqueeze adds an additional dimension of size 1 to wherever we ask for it. 00:57:39.880 |
So unsqueeze 0, let's check this out, unsqueeze 0 is of shape 1, 3, it puts the new dimension 00:57:52.000 |
Unsqueeze 1 is shape 3, 1, it creates the new axis in position 1. 00:57:58.760 |
So unsqueeze 0 looks a lot like C, but now rather than being a rank 1 tensor, it's now 00:58:09.840 |
See how it's got two square brackets around it? 00:58:16.880 |
As more interestingly, C.unsqueeze 1 now looks like a column, right? 00:58:23.340 |
It's also a rank 2 tensor, but it's 3 rows by one column. 00:58:30.400 |
Because we can say, well actually before we do, I'll just mention writing .unsqueeze is 00:58:39.540 |
So PyTorch and NumPy have a neat trick, which is that you can index into an array with a 00:58:46.520 |
special value none, and none means squeeze a new axis in here please. 00:58:53.960 |
So you can see that C none colon is exactly the same shape, 1, 3, as C.unsqueeze 0. 00:59:03.160 |
And C colon, none is exactly the same shape as C.unsqueeze 1. 00:59:09.120 |
So I hardly ever use unsqueeze unless I'm like particularly trying to demonstrate something 00:59:12.740 |
for teaching purposes, I pretty much always use none. 00:59:15.520 |
Apart from anything else, I can add additional axes this way, or else with unsqueeze you 00:59:22.320 |
have to go to unsqueeze, unsqueeze, unsqueeze. 00:59:30.880 |
The reason we did all that is because if we go C colon, none, so in other words we turn 00:59:40.800 |
it into a column, kind of a columnar shape, so it's now of shape 3, 1, .expandas, it doesn't 00:59:50.520 |
now say 10, 20, 30, 10, 20, 30, 10, 20, 30, but it says 10, 10, 10, 20, 20, 20, 30, 30. 00:59:56.720 |
So in other words it's getting broadcast along columns instead of rows. 01:00:01.520 |
So as you might expect, if I take that and add it to M, then I get the result of broadcasting 01:00:42.640 |
So we can use the rows and columns functions in Excel to get the rows and columns of this 01:00:51.560 |
Here is a 3 by 1, rank 2 tensor, again rows and columns. 01:01:07.760 |
So here's what happens if we broadcast this to be the shape of M. 01:01:21.360 |
And here is the result of that C plus M. And here's what happens if we broadcast this to 01:01:43.360 |
So basically what's happening is when we broadcast, it's taking the thing which has a 01:01:51.240 |
unit axis and is kind of effectively copying that unit axis so it is as long as the larger 01:02:01.180 |
But it doesn't really copy it, it just pretends as if it's been copied. 01:02:14.400 |
So this was the loop we were trying to get rid of, going through each of range BC. 01:02:27.760 |
So now we are not anymore going through that loop. 01:02:30.840 |
So now rather than setting CI comma J, we can set the entire row of CI. 01:02:42.900 |
Every time there's a trailing colon in NumPy or PyTorch, you can delete it optionally. 01:02:52.920 |
So before, we had a few of those, right, let's see if we can find one. 01:03:05.840 |
So I'm claiming we could have got rid of that. 01:03:13.040 |
And similar thing, any time you see any number of colon commas at the start, you can replace 01:03:22.760 |
Which in this case doesn't save us anything because there's only one of these. 01:03:25.520 |
But if you've got like a really high-rank tensor, that can be super convenient, especially 01:03:30.040 |
if you want to do something where the rank of the tensor could vary. 01:03:33.400 |
You don't know how big it's going to be ahead of time. 01:03:40.880 |
So we're going to set the whole of row I, and we don't need that colon, though it doesn't 01:03:48.760 |
And we're going to set it to the whole of row I of A. 01:03:55.800 |
And then now that we've got row I of A, that is a rank 1 tensor. 01:04:05.880 |
So it's now got a new-- and see how this is minus 1? 01:04:18.440 |
We could also have written it like that with a special value none. 01:04:28.200 |
So this is of now length whatever the size of A is, which is AR. 01:04:55.280 |
And so this is going to get broadcast over this, which is exactly what we want. 01:05:03.280 |
And then, so that's going to return, because it broadcast, it's actually going to return 01:05:09.640 |
And then that rank 2 tensor, we want to sum it up over the rows. 01:05:15.640 |
And so sum, you can give it a dimension argument to say which axis to sum over. 01:05:22.280 |
So this one is kind of our most mind-bending broadcast of the lesson. 01:05:28.160 |
So I'm going to leave this as a bit of homework for you to go back and convince yourself as 01:05:35.380 |
So maybe put it in Excel or do it on paper if it's not already clear to you why this 01:05:43.840 |
But this is sure handy, because before we were broadcasting that, we were at 1.39 milliseconds. 01:05:53.680 |
After using that broadcasting, we're down to 250 microseconds. 01:05:58.720 |
So at this point, we're now 3,200 times faster than Python. 01:06:07.760 |
Once you get used to this style of coding, getting rid of these loops I find really reduces 01:06:16.040 |
It takes a while to get used to, but once you're used to it, it's a really comfortable 01:06:23.960 |
Once you get to kind of higher ranked tensors, this broadcasting can start getting a bit 01:06:30.680 |
So what you need to do instead of trying to keep it all in your head is apply the simple 01:06:39.920 |
Some here, in NumPy and PyTorch and TensorFlow, it's all the same rules. 01:06:45.120 |
What we do is we compare the shapes element-wise. 01:06:51.280 |
So let's look at a slightly interesting example. 01:06:58.280 |
Here is our rank1 tensor C, and let's insert a leading unit axis. 01:07:11.720 |
And here's the version with a, sorry, this one's a preceding axis. 01:07:25.760 |
So just to remind you, that looks like a column. 01:07:34.120 |
What if we went C, none, colon, times C, colon, colon, what on earth is that? 01:07:51.740 |
What happens is it says, okay, you want to multiply this by this, element-wise, right? 01:07:58.360 |
This is asterisk, so element-wise multiplication. 01:08:01.280 |
And it broadcasts this to be the same number of rows as that, like so. 01:08:08.880 |
And it broadcasts this to be the same number of columns as that, like so. 01:08:14.520 |
And then it simply multiplies those together. 01:08:22.000 |
So the rule that it's using, you can do the same thing with greater than, right? 01:08:27.840 |
The rule that it's using is, let's look at the two shapes, 1, 3 and 3,1, and see if they're 01:08:34.760 |
They're compatible if, element-wise, they're either the same number or one of them is 1. 01:08:42.640 |
So in this case, 1 is compatible with 3 because one of them is 1. 01:08:47.400 |
And 3 is compatible with 1 because one of them is 1. 01:08:51.160 |
And so what happens is, if it's 1, that dimension is broadcast to make it the same size as the 01:09:03.600 |
So this one was multiplied 3 times down the rows, and this one was multiplied 3 times 01:09:11.180 |
And then there's one more rule, which is that they don't even have to be the same rank, 01:09:16.880 |
So something that we do a lot with image normalization is we normalize images by channel, right? 01:09:24.480 |
So you might have an image which is 256 by 256 by 3. 01:09:28.160 |
And then you've got the per-channel mean, which is just a rank 1 tensor of size 3. 01:09:33.960 |
They're actually compatible because what it does is, anywhere that there's a missing dimension, 01:09:42.200 |
It inserts leading dimensions and inserts a 1. 01:09:44.580 |
So that's why actually you can normalize by channel with no lines of code. 01:09:52.920 |
Mind you, in PyTorch, it's actually channel by height by width, so it's slightly different. 01:10:02.440 |
We're going to take a break, but we're getting pretty close. 01:10:06.320 |
My goal was to make our Python code 50,000 times faster, we're up to 4,000 times faster. 01:10:14.000 |
And the reason this is really important is because if we're going to be like doing our 01:10:19.120 |
own stuff, like building things that people haven't built before, we need to know how 01:10:25.520 |
to write code that we can write quickly and concisely, but operates fast enough that it's 01:10:32.720 |
And so this broadcasting trick is perhaps the most important trick to know about. 01:10:38.700 |
So let's have a six-minute break, and I'll see you back here at 8 o'clock. 01:10:44.520 |
So broadcasting, when I first started teaching deep learning here, and I asked how many people 01:10:54.960 |
are familiar with broadcasting, this is back when we used to do it in the piano, almost 01:10:58.840 |
no hands went up, so I used to kind of say this is like my secret magic trick. 01:11:04.640 |
I think it's really cool, it's kind of really cool that now half of you have already heard 01:11:08.120 |
of it, and it's kind of sad because it's now not my secret magic trick. 01:11:11.440 |
It's like here's something half of you already knew, but the other half of you, there's a 01:11:17.800 |
reason that people are learning this quickly and it's because it's super cool. 01:11:23.840 |
How many people here know Einstein summation notation? 01:11:30.680 |
So it's not as cool as broadcasting, but it is still very, very cool. 01:11:38.800 |
And this is a technique which I don't think it was invented by Einstein, I think it was 01:11:43.240 |
popularized by Einstein as a way of dealing with these high rank tensor kind of reductions 01:11:49.600 |
that we used in the general relativity, I think. 01:11:55.440 |
This is the innermost part of our original matrix multiplication for loop, remember? 01:12:05.120 |
And here's the version when we removed the innermost loop and replaced it with an element-wise 01:12:10.960 |
And you'll notice that what happened was that the repeated K got replaced with a colon. 01:12:21.280 |
What if I move, okay, so first of all, let's get rid of the names of everything. 01:12:29.640 |
And let's move this to the end and put it after an arrow. 01:12:39.720 |
And let's keep getting rid of the names of everything. 01:12:50.040 |
And get rid of the commas and replace spaces with commas. 01:13:03.160 |
And now I just created Einstein summation notation. 01:13:07.280 |
So Einstein summation notation is like a mini language. 01:13:15.160 |
And what it says is, however many, so there's an arrow, right, and on the left of the arrow 01:13:19.880 |
is the input and on the right of the arrow is the output. 01:13:26.080 |
Well they're delimited by comma, so in this case there's two inputs. 01:13:36.800 |
So this is a rank two input and this is another rank two input and this is a rank two output. 01:13:45.880 |
This is one is the size i by k, this one is the size k by j, and the output is of size 01:13:54.440 |
When you see the same letter appearing in different places, it's referring to the same 01:14:00.360 |
So this is of size i, the output is always has, also has i rows. 01:14:10.200 |
So we know how to go from the input shape to the output shape. 01:14:16.360 |
You look for any place that a letter is repeated and you do a dot product over that dimension. 01:14:23.920 |
In other words, it's just like the way we replaced k with colon. 01:14:29.720 |
So this is going to create something of size i by j by doing dot products over these shared 01:14:42.040 |
So that's how you write matrix multiplication with Einstein summation notation. 01:14:48.120 |
And then all you just do is go torch dot insum. 01:14:52.560 |
If you go to the PI torch insum docs or docs of most of the major libraries, you can find 01:15:03.640 |
You can use it for transpose, diagonalization, tracing, all kinds of things, batch wise versions 01:15:11.960 |
So for example, if PI torch didn't have batch wise matrix multiplication, I just created 01:15:25.920 |
So there's all kinds of things you can kind of invent. 01:15:28.400 |
And often it's quite handy if you kind of need to put a transpose in somewhere or tweak 01:15:37.360 |
Size matmul and that's now taken us down to 57 microseconds. 01:15:44.080 |
So we're now 16,000 times faster than Python. 01:15:53.680 |
It's a travesty that this exists because we've got a little mini language inside Python in 01:16:03.560 |
You shouldn't be writing programming languages inside a string. 01:16:07.920 |
This is as bad as a regex, you know, like regular expressions are also mini languages 01:16:14.160 |
You want your languages to be like typed and have an Intelli sense and like be things that 01:16:19.400 |
you can like, you know, extend this, this mini language does. 01:16:24.840 |
It's amazing, but there's so few things that it actually does, right? 01:16:29.440 |
What I actually want to be able to do is create like any kind of arbitrary combination of 01:16:35.000 |
any axes and any operations and any reductions I like in any order in the actual language 01:16:48.840 |
These are the J and K are the languages that kind of came out of APL. 01:16:52.200 |
This is a kind of a series of languages that have been around for about 60 years and everybody's 01:17:01.680 |
My hope is that things like Swift and Julia will give us this, like the ability to actually 01:17:09.720 |
write stuff in actual Swift and actual Julia that we can run in an actual debugger and 01:17:14.680 |
use an actual profiler and do arbitrary stuff that's really fast. 01:17:20.360 |
But actually, Swift seems like it might go even quite a bit faster than Einstein in an 01:17:28.480 |
even more flexible way, thanks to this new compiler infrastructure called MLIR, which 01:17:33.560 |
actually builds off this and really exciting new research in the compiler world, kind of 01:17:38.040 |
been coming over the last few years, particularly coming out of a system called Halide, which 01:17:42.960 |
is H-A-L-I-D-E, which is this super cool language that basically showed it's possible to create 01:17:49.840 |
a language that can create very, very, very, like totally optimized linear algebra computations 01:18:00.520 |
And since that came along, there's been all kinds of cool research using these techniques 01:18:07.320 |
like something called polyhedral compilation, which kind of have the promise that we're 01:18:13.640 |
going to be able to hopefully, within the next couple of years, write Swift code that 01:18:20.240 |
runs as fast as the next thing I'm about to show you, because the next thing I'm about 01:18:24.360 |
to show you is the PyTorch operation called matmul. 01:18:30.720 |
And matmul takes 8a microseconds, which is 50,000 times faster than Python. 01:18:41.400 |
Well, if you think about what you're doing when you do a matrix multiply of something 01:18:45.800 |
that's like 50,000 by 768 by 768 by 10, these are things that aren't going to fit in the 01:18:58.200 |
So if you do the kind of standard thing of going down all the rows and across all the 01:19:01.400 |
columns, by the time you've got to the end and you go back to exactly the same column 01:19:05.360 |
again, it forgot the contents and has to go back to RAM and pull it in again. 01:19:10.240 |
So if you're smart, what you do is you break your matrix up into little smaller matrices 01:19:16.280 |
And that way, everything is kind of in cache and it goes super fast. 01:19:19.440 |
Now, normally, to do that, you have to write kind of assembly language code, particularly 01:19:25.360 |
if you want to kind of get it all running in your vector processor. 01:19:29.200 |
And that's how you get these 18 microseconds. 01:19:32.300 |
So currently, to get a fast matrix multiply, things like PyTorch, they don't even write 01:19:37.540 |
it themselves, they basically push that off to something called a BLAS, B-L-A-S, a BLAS 01:19:42.920 |
is a Basic Linear Algebra Subprograms Library, where companies like Intel and AMD and NVIDIA 01:19:52.720 |
So you can look up KuBLAS, for example, and this is like NVIDIA's version of BLAS. 01:19:57.920 |
Or you could look up MKL and this is Intel's version of BLAS and so forth. 01:20:04.640 |
And this is kind of awful because, you know, the program is limited to this like subset 01:20:15.800 |
And to use it, you don't really get to write it in Python, you kind of have to write the 01:20:21.200 |
one thing that happens to be turned into that pre-existing BLAS call. 01:20:26.000 |
So this is kind of why we need to do better, right? 01:20:29.560 |
And there are people working on this and there are people actually in Chris Latner's team 01:20:36.600 |
You know, there's some really cool stuff like there's something called Tensor Comprehensions, 01:20:41.720 |
which is like really originally came in PyTorch, and I think they're now inside Chris's team 01:20:47.040 |
at Google, where people are basically saying, hey, here are ways to like compile these much 01:20:53.360 |
And this is what we want as more advanced practitioners. 01:20:57.840 |
Anyway, for now, in PyTorch world, we're stuck at this level, which is to recognize there 01:21:05.680 |
are some things this is, you know, three times faster than the best we can do in an even 01:21:15.920 |
And if we compare it to the actually flexible way, which is broadcasting, we had 254, yeah, 01:21:30.280 |
So wherever possible today, we want to use operations that are predefined in our library, 01:21:37.160 |
particularly for things that kind of operate over lots of rows and columns, the things 01:21:41.480 |
we're kind of dealing with this memory caching stuff is going to be complicated. 01:21:49.280 |
Matrix modification is so common and useful that it's actually got its own operator, which 01:21:56.240 |
These are actually calling the exact same code. 01:22:02.360 |
At is not actually just matrix modification at covers a much broader array of kind of 01:22:10.440 |
tensor reductions across different levels of axes. 01:22:15.120 |
So it's worth checking out what matball can do, because often it'll be able to handle 01:22:19.800 |
things like batch wise or matrix versus vectors, don't think of it as being only something 01:22:25.760 |
that can do rank two by rank two, because it's a little bit more flexible. 01:22:30.560 |
OK, so that's that we have matrix multiplication, and so now we're allowed to use it. 01:22:39.000 |
And so we're going to use it to try to create a forward pass, which means we first need 01:22:45.760 |
value and matrix initialization, because remember, a model contains parameters which start out 01:22:55.880 |
And then we use the gradients to gradually update them with SGD. 01:23:11.600 |
So let's start by importing NBO1, and I just copied and pasted the three lines we used 01:23:18.560 |
to grab the data, and I'm just going to pop them into a function so we can use it to grab 01:23:24.680 |
And now that we know about broadcasting, let's create a normalization function that takes 01:23:29.760 |
our tensor and subtracts the means and divides by the standard deviation. 01:23:35.800 |
So now let's grab our data, OK, and pop it into x, y, x, y. 01:23:42.320 |
Let's grab the mean and standard deviation, and notice that they're not 0 and 1. 01:23:51.720 |
And we're going to be seeing a lot of why we want them to be 0 and 1 over the next couple 01:24:02.120 |
So that means that we need to subtract the mean, divide by the standard deviation, but 01:24:09.720 |
We don't subtract the validation set's mean and divide by the validation set's standard 01:24:14.640 |
Because if we did, those two data sets would be on totally different scales, right? 01:24:20.200 |
So if the training set was mainly green frogs, and the validation set was mainly red frogs, 01:24:28.200 |
right, then if we normalize with the validation sets mean and variance, we would end up with 01:24:34.880 |
them both having the same average coloration, and we wouldn't be able to tell the two apart, 01:24:41.320 |
So that's an important thing to remember when normalizing, is to always make sure your validation 01:24:45.440 |
and training set are normalized in the same way. 01:24:48.920 |
So after doing that, get it twice, okay, so after doing that, our mean is pretty close 01:25:00.800 |
to 0, and our standard deviation is very close to 1, and it would be nice to have something 01:25:07.000 |
So let's create a test near 0 function, and then test that the mean is near 0, and 1 minus 01:25:13.200 |
the standard deviation is near 0, and that's all good. 01:25:18.200 |
Let's define N and M and C the same as before, so the size of the training set and the number 01:25:25.440 |
of activations we're going to eventually need in our model being C, and let's try to create 01:25:35.040 |
Okay, so the model is going to have one hidden layer, and normally we would want the final 01:25:45.800 |
output to have 10 activations, because we would use cross-entropy against those 10 activations, 01:25:52.360 |
but to simplify things for now, we're going to not use cross-entropy, we're going to use 01:25:56.660 |
mean squared error, which means we're going to have one activation, okay, which makes 01:26:01.040 |
no sense from our modeling point of view, we'll fix that later, but just to simplify 01:26:05.500 |
So let's create a simple neural net with a single hidden layer and a single output activation, 01:26:15.880 |
So let's pick a hidden size, so the number of hidden will make 50, okay, so our two layers, 01:26:22.040 |
we're going to need two weight matrices and two bias vectors. 01:26:26.240 |
So here are our two weight matrices, W1 and W2, so they're random numbers, normal random 01:26:33.800 |
numbers of size M, which is the number of columns, 768, by NH, number of hidden, and 01:26:45.500 |
Now our inputs now are mean zero, standard deviation 1, the inputs to the first layer. 01:26:54.360 |
We want the inputs to the second layer to also be mean zero, standard deviation 1. 01:27:05.080 |
Because if we just grab some normal random numbers and then we define a function called 01:27:14.160 |
linear, this is our linear layer, which is X by W plus B, and then create T, which is 01:27:20.760 |
the activation of that linear layer with our validation set and our weights and biases. 01:27:28.000 |
We have a mean of minus 5 and a standard deviation of 27, which is terrible. 01:27:36.560 |
So I'm going to let you work through this at home, but once you actually look at what 01:27:43.040 |
happens when you multiply those things together and add them up, as you do in matrix multiplication, 01:27:49.160 |
you'll see that you're not going to end up with 0, 1. 01:27:51.760 |
But if instead you divide by square root m, so root 768, then it's actually damn good. 01:28:07.800 |
So this is a simplified version of something which PyTorch calls Keiming initialization, 01:28:15.200 |
named after Keiming He who wrote a paper, or was the lead writer of a paper that we're 01:28:24.400 |
So the weights, rand n gives you random numbers with a mean of 0 and a standard deviation 01:28:35.100 |
So if you divide by root m, it will have a mean of 0 and a standard deviation of 1 on 01:28:46.520 |
So in general, normal random numbers of mean 0 and standard deviation of 1 over root of 01:28:57.000 |
whatever this is, so here it's m and here it's nh, will give you an output of 0, 1. 01:29:04.640 |
Now this may seem like a pretty minor issue, but as we're going to see in the next couple 01:29:08.960 |
of lessons, it's like the thing that matters when it comes to training neural nets. 01:29:14.760 |
It's actually, in the last few months, people have really been noticing how important this 01:29:21.480 |
There are things like fix-up initialization, where these folks actually trained a 10,000-layer 01:29:32.800 |
deep neural network with no normalization layers, just by basically doing careful initialization. 01:29:42.420 |
So it's really, people are really spending a lot of time now thinking like, okay, how 01:29:49.800 |
And you know, we've had a lot of success with things like one cycle training and super convergence, 01:29:56.480 |
which is all about what happens in those first few iterations, and it really turns out that 01:30:04.980 |
So we're going to be spending a lot of time studying this in depth. 01:30:09.960 |
So the first thing I'm going to point out is that this is actually not how our first 01:30:17.520 |
Our first layer is actually defined like this. 01:30:23.920 |
So ReLU is just grab our data and replace any negatives with zeros. 01:30:32.640 |
Now there's lots of ways I could have written this. 01:30:35.200 |
But if you can do it with something that's like a single function in PyTorch, it's almost 01:30:39.160 |
always faster because that thing's generally written in C for you. 01:30:42.280 |
So try to find the thing that's as close to what you want as possible. 01:30:51.880 |
And unfortunately, that does not have a mean zero and standard deviation of one. 01:31:06.480 |
Okay, so we had some data that had a mean of zero and a standard deviation of one. 01:31:23.240 |
And then we took everything that was smaller than zero and removed it. 01:31:33.880 |
So that obviously does not have a mean of zero and it obviously now has about half the 01:31:44.680 |
So this was one of the fantastic insights and one of the most extraordinary papers of 01:31:55.440 |
It was the paper from the 2015 ImageNet winners led by the person we've mentioned, Kaiming 01:32:04.280 |
Kaiming at that time was at Microsoft Research. 01:32:12.400 |
Reading papers from competition winners is a very, very good idea because they tend to 01:32:17.600 |
be, you know, normal papers will have like one tiny tweak that they spend pages and pages 01:32:22.720 |
trying to justify why they should be accepted into NeurIPS, whereas competition winners 01:32:27.520 |
have 20 good ideas and only time to mention them in passing. 01:32:31.400 |
This paper introduced us to ResNets, PreluLayers, and Kaiming initialization amongst others. 01:32:48.340 |
Section 2.2, initialization of filter weights or rectifiers. 01:32:52.800 |
A rectifier is a rectified linear unit or rectifier network is any neural network with 01:33:01.120 |
This is only 2015, but it already reads like something from another age in so many ways. 01:33:07.060 |
Like even the word rectifier units and traditional sigmoid activation networks, no one uses sigmoid 01:33:16.820 |
So when you read these papers, you kind of have to keep these things in mind. 01:33:19.840 |
They describe how what happens if you train very deep models with more than eight layers. 01:33:31.360 |
But anyway, they said that in the old days, people used to initialize these with random 01:33:39.960 |
It's just a fancy word for normal or bell shaped. 01:33:44.840 |
And when you do that, they tend to not train very well. 01:33:49.960 |
And the reason why, they point out, or actually Glorow and Benjio pointed out. 01:33:58.820 |
So you'll see two initializations come up all the time. 01:34:01.600 |
One is either Kaiming or Her initialization, which is this one, or the other you'll see 01:34:05.840 |
a lot is Glorow or Xavier initialization, again, named after Xavier Glorow. 01:34:20.760 |
And one of the things you'll notice if you read it is it's very readable. 01:34:29.120 |
And the actual final result they come up with is it's incredibly simple. 01:34:35.200 |
And we're actually going to be re-implementing much of the stuff in this paper over the next 01:34:41.440 |
But basically, they describe one suggestion for how to initialize neural nets. 01:34:52.480 |
And they suggest this particular approach, which is root six over the root of the number 01:35:00.360 |
of input filters plus the number of output filters. 01:35:05.720 |
And so what happened was Kaiming Her and that team pointed out that that does not account 01:35:12.760 |
for the impact of a ReLU, the thing that we just noticed. 01:35:20.400 |
If your variance halves each layer and you have a massive deep network with like eight 01:35:27.600 |
layers, then you've got one over two to the eight squishes. 01:35:33.880 |
And if you want to be fancy like the fix up people with 10,000 layers, forget it, right? 01:35:49.080 |
They replace the one on the top with a two on the top. 01:35:53.160 |
So this, which is not to take anything away from this, it's a fantastic paper, right? 01:35:57.800 |
But in the end, the thing they do is to stick a two on the top. 01:36:02.400 |
So we can do that by taking that exact equation we just used and sticking a two on the top. 01:36:09.560 |
And if we do, then the result is much closer. 01:36:16.120 |
It's not perfect, right, but it actually varies quite a lot. 01:36:21.440 |
Sometimes it's further away, but it's certainly a lot better than it was. 01:36:32.160 |
So law homework for this week is to read 2.2 of the ResNet paper. 01:36:41.240 |
And what you'll see is that they describe what happens in the forward pass of a neural 01:36:48.000 |
And they point out that for the conv layer, this is the response, Y equals WX plus B. 01:36:53.040 |
Now if you're concentrating, that might be confusing because a conv layer isn't quite 01:36:59.480 |
Y equals WX plus B. A conv layer has a convolution. 01:37:02.880 |
But you remember in part one, I pointed out this neat article from Matt Clinesmith where 01:37:08.800 |
he showed that CNNs in convolutions actually are just matrix multiplications with a bunch 01:37:18.360 |
So this is basically all they're saying here. 01:37:20.740 |
So sometimes there are these kind of like throwaway lines in papers that are actually 01:37:27.200 |
So they point out that you can just think of this as a linear layer. 01:37:30.800 |
And then they basically take you through step by step what happens to the variance of your 01:37:41.820 |
And so just try to get to this point here, get as far as backward propagation case. 01:37:46.400 |
So you've got about, I don't know, six paragraphs to read. 01:37:54.520 |
Maybe this one is if you haven't seen this before. 01:37:56.740 |
This is exactly the same as sigma, but instead of doing a sum, you do a product. 01:38:02.120 |
So this is a great way to kind of warm up your paper reading muscles is to try and read 01:38:08.480 |
And then if that's going well, you can keep going with the backward propagation case because 01:38:17.440 |
And as we'll see in a moment, the backward pass does a matrix multiply with a transpose 01:38:22.800 |
So the backward pass is slightly different, but it's nearly the same. 01:38:27.200 |
And so then at the end of that, they will eventually come up with their suggestion. 01:38:40.880 |
They suggest root two over nL, where nL is the number of input activations. 01:38:55.900 |
That is called climbing initialization, and it gives us a pretty nice variance. 01:39:03.600 |
And the reason it doesn't give us a very nice mean is because as we saw, we deleted everything 01:39:09.280 |
So naturally, our main is now half, not zero. 01:39:14.680 |
I haven't seen anybody talk about this in the literature, but something I was just trying 01:39:20.480 |
over the last week is something kind of obvious, which is to replace value with not just x.plantmin, 01:39:31.440 |
And in my brief experiments, that seems to help. 01:39:35.160 |
So there's another thing that you could try out and see if it actually helps or if I'm 01:39:40.240 |
It certainly returns you to the correct mean. 01:39:44.760 |
Okay, so now that we have this formula, we can replace it with init.climbing_normal according 01:39:57.720 |
And let's check that it does the same thing, and it does, okay? 01:40:06.840 |
So again, we've got this about half mean and bit under one standard deviation. 01:40:12.960 |
You'll notice here I had to add something extra, which is mode equals fan out. 01:40:22.440 |
What it means is explained here, fan in or fan out, fan in preserves the magnitude of 01:40:30.360 |
variance in the forward pass, fan out preserves the magnitudes in the backward pass. 01:40:35.120 |
Basically, all it's saying is, are you dividing by root m or root nh? 01:40:42.680 |
Because if you divide by root m, as you'll see in that part of the paper I was suggesting 01:40:46.760 |
you read, that will keep the variance at one during the forward pass. 01:40:50.920 |
But if you use nh, it will give you the right unit variance in the backward pass. 01:40:57.240 |
So it's weird that I had to say fan out, because according to the documentation, that's for 01:41:08.080 |
Well, it's because our weight shape is 784 by 50, but if you actually create a linear 01:41:18.920 |
layer with PyTorch of the same dimensions, it creates it of 50 by 784. 01:41:30.600 |
And these are the kind of things that it's useful to know how to dig into. 01:41:37.500 |
So to find out how it's working, you have to look in the source code. 01:41:40.640 |
So you can either set up Visual Studio code or something like that and kind of set it 01:41:48.320 |
Or you can just do it here with question mark, question mark. 01:41:51.840 |
And you can see that this is the forward function, and it calls something called f.linear. 01:41:59.240 |
In PyTorch, capital F always refers to the torch.nn.functional module, because you like 01:42:07.680 |
it's used everywhere, so they decided that's worth a single letter. 01:42:11.220 |
So torch.nn.functional.linear is what it calls, and let's look at how that's defined. 01:42:23.340 |
So now we know in PyTorch, a linear layer doesn't just do a matrix product. 01:42:32.080 |
So in other words, it's actually going to turn this into 784 by 50 and then do it. 01:42:37.160 |
And so that's why we kind of had to give it the opposite information when we were trying 01:42:41.800 |
to do it with our linear layer, which doesn't have transpose. 01:42:45.800 |
So the main reason I show you that is to kind of show you how you can dig in to the PyTorch 01:42:52.400 |
And when you come across these kind of questions, you want to be able to answer them yourself. 01:42:57.580 |
Which also then leads to the question, if this is how linear layers can be initialized, 01:43:05.400 |
What does PyTorch do for convolutional layers? 01:43:08.300 |
So we could look inside torch.nn.conf2d, and when I looked at it, I noticed that it basically 01:43:17.420 |
All of the code actually gets passed down to something called _convnd. 01:43:22.480 |
And so you need to know how to find these things. 01:43:26.160 |
And so if you go to the very bottom, you can find the file name it's in. 01:43:30.240 |
And so you see this is actually torch.nn.modules.conf. 01:43:34.840 |
So we can find torch.nn.modules.conf._convnd. 01:43:45.740 |
And it calls chiming_uniform, which is basically the same as chiming_normal, but it's uniform 01:43:52.260 |
But it has a special multiplier of math.square root 5. 01:44:01.760 |
And in my experiments, this seems to work pretty badly, as you'll see. 01:44:08.600 |
So it's kind of useful to look inside the code. 01:44:11.040 |
And when you're writing your own code, presumably somebody put this here for a reason. 01:44:15.440 |
Wouldn't it have been nice if they had a URL above it with a link to the paper that they're 01:44:19.200 |
implementing so we could see what's going on? 01:44:22.720 |
So it's always a good idea, always to put some comments in your code to let the next 01:44:29.600 |
So that particular thing, I have a strong feeling, isn't great, as you'll see. 01:44:43.240 |
We've already designed our own new activation function. 01:44:51.960 |
But it's this kind of level of tweak, which is kind of-- when people write papers, this 01:44:57.840 |
is the level of-- it's like a minor change to one line of code. 01:45:00.960 |
It'll be interesting to see how much it helps. 01:45:03.400 |
But if I use it, then you can see here, yep, now I have a mean of 0 thereabouts. 01:45:12.080 |
And interestingly, I've also noticed it helps my variance a lot. 01:45:15.760 |
All of my variance, remember, was generally around 0.7 to 0.8. 01:45:21.640 |
So it helps both, which makes sense as to why I think I'm seeing these better results. 01:45:44.160 |
And remember, in PyTorch, a model can just be a function. 01:45:48.800 |
It's just a function that does one linear layer, one ReLU layer, and one more linear 01:45:58.040 |
And OK, it takes eight milliseconds to run the model on the validation set. 01:46:09.160 |
Add an assert to make sure the shape seems sensible. 01:46:13.700 |
So the next thing we need for our forward pass is a loss function. 01:46:18.080 |
And as I said, we're going to simplify things for now by using mean squared error, even 01:46:26.240 |
Our model is returning something of size 10,000 by 1. 01:46:31.720 |
But mean squared error, you would expect it just to be a single vector of size 10,000. 01:46:40.640 |
In PyTorch, the thing to add a unit axis we've learned is called squeeze-- sorry, unsqueeze. 01:46:47.120 |
The thing to get rid of a unit axis, therefore, is called squeeze. 01:46:49.760 |
So we just go output.squeeze to get rid of that unit axis. 01:46:55.300 |
But actually, now I think about it-- this is lazy. 01:46:59.400 |
Because output.squeeze gets rid of all unit axes. 01:47:03.080 |
And we very commonly see on the fastAO forums people saying that their code's broken. 01:47:11.020 |
And it's that one case where maybe they had a batch size of size 1. 01:47:15.080 |
And so that 1,1 would get squeezed down to a scalar. 01:47:20.380 |
So rather than just calling squeeze, it's actually better to say which dimension you 01:47:23.800 |
want to squeeze, which we could write either 1 or minus 1, it would be the same thing. 01:47:28.320 |
And this is going to be more resilient now to that weird edge case of a batch size of 01:47:33.880 |
OK, so output minus target squared main-- that's main squared error. 01:47:41.000 |
So remember, in PyTorch, loss functions can just be functions. 01:47:46.560 |
For main squared error, we're going to have to make sure these are floats. 01:48:05.520 |
What we need is a backward pass, because that's the thing that tells us how to update our 01:48:13.600 |
OK, how much do you want to know about matrix calculus? 01:48:20.160 |
But if you want to know everything about matrix calculus, I can point you to this excellent 01:48:25.040 |
paper by Terrence Parr and Jeremy Howard, which tells you everything about matrix calculus 01:48:37.480 |
So this is a few weeks work to get through, but it absolutely assumes nothing at all. 01:48:44.360 |
So basically, Terrence and I both felt like, oh, we don't know any of this stuff. 01:48:58.200 |
And so this will take you all the way up to knowing everything that you need for deep 01:49:36.320 |
And we stick it through the first linear layer. 01:49:42.960 |
And then we stick it through the second linear layer. 01:50:05.040 |
And then we take the output of that, and we put it through the function ReLU. 01:50:10.080 |
And then we take the output of that, and we put it through the function lin2. 01:50:14.920 |
And then we take the output of that, and we put it through the function MSE. 01:50:20.380 |
And strictly speaking, MSE has a second argument, which is the actual target value. 01:50:30.840 |
And we want the gradient of the output with respect to the input. 01:50:40.240 |
So it's a function of a function of a function of a function of a function. 01:50:43.480 |
So if we simplify that down a bit, we could just say, what if it's just like y equals 01:50:49.680 |
f of x-- sorry, y equals f of u and u equals f of x. 01:51:16.360 |
If that doesn't look familiar to you, or you've forgotten it, go to Khan Academy. 01:51:20.680 |
Khan Academy has some great tutorials on the chain rule. 01:51:23.920 |
But this is actually the thing we need to know. 01:51:26.440 |
Because once you know that, then all you need to know is the derivative of each bit on its 01:51:32.040 |
own, and you just multiply them all together. 01:51:37.400 |
And if you ever forget the chain rule, just cross-multiply. 01:51:41.960 |
So that would be dy/dx, cross out to the u's, you get dy/dx. 01:51:50.680 |
And if you went to a fancy school, they would have told you not to do that. 01:51:55.400 |
They said you can't treat calculus like this, because they're special magic small things. 01:52:04.840 |
There's actually a different way of treating calculus called the calculus of infinitesimals, 01:52:12.540 |
And you suddenly realize you actually can do this exact thing. 01:52:17.580 |
So any time you see a derivative, just remember that all it's actually doing is it's taking 01:52:25.640 |
some function, and it's saying, as you go across a little bit, how much do you go up? 01:52:34.600 |
And that it's dividing that change in y divided by that change in x. 01:52:41.240 |
That's literally what it is, where y and x, you must make them small numbers. 01:52:46.160 |
So they behave very sensibly when you just think of them as a small change in y over 01:52:52.760 |
a small change in x, as I just did, showing you the chain rule. 01:52:57.280 |
So to do the chain rule, we're going to have to start with the very last function. 01:53:04.720 |
The very last function on the outside was the loss function, mean squared error. 01:53:12.600 |
So the gradient of the loss with respect to output of previous layer. 01:53:26.000 |
So the output of the previous layer, the MSC is just input minus target squared. 01:53:34.740 |
And so the derivative of that is just 2 times input minus target, because the derivative 01:53:47.520 |
Now the thing is that for the chain rule, I'm going to need to multiply all these things 01:53:53.280 |
So if I store it inside the dot g attribute of the previous layer, because remember this 01:54:00.520 |
is the previous layer, then when the previous layer, so the input of MSC is the same as 01:54:10.640 |
So if I store it away in here, I can then quite comfortably refer to it. 01:54:22.960 |
So ReLU is this, okay, what's the gradient there? 01:54:37.400 |
So therefore, that's the gradient of the ReLU. 01:54:45.280 |
But we need the chain rule, so we need to multiply this by the gradient of the next layer, which 01:55:01.880 |
So same thing for the linear layer, the gradient is simply, and this is where the matrix calculus 01:55:08.000 |
comes in, the gradient of a matrix product is simply the matrix product with the transpose. 01:55:13.980 |
So you can either read all that stuff I showed you, or you can take my word for it. 01:55:22.640 |
Here's the function which does the forward pass that we've already seen, and then it 01:55:29.640 |
It calls each of the gradients backwards, right, in reverse order, because we know we 01:55:35.300 |
And you can notice that every time we're passing in the result of the forward pass, and it 01:55:42.340 |
also has access, as we discussed, to the gradient of the next layer. 01:55:52.280 |
So when people say, as they love to do, backpropagation is not just the chain rule, they're basically 01:56:01.320 |
Backpropagation is the chain rule, where we just save away all the intermediate calculations 01:56:18.280 |
One interesting thing here is this value here, loss, this value here, loss, we never actually 01:56:26.980 |
use it, because the loss never actually appears in the gradients. 01:56:33.980 |
I mean, just by the way, you still probably want it to be able to print it out, or whatever, 01:56:38.800 |
but it's actually not something that appears in the gradients. 01:56:42.400 |
So that's it, so w1.g, w2.g, et cetera, they now contain all of our gradients, which we're 01:56:55.180 |
And so let's cheat and use PyTorch autograd to check our results, because PyTorch can 01:57:03.860 |
So let's clone all of our weights and biases and input, and then turn on requires-grad 01:57:14.420 |
So requires-grad_ is how you take a PyTorch tensor and turn it into a magical autogradified 01:57:22.460 |
So what it's now going to do is everything that gets calculated with test tensor, it's 01:57:26.380 |
basically going to keep track of what happened. 01:57:30.340 |
So it basically keeps track of these steps, so that then it can do these things. 01:57:35.200 |
It's not actually that magical, you could totally write it yourself, you just need to 01:57:39.460 |
make sure that each time you do an operation, you remember what it is, and so then you can 01:57:48.000 |
Okay, so now that we've done requires-grad, we can now just do the forward pass like so, 01:57:56.880 |
that gives us loss in PyTorch, you say loss.backward, and now we can test that, and remember PyTorch 01:58:03.080 |
doesn't store things in .g, it stores them in .grad, and we can test them, and all of 01:58:09.380 |
our gradients were correct, or at least they're the same as PyTorch's. 01:58:15.120 |
So that's pretty interesting, right, I mean that's an actual neural network that kind 01:58:22.840 |
of contains all the main pieces that we're going to need, and we've written all these 01:58:28.520 |
pieces from scratch, so there's nothing magical here, but let's do some cool refactoring. 01:58:35.960 |
I really love this refactoring, and this is massively inspired by and very closely stolen 01:58:42.040 |
from the PyTorch API, but it's kind of interesting, I didn't have the PyTorch API in mind as I 01:58:47.400 |
did this, but as I kept refactoring, I kind of noticed like, oh, I just recreated the 01:58:55.500 |
So let's take each of our layers, relu and linear, and create classes, right, and for 01:59:04.280 |
the forward, let's use dundercall, now do you remember that dundercall means that we 01:59:10.040 |
can now treat this as if it was a function, right, so if you call this class just with 01:59:17.260 |
And let's save the input, let's save the output, and let's return the output, right, 01:59:25.120 |
and then backward, do you remember this was our backward pass, okay, so it's exactly the 01:59:30.000 |
same as we had before, okay, but we're going to save it inside self.input.gradient, so this 01:59:35.880 |
is exactly the same code as we had here, okay, but I've just moved the forward and backward 01:59:45.240 |
So here's linear, forward, exactly the same, but each time I'm saving the input, I'm saving 01:59:51.720 |
the output, I'm returning the output, and then here's our backward. 02:00:02.560 |
One thing to notice, the backward pass here, for linear, we don't just want the gradient 02:00:11.720 |
of the outputs with respect to the inputs, we also need the gradient of the outputs with 02:00:17.640 |
respect to the weights and the output with respect to the biases, right, so that's why 02:00:21.560 |
we've got three lots of dot g's going on here, okay, so there's our linear layers forward 02:00:32.360 |
and backward, and then we've got our mean squared error, okay, so there's our forward, 02:00:46.160 |
and we'll save away both the input and the target for using later, and there's our gradient, 02:00:51.000 |
again, same as before, two times input minus target. 02:00:55.940 |
So with this refactoring, we can now create our model, we can just say let's create a model 02:01:02.320 |
class and create something called dot layers with a list of all of our layers, all right, 02:01:07.720 |
notice I'm not using any PyTorch machinery, this is all from scratch, let's define loss 02:01:14.200 |
and then let's define call, and it's going to go through each layer and say x equals 02:01:19.440 |
lx, so this is how I do that function composition, we're just calling the function on the result 02:01:24.880 |
of the previous thing, okay, and then at the other very end call self dot loss on that, 02:01:31.640 |
and then for backward we do the exact opposite, we go self dot loss dot backward and then 02:01:36.240 |
we go through the reversed layers and call backward on each one, right, and remember the 02:01:41.120 |
backward passes are going to save the gradient away inside the dot g, so with that, let's 02:01:51.640 |
just set all of our gradients to none so that we know we're not cheating, we can then create 02:01:55.600 |
our model, right, this class model, and call it, and we can call it as if it was a function 02:02:03.480 |
because we have done to call, right, so this is going to call done to call, and then we 02:02:10.720 |
can call backward, and then we can check that our gradients are correct, right, so that's 02:02:17.080 |
nice, one thing that's not nice is, holy crap that took a long time, let's run it, there 02:02:26.800 |
we go, 3.4 seconds, so that was really really slow, so we'll come back to that, I don't 02:02:36.840 |
like duplicate code, there's a lot of duplicate code here, self dot imp equals imp, return 02:02:42.040 |
self dot out, that's messy, so let's get rid of it, so what we could do is we could create 02:02:48.520 |
a new class called module, which basically does the self dot imp equals imp, and return 02:02:54.340 |
self dot out for us, and so now we're not going to use done to call to implement our 02:03:01.240 |
forward, we're going to have a call something called self dot forward, which we will initially 02:03:06.140 |
set to raise exception, not implemented, and backward is going to call self dot bwd, passing 02:03:14.160 |
in the thing that we just saved, and so now relu has something called forward, which just 02:03:20.540 |
has that, so we're now basically back to where we were, and backward just has that, right, 02:03:27.360 |
so now look how neat that is, and we also realized that this thing we were doing to calculate 02:03:39.880 |
the derivative of the output of the linear layer with respect to the weights, where we're 02:03:46.840 |
doing an unsqueeze and an unsqueeze, which is basically a big adder product and a sum, 02:03:51.840 |
we could actually re-express that with einsum, okay, and when we do that, so our code is 02:04:00.580 |
now neater, and our 3.4 seconds is down to 143 milliseconds, okay, so thank you again 02:04:08.320 |
to einsum, so you'll see this now, look, model equals model, loss equals bla, bla dot backward, 02:04:19.320 |
and now the gradients are all there, that looks almost exactly like PyTorch, and so 02:04:25.780 |
we can see why, why it's done this way, why do we have to inherit from nn dot module, 02:04:31.960 |
why do we have to define forward, this is why, right, it lets PyTorch factor out all 02:04:37.600 |
this duplicate stuff, so all we have to do is do the implementation, so I think that's 02:04:43.680 |
pretty fun, and then once we realized, we thought more about it, more like what are 02:04:49.840 |
we doing with this einsum, and we actually realized that it's exactly the same as just 02:04:54.360 |
doing input dot transpose times output, so we replaced the einsum with a matrix product, 02:05:01.600 |
and that's 140 milliseconds, and so now we've basically implemented nn dot linear and nn 02:05:10.520 |
dot module, so let's now use nn dot linear and nn dot module, because we're allowed to, 02:05:17.140 |
that's the rules, and the forward pass is almost exactly the same speed as our forward 02:05:21.760 |
pass, and their backward pass is about twice as fast, I'm guessing that's because we're 02:05:29.360 |
calculating all of the gradients, and they're not calculating all of them, only the ones 02:05:34.080 |
they need, but it's basically the same thing. So at this point, we're ready in the next 02:05:42.960 |
lesson to do a training loop. We have something, we have a multi-layer fully connected neural 02:05:52.320 |
network, what her paper would call a rectified network, we have matrix multiply organized, 02:06:01.440 |
we have our forward and backward passes organized, it's all nicely refactored out into classes 02:06:06.960 |
and a module class, so in the next lesson, we will see how far we can get, hopefully 02:06:13.820 |
we will build a high quality, fast ResNet, and we're also going to take a very deep dive 02:06:23.120 |
into optimizers and callbacks and training loops and normalization methods. Any questions 02:06:31.200 |
before we go? No? That's great. Okay, thanks everybody, see you on the forums.