back to index

Lesson 8 (2019) - Deep Learning from the Foundations


Chapters

0:0 Introduction
0:40 Overview
5:12 Bottomup
6:47 Why Swift
10:50 Swift for Tensorflow
14:50 The Game
17:0 Why do this
18:47 Homework
19:42 Remember Part 1
21:1 Three Steps to Training a Good Model
23:10 Reading Papers
25:25 Symbols
28:10 Jupiter Notebooks
32:56 Run Notebook
34:13 Notebook to Script
36:0 Standard Library
38:38 Plot
39:52 Matrix Multiplication
44:36 Removing Python
45:12 Elementwise Addition
46:2 Frobenius Norm
48:50 Recap
49:26 Replace inner loop
51:51 Broadcasting
57:5 Columns

Whisper Transcript | Transcript Only Page

00:00:00.000 | So, welcome back to part two of what previously was called Practical Deep Learning for Coders,
00:00:08.960 | but part two is not called that, as you will see.
00:00:11.680 | It's called Deep Learning from the Foundations.
00:00:15.680 | It's lesson eight because it's lesson eight of the full journey, lesson one of part two,
00:00:20.440 | or lesson eight, mod seven, as we sometimes call it.
00:00:26.540 | So those of you, I know a lot of you do every year's course and keep coming back.
00:00:30.280 | For those of you doing that, this will not look at all familiar to you.
00:00:33.720 | It's a very different part two.
00:00:36.240 | We're really excited about it and hope you like it as well.
00:00:40.520 | The basic idea of deep learning from the foundations is that we are going to implement much of
00:00:46.200 | the fast AI library from foundations.
00:00:49.640 | Now, talk about exactly what I mean by foundations in a moment, but it basically means from scratch.
00:00:55.880 | So we'll be looking at basic matrix calculus and creating a training loop from scratch
00:01:01.200 | and creating an optimizer from scratch and lots of different layers and architectures
00:01:05.360 | and so forth, and not just to create some kind of dumbed down library that's not useful
00:01:11.760 | for anything, but to actually build from scratch something you can train cutting edge world-class
00:01:17.720 | models with.
00:01:19.440 | So that's the goal.
00:01:21.080 | We've never done it before.
00:01:22.080 | I don't think anybody's ever done this before.
00:01:24.500 | So I don't exactly know how far we'll get, but this is the journey that we're on.
00:01:29.360 | We'll see how we go.
00:01:30.600 | So in the process, we will be having to read and implement papers because the fast AI library
00:01:37.040 | is full of implemented papers.
00:01:40.160 | So you're not going to be able to do this if you're not reading and implementing papers.
00:01:44.160 | Along the way, we'll be implementing much of PyTorch as well, as you'll see.
00:01:52.520 | We'll also be going deeper into solving some applications that are not kind of fully baked
00:02:00.960 | into the fast AI library yet, so it's going to require a lot of custom work.
00:02:05.720 | So things like object detection, sequence to sequence with attention, transformer and
00:02:10.680 | the transform Excel, cycleGAN, audio, stuff like that.
00:02:17.360 | We'll also be doing a deeper dive into some performance considerations like doing distributed
00:02:22.200 | multi-GPU training, using the new just-in-time compiler, which we'll just call JIT from now
00:02:28.560 | on, CUDA and C++, stuff like that.
00:02:32.680 | So that's the first five lessons.
00:02:34.720 | And then the last two lessons, implementing some subset of that in Swift.
00:02:44.000 | So this is otherwise known as impractical deep learning for coders.
00:02:48.880 | Because really none of this is stuff that you're going to go and use right away.
00:02:52.960 | It's kind of the opposite of part one.
00:02:55.520 | Part one was like, oh, we've been spending 20 minutes on this.
00:02:58.160 | You can now create a world-class vision classification model.
00:03:03.960 | This is not that, because you already know how to do that.
00:03:08.200 | And so back in the earlier years, part two used to be more of the same thing, but it
00:03:13.520 | was kind of like more advanced types of model, more advanced architectures.
00:03:19.400 | But there's a couple of reasons we've changed this year.
00:03:21.600 | The first is so many papers come out now, because this whole area has increased in scale
00:03:28.360 | so quickly, that I can't pick out for you the 12 papers to do in the next seven weeks
00:03:34.560 | that you really need to know, because there's too many.
00:03:37.880 | And it's also kind of pointless, because once you get into it, you realize that all the
00:03:41.120 | papers pretty much say minor variations on the same thing.
00:03:46.080 | So instead, what I want to be able to do is show you the foundations that let you read
00:03:50.840 | the 12 papers you care about and realize like, oh, that's just that thing with this minor
00:03:55.800 | tweak.
00:03:56.800 | And I now have all the tools I need to implement that and test it and experiment with it.
00:04:01.340 | So that's kind of a really key issue in why we want to go in this direction.
00:04:09.600 | Also it's increasingly clear that, you know, we used to call part two cutting edge deep
00:04:14.720 | learning for coders, but it's increasingly clear that the cutting edge of deep learning
00:04:19.120 | is really about engineering, not about papers.
00:04:24.880 | The difference between really effective people in deep learning and the rest is really about
00:04:29.200 | who can like make things encode that work properly.
00:04:33.820 | And there's very few of those people.
00:04:36.120 | So really, the goal of this part two is to deepen your practice so you can understand,
00:04:44.920 | you know, the things that you care about and build the things you care about and have them
00:04:48.120 | work and perform at a reasonable speed.
00:04:53.120 | So that's where we're trying to head to.
00:04:56.200 | And so it's impractical in the sense that like none of these are things that you're
00:05:00.200 | going to go probably straight away and say, here's this thing I built, right?
00:05:05.340 | Probably Swift.
00:05:06.340 | Because Swift, we're actually going to be learning a language in a library that as you'll
00:05:10.480 | see is far from ready for use, and I'll describe why we're doing that in a moment.
00:05:16.080 | So part one of this course was top down, right?
00:05:20.960 | So that you got the context you needed to understand, you got the motivation you needed
00:05:24.520 | to keep going, and you got the results you needed to make it useful.
00:05:29.400 | But bottom up is useful too.
00:05:30.920 | And we started doing some bottom up at the end of part one, right?
00:05:34.980 | But really bottom up lets you, when you've built everything from the bottom yourself,
00:05:41.560 | then you can see the connections between all the different things.
00:05:44.380 | You can see they're all variations of the same thing, you know?
00:05:47.500 | And then you can customize, rather than picking algorithm A or algorithm B, you create your
00:05:52.600 | own algorithm to solve your own problem doing just the things you need it to do.
00:05:58.040 | And then you can make sure that it performs well, that you can debug it, profile it, maintain
00:06:05.760 | it, because you understand all of the pieces.
00:06:08.720 | So normally when people say bottom up in this world, in this field, they mean bottom up
00:06:15.080 | with math.
00:06:16.880 | I don't mean that.
00:06:18.160 | I mean bottom up with code, right?
00:06:20.780 | So today, step one will be to implement matrix multiplication from scratch in Python.
00:06:30.240 | Because bottom up with code means that you can experiment really deeply on every part
00:06:36.440 | of every bit of the system.
00:06:38.360 | You can see exactly what's going in, exactly what's coming out, and you can figure out
00:06:41.820 | why your model's not training well, or why it's slow, or why it's giving the wrong answer,
00:06:46.200 | or whatever.
00:06:48.440 | So why Swift?
00:06:51.000 | What are these two lessons about?
00:06:52.280 | And be clear, we are only talking the last two lessons, right?
00:06:55.800 | You know, our focus, as I'll describe, is still very much Python and PyTorch, right?
00:07:03.580 | But there's something very exciting going on.
00:07:07.320 | The first exciting thing is this guy's face you see here, Chris Latner.
00:07:11.400 | Chris is unique, as far as I know, as being somebody who has built, I think, what is the
00:07:17.320 | world's most widely used compiler framework, LLVM.
00:07:21.920 | He's built the default C and C++ compiler for Mac, being Clang.
00:07:29.600 | And he's built what's probably like the world's fastest growing fairly new computer language,
00:07:35.920 | being Swift.
00:07:37.840 | And he's now dedicating his life to deep learning, right?
00:07:42.160 | So we haven't had somebody from that world come into our world before.
00:07:47.040 | And so when you actually look at stuff like the internals of something like TensorFlow,
00:07:53.060 | it looks like something that was built by a bunch of deep learning people, not by a
00:07:57.080 | bunch of compiler people, right?
00:07:59.840 | And so I've been wanting for over 20 years for there to be a good numerical programming
00:08:07.040 | language that was built by somebody that really gets programming languages.
00:08:12.000 | And it's never happened, you know?
00:08:14.480 | So we've had like, in the early days, it was elispstat in LISP, and then it was R and then
00:08:21.440 | it was Python.
00:08:22.440 | None of these languages were built to be good at data analysis.
00:08:30.400 | They weren't built by people that really deeply understood compilers.
00:08:34.720 | They certainly weren't built for today's kind of modern, highly parallel processor situation
00:08:40.160 | we're in.
00:08:42.580 | But Swift was, Swift is, right?
00:08:44.780 | And so we've got this unique situation where for the first time, you know, a really widely
00:08:49.960 | used language, a really well-designed language from the ground up, is actually being targeted
00:08:58.980 | towards numeric programming and deep learning.
00:09:01.360 | So there's no way I'm missing out on that boat.
00:09:05.600 | And I don't want you to miss out on it either.
00:09:09.160 | I should mention there's another language which you could possibly put in there, which
00:09:12.440 | is a language called Julia, which has maybe as much potential.
00:09:17.720 | But it's, you know, it's about ten times less used than Swift.
00:09:22.200 | It doesn't have the same level of community.
00:09:24.880 | But I would still say it's super exciting.
00:09:26.280 | So I'd say, like, maybe there's two languages which you might want to seriously consider
00:09:31.720 | picking one and spending some time with it.
00:09:36.080 | Julia is actually further along.
00:09:38.600 | Swift is very early days in this world.
00:09:40.400 | But that's one of the things I'm excited about for it.
00:09:42.780 | So I actually spent some time over the Christmas break kind of digging into numeric programming
00:09:49.480 | in Swift.
00:09:50.920 | And I was delighted to find that I could create code from scratch that was competitive with
00:10:00.520 | the fastest hand-tuned vendor linear algebra libraries, even though I am -- was and remain
00:10:09.880 | pretty incompetent at Swift.
00:10:11.640 | I found it was a language that, you know, was really delightful.
00:10:15.120 | It was expressive.
00:10:16.120 | It was concise.
00:10:17.120 | But it was also very performant.
00:10:18.760 | And I could write everything in Swift, you know, rather than having to kind of get to
00:10:24.160 | some layer where it's like, oh, that's crude DNN now or that's MKL now or whatever.
00:10:29.080 | So that got me pretty enthusiastic.
00:10:34.080 | And so the really exciting news, as I'm sure you've heard, is that Chris Latner himself
00:10:39.720 | is going to come and join us for the last two lessons, and we're going to teach Swift
00:10:45.200 | for deep learning together.
00:10:48.520 | So Swift for deep learning means Swift for TensorFlow.
00:10:53.360 | That's specifically the library that Chris and his team at Google are working on.
00:11:01.480 | We will call that S for TF when I write it down because I couldn't be bothered typing
00:11:04.840 | Swift for TensorFlow every time.
00:11:07.920 | Swift for TensorFlow has some pros and cons.
00:11:12.960 | PyTorch has some pros and cons.
00:11:15.400 | And interestingly, they're the opposite of each other.
00:11:19.800 | PyTorch and Python's pros are you can get stuff done right now with this amazing ecosystem,
00:11:27.840 | fantastic documentation and tutorials.
00:11:30.800 | You know, it's just a really great practical system for solving problems.
00:11:38.440 | And to be clear, Swift for TensorFlow is not.
00:11:42.040 | It's not any of those things right now, right?
00:11:44.160 | It's really early.
00:11:47.280 | Almost nothing works.
00:11:49.280 | You have to learn a whole new language if you don't know Swift already.
00:11:52.480 | There's very little ecosystem.
00:11:53.880 | Now, I'm not sure about Swift in particular, but the kind of Swift for TensorFlow and Swift
00:11:58.640 | for deep learning and even Swift for numeric programming.
00:12:01.360 | I was kind of surprised when I got into it to find there was hardly any documentation
00:12:06.480 | about Swift for numeric programming, even though I was pretty delighted by the experience.
00:12:11.960 | People have had this view that Swift is kind of for iPhone programming.
00:12:18.400 | I guess that's kind of how it was marketed, right?
00:12:20.840 | But actually it's an incredibly well-designed, incredibly powerful language.
00:12:27.680 | And then TensorFlow, I mean, to be honest, I'm not a huge fan of TensorFlow in general.
00:12:33.600 | I mean, if I was, we wouldn't have switched away from it.
00:12:37.000 | But it's getting a lot better.
00:12:38.880 | TensorFlow 2 is certainly improving.
00:12:42.520 | And the bits of it I particularly don't like are largely the bits that Swift for TensorFlow
00:12:47.120 | will avoid.
00:12:48.120 | But I think long-term, the kind of things I see happening, like there's this fantastic
00:12:54.200 | new kind of compiler project called MLIR, which Chris is also co-leading, which I think actually
00:13:01.400 | has the potential long-term to allow Swift to replace most of the yucky bits or maybe
00:13:07.520 | even all of the yucky bits of TensorFlow with stuff where Swift is actually talking directly
00:13:11.920 | to LLVM.
00:13:13.800 | You'll be hearing a lot more about LLVM in the coming, in the last two weeks, the last
00:13:17.560 | two lessons.
00:13:18.560 | Basically, it's the compiler infrastructure that kind of everybody uses, that Julia uses,
00:13:26.680 | that Clang uses.
00:13:28.920 | And Swift is this kind of almost this thin layer on top of it, where when you write stuff
00:13:34.400 | in Swift, it's really easy for LLVM to compile it down to super-fast optimized code.
00:13:43.640 | Which is like the opposite of Python.
00:13:46.120 | With Python, as you'll see today, we almost never actually write Python code.
00:13:51.880 | We write code in Python that gets turned into some other language or library, and that's
00:13:56.960 | what gets run.
00:13:58.560 | And this mismatch, this impedance mismatch between what I'm trying to write and what
00:14:03.680 | actually gets run makes it very hard to do the kind of deep dives that we're going to
00:14:07.600 | do in this course, as you'll see.
00:14:11.280 | It's kind of a frustrating experience.
00:14:14.760 | So I'm excited about getting involved in these very early days for impractical deep learning
00:14:20.440 | in Swift for TensorFlow, because it means that me and those of you that want to follow along
00:14:27.080 | can be the pioneers in something that I think is going to take over this field.
00:14:34.000 | We'll be the first in there.
00:14:35.320 | We'll be the ones that understand it really well.
00:14:37.960 | And in your portfolio, you can actually point at things and say, "That library that everybody
00:14:42.080 | is, I wrote that."
00:14:44.440 | This piece of documentation that's like on the Swift for TensorFlow website, I wrote that.
00:14:49.040 | That's the opportunity that you have.
00:14:51.700 | So let's put that aside for the next five weeks.
00:14:57.600 | And let's try to create a really high bar for the Swift for TensorFlow team to have to
00:15:03.720 | try to re-implement before six weeks' time.
00:15:06.760 | We're going to try to implement as much of fast AI and many parts of PyTorch as we can
00:15:11.880 | and then see if the Swift for TensorFlow team can help us build that in Swift in five weeks'
00:15:18.320 | time.
00:15:19.320 | So the goal is to recreate fast AI from the foundations and much of PyTorch like matrix
00:15:26.200 | multiplication, a lot of torch.nn, torch.optm, dataset, data loader from the foundations.
00:15:34.040 | And this is the game we're going to play.
00:15:35.880 | The game we're going to play is we're only allowed to use these bits.
00:15:39.000 | We're allowed to use pure Python, anything in the Python standard library, any non-data
00:15:45.920 | science modules, so like a requests library for HTTP or whatever, we can use PyTorch but
00:15:53.000 | only for creating arrays, random number generation, and indexing into arrays.
00:16:00.440 | We can use the fastai.datasets library because that's the thing that has access to like MNIST
00:16:04.440 | and stuff, so we don't have to worry about writing our own HTTP stuff.
00:16:08.480 | And we can use matplotlib.
00:16:09.480 | We don't have to write our own plotting library.
00:16:11.800 | That's it.
00:16:12.800 | That's the game.
00:16:13.800 | So we're going to try and recreate all of this from that.
00:16:18.440 | And then the rules are that each time we have replicated some piece of fastai or PyTorch
00:16:24.560 | from the foundations, we can then use the real version if we want to, okay?
00:16:30.560 | So that's the game we're going to play.
00:16:34.240 | What I've discovered as I started doing that is that I started actually making things a
00:16:37.680 | lot better than fastai.
00:16:39.840 | So I'm now realizing that fastai version 1 is kind of a disappointment because there
00:16:43.720 | was a whole lot of things I could have done better.
00:16:45.960 | And so you'll find the same thing.
00:16:47.600 | As you go along this journey, you'll find decisions that I made or the PyTorch teammate
00:16:51.480 | or whatever where you think, what if they'd made a different decision there?
00:16:55.880 | And you can maybe come up with more examples of things that we could do differently, right?
00:17:02.680 | So why would you do this?
00:17:03.960 | Well, the main reason is so that you can really experiment, right?
00:17:08.320 | So you can really understand what's going on in your models, what's really going on in
00:17:11.480 | your training.
00:17:13.000 | And you'll actually find that in the experiments that we're going to do in the next couple
00:17:17.400 | of classes, we're going to actually come up with some new insights.
00:17:22.600 | If you can create something from scratch yourself, you know that you understand it.
00:17:28.160 | And then once you've created something from scratch and you really understand it, then
00:17:30.920 | you can tweak everything, right?
00:17:32.840 | But you suddenly realize that there's not this object detection system and this confident
00:17:38.840 | architecture and that optimizer.
00:17:41.640 | They're all like a kind of semi-arbitrary bunch of particular knobs and choices.
00:17:47.040 | And that it's pretty likely that your particular problem would want a different set of knobs
00:17:51.960 | and choices.
00:17:53.080 | So you can change all of these things.
00:17:57.120 | For those of you looking to contribute to open source, to fast AI or to PyTorch, you'll
00:18:03.040 | be able to, right?
00:18:04.640 | Because you'll understand how it's all built up.
00:18:06.320 | You'll understand what bits are working well, which bits need help.
00:18:09.560 | You know how to contribute tests or documentation or new features or create your own libraries.
00:18:17.200 | And for those of you interested in going deeper into research, you'll be implementing papers,
00:18:23.280 | which means you'll be able to correlate the code that you're writing with the paper that
00:18:27.160 | you're reading.
00:18:28.160 | And if you're a poor mathematician like I am, then you'll find that you'll be getting
00:18:33.720 | a much better understanding of papers that you might otherwise have thought were beyond
00:18:39.360 | And you realize that all those Greek symbols actually just map to pieces of code that you're
00:18:44.120 | already very familiar with.
00:18:48.080 | So there are a lot of opportunities in part one to blog and to do interesting things,
00:18:55.240 | but the opportunities are much greater now.
00:18:57.160 | In part two, you can be doing homework that's actually at the cutting edge, actually doing
00:19:02.080 | experiments people haven't done before, making observations people haven't made before.
00:19:06.800 | Because you're getting to the point where you're a more competent deep learning practitioner
00:19:12.560 | than the vast majority that are out there, and we're looking at stuff that other people
00:19:16.960 | haven't looked at before.
00:19:17.960 | So please try doing lots of experiments, particularly in your domain area, and consider writing
00:19:26.880 | things down, especially if it's not perfect.
00:19:32.080 | So write stuff down for the you of six months ago.
00:19:38.880 | That's your audience.
00:19:40.720 | Okay, so I am going to be assuming that you remember the contents of part one, which was
00:19:51.520 | these things.
00:19:52.520 | Here's the contents of part one.
00:19:55.200 | In practice, it's very unlikely you remember all of these things because nobody's perfect.
00:20:00.480 | So what I'm actually expecting you to do is as I'm going on about something which you're
00:20:05.040 | thinking I don't know what he's talking about, that you'll go back and watch the video about
00:20:10.680 | that thing.
00:20:12.020 | Don't just keep blasting forwards, because I'm assuming that you already know the content
00:20:17.200 | of part one.
00:20:19.280 | Particularly if you're less confident about the second half of part one, where we went
00:20:24.040 | a little bit deeper into what's an activation, and what's a parameter really, and exactly
00:20:28.680 | how does SGD work.
00:20:30.360 | Particularly in today's lesson, I'm going to assume that you really get that stuff.
00:20:36.400 | So if you don't, then go back and re-look at those videos.
00:20:42.020 | Go back to that SGD from scratch and take your time.
00:20:47.600 | I've kind of designed this course to keep most people busy up until the next course.
00:20:55.720 | So feel free to take your time and dig deeply.
00:21:02.040 | So the most important thing, though, is we're going to try and make sure that you can train
00:21:05.960 | really good models.
00:21:07.780 | And there are three steps to training a really good model.
00:21:11.560 | Step one is to create something with way more capacity you need, and basically no regularization,
00:21:18.320 | and overfit.
00:21:20.140 | So overfit means what?
00:21:23.320 | It means that your training loss is lower than your validation loss?
00:21:30.640 | No, it doesn't mean that.
00:21:31.960 | Remember, it doesn't mean that.
00:21:33.840 | A well-fit model will almost always have training loss lower than the validation loss.
00:21:40.000 | Remember that overfit means you have actually personally seen your validation error getting
00:21:45.200 | worse.
00:21:46.200 | Okay?
00:21:47.200 | Until you see that happening, you're not overfitting.
00:21:49.360 | So step one is overfit.
00:21:51.280 | And then step two is reduce overfitting.
00:21:54.520 | And then step three, okay, there is no step three.
00:21:57.640 | Well, I guess step three is to visualize the inputs and outputs and stuff like that, right?
00:22:01.960 | That is to experiment and see what's going on.
00:22:05.640 | So one is pretty easy normally, right?
00:22:10.660 | Two is the hard bit.
00:22:12.560 | It's not really that hard, but it's basically these are the five things that you can do
00:22:18.080 | in order of priority.
00:22:19.320 | If you can get more data, you should.
00:22:21.560 | If you can do more data augmentation, you should.
00:22:24.000 | If you can use a more generalizable architecture, you should.
00:22:27.040 | And then if all those things are done, then you can start adding regularization like drop-out,
00:22:31.920 | or weight decay, but remember, at that point, you're reducing the effective capacity of
00:22:41.080 | your model.
00:22:42.080 | So it's less good than the first three things.
00:22:44.240 | And then last of all, reduce the architecture complexity.
00:22:48.920 | And most people, most beginners especially, start with reducing the complexity of the
00:22:54.680 | architecture, but that should be the last thing that you try.
00:22:58.440 | Unless your architecture is so complex that it's too slow for your problem, okay?
00:23:04.400 | So that's kind of a summary of what we want to be able to do that we learned about in
00:23:09.080 | part one.
00:23:11.880 | Okay.
00:23:14.560 | So we're going to be reading papers, which we didn't really do in part one.
00:23:18.680 | And papers look something like this, which if you're anything like me, that's terrifying.
00:23:25.320 | But I'm not going to lie, it's still the case that when I start looking at a new paper,
00:23:30.520 | every single time I think I'm not smart enough to understand this, I just can't get past
00:23:37.880 | that immediate reaction because I just look at this stuff and I just go, that's not something
00:23:42.800 | that I understand.
00:23:43.800 | But then I remember, this is the Adam paper, and you've all seen Adam implemented in one
00:23:49.540 | cell of Microsoft Excel, right?
00:23:53.320 | When it actually comes down to it, every time I do get to the point where I understand if
00:23:57.520 | I've implemented a paper, I go, oh my God, that's all it is, right?
00:24:03.100 | So a big part of reading papers, especially if you're less mathematically inclined than
00:24:07.720 | I am, is just getting past the fear of the Greek letters.
00:24:14.280 | I'll say something else about Greek letters.
00:24:16.920 | There are lots of them, right? And it's very hard to read something that you can't actually
00:24:24.160 | pronounce, right?
00:24:25.800 | Because you're just saying to yourself, oh, squiggle bracket one plus squiggle one, G
00:24:30.040 | squiggle one minus squiggle.
00:24:31.920 | And it's like all the squiggles, you just get lost, right?
00:24:34.520 | So believe it or not, it actually really helps to go and learn the Greek alphabet so you
00:24:40.160 | can pronounce alpha times one plus beta one, right?
00:24:45.320 | Whenever you can start talking to other people about it, you can actually read it out loud.
00:24:49.440 | It makes a big difference.
00:24:50.940 | So learn to pronounce the Greek letters.
00:24:54.400 | Note that the people that write these papers are generally not selected for their outstanding
00:24:59.720 | clarity of communication, right?
00:25:02.860 | So you will often find that there'll be a blog post or a tutorial that does a better
00:25:08.880 | job of explaining the concept than the paper does.
00:25:11.960 | So don't be afraid to go and look for those as well, but do go back to the paper, right?
00:25:16.560 | Because in the end, the paper's the one that's hopefully got it mainly right.
00:25:22.680 | Okay.
00:25:26.080 | One of the tricky things about reading papers is the equations have symbols and you don't
00:25:30.200 | know what they mean and you can't Google for them.
00:25:33.760 | So a couple of good resources, if you see symbols you don't recognize, Wikipedia has
00:25:39.480 | an excellent list of mathematical symbols page that you can scroll through.
00:25:44.800 | And even better, de-techify is a website where you can draw a symbol you don't recognize
00:25:51.040 | and it uses the power of machine learning to find similar symbols.
00:25:57.200 | There are lots of symbols that look a bit the same, so you will have to use some level
00:26:00.140 | of judgment, right?
00:26:01.680 | But the thing that it shows here is the LaTeX name and you can then Google for the LaTeX
00:26:07.240 | name to find out what that thing means.
00:26:12.080 | Okay.
00:26:14.320 | So let's start.
00:26:20.640 | Here's what we're going to do over the next couple of lessons.
00:26:24.040 | We're going to try to create a pretty competent modern CNN model.
00:26:31.960 | And we actually already have this bit because we did that in the last course, right?
00:26:39.840 | We already have our layers for creating a ResNet.
00:26:42.180 | We actually got a pretty good result.
00:26:44.720 | So we just have to do all these things, okay, to get us from here to here.
00:26:51.680 | This is just the next couple of lessons.
00:26:53.080 | After that we're going to go a lot further, right?
00:26:56.920 | So today we're going to try to get to at least the point where we've got the backward
00:27:00.720 | pass going, right?
00:27:02.120 | So remember, we're going to build a model that takes an input array and we're going
00:27:07.280 | to try and create a simple, fully connected network, right?
00:27:10.240 | So it's going to have one hidden layer.
00:27:12.360 | So we're going to start with some input, do a matrix multiply, do a value, do a matrix
00:27:17.440 | multiply, do a loss function, okay?
00:27:21.200 | And so that's a forward pass and that'll tell us our loss.
00:27:24.960 | And then we will calculate the gradients of the weights and biases with respect to the
00:27:32.520 | weights and biases in order to basically multiply them by some learning rate, which we will
00:27:38.520 | then subtract off the parameters to get our new set of parameters.
00:27:43.120 | And we'll repeat that lots of times.
00:27:46.280 | So to get to our fully connected backward pass, we will need to first of all have the
00:27:51.680 | fully connected forward pass and the fully connected forward pass means we will need
00:27:55.480 | to have some initialized parameters and we'll need a value and we will also need to be able
00:28:01.440 | to do matrix multiplication.
00:28:04.660 | So let's start there.
00:28:11.440 | So let's start at 00 exports notebook.
00:28:19.600 | And what I'm showing you here is how I'm going to go about building up our library in Jupyter
00:28:26.700 | notebooks.
00:28:28.580 | A lot of very smart people have assured me that it is impossible to do effective library
00:28:36.080 | development in Jupyter notebooks, which is a shame because I've built a library in Jupyter
00:28:41.760 | notebooks.
00:28:43.440 | So anyway, people will often tell you things are impossible, but I will tell you my point
00:28:47.920 | of view, which is that I've been programming for over 30 years and in the time I've been
00:28:55.560 | using Jupyter notebooks to do my development, I would guess I'm about two to three times
00:29:00.160 | more productive.
00:29:02.160 | I've built a lot more useful stuff in the last two or three years than I did beforehand.
00:29:08.640 | I'm not saying you have to do things this way either, but this is how I develop and
00:29:14.280 | hopefully you find some of this useful as well.
00:29:16.840 | So I'll show you how.
00:29:20.880 | We need to do a couple of things.
00:29:22.400 | We can't just create one giant notebook with our whole library.
00:29:26.360 | Somehow we have to be able to pull out those little gems, those bits of code where we think,
00:29:31.280 | oh, this is good.
00:29:32.520 | Let's keep this.
00:29:33.520 | We have to be able to pull that out into a package that we reuse.
00:29:37.520 | So in order to tell our system that here is a cell that I want you to keep and reuse,
00:29:44.480 | I use this special comment, hash export at the top of the cell.
00:29:50.200 | And then I have a program called notebook to script, which goes through the notebook
00:29:56.800 | and finds those cells and puts them into a Python module.
00:30:02.520 | So let me show you.
00:30:03.640 | So if I run this cell, okay, so if I run this cell and then I head over and notice I don't
00:30:12.640 | have to type all of exports because I have tab completion, even for file names in jupyter
00:30:19.240 | notebook.
00:30:20.240 | So tab is enough and I could either run this here or I could go back to my console and
00:30:28.200 | run it.
00:30:29.200 | So let's run it here.
00:30:30.560 | Okay, so that says converted exports.ipanb to nb00.
00:30:36.640 | And what I've done is I've made it so that these things go into a directory called exp
00:30:40.920 | for exported modules.
00:30:43.320 | And here is that nb00.
00:30:46.360 | And there it is, right?
00:30:47.360 | So you can see other than a standard header, it's got the contents of that one cell.
00:30:52.240 | So now I can import that at the top of my next notebook from exp nb00 import star.
00:31:01.560 | And I can create a test that that variable equals that value.
00:31:07.960 | So let's see.
00:31:10.900 | It does.
00:31:11.900 | Okay.
00:31:12.900 | And notice there's a lot of test frameworks around, but it's not always helpful to use
00:31:18.480 | them.
00:31:19.480 | Like here we've created a test framework or the start of one.
00:31:23.880 | I've created a function called test, which checks whether A and B return true or false
00:31:29.720 | based on this comparison function by using assert.
00:31:34.460 | And then I've created something called test equals, which calls test passing in A and
00:31:38.840 | B and operated dot equals.
00:31:41.360 | Okay.
00:31:42.360 | So if they're wrong, assertion error equals test, test one.
00:31:49.960 | Whoops.
00:31:50.960 | Okay.
00:31:51.960 | So we've been able to write a test, which so far has basically tested that our little
00:31:55.960 | module exporter thing works correctly.
00:32:00.560 | We probably want to be able to run these tests somewhere other than just inside a notebook
00:32:04.440 | . So we have a little program called run notebook dot P Y and you pass it the name of a notebook.
00:32:14.560 | And it runs it.
00:32:16.920 | So I should save this one with our failing test so you can see it fail.
00:32:24.600 | So first time it passed and then I make the failing test and you can see here it is assertion
00:32:27.960 | error and tells you exactly where it happened.
00:32:30.840 | Okay.
00:32:31.840 | So we now have an automatable unit testing framework in our Jupyter Notebook.
00:32:41.240 | I'll point out that the contents of these two Python scripts, let's look at them.
00:32:51.840 | So the first one was run notebook dot P Y, which is our test runner.
00:32:56.180 | There is the entirety of it, right?
00:32:58.280 | So there's a thing called nb format, so if you condor install nb format, then it basically
00:33:04.060 | lets you execute a notebook and it prints out any errors.
00:33:08.140 | So that's the entirety of that.
00:33:10.500 | You'll notice that I'm using a library called fire.
00:33:14.920 | Fire is a really neat library that lets you take any function like this one and automatically
00:33:20.600 | converts it into a command line interface.
00:33:23.640 | So here I've got a function called run notebook and then it says fire, run notebook.
00:33:28.960 | So if I now go Python, run notebook, then it says, oh, this function received no value,
00:33:37.000 | path, usage, run notebook, path.
00:33:39.920 | So you can see that what it did was it converted my function into a command line interface,
00:33:46.200 | which is really great.
00:33:47.200 | And it handles things like optional arguments and classes and it's super useful, particularly
00:33:52.920 | for this kind of Jupiter first development, because you can grab stuff that's in Jupiter
00:33:57.560 | and turn it into a script often by just copying and pasting the function or exporting it.
00:34:02.720 | And then just add this one line of code.
00:34:07.120 | The other one notebook to script is not much more complicated.
00:34:17.880 | It's one screen of code, which again, the main thing here is to call fire, which calls
00:34:23.400 | this one function and you'll see basically it uses JSON.load because notebooks are JSON.
00:34:29.880 | The reason I mentioned this to you is that Jupiter notebook comes with this whole kind
00:34:35.920 | of ecosystem of libraries and APIs and stuff like that.
00:34:40.640 | And on the whole, I hate them.
00:34:43.080 | I find it's just JSON.
00:34:44.880 | I find that just doing JSON.load is the easiest way.
00:34:49.240 | And specifically I build my Jupiter notebook infrastructure inside Jupiter notebooks.
00:34:55.000 | So here's how it looks, right, import JSON, JSON.load this file and gives you an array
00:35:03.640 | and there's the contents of source, my first row, right?
00:35:08.880 | So if you do want to play around with doing stuff in Jupiter notebook, it's a really great
00:35:13.560 | environment for kind of automating stuff and running scripts on it and stuff like that.
00:35:20.080 | So there's that.
00:35:22.320 | All right.
00:35:23.420 | So that's the entire contents of our development infrastructure.
00:35:28.440 | We now have a test.
00:35:29.440 | Let's make it pass again.
00:35:30.720 | One of the great things about having unit tests in notebooks is that when one does fail,
00:35:36.960 | you open up a notebook, which can have pros saying, this is what this test does.
00:35:42.160 | It's implementing this part of this paper.
00:35:44.060 | You can see all the stuff above it that's setting up all the context for it.
00:35:46.960 | You can check in each input and output.
00:35:48.800 | It's a really great way to fix those failing tests because you've got the whole truly literate
00:35:54.680 | programming experience all around it.
00:35:57.740 | So I think that works great.
00:35:59.840 | Okay.
00:36:00.840 | So before we start doing matrix multiply, we need some matrices to multiply.
00:36:06.500 | So these are some of the things that are allowed by our rules.
00:36:09.480 | We've got some stuff that's part of the standard library.
00:36:12.560 | This is the fast AI data sets library to let us grab the data sets we need some more standard
00:36:17.220 | library stuff.
00:36:18.220 | We're only allowed to use this for indexing and array creation that plot lab.
00:36:23.000 | There you go.
00:36:25.260 | So let's grab Evanist.
00:36:27.600 | So to grab Evanist, we just don't, we can use faster data sets to download it.
00:36:34.160 | And then we can use a standard library gzip to open it.
00:36:38.480 | And then we can pickle.load it.
00:36:40.560 | So in Python, the kind of standard serialization format is called pickle.
00:36:44.660 | And so this Evanist version on deeplearning.net is stored in that, in that format.
00:36:49.620 | And so we can, it's basically gives us a tuple of tuples of data sets like so x train, y
00:36:55.680 | train, x valid, y valid.
00:36:58.800 | It actually contains NumPy arrays, but NumPy arrays are not allowed in our foundations.
00:37:05.720 | So we have to convert them into tensors.
00:37:08.700 | So we can just use the Python map to map the tensor function over each of these four arrays
00:37:16.600 | to get back four tensors.
00:37:19.200 | A lot of you will be more familiar with NumPy arrays than PyTorch tensors.
00:37:26.040 | But you know, everything you can do in NumPy arrays, you can also do in PyTorch tensors,
00:37:31.300 | you can also do it on the GPU and have all this nice deeplearning infrastructure.
00:37:36.800 | So it's a good idea to get used to using PyTorch tensors, in my opinion.
00:37:42.280 | So we can now grab the number of rows and number of columns in the training search.
00:37:49.200 | And we can take a look.
00:37:51.520 | So here's Evanist, hopefully pretty familiar to you already.
00:37:56.320 | It's 50,000 rows by 784 columns, and the y data looks something like this.
00:38:04.120 | The y shape is just 50,000 rows, and the minimum and maximum of the dependent variable is zero
00:38:11.360 | to nine.
00:38:12.360 | So hopefully that all looks pretty familiar.
00:38:14.680 | So let's add some tests.
00:38:16.260 | So the n should be equal to the shape of the y, should be equal to 50,000.
00:38:24.920 | The number of columns should be equal to 28 by 28, because that's how many pixels there
00:38:29.120 | are in Evanist, and so forth.
00:38:31.320 | And we're just using that test equals function that we created just above.
00:38:38.400 | So now we can plot it.
00:38:40.520 | Okay, so we've got a float tensor.
00:38:44.160 | And we pass that to imshow after casting it to a 28 by 28.
00:38:49.960 | Dot view is really important.
00:38:51.160 | I think we saw it a few times in part one, but get very familiar with it.
00:38:54.420 | This is how we reshape our 768 long vector into a 28 by 28 matrix that's suitable for
00:39:03.040 | plotting.
00:39:04.040 | Okay, so there's our data.
00:39:06.900 | And let's start by creating a simple linear model.
00:39:13.920 | So for a linear model, we're going to need to basically have something where y equals
00:39:19.540 | a x plus b. And so our a will be a bunch of weights.
00:39:26.480 | So it's going to be to be 784 by 10 matrix, because we've got 784 coming in and 10 going
00:39:35.680 | So that's going to allow us to take in our independent variable and map it to something
00:39:40.880 | which we compare to our dependent variable.
00:39:44.320 | And then for our bias, we'll just start with 10 zeros.
00:39:49.720 | So if we're going to do y equals a x plus b, then we're going to need a matrix multiplication.
00:39:56.560 | So almost everything we do in deep learning is basically matrix multiplication or a variant
00:40:04.360 | thereof affine functions, as we call them.
00:40:08.180 | So you want to be very comfortable with matrix multiplication.
00:40:12.480 | So this cool website, matrix multiplication dot x, y, z, shows us exactly what happens
00:40:17.200 | when we multiply these two matrices.
00:40:23.320 | So we take the first column of the first row and the first row, and we multiply each of
00:40:29.240 | them element-wise, and then we add them up, and that gives us that one.
00:40:36.680 | And now you can see we've got two sets going on at the same time, so that gives us two
00:40:39.520 | more, and then two more, and then the final one.
00:40:45.580 | And that's our matrix multiplication.
00:40:47.800 | So we have to do that.
00:40:52.000 | So we've got a few loops going on.
00:40:55.240 | We've got the loop of this thing scrolling down here.
00:40:58.960 | We've got the loop of these two rows.
00:41:01.200 | They're really columns, so we flip them around.
00:41:03.960 | And then we've got the loop of the multiply and add.
00:41:06.920 | So we're going to need three loops.
00:41:10.700 | And so here's our three loops.
00:41:13.360 | And notice this is not going to work unless the number of columns here and the number
00:41:23.640 | of rows here are the same.
00:41:26.960 | So let's grab the number of rows and columns of A, and the number of rows and columns of
00:41:33.440 | B, and make sure that AC equals BR, just to double check.
00:41:40.160 | And then let's create something of size AR by BC, because the size of this is going to
00:41:45.000 | be AR by BC with zeros in, and then have our three loops.
00:41:54.320 | And then right in the middle, let's do that.
00:42:02.800 | OK, so right in the middle, the result in I comma J is going to be AIK by BKJ.
00:42:16.040 | And this is the vast majority of what we're going to be doing in deep learning.
00:42:20.800 | So get very, very comfortable with that equation, because we're going to be seeing it in three
00:42:27.040 | or four different variants of notation and style in the next few weeks, in the next few
00:42:33.440 | minutes.
00:42:35.440 | And it's got kind of a few interesting things going on.
00:42:38.040 | This I here appears also over here.
00:42:41.720 | This J here appears also over here.
00:42:45.000 | And then the K in the loop appears twice.
00:42:49.840 | And look, it's got to be the same number in each place, because this is the bit where
00:42:53.160 | we're multiplying together the element-wise things.
00:42:56.640 | So there it is.
00:42:57.960 | So let's create a nice small version, grab the first five rows of the validation set.
00:43:02.760 | We'll call that M1.
00:43:04.000 | And grab our weight matrix.
00:43:05.000 | We'll call that M2.
00:43:07.960 | Grab our weight matrix, call that M2.
00:43:10.560 | And then here's their sizes, five, because we just grabbed the first five rows, five
00:43:15.400 | by 784, OK, multiplied by 784 by 10.
00:43:20.280 | So these match as they should.
00:43:23.200 | And so now we can go ahead and do that matrix multiplication.
00:43:26.560 | And it's done, OK?
00:43:28.560 | And it's given us 50,000-- sorry, length of-- sorry.
00:43:34.720 | It's given us T1.shape.
00:43:40.800 | As you would expect, a five rows by 10 column output.
00:43:45.240 | And it took about a second.
00:43:48.640 | So it took about a second for five rows.
00:43:51.520 | Our data set, MNIST, is 50,000 rows.
00:43:54.800 | So it's going to take about 50,000 seconds to do a single matrix multiplication in Python.
00:44:02.920 | So imagine doing MNIST where every layer for every pass took about 10 hours.
00:44:12.560 | Not going to work, right?
00:44:14.300 | So that's why we don't really write things in Python.
00:44:17.800 | Like, when we say Python is too slow, we don't mean 20% too slow.
00:44:24.600 | We mean thousands of times too slow.
00:44:27.940 | So let's see if we can speed this up by 50,000 times.
00:44:33.080 | Because if we could do that, it might just be fast enough.
00:44:36.260 | So the way we speed things up is we start in the innermost loop.
00:44:41.160 | And we make each bit faster.
00:44:43.840 | So the way to make Python faster is to remove Python.
00:44:49.840 | And the way we remove Python is by passing our computation down to something that's written
00:44:54.760 | in something other than Python, like PyTorch.
00:44:58.520 | Because PyTorch behind the scenes is using a library called A10.
00:45:04.720 | And so we want to get this going down to the A10 library.
00:45:07.480 | So the way we do that is to take advantage of something called element-wise operations.
00:45:12.680 | So you've seen them before.
00:45:14.800 | For example, if I have two tensors, A and B, both of length three, I can add them together.
00:45:23.080 | And when I add them together, it simply adds together the corresponding items.
00:45:27.960 | So that's called element-wise addition.
00:45:31.360 | Or I could do less than, in which case it's going to do element-wise less than.
00:45:37.340 | So what percentage of A is less than the corresponding item of B, A less than B dot float dot mean.
00:45:48.120 | We can do element-wise operations on things not just of rank one, but we could do it on
00:45:53.640 | a rank two tensor, also known as a matrix.
00:45:58.660 | So here's our rank two tensor, M. Let's calculate the Frobenius norm.
00:46:07.160 | How many people know about the Frobenius norm?
00:46:10.040 | Right, almost nobody.
00:46:12.240 | And it looks kind of terrifying, right, but actually it's just this.
00:46:18.520 | It's a matrix times itself dot sum dot square root.
00:46:25.080 | So here's the first time we're going to start trying to translate some equations into code
00:46:31.720 | to help us understand these equations.
00:46:34.720 | So this says, when you see something like A with two sets of double lines around it,
00:46:41.640 | and an F underneath, that means we are calculating the Frobenius norm.
00:46:46.720 | So any time you see this, and you will, it actually pops up semi-regularly in deep learning
00:46:50.600 | literature, when you see this, what it actually means is this function.
00:46:57.500 | As you probably know, capital sigma means sum, and this says we're going to sum over
00:47:03.120 | two for loops.
00:47:05.240 | The first for loop will be called i, and we'll go from 1 to n.
00:47:10.760 | And the second for loop will also be called j, and will also go from 1 to n.
00:47:16.360 | And in these nested for loops, we're going to grab something out of a matrix A, that
00:47:22.640 | position i, j.
00:47:25.120 | We're going to square it, and then we're going to add all of those together, and then we'll
00:47:30.120 | take the square root.
00:47:33.400 | Which is that.
00:47:37.440 | Now I have something to admit to you.
00:47:40.440 | I can't write LaTeX.
00:47:42.860 | And yet I did create this Jupyter notebook, so it looks a lot like I created some LaTeX,
00:47:47.920 | which is certainly the impression I like to give people sometimes.
00:47:51.120 | But the way I actually write LaTeX is I find somebody else who wrote it, and then I copy
00:47:56.480 | So the way you do this most of the time is you Google for Frobenius Norm, you find the
00:48:01.680 | wiki page for Frobenius Norm, you click edit next to the equation, and you copy and paste
00:48:09.720 | So that's a really good way to do it.
00:48:11.680 | And check dollar signs or even two dollar signs around it.
00:48:13.960 | Two dollar signs make it a bit bigger.
00:48:16.880 | So that's way one to get equations.
00:48:19.760 | Method two is if it's in a paper on archive, did you know on archive you can click on download
00:48:26.440 | other formats in the top right, and then download source, and that will actually give you the
00:48:32.080 | original tech source, and then you can copy and paste their LaTeX.
00:48:38.040 | So I'll be showing you a bunch of equations during these lessons, and I can promise you
00:48:41.880 | one thing, I wrote none of them by hand.
00:48:44.800 | So this one was stolen from Wikipedia.
00:48:48.920 | All right, so you now know how to implement the Frobenius Norm from scratch in TensorFlow.
00:48:58.480 | You could also have written it, of course, as m.pal2, but that would be illegal under
00:49:07.160 | our rules.
00:49:08.160 | We're not allowed to use PAL yet, so that's why we did it that way.
00:49:14.680 | So that's just doing the element-wise multiplication of a rank two tensor with itself.
00:49:20.880 | One times one, two times two, three times three, etc.
00:49:27.120 | So that is enough information to replace this loop, because this loop is just going through
00:49:36.120 | the first row of A and the first column of B, and doing an element-wise multiplication
00:49:43.760 | and sum.
00:49:45.800 | So our new version is going to have two loops, not three.
00:49:48.680 | Here it is.
00:49:50.440 | So this is all the same, but now we've replaced the inner loop, and you'll see that basically
00:49:59.520 | it looks exactly the same as before, but where it used to say k, it now says colon.
00:50:04.680 | So in pytorch and numpy, colon means the entirety of that axis.
00:50:11.860 | So Rachel, help me remember the order of rows and columns when we talk about matrices, which
00:50:21.480 | is the song.
00:50:22.480 | Row by column, row by column.
00:50:25.920 | So that's the song.
00:50:26.920 | So i is the row number.
00:50:30.120 | So this is row number i, the whole row.
00:50:34.560 | And this is column number j, the whole column.
00:50:39.160 | So multiply all of column j by all of row i, and that gives us back a rank one tensor,
00:50:46.840 | which we add up.
00:50:49.160 | That's exactly the same as what we had before.
00:50:51.520 | And so now that takes 1.45 milliseconds.
00:50:55.280 | We've removed one line of code, and it's 178 times faster.
00:51:00.360 | So we successfully got rid of that inner loop.
00:51:03.720 | And so now this is running in C. We didn't really write Python here.
00:51:10.360 | We wrote kind of a Python-ish thing that said, please call this C code for us.
00:51:16.040 | And that made it 178 times faster.
00:51:19.140 | Let's check that it's right.
00:51:20.720 | We can't really check that it's equal, because floats are sometimes changed slightly, depending
00:51:27.200 | on how you calculate them.
00:51:28.640 | So instead, let's create something called near, which calls torch.all_close to some tolerance.
00:51:35.480 | And then we'll create a test_near function that calls our test function using our near
00:51:39.400 | comparison.
00:51:42.880 | And let's see.
00:51:44.880 | Passes.
00:51:46.880 | So we've now got our matrix multiplication at 65 microseconds.
00:51:52.760 | Now we need to get rid of this loop, because now this is our innermost loop.
00:51:58.200 | And to do that, we're going to have to use something called broadcasting.
00:52:02.000 | Who here is familiar with broadcasting?
00:52:05.200 | About half.
00:52:07.200 | That's what I figured.
00:52:08.200 | So broadcasting is about the most powerful tool we have in our toolbox for writing code
00:52:16.560 | in Python that runs at C speed, or in fact, with PyTorch, if you put it on the GPU, it's
00:52:24.360 | going to run at CUDA speed.
00:52:27.900 | It allows us to get rid of nearly all of our loops, as you'll see.
00:52:33.960 | The term "broadcasting" comes from NumPy, but the idea actually goes all the way back
00:52:38.280 | to APL from 1962.
00:52:43.400 | And it's a really, really powerful technique.
00:52:46.480 | A lot of people consider it a different way of programming, where we get rid of all of
00:52:50.360 | our for loops and replace them with these implicit, broadcasted loops.
00:52:58.600 | In fact, you've seen broadcasting before.
00:53:01.940 | Remember our tensor A, which contains 10, 6, 4?
00:53:06.000 | If you say A greater than 0, then on the left-hand side, you've got to rank one tensor.
00:53:12.160 | On the right-hand side, you've got a scalar.
00:53:14.360 | And yet somehow it works.
00:53:16.640 | And the reason why is that this value 0 is broadcast three times.
00:53:22.800 | It becomes 0, 0, 0, and then it does an element-wise comparison.
00:53:28.000 | So every time, for example, you've normalized a dataset by subtracting the mean and divided
00:53:33.560 | by the standard deviation in a kind of one line like this, you've actually been broadcasting.
00:53:39.400 | You're broadcasting a scalar to a tensor.
00:53:45.420 | So A plus one also broadcasts a scalar to a tensor.
00:53:50.600 | And the tensor doesn't have to be rank one.
00:53:52.680 | Here we can multiply our rank two tensor by two.
00:53:57.340 | So there's the simplest kind of broadcasting.
00:53:59.520 | And any time you do that, you're not operating at Python speed, you're operating at C or
00:54:06.280 | CUDA speed.
00:54:07.800 | So that's good.
00:54:11.100 | We can also broadcast a vector to a matrix.
00:54:15.800 | So here's a rank one tensor C. And here's our previous rank two tensor M. So M's shape
00:54:24.840 | is 3, 3, C's shape is 3.
00:54:29.660 | And yet M plus C does something.
00:54:34.320 | What did it do?
00:54:36.320 | 10, 20, 30 plus 1, 2, 3, 10, 20, 30 plus 4, 5, 6, 10, 20, 30 plus 7, 8, 9.
00:54:46.560 | It's broadcast this row across each row of the matrix.
00:54:54.200 | And it's doing that at C speed.
00:54:57.800 | So this, there's no loop, but it sure looks as if there was a loop.
00:55:04.480 | C plus M does exactly the same thing.
00:55:08.600 | So we can write C dot expand as M. And it shows us what C would look like when broadcast
00:55:19.440 | to M. 10, 20, 30, 10, 20, 30, 10, 20, 30.
00:55:23.000 | So you can see M plus T is the same as C plus M. So basically it's creating or acting as
00:55:33.240 | if it's creating this bigger rank two tensor.
00:55:39.740 | So this is pretty cool because it now means that any time we need to do something between
00:55:44.040 | a vector and a matrix, we can do it at C speed with no loop.
00:55:51.920 | Now you might be worrying though that this looks pretty memory intensive if we're kind
00:55:56.060 | of turning all of our rows into big matrices, but fear not.
00:56:00.260 | Because you can look inside the actual memory used by PyTorch.
00:56:05.040 | So here T is a 3 by 3 matrix, but T dot storage tells us that actually it's only storing one
00:56:12.040 | copy of that data.
00:56:15.760 | T dot shape tells us that T knows it's meant to be a 3 by 3 matrix.
00:56:21.280 | And T dot stride tells us that it knows that when it's going from column to column, it
00:56:26.780 | should take one step through the storage.
00:56:30.800 | But when it goes from row to row, it should take zero steps.
00:56:35.780 | And so that's how calm it repeats 10, 20, 30, 10, 20, 30, 10, 20, 30.
00:56:40.680 | So this is a really powerful thing that appears in pretty much every linear algebra library
00:56:44.680 | you'll come across is this idea that you can actually create tensors that behave like higher
00:56:51.600 | rank things than they're actually stored as.
00:56:55.680 | So this is really neat.
00:56:56.680 | It basically means that this broadcasting functionality gives us C like speed with no
00:57:01.760 | additional memory overhead.
00:57:04.680 | Okay, what if we wanted to take a column instead of a row?
00:57:12.820 | So in other words, a rank 2 tensor of shape 3, 1.
00:57:21.520 | We can create a rank 2 tensor of shape 3, 1 from a rank 1 tensor by using the unsqueeze
00:57:30.240 | method.
00:57:33.040 | Unsqueeze adds an additional dimension of size 1 to wherever we ask for it.
00:57:39.880 | So unsqueeze 0, let's check this out, unsqueeze 0 is of shape 1, 3, it puts the new dimension
00:57:49.500 | in position 1.
00:57:52.000 | Unsqueeze 1 is shape 3, 1, it creates the new axis in position 1.
00:57:58.760 | So unsqueeze 0 looks a lot like C, but now rather than being a rank 1 tensor, it's now
00:58:07.960 | a rank 2 tensor.
00:58:09.840 | See how it's got two square brackets around it?
00:58:12.680 | See how its size is 1, 3?
00:58:16.880 | As more interestingly, C.unsqueeze 1 now looks like a column, right?
00:58:23.340 | It's also a rank 2 tensor, but it's 3 rows by one column.
00:58:28.280 | Why is this interesting?
00:58:30.400 | Because we can say, well actually before we do, I'll just mention writing .unsqueeze is
00:58:37.920 | kind of clunky.
00:58:39.540 | So PyTorch and NumPy have a neat trick, which is that you can index into an array with a
00:58:46.520 | special value none, and none means squeeze a new axis in here please.
00:58:53.960 | So you can see that C none colon is exactly the same shape, 1, 3, as C.unsqueeze 0.
00:59:03.160 | And C colon, none is exactly the same shape as C.unsqueeze 1.
00:59:09.120 | So I hardly ever use unsqueeze unless I'm like particularly trying to demonstrate something
00:59:12.740 | for teaching purposes, I pretty much always use none.
00:59:15.520 | Apart from anything else, I can add additional axes this way, or else with unsqueeze you
00:59:22.320 | have to go to unsqueeze, unsqueeze, unsqueeze.
00:59:25.200 | So this is handy.
00:59:28.960 | So why did we do all that?
00:59:30.880 | The reason we did all that is because if we go C colon, none, so in other words we turn
00:59:40.800 | it into a column, kind of a columnar shape, so it's now of shape 3, 1, .expandas, it doesn't
00:59:50.520 | now say 10, 20, 30, 10, 20, 30, 10, 20, 30, but it says 10, 10, 10, 20, 20, 20, 30, 30.
00:59:56.720 | So in other words it's getting broadcast along columns instead of rows.
01:00:01.520 | So as you might expect, if I take that and add it to M, then I get the result of broadcasting
01:00:12.880 | the column.
01:00:14.340 | So it's now not 11, 22, 33, but 11, 12, 13.
01:00:25.120 | So everything makes more sense in Excel.
01:00:29.120 | Let's look.
01:00:30.320 | So here's broadcasting in Excel.
01:00:33.400 | Here is a 1, 3 shape rank 2 tensor.
01:00:42.640 | So we can use the rows and columns functions in Excel to get the rows and columns of this
01:00:49.400 | object.
01:00:51.560 | Here is a 3 by 1, rank 2 tensor, again rows and columns.
01:00:59.960 | And here is a 3 by 3, rank 2 tensor.
01:01:05.240 | As you can see, rows by columns.
01:01:07.760 | So here's what happens if we broadcast this to be the shape of M.
01:01:21.360 | And here is the result of that C plus M. And here's what happens if we broadcast this to
01:01:29.480 | that shape.
01:01:31.420 | And here is the result of that addition.
01:01:34.760 | And there it is, 11, 12, 13, 24, 25, 26.
01:01:43.360 | So basically what's happening is when we broadcast, it's taking the thing which has a
01:01:51.240 | unit axis and is kind of effectively copying that unit axis so it is as long as the larger
01:01:58.980 | tensor on that axis.
01:02:01.180 | But it doesn't really copy it, it just pretends as if it's been copied.
01:02:06.780 | So we can use that to get rid of our loop.
01:02:14.400 | So this was the loop we were trying to get rid of, going through each of range BC.
01:02:26.240 | And so here it is.
01:02:27.760 | So now we are not anymore going through that loop.
01:02:30.840 | So now rather than setting CI comma J, we can set the entire row of CI.
01:02:38.720 | This is the same as CI comma colon.
01:02:42.900 | Every time there's a trailing colon in NumPy or PyTorch, you can delete it optionally.
01:02:51.440 | You don't have to.
01:02:52.920 | So before, we had a few of those, right, let's see if we can find one.
01:03:03.840 | Here's one.
01:03:04.840 | Comma colon.
01:03:05.840 | So I'm claiming we could have got rid of that.
01:03:07.640 | Let's see.
01:03:08.800 | Yep, still torch size 1 comma 3.
01:03:13.040 | And similar thing, any time you see any number of colon commas at the start, you can replace
01:03:19.660 | them with a single ellipsis.
01:03:22.760 | Which in this case doesn't save us anything because there's only one of these.
01:03:25.520 | But if you've got like a really high-rank tensor, that can be super convenient, especially
01:03:30.040 | if you want to do something where the rank of the tensor could vary.
01:03:33.400 | You don't know how big it's going to be ahead of time.
01:03:40.880 | So we're going to set the whole of row I, and we don't need that colon, though it doesn't
01:03:46.840 | matter if it's there.
01:03:48.760 | And we're going to set it to the whole of row I of A.
01:03:55.800 | And then now that we've got row I of A, that is a rank 1 tensor.
01:04:01.840 | So let's turn it into a rank 2 tensor.
01:04:05.880 | So it's now got a new-- and see how this is minus 1?
01:04:11.520 | So minus 1 always means the last dimension.
01:04:15.840 | So how else could we have written that?
01:04:18.440 | We could also have written it like that with a special value none.
01:04:28.200 | So this is of now length whatever the size of A is, which is AR.
01:04:38.440 | So it's of shape AR comma 1.
01:04:43.760 | So that is a rank 2 tensor.
01:04:50.120 | And B is also a rank 2 tensor.
01:04:51.960 | That's the entirety of our matrix.
01:04:55.280 | And so this is going to get broadcast over this, which is exactly what we want.
01:05:01.320 | We want it to get rid of that loop.
01:05:03.280 | And then, so that's going to return, because it broadcast, it's actually going to return
01:05:07.680 | a rank 2 tensor.
01:05:09.640 | And then that rank 2 tensor, we want to sum it up over the rows.
01:05:15.640 | And so sum, you can give it a dimension argument to say which axis to sum over.
01:05:22.280 | So this one is kind of our most mind-bending broadcast of the lesson.
01:05:28.160 | So I'm going to leave this as a bit of homework for you to go back and convince yourself as
01:05:34.300 | to why this works.
01:05:35.380 | So maybe put it in Excel or do it on paper if it's not already clear to you why this
01:05:42.280 | works.
01:05:43.840 | But this is sure handy, because before we were broadcasting that, we were at 1.39 milliseconds.
01:05:53.680 | After using that broadcasting, we're down to 250 microseconds.
01:05:58.720 | So at this point, we're now 3,200 times faster than Python.
01:06:06.680 | And it's not just speed.
01:06:07.760 | Once you get used to this style of coding, getting rid of these loops I find really reduces
01:06:13.020 | a lot of errors in my code.
01:06:16.040 | It takes a while to get used to, but once you're used to it, it's a really comfortable
01:06:18.960 | way of programming.
01:06:23.960 | Once you get to kind of higher ranked tensors, this broadcasting can start getting a bit
01:06:29.680 | complicated.
01:06:30.680 | So what you need to do instead of trying to keep it all in your head is apply the simple
01:06:35.240 | broadcasting rules.
01:06:38.920 | Here are the rules.
01:06:39.920 | Some here, in NumPy and PyTorch and TensorFlow, it's all the same rules.
01:06:45.120 | What we do is we compare the shapes element-wise.
01:06:51.280 | So let's look at a slightly interesting example.
01:06:58.280 | Here is our rank1 tensor C, and let's insert a leading unit axis.
01:07:06.280 | So this is a shape 1,3.
01:07:08.200 | See how there's two square brackets?
01:07:11.720 | And here's the version with a, sorry, this one's a preceding axis.
01:07:17.420 | This one's a trailing axis.
01:07:18.680 | So this is shape 3,1.
01:07:23.960 | And we should take a look at that.
01:07:25.760 | So just to remind you, that looks like a column.
01:07:34.120 | What if we went C, none, colon, times C, colon, colon, what on earth is that?
01:07:43.640 | And so let's go back to Excel.
01:07:47.680 | Here's our row version.
01:07:49.920 | Here's our column version.
01:07:51.740 | What happens is it says, okay, you want to multiply this by this, element-wise, right?
01:07:57.120 | This is not the at sign.
01:07:58.360 | This is asterisk, so element-wise multiplication.
01:08:01.280 | And it broadcasts this to be the same number of rows as that, like so.
01:08:08.880 | And it broadcasts this to be the same number of columns as that, like so.
01:08:14.520 | And then it simply multiplies those together.
01:08:20.720 | That's it, right?
01:08:22.000 | So the rule that it's using, you can do the same thing with greater than, right?
01:08:27.840 | The rule that it's using is, let's look at the two shapes, 1, 3 and 3,1, and see if they're
01:08:33.760 | compatible.
01:08:34.760 | They're compatible if, element-wise, they're either the same number or one of them is 1.
01:08:42.640 | So in this case, 1 is compatible with 3 because one of them is 1.
01:08:47.400 | And 3 is compatible with 1 because one of them is 1.
01:08:51.160 | And so what happens is, if it's 1, that dimension is broadcast to make it the same size as the
01:08:57.160 | bigger one, okay?
01:08:59.600 | So 3,1 became 3,3.
01:09:03.600 | So this one was multiplied 3 times down the rows, and this one was multiplied 3 times
01:09:08.200 | down the columns.
01:09:11.180 | And then there's one more rule, which is that they don't even have to be the same rank,
01:09:15.880 | right?
01:09:16.880 | So something that we do a lot with image normalization is we normalize images by channel, right?
01:09:24.480 | So you might have an image which is 256 by 256 by 3.
01:09:28.160 | And then you've got the per-channel mean, which is just a rank 1 tensor of size 3.
01:09:33.960 | They're actually compatible because what it does is, anywhere that there's a missing dimension,
01:09:39.440 | it inserts a 1 there at the start.
01:09:42.200 | It inserts leading dimensions and inserts a 1.
01:09:44.580 | So that's why actually you can normalize by channel with no lines of code.
01:09:52.920 | Mind you, in PyTorch, it's actually channel by height by width, so it's slightly different.
01:09:58.080 | But this is the basic idea.
01:10:00.740 | So this is super cool.
01:10:02.440 | We're going to take a break, but we're getting pretty close.
01:10:06.320 | My goal was to make our Python code 50,000 times faster, we're up to 4,000 times faster.
01:10:14.000 | And the reason this is really important is because if we're going to be like doing our
01:10:19.120 | own stuff, like building things that people haven't built before, we need to know how
01:10:25.520 | to write code that we can write quickly and concisely, but operates fast enough that it's
01:10:31.160 | actually useful, right?
01:10:32.720 | And so this broadcasting trick is perhaps the most important trick to know about.
01:10:38.700 | So let's have a six-minute break, and I'll see you back here at 8 o'clock.
01:10:44.520 | So broadcasting, when I first started teaching deep learning here, and I asked how many people
01:10:54.960 | are familiar with broadcasting, this is back when we used to do it in the piano, almost
01:10:58.840 | no hands went up, so I used to kind of say this is like my secret magic trick.
01:11:04.640 | I think it's really cool, it's kind of really cool that now half of you have already heard
01:11:08.120 | of it, and it's kind of sad because it's now not my secret magic trick.
01:11:11.440 | It's like here's something half of you already knew, but the other half of you, there's a
01:11:17.800 | reason that people are learning this quickly and it's because it's super cool.
01:11:22.680 | Here's another magic trick.
01:11:23.840 | How many people here know Einstein summation notation?
01:11:27.480 | Okay, good, good, almost nobody.
01:11:30.680 | So it's not as cool as broadcasting, but it is still very, very cool.
01:11:37.080 | Let me show you, right?
01:11:38.800 | And this is a technique which I don't think it was invented by Einstein, I think it was
01:11:43.240 | popularized by Einstein as a way of dealing with these high rank tensor kind of reductions
01:11:49.600 | that we used in the general relativity, I think.
01:11:53.960 | Here's the trick.
01:11:55.440 | This is the innermost part of our original matrix multiplication for loop, remember?
01:12:05.120 | And here's the version when we removed the innermost loop and replaced it with an element-wise
01:12:09.960 | product.
01:12:10.960 | And you'll notice that what happened was that the repeated K got replaced with a colon.
01:12:17.640 | Okay, so what's this?
01:12:21.280 | What if I move, okay, so first of all, let's get rid of the names of everything.
01:12:29.640 | And let's move this to the end and put it after an arrow.
01:12:39.720 | And let's keep getting rid of the names of everything.
01:12:50.040 | And get rid of the commas and replace spaces with commas.
01:13:00.240 | Okay.
01:13:03.160 | And now I just created Einstein summation notation.
01:13:07.280 | So Einstein summation notation is like a mini language.
01:13:11.960 | You put it inside a string, right?
01:13:15.160 | And what it says is, however many, so there's an arrow, right, and on the left of the arrow
01:13:19.880 | is the input and on the right of the arrow is the output.
01:13:24.000 | How many inputs do you have?
01:13:26.080 | Well they're delimited by comma, so in this case there's two inputs.
01:13:32.520 | The inputs, what's the rank of each input?
01:13:35.200 | It's however many letters there are.
01:13:36.800 | So this is a rank two input and this is another rank two input and this is a rank two output.
01:13:43.960 | How big are the inputs?
01:13:45.880 | This is one is the size i by k, this one is the size k by j, and the output is of size
01:13:53.120 | i by j.
01:13:54.440 | When you see the same letter appearing in different places, it's referring to the same
01:13:58.520 | size dimension.
01:14:00.360 | So this is of size i, the output is always has, also has i rows.
01:14:06.120 | This has j columns.
01:14:07.400 | The output also has j columns.
01:14:09.200 | Alright.
01:14:10.200 | So we know how to go from the input shape to the output shape.
01:14:13.880 | What about the k?
01:14:16.360 | You look for any place that a letter is repeated and you do a dot product over that dimension.
01:14:23.920 | In other words, it's just like the way we replaced k with colon.
01:14:29.720 | So this is going to create something of size i by j by doing dot products over these shared
01:14:37.960 | k's, which is matrix multiplication.
01:14:42.040 | So that's how you write matrix multiplication with Einstein summation notation.
01:14:48.120 | And then all you just do is go torch dot insum.
01:14:52.560 | If you go to the PI torch insum docs or docs of most of the major libraries, you can find
01:15:01.560 | all kinds of cool examples of insum.
01:15:03.640 | You can use it for transpose, diagonalization, tracing, all kinds of things, batch wise versions
01:15:10.840 | of just about everything.
01:15:11.960 | So for example, if PI torch didn't have batch wise matrix multiplication, I just created
01:15:23.760 | There's batch wise matrix multiplication.
01:15:25.920 | So there's all kinds of things you can kind of invent.
01:15:28.400 | And often it's quite handy if you kind of need to put a transpose in somewhere or tweak
01:15:32.680 | things to be a little bit different.
01:15:34.320 | You can use this.
01:15:35.440 | So that's Einstein summation notation.
01:15:37.360 | Size matmul and that's now taken us down to 57 microseconds.
01:15:44.080 | So we're now 16,000 times faster than Python.
01:15:49.320 | I will say something about Einstein.
01:15:53.680 | It's a travesty that this exists because we've got a little mini language inside Python in
01:15:59.600 | a string.
01:16:00.600 | I mean, that's horrendous.
01:16:03.560 | You shouldn't be writing programming languages inside a string.
01:16:07.920 | This is as bad as a regex, you know, like regular expressions are also mini languages
01:16:13.160 | inside a string.
01:16:14.160 | You want your languages to be like typed and have an Intelli sense and like be things that
01:16:19.400 | you can like, you know, extend this, this mini language does.
01:16:24.840 | It's amazing, but there's so few things that it actually does, right?
01:16:29.440 | What I actually want to be able to do is create like any kind of arbitrary combination of
01:16:35.000 | any axes and any operations and any reductions I like in any order in the actual language
01:16:40.680 | I'm writing in, right?
01:16:42.800 | So that's actually what APL does.
01:16:46.780 | That's actually what J and K do.
01:16:48.840 | These are the J and K are the languages that kind of came out of APL.
01:16:52.200 | This is a kind of a series of languages that have been around for about 60 years and everybody's
01:16:57.600 | pretty much failed to notice.
01:17:01.680 | My hope is that things like Swift and Julia will give us this, like the ability to actually
01:17:09.720 | write stuff in actual Swift and actual Julia that we can run in an actual debugger and
01:17:14.680 | use an actual profiler and do arbitrary stuff that's really fast.
01:17:20.360 | But actually, Swift seems like it might go even quite a bit faster than Einstein in an
01:17:28.480 | even more flexible way, thanks to this new compiler infrastructure called MLIR, which
01:17:33.560 | actually builds off this and really exciting new research in the compiler world, kind of
01:17:38.040 | been coming over the last few years, particularly coming out of a system called Halide, which
01:17:42.960 | is H-A-L-I-D-E, which is this super cool language that basically showed it's possible to create
01:17:49.840 | a language that can create very, very, very, like totally optimized linear algebra computations
01:17:57.840 | in a really flexible, convenient way.
01:18:00.520 | And since that came along, there's been all kinds of cool research using these techniques
01:18:07.320 | like something called polyhedral compilation, which kind of have the promise that we're
01:18:13.640 | going to be able to hopefully, within the next couple of years, write Swift code that
01:18:20.240 | runs as fast as the next thing I'm about to show you, because the next thing I'm about
01:18:24.360 | to show you is the PyTorch operation called matmul.
01:18:30.720 | And matmul takes 8a microseconds, which is 50,000 times faster than Python.
01:18:39.920 | Why is it so fast?
01:18:41.400 | Well, if you think about what you're doing when you do a matrix multiply of something
01:18:45.800 | that's like 50,000 by 768 by 768 by 10, these are things that aren't going to fit in the
01:18:56.280 | cache in your CPU.
01:18:58.200 | So if you do the kind of standard thing of going down all the rows and across all the
01:19:01.400 | columns, by the time you've got to the end and you go back to exactly the same column
01:19:05.360 | again, it forgot the contents and has to go back to RAM and pull it in again.
01:19:10.240 | So if you're smart, what you do is you break your matrix up into little smaller matrices
01:19:14.540 | and you do a little bit at a time.
01:19:16.280 | And that way, everything is kind of in cache and it goes super fast.
01:19:19.440 | Now, normally, to do that, you have to write kind of assembly language code, particularly
01:19:25.360 | if you want to kind of get it all running in your vector processor.
01:19:29.200 | And that's how you get these 18 microseconds.
01:19:32.300 | So currently, to get a fast matrix multiply, things like PyTorch, they don't even write
01:19:37.540 | it themselves, they basically push that off to something called a BLAS, B-L-A-S, a BLAS
01:19:42.920 | is a Basic Linear Algebra Subprograms Library, where companies like Intel and AMD and NVIDIA
01:19:51.120 | write these things for you.
01:19:52.720 | So you can look up KuBLAS, for example, and this is like NVIDIA's version of BLAS.
01:19:57.920 | Or you could look up MKL and this is Intel's version of BLAS and so forth.
01:20:04.640 | And this is kind of awful because, you know, the program is limited to this like subset
01:20:11.480 | of things that your BLAS can handle.
01:20:15.800 | And to use it, you don't really get to write it in Python, you kind of have to write the
01:20:21.200 | one thing that happens to be turned into that pre-existing BLAS call.
01:20:26.000 | So this is kind of why we need to do better, right?
01:20:29.560 | And there are people working on this and there are people actually in Chris Latner's team
01:20:35.600 | working on this.
01:20:36.600 | You know, there's some really cool stuff like there's something called Tensor Comprehensions,
01:20:41.720 | which is like really originally came in PyTorch, and I think they're now inside Chris's team
01:20:47.040 | at Google, where people are basically saying, hey, here are ways to like compile these much
01:20:52.360 | more general things.
01:20:53.360 | And this is what we want as more advanced practitioners.
01:20:57.840 | Anyway, for now, in PyTorch world, we're stuck at this level, which is to recognize there
01:21:05.680 | are some things this is, you know, three times faster than the best we can do in an even
01:21:13.520 | vaguely flexible way.
01:21:15.920 | And if we compare it to the actually flexible way, which is broadcasting, we had 254, yeah,
01:21:26.680 | so still over 10 times better, right?
01:21:30.280 | So wherever possible today, we want to use operations that are predefined in our library,
01:21:37.160 | particularly for things that kind of operate over lots of rows and columns, the things
01:21:41.480 | we're kind of dealing with this memory caching stuff is going to be complicated.
01:21:46.600 | So keep an eye out for that.
01:21:49.280 | Matrix modification is so common and useful that it's actually got its own operator, which
01:21:55.000 | is at.
01:21:56.240 | These are actually calling the exact same code.
01:21:58.160 | So they're the exact same speed.
01:22:02.360 | At is not actually just matrix modification at covers a much broader array of kind of
01:22:10.440 | tensor reductions across different levels of axes.
01:22:15.120 | So it's worth checking out what matball can do, because often it'll be able to handle
01:22:19.800 | things like batch wise or matrix versus vectors, don't think of it as being only something
01:22:25.760 | that can do rank two by rank two, because it's a little bit more flexible.
01:22:30.560 | OK, so that's that we have matrix multiplication, and so now we're allowed to use it.
01:22:39.000 | And so we're going to use it to try to create a forward pass, which means we first need
01:22:45.760 | value and matrix initialization, because remember, a model contains parameters which start out
01:22:53.680 | randomly initialized.
01:22:55.880 | And then we use the gradients to gradually update them with SGD.
01:23:00.400 | So let's do that.
01:23:07.720 | So here is O2.
01:23:11.600 | So let's start by importing NBO1, and I just copied and pasted the three lines we used
01:23:18.560 | to grab the data, and I'm just going to pop them into a function so we can use it to grab
01:23:22.160 | MNIST when we need it.
01:23:24.680 | And now that we know about broadcasting, let's create a normalization function that takes
01:23:29.760 | our tensor and subtracts the means and divides by the standard deviation.
01:23:35.800 | So now let's grab our data, OK, and pop it into x, y, x, y.
01:23:42.320 | Let's grab the mean and standard deviation, and notice that they're not 0 and 1.
01:23:48.720 | Why would they be?
01:23:49.720 | Right?
01:23:50.720 | But we want them to be 0 and 1.
01:23:51.720 | And we're going to be seeing a lot of why we want them to be 0 and 1 over the next couple
01:23:56.880 | of lessons.
01:23:57.880 | But for now, let's just take my word for it.
01:24:00.520 | We want them to be 0 and 1.
01:24:02.120 | So that means that we need to subtract the mean, divide by the standard deviation, but
01:24:07.680 | not for the validation set.
01:24:09.720 | We don't subtract the validation set's mean and divide by the validation set's standard
01:24:13.640 | deviation.
01:24:14.640 | Because if we did, those two data sets would be on totally different scales, right?
01:24:20.200 | So if the training set was mainly green frogs, and the validation set was mainly red frogs,
01:24:28.200 | right, then if we normalize with the validation sets mean and variance, we would end up with
01:24:34.880 | them both having the same average coloration, and we wouldn't be able to tell the two apart,
01:24:40.320 | right?
01:24:41.320 | So that's an important thing to remember when normalizing, is to always make sure your validation
01:24:45.440 | and training set are normalized in the same way.
01:24:48.920 | So after doing that, get it twice, okay, so after doing that, our mean is pretty close
01:25:00.800 | to 0, and our standard deviation is very close to 1, and it would be nice to have something
01:25:05.360 | to easily check that these are true.
01:25:07.000 | So let's create a test near 0 function, and then test that the mean is near 0, and 1 minus
01:25:13.200 | the standard deviation is near 0, and that's all good.
01:25:18.200 | Let's define N and M and C the same as before, so the size of the training set and the number
01:25:25.440 | of activations we're going to eventually need in our model being C, and let's try to create
01:25:31.120 | our model.
01:25:35.040 | Okay, so the model is going to have one hidden layer, and normally we would want the final
01:25:45.800 | output to have 10 activations, because we would use cross-entropy against those 10 activations,
01:25:52.360 | but to simplify things for now, we're going to not use cross-entropy, we're going to use
01:25:56.660 | mean squared error, which means we're going to have one activation, okay, which makes
01:26:01.040 | no sense from our modeling point of view, we'll fix that later, but just to simplify
01:26:04.200 | things for now.
01:26:05.500 | So let's create a simple neural net with a single hidden layer and a single output activation,
01:26:13.600 | which we're going to use mean squared error.
01:26:15.880 | So let's pick a hidden size, so the number of hidden will make 50, okay, so our two layers,
01:26:22.040 | we're going to need two weight matrices and two bias vectors.
01:26:26.240 | So here are our two weight matrices, W1 and W2, so they're random numbers, normal random
01:26:33.800 | numbers of size M, which is the number of columns, 768, by NH, number of hidden, and
01:26:40.560 | then this one is NH by 1.
01:26:45.500 | Now our inputs now are mean zero, standard deviation 1, the inputs to the first layer.
01:26:54.360 | We want the inputs to the second layer to also be mean zero, standard deviation 1.
01:27:00.680 | Well, how are we going to do that?
01:27:05.080 | Because if we just grab some normal random numbers and then we define a function called
01:27:14.160 | linear, this is our linear layer, which is X by W plus B, and then create T, which is
01:27:20.760 | the activation of that linear layer with our validation set and our weights and biases.
01:27:28.000 | We have a mean of minus 5 and a standard deviation of 27, which is terrible.
01:27:36.560 | So I'm going to let you work through this at home, but once you actually look at what
01:27:43.040 | happens when you multiply those things together and add them up, as you do in matrix multiplication,
01:27:49.160 | you'll see that you're not going to end up with 0, 1.
01:27:51.760 | But if instead you divide by square root m, so root 768, then it's actually damn good.
01:28:07.800 | So this is a simplified version of something which PyTorch calls Keiming initialization,
01:28:15.200 | named after Keiming He who wrote a paper, or was the lead writer of a paper that we're
01:28:20.720 | going to look at in a moment.
01:28:24.400 | So the weights, rand n gives you random numbers with a mean of 0 and a standard deviation
01:28:33.400 | of 1.
01:28:35.100 | So if you divide by root m, it will have a mean of 0 and a standard deviation of 1 on
01:28:40.440 | root m.
01:28:41.440 | So we can test this.
01:28:46.520 | So in general, normal random numbers of mean 0 and standard deviation of 1 over root of
01:28:57.000 | whatever this is, so here it's m and here it's nh, will give you an output of 0, 1.
01:29:04.640 | Now this may seem like a pretty minor issue, but as we're going to see in the next couple
01:29:08.960 | of lessons, it's like the thing that matters when it comes to training neural nets.
01:29:14.760 | It's actually, in the last few months, people have really been noticing how important this
01:29:21.480 | There are things like fix-up initialization, where these folks actually trained a 10,000-layer
01:29:32.800 | deep neural network with no normalization layers, just by basically doing careful initialization.
01:29:42.420 | So it's really, people are really spending a lot of time now thinking like, okay, how
01:29:47.640 | we initialize things is really important.
01:29:49.800 | And you know, we've had a lot of success with things like one cycle training and super convergence,
01:29:56.480 | which is all about what happens in those first few iterations, and it really turns out that
01:30:02.760 | it's all about initializations.
01:30:04.980 | So we're going to be spending a lot of time studying this in depth.
01:30:09.960 | So the first thing I'm going to point out is that this is actually not how our first
01:30:15.360 | layer is defined.
01:30:17.520 | Our first layer is actually defined like this.
01:30:19.980 | It's got a ReLU on it.
01:30:22.000 | So first let's define ReLU.
01:30:23.920 | So ReLU is just grab our data and replace any negatives with zeros.
01:30:29.760 | That's all Clamp min means.
01:30:32.640 | Now there's lots of ways I could have written this.
01:30:35.200 | But if you can do it with something that's like a single function in PyTorch, it's almost
01:30:39.160 | always faster because that thing's generally written in C for you.
01:30:42.280 | So try to find the thing that's as close to what you want as possible.
01:30:45.840 | There's a lot of functions in PyTorch.
01:30:47.840 | So that's a good way of implementing ReLU.
01:30:51.880 | And unfortunately, that does not have a mean zero and standard deviation of one.
01:30:59.200 | Why not?
01:31:01.360 | Well, where's my stylus?
01:31:06.480 | Okay, so we had some data that had a mean of zero and a standard deviation of one.
01:31:23.240 | And then we took everything that was smaller than zero and removed it.
01:31:33.880 | So that obviously does not have a mean of zero and it obviously now has about half the
01:31:40.680 | standard deviation that it used to have.
01:31:44.680 | So this was one of the fantastic insights and one of the most extraordinary papers of
01:31:53.720 | the last few years.
01:31:55.440 | It was the paper from the 2015 ImageNet winners led by the person we've mentioned, Kaiming
01:32:04.280 | Kaiming at that time was at Microsoft Research.
01:32:07.720 | And this is full of great ideas.
01:32:12.400 | Reading papers from competition winners is a very, very good idea because they tend to
01:32:17.600 | be, you know, normal papers will have like one tiny tweak that they spend pages and pages
01:32:22.720 | trying to justify why they should be accepted into NeurIPS, whereas competition winners
01:32:27.520 | have 20 good ideas and only time to mention them in passing.
01:32:31.400 | This paper introduced us to ResNets, PreluLayers, and Kaiming initialization amongst others.
01:32:39.800 | So here is section 2.2.
01:32:48.340 | Section 2.2, initialization of filter weights or rectifiers.
01:32:51.800 | What's a rectifier?
01:32:52.800 | A rectifier is a rectified linear unit or rectifier network is any neural network with
01:32:58.880 | rectifier linear units in it.
01:33:01.120 | This is only 2015, but it already reads like something from another age in so many ways.
01:33:07.060 | Like even the word rectifier units and traditional sigmoid activation networks, no one uses sigmoid
01:33:13.080 | activations anymore, you know.
01:33:15.160 | So a lot's changed since 2015.
01:33:16.820 | So when you read these papers, you kind of have to keep these things in mind.
01:33:19.840 | They describe how what happens if you train very deep models with more than eight layers.
01:33:29.480 | So things have changed, right?
01:33:31.360 | But anyway, they said that in the old days, people used to initialize these with random
01:33:36.100 | Gaussian distributions.
01:33:37.920 | So this is a Gaussian distribution.
01:33:39.960 | It's just a fancy word for normal or bell shaped.
01:33:44.840 | And when you do that, they tend to not train very well.
01:33:49.960 | And the reason why, they point out, or actually Glorow and Benjio pointed out.
01:33:55.120 | Let's look at that paper.
01:33:58.820 | So you'll see two initializations come up all the time.
01:34:01.600 | One is either Kaiming or Her initialization, which is this one, or the other you'll see
01:34:05.840 | a lot is Glorow or Xavier initialization, again, named after Xavier Glorow.
01:34:14.960 | This is a really interesting paper to read.
01:34:16.640 | It's a slightly older one.
01:34:17.640 | It's from 2010.
01:34:18.640 | It's been massively influential.
01:34:20.760 | And one of the things you'll notice if you read it is it's very readable.
01:34:27.120 | It's very practical.
01:34:29.120 | And the actual final result they come up with is it's incredibly simple.
01:34:35.200 | And we're actually going to be re-implementing much of the stuff in this paper over the next
01:34:39.520 | couple of lessons.
01:34:41.440 | But basically, they describe one suggestion for how to initialize neural nets.
01:34:52.480 | And they suggest this particular approach, which is root six over the root of the number
01:35:00.360 | of input filters plus the number of output filters.
01:35:05.720 | And so what happened was Kaiming Her and that team pointed out that that does not account
01:35:12.760 | for the impact of a ReLU, the thing that we just noticed.
01:35:18.840 | So this is a big problem.
01:35:20.400 | If your variance halves each layer and you have a massive deep network with like eight
01:35:27.600 | layers, then you've got one over two to the eight squishes.
01:35:32.200 | Like by the end, it's all gone.
01:35:33.880 | And if you want to be fancy like the fix up people with 10,000 layers, forget it, right?
01:35:41.200 | Your gradients have totally disappeared.
01:35:43.240 | So this is totally unacceptable.
01:35:45.360 | So they do something super genius smart.
01:35:49.080 | They replace the one on the top with a two on the top.
01:35:53.160 | So this, which is not to take anything away from this, it's a fantastic paper, right?
01:35:57.800 | But in the end, the thing they do is to stick a two on the top.
01:36:02.400 | So we can do that by taking that exact equation we just used and sticking a two on the top.
01:36:09.560 | And if we do, then the result is much closer.
01:36:16.120 | It's not perfect, right, but it actually varies quite a lot.
01:36:19.200 | It's really random.
01:36:20.440 | Sometimes it's quite close.
01:36:21.440 | Sometimes it's further away, but it's certainly a lot better than it was.
01:36:25.040 | So that's good.
01:36:27.560 | And it's really worth reading.
01:36:32.160 | So law homework for this week is to read 2.2 of the ResNet paper.
01:36:41.240 | And what you'll see is that they describe what happens in the forward pass of a neural
01:36:48.000 | And they point out that for the conv layer, this is the response, Y equals WX plus B.
01:36:53.040 | Now if you're concentrating, that might be confusing because a conv layer isn't quite
01:36:59.480 | Y equals WX plus B. A conv layer has a convolution.
01:37:02.880 | But you remember in part one, I pointed out this neat article from Matt Clinesmith where
01:37:08.800 | he showed that CNNs in convolutions actually are just matrix multiplications with a bunch
01:37:15.840 | of zeros and some tideweights.
01:37:18.360 | So this is basically all they're saying here.
01:37:20.740 | So sometimes there are these kind of like throwaway lines in papers that are actually
01:37:24.840 | quite deep and worth thinking about.
01:37:27.200 | So they point out that you can just think of this as a linear layer.
01:37:30.800 | And then they basically take you through step by step what happens to the variance of your
01:37:38.080 | network depending on the initialization.
01:37:41.820 | And so just try to get to this point here, get as far as backward propagation case.
01:37:46.400 | So you've got about, I don't know, six paragraphs to read.
01:37:51.720 | None of the math notation is weird.
01:37:54.520 | Maybe this one is if you haven't seen this before.
01:37:56.740 | This is exactly the same as sigma, but instead of doing a sum, you do a product.
01:38:02.120 | So this is a great way to kind of warm up your paper reading muscles is to try and read
01:38:05.760 | this section.
01:38:08.480 | And then if that's going well, you can keep going with the backward propagation case because
01:38:13.840 | the forward pass does a matrix multiply.
01:38:17.440 | And as we'll see in a moment, the backward pass does a matrix multiply with a transpose
01:38:21.640 | of the matrix.
01:38:22.800 | So the backward pass is slightly different, but it's nearly the same.
01:38:27.200 | And so then at the end of that, they will eventually come up with their suggestion.
01:38:35.720 | Let's see if we can find it.
01:38:37.360 | Oh yeah, here it is.
01:38:40.880 | They suggest root two over nL, where nL is the number of input activations.
01:38:51.240 | Okay.
01:38:53.240 | So that's what we're using.
01:38:55.900 | That is called climbing initialization, and it gives us a pretty nice variance.
01:39:00.500 | It doesn't give us a very nice mean though.
01:39:03.600 | And the reason it doesn't give us a very nice mean is because as we saw, we deleted everything
01:39:07.020 | below the axis.
01:39:09.280 | So naturally, our main is now half, not zero.
01:39:14.680 | I haven't seen anybody talk about this in the literature, but something I was just trying
01:39:20.480 | over the last week is something kind of obvious, which is to replace value with not just x.plantmin,
01:39:28.280 | but x.plantmin minus 0.5.
01:39:31.440 | And in my brief experiments, that seems to help.
01:39:35.160 | So there's another thing that you could try out and see if it actually helps or if I'm
01:39:38.600 | just imagining things.
01:39:40.240 | It certainly returns you to the correct mean.
01:39:44.760 | Okay, so now that we have this formula, we can replace it with init.climbing_normal according
01:39:54.720 | to our rules, because it's the same thing.
01:39:57.720 | And let's check that it does the same thing, and it does, okay?
01:40:06.840 | So again, we've got this about half mean and bit under one standard deviation.
01:40:12.960 | You'll notice here I had to add something extra, which is mode equals fan out.
01:40:18.200 | What does that mean?
01:40:22.440 | What it means is explained here, fan in or fan out, fan in preserves the magnitude of
01:40:30.360 | variance in the forward pass, fan out preserves the magnitudes in the backward pass.
01:40:35.120 | Basically, all it's saying is, are you dividing by root m or root nh?
01:40:42.680 | Because if you divide by root m, as you'll see in that part of the paper I was suggesting
01:40:46.760 | you read, that will keep the variance at one during the forward pass.
01:40:50.920 | But if you use nh, it will give you the right unit variance in the backward pass.
01:40:57.240 | So it's weird that I had to say fan out, because according to the documentation, that's for
01:41:01.080 | the backward pass to keep the unit variance.
01:41:05.680 | So why did I need that?
01:41:08.080 | Well, it's because our weight shape is 784 by 50, but if you actually create a linear
01:41:18.920 | layer with PyTorch of the same dimensions, it creates it of 50 by 784.
01:41:27.040 | It's the opposite.
01:41:29.120 | So how can that possibly work?
01:41:30.600 | And these are the kind of things that it's useful to know how to dig into.
01:41:33.840 | So how is this working?
01:41:37.500 | So to find out how it's working, you have to look in the source code.
01:41:40.640 | So you can either set up Visual Studio code or something like that and kind of set it
01:41:44.760 | up so you can jump between things.
01:41:46.880 | It's a nice way to do it.
01:41:48.320 | Or you can just do it here with question mark, question mark.
01:41:51.840 | And you can see that this is the forward function, and it calls something called f.linear.
01:41:59.240 | In PyTorch, capital F always refers to the torch.nn.functional module, because you like
01:42:07.680 | it's used everywhere, so they decided that's worth a single letter.
01:42:11.220 | So torch.nn.functional.linear is what it calls, and let's look at how that's defined.
01:42:18.320 | Input.matmal.weight.t, t means transpose.
01:42:23.340 | So now we know in PyTorch, a linear layer doesn't just do a matrix product.
01:42:28.780 | It does a matrix product with a transpose.
01:42:32.080 | So in other words, it's actually going to turn this into 784 by 50 and then do it.
01:42:37.160 | And so that's why we kind of had to give it the opposite information when we were trying
01:42:41.800 | to do it with our linear layer, which doesn't have transpose.
01:42:45.800 | So the main reason I show you that is to kind of show you how you can dig in to the PyTorch
01:42:49.480 | source code, see exactly what's going on.
01:42:52.400 | And when you come across these kind of questions, you want to be able to answer them yourself.
01:42:57.580 | Which also then leads to the question, if this is how linear layers can be initialized,
01:43:03.360 | what about convolutional layers?
01:43:05.400 | What does PyTorch do for convolutional layers?
01:43:08.300 | So we could look inside torch.nn.conf2d, and when I looked at it, I noticed that it basically
01:43:14.480 | doesn't have any code.
01:43:15.480 | It just has documentation.
01:43:17.420 | All of the code actually gets passed down to something called _convnd.
01:43:22.480 | And so you need to know how to find these things.
01:43:26.160 | And so if you go to the very bottom, you can find the file name it's in.
01:43:30.240 | And so you see this is actually torch.nn.modules.conf.
01:43:34.840 | So we can find torch.nn.modules.conf._convnd.
01:43:41.080 | And so here it is.
01:43:42.920 | And here's how it initializes things.
01:43:45.740 | And it calls chiming_uniform, which is basically the same as chiming_normal, but it's uniform
01:43:50.800 | instead.
01:43:52.260 | But it has a special multiplier of math.square root 5.
01:43:57.440 | And that is not documented anywhere.
01:43:59.240 | I have no idea where it comes from.
01:44:01.760 | And in my experiments, this seems to work pretty badly, as you'll see.
01:44:08.600 | So it's kind of useful to look inside the code.
01:44:11.040 | And when you're writing your own code, presumably somebody put this here for a reason.
01:44:15.440 | Wouldn't it have been nice if they had a URL above it with a link to the paper that they're
01:44:19.200 | implementing so we could see what's going on?
01:44:22.720 | So it's always a good idea, always to put some comments in your code to let the next
01:44:26.520 | person know what the hell are you doing?
01:44:29.600 | So that particular thing, I have a strong feeling, isn't great, as you'll see.
01:44:36.000 | So we're going to try this thing.
01:44:38.960 | It's attracting 0.5 from our ReLU.
01:44:41.320 | So this is pretty cool, right?
01:44:43.240 | We've already designed our own new activation function.
01:44:48.960 | Is it great?
01:44:49.960 | Is it terrible?
01:44:50.960 | I don't know.
01:44:51.960 | But it's this kind of level of tweak, which is kind of-- when people write papers, this
01:44:57.840 | is the level of-- it's like a minor change to one line of code.
01:45:00.960 | It'll be interesting to see how much it helps.
01:45:03.400 | But if I use it, then you can see here, yep, now I have a mean of 0 thereabouts.
01:45:12.080 | And interestingly, I've also noticed it helps my variance a lot.
01:45:15.760 | All of my variance, remember, was generally around 0.7 to 0.8.
01:45:19.200 | But now it's generally above 0.8.
01:45:21.640 | So it helps both, which makes sense as to why I think I'm seeing these better results.
01:45:28.720 | So now we have ReLU.
01:45:30.860 | We have a linear.
01:45:32.640 | We have init.
01:45:34.000 | So we can do a forward pass.
01:45:37.920 | So we're now up to here.
01:45:42.920 | And so here it is.
01:45:44.160 | And remember, in PyTorch, a model can just be a function.
01:45:47.660 | And so here's our model.
01:45:48.800 | It's just a function that does one linear layer, one ReLU layer, and one more linear
01:45:54.260 | layer.
01:45:56.680 | And let's try running it.
01:45:58.040 | And OK, it takes eight milliseconds to run the model on the validation set.
01:46:03.000 | So it's plenty fast enough to train.
01:46:06.200 | It's looking good.
01:46:09.160 | Add an assert to make sure the shape seems sensible.
01:46:13.700 | So the next thing we need for our forward pass is a loss function.
01:46:18.080 | And as I said, we're going to simplify things for now by using mean squared error, even
01:46:23.400 | though that's obviously a dumb idea.
01:46:26.240 | Our model is returning something of size 10,000 by 1.
01:46:31.720 | But mean squared error, you would expect it just to be a single vector of size 10,000.
01:46:37.680 | So I want to get rid of this unit axis.
01:46:40.640 | In PyTorch, the thing to add a unit axis we've learned is called squeeze-- sorry, unsqueeze.
01:46:47.120 | The thing to get rid of a unit axis, therefore, is called squeeze.
01:46:49.760 | So we just go output.squeeze to get rid of that unit axis.
01:46:55.300 | But actually, now I think about it-- this is lazy.
01:46:59.400 | Because output.squeeze gets rid of all unit axes.
01:47:03.080 | And we very commonly see on the fastAO forums people saying that their code's broken.
01:47:08.260 | And it's when they've got squeeze.
01:47:11.020 | And it's that one case where maybe they had a batch size of size 1.
01:47:15.080 | And so that 1,1 would get squeezed down to a scalar.
01:47:19.240 | And things would break.
01:47:20.380 | So rather than just calling squeeze, it's actually better to say which dimension you
01:47:23.800 | want to squeeze, which we could write either 1 or minus 1, it would be the same thing.
01:47:28.320 | And this is going to be more resilient now to that weird edge case of a batch size of
01:47:32.740 | size 1.
01:47:33.880 | OK, so output minus target squared main-- that's main squared error.
01:47:41.000 | So remember, in PyTorch, loss functions can just be functions.
01:47:46.560 | For main squared error, we're going to have to make sure these are floats.
01:47:48.920 | So let's convert them.
01:47:50.340 | So now we can calculate some predictions.
01:47:53.400 | That's the shape of our predictions.
01:47:55.240 | And we can calculate our main squared error.
01:47:57.560 | So there we go.
01:47:58.560 | So we've done a forward pass.
01:48:01.440 | So we're up to here.
01:48:03.520 | A forward pass is useless.
01:48:05.520 | What we need is a backward pass, because that's the thing that tells us how to update our
01:48:09.680 | parameters.
01:48:11.200 | So we need gradients.
01:48:13.600 | OK, how much do you want to know about matrix calculus?
01:48:17.440 | I don't know.
01:48:19.160 | It's up to you.
01:48:20.160 | But if you want to know everything about matrix calculus, I can point you to this excellent
01:48:25.040 | paper by Terrence Parr and Jeremy Howard, which tells you everything about matrix calculus
01:48:34.320 | from scratch.
01:48:37.480 | So this is a few weeks work to get through, but it absolutely assumes nothing at all.
01:48:44.360 | So basically, Terrence and I both felt like, oh, we don't know any of this stuff.
01:48:52.560 | Let's learn all of it and tell other people.
01:48:55.080 | And so we wrote it with that in mind.
01:48:58.200 | And so this will take you all the way up to knowing everything that you need for deep
01:49:03.920 | learning.
01:49:06.000 | You can actually get away with a lot less.
01:49:09.320 | But if you're here, maybe it's worth it.
01:49:12.560 | But I'll tell you what you do need to know.
01:49:15.040 | What you need to know is the chain rule.
01:49:20.520 | Because let me point something out.
01:49:26.480 | We start with some input.
01:49:33.920 | We start with some input.
01:49:36.320 | And we stick it through the first linear layer.
01:49:40.260 | And then we stick it through ReLU.
01:49:42.960 | And then we stick it through the second linear layer.
01:49:46.120 | And then we stick it through MSE.
01:49:49.160 | And that gives us our predictions.
01:49:51.920 | Or to put it another way, we start with x.
01:50:00.840 | And we put it through the function lin1.
01:50:05.040 | And then we take the output of that, and we put it through the function ReLU.
01:50:10.080 | And then we take the output of that, and we put it through the function lin2.
01:50:14.920 | And then we take the output of that, and we put it through the function MSE.
01:50:20.380 | And strictly speaking, MSE has a second argument, which is the actual target value.
01:50:30.840 | And we want the gradient of the output with respect to the input.
01:50:40.240 | So it's a function of a function of a function of a function of a function.
01:50:43.480 | So if we simplify that down a bit, we could just say, what if it's just like y equals
01:50:49.680 | f of x-- sorry, y equals f of u and u equals f of x.
01:51:00.640 | So that's like a function of a function.
01:51:02.480 | Simplify it a little bit.
01:51:04.160 | And the derivative is that.
01:51:13.480 | That's the chain rule.
01:51:16.360 | If that doesn't look familiar to you, or you've forgotten it, go to Khan Academy.
01:51:20.680 | Khan Academy has some great tutorials on the chain rule.
01:51:23.920 | But this is actually the thing we need to know.
01:51:26.440 | Because once you know that, then all you need to know is the derivative of each bit on its
01:51:32.040 | own, and you just multiply them all together.
01:51:37.400 | And if you ever forget the chain rule, just cross-multiply.
01:51:41.960 | So that would be dy/dx, cross out to the u's, you get dy/dx.
01:51:50.680 | And if you went to a fancy school, they would have told you not to do that.
01:51:55.400 | They said you can't treat calculus like this, because they're special magic small things.
01:52:03.040 | Actually you can.
01:52:04.840 | There's actually a different way of treating calculus called the calculus of infinitesimals,
01:52:10.720 | where all of this just makes sense.
01:52:12.540 | And you suddenly realize you actually can do this exact thing.
01:52:17.580 | So any time you see a derivative, just remember that all it's actually doing is it's taking
01:52:25.640 | some function, and it's saying, as you go across a little bit, how much do you go up?
01:52:34.600 | And that it's dividing that change in y divided by that change in x.
01:52:41.240 | That's literally what it is, where y and x, you must make them small numbers.
01:52:46.160 | So they behave very sensibly when you just think of them as a small change in y over
01:52:52.760 | a small change in x, as I just did, showing you the chain rule.
01:52:57.280 | So to do the chain rule, we're going to have to start with the very last function.
01:53:04.720 | The very last function on the outside was the loss function, mean squared error.
01:53:09.800 | So we just do each bit separately.
01:53:12.600 | So the gradient of the loss with respect to output of previous layer.
01:53:26.000 | So the output of the previous layer, the MSC is just input minus target squared.
01:53:34.740 | And so the derivative of that is just 2 times input minus target, because the derivative
01:53:38.840 | of blah squared is 2 times blah.
01:53:42.920 | So that's it.
01:53:44.680 | Now I need to store that gradient somewhere.
01:53:47.520 | Now the thing is that for the chain rule, I'm going to need to multiply all these things
01:53:51.760 | together.
01:53:53.280 | So if I store it inside the dot g attribute of the previous layer, because remember this
01:54:00.520 | is the previous layer, then when the previous layer, so the input of MSC is the same as
01:54:07.800 | the output of the previous layer.
01:54:10.640 | So if I store it away in here, I can then quite comfortably refer to it.
01:54:16.600 | So here, look, ReLU, let's do ReLU.
01:54:22.960 | So ReLU is this, okay, what's the gradient there?
01:54:33.240 | What's the gradient there?
01:54:37.400 | So therefore, that's the gradient of the ReLU.
01:54:42.480 | It's just imp greater than 0.
01:54:45.280 | But we need the chain rule, so we need to multiply this by the gradient of the next layer, which
01:54:53.520 | remember we store it away.
01:54:56.900 | So we can just grab it.
01:54:59.100 | So this is really cool.
01:55:01.880 | So same thing for the linear layer, the gradient is simply, and this is where the matrix calculus
01:55:08.000 | comes in, the gradient of a matrix product is simply the matrix product with the transpose.
01:55:13.980 | So you can either read all that stuff I showed you, or you can take my word for it.
01:55:19.520 | So here's the cool thing, right?
01:55:22.640 | Here's the function which does the forward pass that we've already seen, and then it
01:55:28.640 | goes backwards.
01:55:29.640 | It calls each of the gradients backwards, right, in reverse order, because we know we
01:55:33.080 | need that for the chain rule.
01:55:35.300 | And you can notice that every time we're passing in the result of the forward pass, and it
01:55:42.340 | also has access, as we discussed, to the gradient of the next layer.
01:55:49.800 | This is called backpropagation, right?
01:55:52.280 | So when people say, as they love to do, backpropagation is not just the chain rule, they're basically
01:56:00.080 | lying to you.
01:56:01.320 | Backpropagation is the chain rule, where we just save away all the intermediate calculations
01:56:06.240 | so we don't have to calculate them again.
01:56:09.560 | So this is a full forward and backward pass.
01:56:18.280 | One interesting thing here is this value here, loss, this value here, loss, we never actually
01:56:26.980 | use it, because the loss never actually appears in the gradients.
01:56:33.980 | I mean, just by the way, you still probably want it to be able to print it out, or whatever,
01:56:38.800 | but it's actually not something that appears in the gradients.
01:56:42.400 | So that's it, so w1.g, w2.g, et cetera, they now contain all of our gradients, which we're
01:56:51.320 | going to use for the optimizer.
01:56:55.180 | And so let's cheat and use PyTorch autograd to check our results, because PyTorch can
01:57:02.000 | do this for us.
01:57:03.860 | So let's clone all of our weights and biases and input, and then turn on requires-grad
01:57:13.420 | for all of them.
01:57:14.420 | So requires-grad_ is how you take a PyTorch tensor and turn it into a magical autogradified
01:57:21.060 | PyTorch tensor.
01:57:22.460 | So what it's now going to do is everything that gets calculated with test tensor, it's
01:57:26.380 | basically going to keep track of what happened.
01:57:30.340 | So it basically keeps track of these steps, so that then it can do these things.
01:57:35.200 | It's not actually that magical, you could totally write it yourself, you just need to
01:57:39.460 | make sure that each time you do an operation, you remember what it is, and so then you can
01:57:44.760 | just go back through them in reverse order.
01:57:48.000 | Okay, so now that we've done requires-grad, we can now just do the forward pass like so,
01:57:56.880 | that gives us loss in PyTorch, you say loss.backward, and now we can test that, and remember PyTorch
01:58:03.080 | doesn't store things in .g, it stores them in .grad, and we can test them, and all of
01:58:09.380 | our gradients were correct, or at least they're the same as PyTorch's.
01:58:15.120 | So that's pretty interesting, right, I mean that's an actual neural network that kind
01:58:22.840 | of contains all the main pieces that we're going to need, and we've written all these
01:58:28.520 | pieces from scratch, so there's nothing magical here, but let's do some cool refactoring.
01:58:35.960 | I really love this refactoring, and this is massively inspired by and very closely stolen
01:58:42.040 | from the PyTorch API, but it's kind of interesting, I didn't have the PyTorch API in mind as I
01:58:47.400 | did this, but as I kept refactoring, I kind of noticed like, oh, I just recreated the
01:58:52.920 | PyTorch API, that makes perfect sense.
01:58:55.500 | So let's take each of our layers, relu and linear, and create classes, right, and for
01:59:04.280 | the forward, let's use dundercall, now do you remember that dundercall means that we
01:59:10.040 | can now treat this as if it was a function, right, so if you call this class just with
01:59:13.800 | parentheses, it calls this function.
01:59:17.260 | And let's save the input, let's save the output, and let's return the output, right,
01:59:25.120 | and then backward, do you remember this was our backward pass, okay, so it's exactly the
01:59:30.000 | same as we had before, okay, but we're going to save it inside self.input.gradient, so this
01:59:35.880 | is exactly the same code as we had here, okay, but I've just moved the forward and backward
01:59:42.200 | into the same class, right?
01:59:45.240 | So here's linear, forward, exactly the same, but each time I'm saving the input, I'm saving
01:59:51.720 | the output, I'm returning the output, and then here's our backward.
02:00:02.560 | One thing to notice, the backward pass here, for linear, we don't just want the gradient
02:00:11.720 | of the outputs with respect to the inputs, we also need the gradient of the outputs with
02:00:17.640 | respect to the weights and the output with respect to the biases, right, so that's why
02:00:21.560 | we've got three lots of dot g's going on here, okay, so there's our linear layers forward
02:00:32.360 | and backward, and then we've got our mean squared error, okay, so there's our forward,
02:00:46.160 | and we'll save away both the input and the target for using later, and there's our gradient,
02:00:51.000 | again, same as before, two times input minus target.
02:00:55.940 | So with this refactoring, we can now create our model, we can just say let's create a model
02:01:02.320 | class and create something called dot layers with a list of all of our layers, all right,
02:01:07.720 | notice I'm not using any PyTorch machinery, this is all from scratch, let's define loss
02:01:14.200 | and then let's define call, and it's going to go through each layer and say x equals
02:01:19.440 | lx, so this is how I do that function composition, we're just calling the function on the result
02:01:24.880 | of the previous thing, okay, and then at the other very end call self dot loss on that,
02:01:31.640 | and then for backward we do the exact opposite, we go self dot loss dot backward and then
02:01:36.240 | we go through the reversed layers and call backward on each one, right, and remember the
02:01:41.120 | backward passes are going to save the gradient away inside the dot g, so with that, let's
02:01:51.640 | just set all of our gradients to none so that we know we're not cheating, we can then create
02:01:55.600 | our model, right, this class model, and call it, and we can call it as if it was a function
02:02:03.480 | because we have done to call, right, so this is going to call done to call, and then we
02:02:10.720 | can call backward, and then we can check that our gradients are correct, right, so that's
02:02:17.080 | nice, one thing that's not nice is, holy crap that took a long time, let's run it, there
02:02:26.800 | we go, 3.4 seconds, so that was really really slow, so we'll come back to that, I don't
02:02:36.840 | like duplicate code, there's a lot of duplicate code here, self dot imp equals imp, return
02:02:42.040 | self dot out, that's messy, so let's get rid of it, so what we could do is we could create
02:02:48.520 | a new class called module, which basically does the self dot imp equals imp, and return
02:02:54.340 | self dot out for us, and so now we're not going to use done to call to implement our
02:03:01.240 | forward, we're going to have a call something called self dot forward, which we will initially
02:03:06.140 | set to raise exception, not implemented, and backward is going to call self dot bwd, passing
02:03:14.160 | in the thing that we just saved, and so now relu has something called forward, which just
02:03:20.540 | has that, so we're now basically back to where we were, and backward just has that, right,
02:03:27.360 | so now look how neat that is, and we also realized that this thing we were doing to calculate
02:03:39.880 | the derivative of the output of the linear layer with respect to the weights, where we're
02:03:46.840 | doing an unsqueeze and an unsqueeze, which is basically a big adder product and a sum,
02:03:51.840 | we could actually re-express that with einsum, okay, and when we do that, so our code is
02:04:00.580 | now neater, and our 3.4 seconds is down to 143 milliseconds, okay, so thank you again
02:04:08.320 | to einsum, so you'll see this now, look, model equals model, loss equals bla, bla dot backward,
02:04:19.320 | and now the gradients are all there, that looks almost exactly like PyTorch, and so
02:04:25.780 | we can see why, why it's done this way, why do we have to inherit from nn dot module,
02:04:31.960 | why do we have to define forward, this is why, right, it lets PyTorch factor out all
02:04:37.600 | this duplicate stuff, so all we have to do is do the implementation, so I think that's
02:04:43.680 | pretty fun, and then once we realized, we thought more about it, more like what are
02:04:49.840 | we doing with this einsum, and we actually realized that it's exactly the same as just
02:04:54.360 | doing input dot transpose times output, so we replaced the einsum with a matrix product,
02:05:01.600 | and that's 140 milliseconds, and so now we've basically implemented nn dot linear and nn
02:05:10.520 | dot module, so let's now use nn dot linear and nn dot module, because we're allowed to,
02:05:17.140 | that's the rules, and the forward pass is almost exactly the same speed as our forward
02:05:21.760 | pass, and their backward pass is about twice as fast, I'm guessing that's because we're
02:05:29.360 | calculating all of the gradients, and they're not calculating all of them, only the ones
02:05:34.080 | they need, but it's basically the same thing. So at this point, we're ready in the next
02:05:42.960 | lesson to do a training loop. We have something, we have a multi-layer fully connected neural
02:05:52.320 | network, what her paper would call a rectified network, we have matrix multiply organized,
02:06:01.440 | we have our forward and backward passes organized, it's all nicely refactored out into classes
02:06:06.960 | and a module class, so in the next lesson, we will see how far we can get, hopefully
02:06:13.820 | we will build a high quality, fast ResNet, and we're also going to take a very deep dive
02:06:23.120 | into optimizers and callbacks and training loops and normalization methods. Any questions
02:06:31.200 | before we go? No? That's great. Okay, thanks everybody, see you on the forums.
02:06:37.440 | (audience applauds)