Lesson 8 (2019) - Deep Learning from the Foundations

So, welcome back to part two of what previously was called Practical Deep Learning for Coders, but part two is not called that, as you will see. It's called Deep Learning from the Foundations. It's lesson eight because it's lesson eight of the full journey, lesson one of part two, or lesson eight, mod seven, as we sometimes call it.

So those of you, I know a lot of you do every year's course and keep coming back. For those of you doing that, this will not look at all familiar to you. It's a very different part two. We're really excited about it and hope you like it as well.

The basic idea of deep learning from the foundations is that we are going to implement much of the fast AI library from foundations. Now, talk about exactly what I mean by foundations in a moment, but it basically means from scratch. So we'll be looking at basic matrix calculus and creating a training loop from scratch and creating an optimizer from scratch and lots of different layers and architectures and so forth, and not just to create some kind of dumbed down library that's not useful for anything, but to actually build from scratch something you can train cutting edge world-class models with.

So that's the goal. We've never done it before. I don't think anybody's ever done this before. So I don't exactly know how far we'll get, but this is the journey that we're on. We'll see how we go. So in the process, we will be having to read and implement papers because the fast AI library is full of implemented papers.

So you're not going to be able to do this if you're not reading and implementing papers. Along the way, we'll be implementing much of PyTorch as well, as you'll see. We'll also be going deeper into solving some applications that are not kind of fully baked into the fast AI library yet, so it's going to require a lot of custom work.

So things like object detection, sequence to sequence with attention, transformer and the transform Excel, cycleGAN, audio, stuff like that. We'll also be doing a deeper dive into some performance considerations like doing distributed multi-GPU training, using the new just-in-time compiler, which we'll just call JIT from now on, CUDA and C++, stuff like that.

So that's the first five lessons. And then the last two lessons, implementing some subset of that in Swift. So this is otherwise known as impractical deep learning for coders. Because really none of this is stuff that you're going to go and use right away. It's kind of the opposite of part one.

Part one was like, oh, we've been spending 20 minutes on this. You can now create a world-class vision classification model. This is not that, because you already know how to do that. And so back in the earlier years, part two used to be more of the same thing, but it was kind of like more advanced types of model, more advanced architectures.

But there's a couple of reasons we've changed this year. The first is so many papers come out now, because this whole area has increased in scale so quickly, that I can't pick out for you the 12 papers to do in the next seven weeks that you really need to know, because there's too many.

And it's also kind of pointless, because once you get into it, you realize that all the papers pretty much say minor variations on the same thing. So instead, what I want to be able to do is show you the foundations that let you read the 12 papers you care about and realize like, oh, that's just that thing with this minor tweak.

And I now have all the tools I need to implement that and test it and experiment with it. So that's kind of a really key issue in why we want to go in this direction. Also it's increasingly clear that, you know, we used to call part two cutting edge deep learning for coders, but it's increasingly clear that the cutting edge of deep learning is really about engineering, not about papers.

The difference between really effective people in deep learning and the rest is really about who can like make things encode that work properly. And there's very few of those people. So really, the goal of this part two is to deepen your practice so you can understand, you know, the things that you care about and build the things you care about and have them work and perform at a reasonable speed.

So that's where we're trying to head to. And so it's impractical in the sense that like none of these are things that you're going to go probably straight away and say, here's this thing I built, right? Probably Swift. Because Swift, we're actually going to be learning a language in a library that as you'll see is far from ready for use, and I'll describe why we're doing that in a moment.

So part one of this course was top down, right? So that you got the context you needed to understand, you got the motivation you needed to keep going, and you got the results you needed to make it useful. But bottom up is useful too. And we started doing some bottom up at the end of part one, right?

But really bottom up lets you, when you've built everything from the bottom yourself, then you can see the connections between all the different things. You can see they're all variations of the same thing, you know? And then you can customize, rather than picking algorithm A or algorithm B, you create your own algorithm to solve your own problem doing just the things you need it to do.

And then you can make sure that it performs well, that you can debug it, profile it, maintain it, because you understand all of the pieces. So normally when people say bottom up in this world, in this field, they mean bottom up with math. I don't mean that. I mean bottom up with code, right?

So today, step one will be to implement matrix multiplication from scratch in Python. Because bottom up with code means that you can experiment really deeply on every part of every bit of the system. You can see exactly what's going in, exactly what's coming out, and you can figure out why your model's not training well, or why it's slow, or why it's giving the wrong answer, or whatever.

So why Swift? What are these two lessons about? And be clear, we are only talking the last two lessons, right? You know, our focus, as I'll describe, is still very much Python and PyTorch, right? But there's something very exciting going on. The first exciting thing is this guy's face you see here, Chris Latner.

Chris is unique, as far as I know, as being somebody who has built, I think, what is the world's most widely used compiler framework, LLVM. He's built the default C and C++ compiler for Mac, being Clang. And he's built what's probably like the world's fastest growing fairly new computer language, being Swift.

And he's now dedicating his life to deep learning, right? So we haven't had somebody from that world come into our world before. And so when you actually look at stuff like the internals of something like TensorFlow, it looks like something that was built by a bunch of deep learning people, not by a bunch of compiler people, right?

And so I've been wanting for over 20 years for there to be a good numerical programming language that was built by somebody that really gets programming languages. And it's never happened, you know? So we've had like, in the early days, it was elispstat in LISP, and then it was R and then it was Python.

None of these languages were built to be good at data analysis. They weren't built by people that really deeply understood compilers. They certainly weren't built for today's kind of modern, highly parallel processor situation we're in. But Swift was, Swift is, right? And so we've got this unique situation where for the first time, you know, a really widely used language, a really well-designed language from the ground up, is actually being targeted towards numeric programming and deep learning.

So there's no way I'm missing out on that boat. And I don't want you to miss out on it either. I should mention there's another language which you could possibly put in there, which is a language called Julia, which has maybe as much potential. But it's, you know, it's about ten times less used than Swift.

It doesn't have the same level of community. But I would still say it's super exciting. So I'd say, like, maybe there's two languages which you might want to seriously consider picking one and spending some time with it. Julia is actually further along. Swift is very early days in this world.

But that's one of the things I'm excited about for it. So I actually spent some time over the Christmas break kind of digging into numeric programming in Swift. And I was delighted to find that I could create code from scratch that was competitive with the fastest hand-tuned vendor linear algebra libraries, even though I am -- was and remain pretty incompetent at Swift.

I found it was a language that, you know, was really delightful. It was expressive. It was concise. But it was also very performant. And I could write everything in Swift, you know, rather than having to kind of get to some layer where it's like, oh, that's crude DNN now or that's MKL now or whatever.

So that got me pretty enthusiastic. And so the really exciting news, as I'm sure you've heard, is that Chris Latner himself is going to come and join us for the last two lessons, and we're going to teach Swift for deep learning together. So Swift for deep learning means Swift for TensorFlow.

That's specifically the library that Chris and his team at Google are working on. We will call that S for TF when I write it down because I couldn't be bothered typing Swift for TensorFlow every time. Swift for TensorFlow has some pros and cons. PyTorch has some pros and cons.

And interestingly, they're the opposite of each other. PyTorch and Python's pros are you can get stuff done right now with this amazing ecosystem, fantastic documentation and tutorials. You know, it's just a really great practical system for solving problems. And to be clear, Swift for TensorFlow is not. It's not any of those things right now, right?

It's really early. Almost nothing works. You have to learn a whole new language if you don't know Swift already. There's very little ecosystem. Now, I'm not sure about Swift in particular, but the kind of Swift for TensorFlow and Swift for deep learning and even Swift for numeric programming. I was kind of surprised when I got into it to find there was hardly any documentation about Swift for numeric programming, even though I was pretty delighted by the experience.

People have had this view that Swift is kind of for iPhone programming. I guess that's kind of how it was marketed, right? But actually it's an incredibly well-designed, incredibly powerful language. And then TensorFlow, I mean, to be honest, I'm not a huge fan of TensorFlow in general. I mean, if I was, we wouldn't have switched away from it.

But it's getting a lot better. TensorFlow 2 is certainly improving. And the bits of it I particularly don't like are largely the bits that Swift for TensorFlow will avoid. But I think long-term, the kind of things I see happening, like there's this fantastic new kind of compiler project called MLIR, which Chris is also co-leading, which I think actually has the potential long-term to allow Swift to replace most of the yucky bits or maybe even all of the yucky bits of TensorFlow with stuff where Swift is actually talking directly to LLVM.

You'll be hearing a lot more about LLVM in the coming, in the last two weeks, the last two lessons. Basically, it's the compiler infrastructure that kind of everybody uses, that Julia uses, that Clang uses. And Swift is this kind of almost this thin layer on top of it, where when you write stuff in Swift, it's really easy for LLVM to compile it down to super-fast optimized code.

Which is like the opposite of Python. With Python, as you'll see today, we almost never actually write Python code. We write code in Python that gets turned into some other language or library, and that's what gets run. And this mismatch, this impedance mismatch between what I'm trying to write and what actually gets run makes it very hard to do the kind of deep dives that we're going to do in this course, as you'll see.

It's kind of a frustrating experience. So I'm excited about getting involved in these very early days for impractical deep learning in Swift for TensorFlow, because it means that me and those of you that want to follow along can be the pioneers in something that I think is going to take over this field.

We'll be the first in there. We'll be the ones that understand it really well. And in your portfolio, you can actually point at things and say, "That library that everybody is, I wrote that." This piece of documentation that's like on the Swift for TensorFlow website, I wrote that. That's the opportunity that you have.

So let's put that aside for the next five weeks. And let's try to create a really high bar for the Swift for TensorFlow team to have to try to re-implement before six weeks' time. We're going to try to implement as much of fast AI and many parts of PyTorch as we can and then see if the Swift for TensorFlow team can help us build that in Swift in five weeks' time.

So the goal is to recreate fast AI from the foundations and much of PyTorch like matrix multiplication, a lot of torch.nn, torch.optm, dataset, data loader from the foundations. And this is the game we're going to play. The game we're going to play is we're only allowed to use these bits.

We're allowed to use pure Python, anything in the Python standard library, any non-data science modules, so like a requests library for HTTP or whatever, we can use PyTorch but only for creating arrays, random number generation, and indexing into arrays. We can use the fastai.datasets library because that's the thing that has access to like MNIST and stuff, so we don't have to worry about writing our own HTTP stuff.

And we can use matplotlib. We don't have to write our own plotting library. That's it. That's the game. So we're going to try and recreate all of this from that. And then the rules are that each time we have replicated some piece of fastai or PyTorch from the foundations, we can then use the real version if we want to, okay?

So that's the game we're going to play. What I've discovered as I started doing that is that I started actually making things a lot better than fastai. So I'm now realizing that fastai version 1 is kind of a disappointment because there was a whole lot of things I could have done better.

And so you'll find the same thing. As you go along this journey, you'll find decisions that I made or the PyTorch teammate or whatever where you think, what if they'd made a different decision there? And you can maybe come up with more examples of things that we could do differently, right?

So why would you do this? Well, the main reason is so that you can really experiment, right? So you can really understand what's going on in your models, what's really going on in your training. And you'll actually find that in the experiments that we're going to do in the next couple of classes, we're going to actually come up with some new insights.

If you can create something from scratch yourself, you know that you understand it. And then once you've created something from scratch and you really understand it, then you can tweak everything, right? But you suddenly realize that there's not this object detection system and this confident architecture and that optimizer.

They're all like a kind of semi-arbitrary bunch of particular knobs and choices. And that it's pretty likely that your particular problem would want a different set of knobs and choices. So you can change all of these things. For those of you looking to contribute to open source, to fast AI or to PyTorch, you'll be able to, right?

Because you'll understand how it's all built up. You'll understand what bits are working well, which bits need help. You know how to contribute tests or documentation or new features or create your own libraries. And for those of you interested in going deeper into research, you'll be implementing papers, which means you'll be able to correlate the code that you're writing with the paper that you're reading.

And if you're a poor mathematician like I am, then you'll find that you'll be getting a much better understanding of papers that you might otherwise have thought were beyond you. And you realize that all those Greek symbols actually just map to pieces of code that you're already very familiar with.

So there are a lot of opportunities in part one to blog and to do interesting things, but the opportunities are much greater now. In part two, you can be doing homework that's actually at the cutting edge, actually doing experiments people haven't done before, making observations people haven't made before.

Because you're getting to the point where you're a more competent deep learning practitioner than the vast majority that are out there, and we're looking at stuff that other people haven't looked at before. So please try doing lots of experiments, particularly in your domain area, and consider writing things down, especially if it's not perfect.

So write stuff down for the you of six months ago. That's your audience. Okay, so I am going to be assuming that you remember the contents of part one, which was these things. Here's the contents of part one. In practice, it's very unlikely you remember all of these things because nobody's perfect.

So what I'm actually expecting you to do is as I'm going on about something which you're thinking I don't know what he's talking about, that you'll go back and watch the video about that thing. Don't just keep blasting forwards, because I'm assuming that you already know the content of part one.

Particularly if you're less confident about the second half of part one, where we went a little bit deeper into what's an activation, and what's a parameter really, and exactly how does SGD work. Particularly in today's lesson, I'm going to assume that you really get that stuff. So if you don't, then go back and re-look at those videos.

Go back to that SGD from scratch and take your time. I've kind of designed this course to keep most people busy up until the next course. So feel free to take your time and dig deeply. So the most important thing, though, is we're going to try and make sure that you can train really good models.

And there are three steps to training a really good model. Step one is to create something with way more capacity you need, and basically no regularization, and overfit. So overfit means what? It means that your training loss is lower than your validation loss? No. No, it doesn't mean that.

Remember, it doesn't mean that. A well-fit model will almost always have training loss lower than the validation loss. Remember that overfit means you have actually personally seen your validation error getting worse. Okay? Until you see that happening, you're not overfitting. So step one is overfit. And then step two is reduce overfitting.

And then step three, okay, there is no step three. Well, I guess step three is to visualize the inputs and outputs and stuff like that, right? That is to experiment and see what's going on. So one is pretty easy normally, right? Two is the hard bit. It's not really that hard, but it's basically these are the five things that you can do in order of priority.

If you can get more data, you should. If you can do more data augmentation, you should. If you can use a more generalizable architecture, you should. And then if all those things are done, then you can start adding regularization like drop-out, or weight decay, but remember, at that point, you're reducing the effective capacity of your model.

So it's less good than the first three things. And then last of all, reduce the architecture complexity. And most people, most beginners especially, start with reducing the complexity of the architecture, but that should be the last thing that you try. Unless your architecture is so complex that it's too slow for your problem, okay?

So that's kind of a summary of what we want to be able to do that we learned about in part one. Okay. So we're going to be reading papers, which we didn't really do in part one. And papers look something like this, which if you're anything like me, that's terrifying.

But I'm not going to lie, it's still the case that when I start looking at a new paper, every single time I think I'm not smart enough to understand this, I just can't get past that immediate reaction because I just look at this stuff and I just go, that's not something that I understand.

But then I remember, this is the Adam paper, and you've all seen Adam implemented in one cell of Microsoft Excel, right? When it actually comes down to it, every time I do get to the point where I understand if I've implemented a paper, I go, oh my God, that's all it is, right?

So a big part of reading papers, especially if you're less mathematically inclined than I am, is just getting past the fear of the Greek letters. I'll say something else about Greek letters. There are lots of them, right? And it's very hard to read something that you can't actually pronounce, right?

Because you're just saying to yourself, oh, squiggle bracket one plus squiggle one, G squiggle one minus squiggle. And it's like all the squiggles, you just get lost, right? So believe it or not, it actually really helps to go and learn the Greek alphabet so you can pronounce alpha times one plus beta one, right?

Whenever you can start talking to other people about it, you can actually read it out loud. It makes a big difference. So learn to pronounce the Greek letters. Note that the people that write these papers are generally not selected for their outstanding clarity of communication, right? So you will often find that there'll be a blog post or a tutorial that does a better job of explaining the concept than the paper does.

So don't be afraid to go and look for those as well, but do go back to the paper, right? Because in the end, the paper's the one that's hopefully got it mainly right. Okay. One of the tricky things about reading papers is the equations have symbols and you don't know what they mean and you can't Google for them.

So a couple of good resources, if you see symbols you don't recognize, Wikipedia has an excellent list of mathematical symbols page that you can scroll through. And even better, de-techify is a website where you can draw a symbol you don't recognize and it uses the power of machine learning to find similar symbols.

There are lots of symbols that look a bit the same, so you will have to use some level of judgment, right? But the thing that it shows here is the LaTeX name and you can then Google for the LaTeX name to find out what that thing means. Okay. So let's start.

Here's what we're going to do over the next couple of lessons. We're going to try to create a pretty competent modern CNN model. And we actually already have this bit because we did that in the last course, right? We already have our layers for creating a ResNet. We actually got a pretty good result.

So we just have to do all these things, okay, to get us from here to here. This is just the next couple of lessons. After that we're going to go a lot further, right? So today we're going to try to get to at least the point where we've got the backward pass going, right?

So remember, we're going to build a model that takes an input array and we're going to try and create a simple, fully connected network, right? So it's going to have one hidden layer. So we're going to start with some input, do a matrix multiply, do a value, do a matrix multiply, do a loss function, okay?

And so that's a forward pass and that'll tell us our loss. And then we will calculate the gradients of the weights and biases with respect to the weights and biases in order to basically multiply them by some learning rate, which we will then subtract off the parameters to get our new set of parameters.

And we'll repeat that lots of times. So to get to our fully connected backward pass, we will need to first of all have the fully connected forward pass and the fully connected forward pass means we will need to have some initialized parameters and we'll need a value and we will also need to be able to do matrix multiplication.

So let's start there. So let's start at 00 exports notebook. And what I'm showing you here is how I'm going to go about building up our library in Jupyter notebooks. A lot of very smart people have assured me that it is impossible to do effective library development in Jupyter notebooks, which is a shame because I've built a library in Jupyter notebooks.

So anyway, people will often tell you things are impossible, but I will tell you my point of view, which is that I've been programming for over 30 years and in the time I've been using Jupyter notebooks to do my development, I would guess I'm about two to three times more productive.

I've built a lot more useful stuff in the last two or three years than I did beforehand. I'm not saying you have to do things this way either, but this is how I develop and hopefully you find some of this useful as well. So I'll show you how. We need to do a couple of things.

We can't just create one giant notebook with our whole library. Somehow we have to be able to pull out those little gems, those bits of code where we think, oh, this is good. Let's keep this. We have to be able to pull that out into a package that we reuse.

So in order to tell our system that here is a cell that I want you to keep and reuse, I use this special comment, hash export at the top of the cell. And then I have a program called notebook to script, which goes through the notebook and finds those cells and puts them into a Python module.

So let me show you. So if I run this cell, okay, so if I run this cell and then I head over and notice I don't have to type all of exports because I have tab completion, even for file names in jupyter notebook. So tab is enough and I could either run this here or I could go back to my console and run it.

So let's run it here. Okay, so that says converted exports.ipanb to nb00. And what I've done is I've made it so that these things go into a directory called exp for exported modules. And here is that nb00. And there it is, right? So you can see other than a standard header, it's got the contents of that one cell.

So now I can import that at the top of my next notebook from exp nb00 import star. And I can create a test that that variable equals that value. So let's see. It does. Okay. And notice there's a lot of test frameworks around, but it's not always helpful to use them.

Like here we've created a test framework or the start of one. I've created a function called test, which checks whether A and B return true or false based on this comparison function by using assert. And then I've created something called test equals, which calls test passing in A and B and operated dot equals.

Okay. So if they're wrong, assertion error equals test, test one. Whoops. Okay. So we've been able to write a test, which so far has basically tested that our little module exporter thing works correctly. We probably want to be able to run these tests somewhere other than just inside a notebook .

So we have a little program called run notebook dot P Y and you pass it the name of a notebook. And it runs it. So I should save this one with our failing test so you can see it fail. So first time it passed and then I make the failing test and you can see here it is assertion error and tells you exactly where it happened.

Okay. So we now have an automatable unit testing framework in our Jupyter Notebook. I'll point out that the contents of these two Python scripts, let's look at them. So the first one was run notebook dot P Y, which is our test runner. There is the entirety of it, right?

So there's a thing called nb format, so if you condor install nb format, then it basically lets you execute a notebook and it prints out any errors. So that's the entirety of that. You'll notice that I'm using a library called fire. Fire is a really neat library that lets you take any function like this one and automatically converts it into a command line interface.

So here I've got a function called run notebook and then it says fire, run notebook. So if I now go Python, run notebook, then it says, oh, this function received no value, path, usage, run notebook, path. So you can see that what it did was it converted my function into a command line interface, which is really great.

And it handles things like optional arguments and classes and it's super useful, particularly for this kind of Jupiter first development, because you can grab stuff that's in Jupiter and turn it into a script often by just copying and pasting the function or exporting it. And then just add this one line of code.

The other one notebook to script is not much more complicated. It's one screen of code, which again, the main thing here is to call fire, which calls this one function and you'll see basically it uses JSON.load because notebooks are JSON. The reason I mentioned this to you is that Jupiter notebook comes with this whole kind of ecosystem of libraries and APIs and stuff like that.

And on the whole, I hate them. I find it's just JSON. I find that just doing JSON.load is the easiest way. And specifically I build my Jupiter notebook infrastructure inside Jupiter notebooks. So here's how it looks, right, import JSON, JSON.load this file and gives you an array and there's the contents of source, my first row, right?

So if you do want to play around with doing stuff in Jupiter notebook, it's a really great environment for kind of automating stuff and running scripts on it and stuff like that. So there's that. All right. So that's the entire contents of our development infrastructure. We now have a test.

Let's make it pass again. One of the great things about having unit tests in notebooks is that when one does fail, you open up a notebook, which can have pros saying, this is what this test does. It's implementing this part of this paper. You can see all the stuff above it that's setting up all the context for it.

You can check in each input and output. It's a really great way to fix those failing tests because you've got the whole truly literate programming experience all around it. So I think that works great. Okay. So before we start doing matrix multiply, we need some matrices to multiply. So these are some of the things that are allowed by our rules.

We've got some stuff that's part of the standard library. This is the fast AI data sets library to let us grab the data sets we need some more standard library stuff. We're only allowed to use this for indexing and array creation that plot lab. There you go. So let's grab Evanist.

So to grab Evanist, we just don't, we can use faster data sets to download it. And then we can use a standard library gzip to open it. And then we can pickle.load it. So in Python, the kind of standard serialization format is called pickle. And so this Evanist version on deeplearning.net is stored in that, in that format.

And so we can, it's basically gives us a tuple of tuples of data sets like so x train, y train, x valid, y valid. It actually contains NumPy arrays, but NumPy arrays are not allowed in our foundations. So we have to convert them into tensors. So we can just use the Python map to map the tensor function over each of these four arrays to get back four tensors.

A lot of you will be more familiar with NumPy arrays than PyTorch tensors. But you know, everything you can do in NumPy arrays, you can also do in PyTorch tensors, you can also do it on the GPU and have all this nice deeplearning infrastructure. So it's a good idea to get used to using PyTorch tensors, in my opinion.

So we can now grab the number of rows and number of columns in the training search. And we can take a look. So here's Evanist, hopefully pretty familiar to you already. It's 50,000 rows by 784 columns, and the y data looks something like this. The y shape is just 50,000 rows, and the minimum and maximum of the dependent variable is zero to nine.

So hopefully that all looks pretty familiar. So let's add some tests. So the n should be equal to the shape of the y, should be equal to 50,000. The number of columns should be equal to 28 by 28, because that's how many pixels there are in Evanist, and so forth.

And we're just using that test equals function that we created just above. So now we can plot it. Okay, so we've got a float tensor. And we pass that to imshow after casting it to a 28 by 28. Dot view is really important. I think we saw it a few times in part one, but get very familiar with it.

This is how we reshape our 768 long vector into a 28 by 28 matrix that's suitable for plotting. Okay, so there's our data. And let's start by creating a simple linear model. So for a linear model, we're going to need to basically have something where y equals a x plus b.

And so our a will be a bunch of weights. So it's going to be to be 784 by 10 matrix, because we've got 784 coming in and 10 going out. So that's going to allow us to take in our independent variable and map it to something which we compare to our dependent variable.

And then for our bias, we'll just start with 10 zeros. So if we're going to do y equals a x plus b, then we're going to need a matrix multiplication. So almost everything we do in deep learning is basically matrix multiplication or a variant thereof affine functions, as we call them.

So you want to be very comfortable with matrix multiplication. So this cool website, matrix multiplication dot x, y, z, shows us exactly what happens when we multiply these two matrices. So we take the first column of the first row and the first row, and we multiply each of them element-wise, and then we add them up, and that gives us that one.

And now you can see we've got two sets going on at the same time, so that gives us two more, and then two more, and then the final one. And that's our matrix multiplication. So we have to do that. So we've got a few loops going on. We've got the loop of this thing scrolling down here.

We've got the loop of these two rows. They're really columns, so we flip them around. And then we've got the loop of the multiply and add. So we're going to need three loops. And so here's our three loops. And notice this is not going to work unless the number of columns here and the number of rows here are the same.

So let's grab the number of rows and columns of A, and the number of rows and columns of B, and make sure that AC equals BR, just to double check. And then let's create something of size AR by BC, because the size of this is going to be AR by BC with zeros in, and then have our three loops.

And then right in the middle, let's do that. OK, so right in the middle, the result in I comma J is going to be AIK by BKJ. And this is the vast majority of what we're going to be doing in deep learning. So get very, very comfortable with that equation, because we're going to be seeing it in three or four different variants of notation and style in the next few weeks, in the next few minutes.

OK? And it's got kind of a few interesting things going on. This I here appears also over here. This J here appears also over here. And then the K in the loop appears twice. And look, it's got to be the same number in each place, because this is the bit where we're multiplying together the element-wise things.

So there it is. So let's create a nice small version, grab the first five rows of the validation set. We'll call that M1. And grab our weight matrix. We'll call that M2. Grab our weight matrix, call that M2. And then here's their sizes, five, because we just grabbed the first five rows, five by 784, OK, multiplied by 784 by 10.

So these match as they should. And so now we can go ahead and do that matrix multiplication. And it's done, OK? And it's given us 50,000-- sorry, length of-- sorry. It's given us T1.shape. As you would expect, a five rows by 10 column output. And it took about a second.

So it took about a second for five rows. Our data set, MNIST, is 50,000 rows. So it's going to take about 50,000 seconds to do a single matrix multiplication in Python. So imagine doing MNIST where every layer for every pass took about 10 hours. Not going to work, right?

So that's why we don't really write things in Python. Like, when we say Python is too slow, we don't mean 20% too slow. We mean thousands of times too slow. So let's see if we can speed this up by 50,000 times. Because if we could do that, it might just be fast enough.

So the way we speed things up is we start in the innermost loop. And we make each bit faster. So the way to make Python faster is to remove Python. And the way we remove Python is by passing our computation down to something that's written in something other than Python, like PyTorch.

Because PyTorch behind the scenes is using a library called A10. And so we want to get this going down to the A10 library. So the way we do that is to take advantage of something called element-wise operations. So you've seen them before. For example, if I have two tensors, A and B, both of length three, I can add them together.

And when I add them together, it simply adds together the corresponding items. So that's called element-wise addition. Or I could do less than, in which case it's going to do element-wise less than. So what percentage of A is less than the corresponding item of B, A less than B dot float dot mean.

We can do element-wise operations on things not just of rank one, but we could do it on a rank two tensor, also known as a matrix. So here's our rank two tensor, M. Let's calculate the Frobenius norm. How many people know about the Frobenius norm? Right, almost nobody. And it looks kind of terrifying, right, but actually it's just this.

It's a matrix times itself dot sum dot square root. So here's the first time we're going to start trying to translate some equations into code to help us understand these equations. So this says, when you see something like A with two sets of double lines around it, and an F underneath, that means we are calculating the Frobenius norm.

So any time you see this, and you will, it actually pops up semi-regularly in deep learning literature, when you see this, what it actually means is this function. As you probably know, capital sigma means sum, and this says we're going to sum over two for loops. The first for loop will be called i, and we'll go from 1 to n.

And the second for loop will also be called j, and will also go from 1 to n. And in these nested for loops, we're going to grab something out of a matrix A, that position i, j. We're going to square it, and then we're going to add all of those together, and then we'll take the square root.

Which is that. Now I have something to admit to you. I can't write LaTeX. And yet I did create this Jupyter notebook, so it looks a lot like I created some LaTeX, which is certainly the impression I like to give people sometimes. But the way I actually write LaTeX is I find somebody else who wrote it, and then I copy it.

So the way you do this most of the time is you Google for Frobenius Norm, you find the wiki page for Frobenius Norm, you click edit next to the equation, and you copy and paste it. So that's a really good way to do it. And check dollar signs or even two dollar signs around it.

Two dollar signs make it a bit bigger. So that's way one to get equations. Method two is if it's in a paper on archive, did you know on archive you can click on download other formats in the top right, and then download source, and that will actually give you the original tech source, and then you can copy and paste their LaTeX.

So I'll be showing you a bunch of equations during these lessons, and I can promise you one thing, I wrote none of them by hand. So this one was stolen from Wikipedia. All right, so you now know how to implement the Frobenius Norm from scratch in TensorFlow. You could also have written it, of course, as m.pal2, but that would be illegal under our rules.

We're not allowed to use PAL yet, so that's why we did it that way. So that's just doing the element-wise multiplication of a rank two tensor with itself. One times one, two times two, three times three, etc. So that is enough information to replace this loop, because this loop is just going through the first row of A and the first column of B, and doing an element-wise multiplication and sum.

So our new version is going to have two loops, not three. Here it is. So this is all the same, but now we've replaced the inner loop, and you'll see that basically it looks exactly the same as before, but where it used to say k, it now says colon.

So in pytorch and numpy, colon means the entirety of that axis. So Rachel, help me remember the order of rows and columns when we talk about matrices, which is the song. Row by column, row by column. So that's the song. So i is the row number. So this is row number i, the whole row.

And this is column number j, the whole column. So multiply all of column j by all of row i, and that gives us back a rank one tensor, which we add up. That's exactly the same as what we had before. And so now that takes 1.45 milliseconds. We've removed one line of code, and it's 178 times faster.

So we successfully got rid of that inner loop. And so now this is running in C. We didn't really write Python here. We wrote kind of a Python-ish thing that said, please call this C code for us. And that made it 178 times faster. Let's check that it's right.

We can't really check that it's equal, because floats are sometimes changed slightly, depending on how you calculate them. So instead, let's create something called near, which calls torch.all_close to some tolerance. And then we'll create a test_near function that calls our test function using our near comparison. And let's see.

Yep. Passes. OK. So we've now got our matrix multiplication at 65 microseconds. Now we need to get rid of this loop, because now this is our innermost loop. And to do that, we're going to have to use something called broadcasting. Who here is familiar with broadcasting? About half. OK.

That's what I figured. So broadcasting is about the most powerful tool we have in our toolbox for writing code in Python that runs at C speed, or in fact, with PyTorch, if you put it on the GPU, it's going to run at CUDA speed. It allows us to get rid of nearly all of our loops, as you'll see.

The term "broadcasting" comes from NumPy, but the idea actually goes all the way back to APL from 1962. And it's a really, really powerful technique. A lot of people consider it a different way of programming, where we get rid of all of our for loops and replace them with these implicit, broadcasted loops.

In fact, you've seen broadcasting before. Remember our tensor A, which contains 10, 6, 4? If you say A greater than 0, then on the left-hand side, you've got to rank one tensor. On the right-hand side, you've got a scalar. And yet somehow it works. And the reason why is that this value 0 is broadcast three times.

It becomes 0, 0, 0, and then it does an element-wise comparison. So every time, for example, you've normalized a dataset by subtracting the mean and divided by the standard deviation in a kind of one line like this, you've actually been broadcasting. You're broadcasting a scalar to a tensor. So A plus one also broadcasts a scalar to a tensor.

And the tensor doesn't have to be rank one. Here we can multiply our rank two tensor by two. So there's the simplest kind of broadcasting. And any time you do that, you're not operating at Python speed, you're operating at C or CUDA speed. So that's good. We can also broadcast a vector to a matrix.

So here's a rank one tensor C. And here's our previous rank two tensor M. So M's shape is 3, 3, C's shape is 3. And yet M plus C does something. What did it do? 10, 20, 30 plus 1, 2, 3, 10, 20, 30 plus 4, 5, 6, 10, 20, 30 plus 7, 8, 9.

It's broadcast this row across each row of the matrix. And it's doing that at C speed. So this, there's no loop, but it sure looks as if there was a loop. C plus M does exactly the same thing. So we can write C dot expand as M. And it shows us what C would look like when broadcast to M.

10, 20, 30, 10, 20, 30, 10, 20, 30. So you can see M plus T is the same as C plus M. So basically it's creating or acting as if it's creating this bigger rank two tensor. So this is pretty cool because it now means that any time we need to do something between a vector and a matrix, we can do it at C speed with no loop.

Now you might be worrying though that this looks pretty memory intensive if we're kind of turning all of our rows into big matrices, but fear not. Because you can look inside the actual memory used by PyTorch. So here T is a 3 by 3 matrix, but T dot storage tells us that actually it's only storing one copy of that data.

T dot shape tells us that T knows it's meant to be a 3 by 3 matrix. And T dot stride tells us that it knows that when it's going from column to column, it should take one step through the storage. But when it goes from row to row, it should take zero steps.

And so that's how calm it repeats 10, 20, 30, 10, 20, 30, 10, 20, 30. So this is a really powerful thing that appears in pretty much every linear algebra library you'll come across is this idea that you can actually create tensors that behave like higher rank things than they're actually stored as.

So this is really neat. It basically means that this broadcasting functionality gives us C like speed with no additional memory overhead. Okay, what if we wanted to take a column instead of a row? So in other words, a rank 2 tensor of shape 3, 1. We can create a rank 2 tensor of shape 3, 1 from a rank 1 tensor by using the unsqueeze method.

Unsqueeze adds an additional dimension of size 1 to wherever we ask for it. So unsqueeze 0, let's check this out, unsqueeze 0 is of shape 1, 3, it puts the new dimension in position 1. Unsqueeze 1 is shape 3, 1, it creates the new axis in position 1. So unsqueeze 0 looks a lot like C, but now rather than being a rank 1 tensor, it's now a rank 2 tensor.

See how it's got two square brackets around it? See how its size is 1, 3? As more interestingly, C.unsqueeze 1 now looks like a column, right? It's also a rank 2 tensor, but it's 3 rows by one column. Why is this interesting? Because we can say, well actually before we do, I'll just mention writing .unsqueeze is kind of clunky.

So PyTorch and NumPy have a neat trick, which is that you can index into an array with a special value none, and none means squeeze a new axis in here please. So you can see that C none colon is exactly the same shape, 1, 3, as C.unsqueeze 0. And C colon, none is exactly the same shape as C.unsqueeze 1.

So I hardly ever use unsqueeze unless I'm like particularly trying to demonstrate something for teaching purposes, I pretty much always use none. Apart from anything else, I can add additional axes this way, or else with unsqueeze you have to go to unsqueeze, unsqueeze, unsqueeze. So this is handy. So why did we do all that?

The reason we did all that is because if we go C colon, none, so in other words we turn it into a column, kind of a columnar shape, so it's now of shape 3, 1, .expandas, it doesn't now say 10, 20, 30, 10, 20, 30, 10, 20, 30, but it says 10, 10, 10, 20, 20, 20, 30, 30.

So in other words it's getting broadcast along columns instead of rows. So as you might expect, if I take that and add it to M, then I get the result of broadcasting the column. So it's now not 11, 22, 33, but 11, 12, 13. So everything makes more sense in Excel.

Let's look. So here's broadcasting in Excel. Here is a 1, 3 shape rank 2 tensor. So we can use the rows and columns functions in Excel to get the rows and columns of this object. Here is a 3 by 1, rank 2 tensor, again rows and columns. And here is a 3 by 3, rank 2 tensor.

As you can see, rows by columns. So here's what happens if we broadcast this to be the shape of M. And here is the result of that C plus M. And here's what happens if we broadcast this to that shape. And here is the result of that addition. And there it is, 11, 12, 13, 24, 25, 26.

So basically what's happening is when we broadcast, it's taking the thing which has a unit axis and is kind of effectively copying that unit axis so it is as long as the larger tensor on that axis. But it doesn't really copy it, it just pretends as if it's been copied.

So we can use that to get rid of our loop. So this was the loop we were trying to get rid of, going through each of range BC. And so here it is. So now we are not anymore going through that loop. So now rather than setting CI comma J, we can set the entire row of CI.

This is the same as CI comma colon. Every time there's a trailing colon in NumPy or PyTorch, you can delete it optionally. You don't have to. So before, we had a few of those, right, let's see if we can find one. Here's one. Comma colon. So I'm claiming we could have got rid of that.

Let's see. Yep, still torch size 1 comma 3. And similar thing, any time you see any number of colon commas at the start, you can replace them with a single ellipsis. Which in this case doesn't save us anything because there's only one of these. But if you've got like a really high-rank tensor, that can be super convenient, especially if you want to do something where the rank of the tensor could vary.

You don't know how big it's going to be ahead of time. So we're going to set the whole of row I, and we don't need that colon, though it doesn't matter if it's there. And we're going to set it to the whole of row I of A. And then now that we've got row I of A, that is a rank 1 tensor.

So let's turn it into a rank 2 tensor. So it's now got a new-- and see how this is minus 1? So minus 1 always means the last dimension. So how else could we have written that? We could also have written it like that with a special value none.

So this is of now length whatever the size of A is, which is AR. So it's of shape AR comma 1. So that is a rank 2 tensor. And B is also a rank 2 tensor. That's the entirety of our matrix. And so this is going to get broadcast over this, which is exactly what we want.

We want it to get rid of that loop. And then, so that's going to return, because it broadcast, it's actually going to return a rank 2 tensor. And then that rank 2 tensor, we want to sum it up over the rows. And so sum, you can give it a dimension argument to say which axis to sum over.

So this one is kind of our most mind-bending broadcast of the lesson. So I'm going to leave this as a bit of homework for you to go back and convince yourself as to why this works. So maybe put it in Excel or do it on paper if it's not already clear to you why this works.

But this is sure handy, because before we were broadcasting that, we were at 1.39 milliseconds. After using that broadcasting, we're down to 250 microseconds. So at this point, we're now 3,200 times faster than Python. And it's not just speed. Once you get used to this style of coding, getting rid of these loops I find really reduces a lot of errors in my code.

It takes a while to get used to, but once you're used to it, it's a really comfortable way of programming. Once you get to kind of higher ranked tensors, this broadcasting can start getting a bit complicated. So what you need to do instead of trying to keep it all in your head is apply the simple broadcasting rules.

Here are the rules. Some here, in NumPy and PyTorch and TensorFlow, it's all the same rules. What we do is we compare the shapes element-wise. So let's look at a slightly interesting example. Here is our rank1 tensor C, and let's insert a leading unit axis. So this is a shape 1,3.

See how there's two square brackets? And here's the version with a, sorry, this one's a preceding axis. This one's a trailing axis. So this is shape 3,1. And we should take a look at that. So just to remind you, that looks like a column. What if we went C, none, colon, times C, colon, colon, what on earth is that?

And so let's go back to Excel. Here's our row version. Here's our column version. What happens is it says, okay, you want to multiply this by this, element-wise, right? This is not the at sign. This is asterisk, so element-wise multiplication. And it broadcasts this to be the same number of rows as that, like so.

And it broadcasts this to be the same number of columns as that, like so. And then it simply multiplies those together. That's it, right? So the rule that it's using, you can do the same thing with greater than, right? The rule that it's using is, let's look at the two shapes, 1, 3 and 3,1, and see if they're compatible.

They're compatible if, element-wise, they're either the same number or one of them is 1. So in this case, 1 is compatible with 3 because one of them is 1. And 3 is compatible with 1 because one of them is 1. And so what happens is, if it's 1, that dimension is broadcast to make it the same size as the bigger one, okay?

So 3,1 became 3,3. So this one was multiplied 3 times down the rows, and this one was multiplied 3 times down the columns. And then there's one more rule, which is that they don't even have to be the same rank, right? So something that we do a lot with image normalization is we normalize images by channel, right?

So you might have an image which is 256 by 256 by 3. And then you've got the per-channel mean, which is just a rank 1 tensor of size 3. They're actually compatible because what it does is, anywhere that there's a missing dimension, it inserts a 1 there at the start.

It inserts leading dimensions and inserts a 1. So that's why actually you can normalize by channel with no lines of code. Mind you, in PyTorch, it's actually channel by height by width, so it's slightly different. But this is the basic idea. So this is super cool. We're going to take a break, but we're getting pretty close.

My goal was to make our Python code 50,000 times faster, we're up to 4,000 times faster. And the reason this is really important is because if we're going to be like doing our own stuff, like building things that people haven't built before, we need to know how to write code that we can write quickly and concisely, but operates fast enough that it's actually useful, right?

And so this broadcasting trick is perhaps the most important trick to know about. So let's have a six-minute break, and I'll see you back here at 8 o'clock. So broadcasting, when I first started teaching deep learning here, and I asked how many people are familiar with broadcasting, this is back when we used to do it in the piano, almost no hands went up, so I used to kind of say this is like my secret magic trick.

I think it's really cool, it's kind of really cool that now half of you have already heard of it, and it's kind of sad because it's now not my secret magic trick. It's like here's something half of you already knew, but the other half of you, there's a reason that people are learning this quickly and it's because it's super cool.

Here's another magic trick. How many people here know Einstein summation notation? Okay, good, good, almost nobody. So it's not as cool as broadcasting, but it is still very, very cool. Let me show you, right? And this is a technique which I don't think it was invented by Einstein, I think it was popularized by Einstein as a way of dealing with these high rank tensor kind of reductions that we used in the general relativity, I think.

Here's the trick. This is the innermost part of our original matrix multiplication for loop, remember? And here's the version when we removed the innermost loop and replaced it with an element-wise product. And you'll notice that what happened was that the repeated K got replaced with a colon. Okay, so what's this?

What if I move, okay, so first of all, let's get rid of the names of everything. And let's move this to the end and put it after an arrow. And let's keep getting rid of the names of everything. And get rid of the commas and replace spaces with commas.

Okay. And now I just created Einstein summation notation. So Einstein summation notation is like a mini language. You put it inside a string, right? And what it says is, however many, so there's an arrow, right, and on the left of the arrow is the input and on the right of the arrow is the output.

How many inputs do you have? Well they're delimited by comma, so in this case there's two inputs. The inputs, what's the rank of each input? It's however many letters there are. So this is a rank two input and this is another rank two input and this is a rank two output.

How big are the inputs? This is one is the size i by k, this one is the size k by j, and the output is of size i by j. When you see the same letter appearing in different places, it's referring to the same size dimension. So this is of size i, the output is always has, also has i rows.

This has j columns. The output also has j columns. Alright. So we know how to go from the input shape to the output shape. What about the k? You look for any place that a letter is repeated and you do a dot product over that dimension. In other words, it's just like the way we replaced k with colon.

So this is going to create something of size i by j by doing dot products over these shared k's, which is matrix multiplication. So that's how you write matrix multiplication with Einstein summation notation. And then all you just do is go torch dot insum. If you go to the PI torch insum docs or docs of most of the major libraries, you can find all kinds of cool examples of insum.

You can use it for transpose, diagonalization, tracing, all kinds of things, batch wise versions of just about everything. So for example, if PI torch didn't have batch wise matrix multiplication, I just created it. There's batch wise matrix multiplication. So there's all kinds of things you can kind of invent.

And often it's quite handy if you kind of need to put a transpose in somewhere or tweak things to be a little bit different. You can use this. So that's Einstein summation notation. Size matmul and that's now taken us down to 57 microseconds. So we're now 16,000 times faster than Python.

I will say something about Einstein. It's a travesty that this exists because we've got a little mini language inside Python in a string. I mean, that's horrendous. You shouldn't be writing programming languages inside a string. This is as bad as a regex, you know, like regular expressions are also mini languages inside a string.

You want your languages to be like typed and have an Intelli sense and like be things that you can like, you know, extend this, this mini language does. It's amazing, but there's so few things that it actually does, right? What I actually want to be able to do is create like any kind of arbitrary combination of any axes and any operations and any reductions I like in any order in the actual language I'm writing in, right?

So that's actually what APL does. That's actually what J and K do. These are the J and K are the languages that kind of came out of APL. This is a kind of a series of languages that have been around for about 60 years and everybody's pretty much failed to notice.

My hope is that things like Swift and Julia will give us this, like the ability to actually write stuff in actual Swift and actual Julia that we can run in an actual debugger and use an actual profiler and do arbitrary stuff that's really fast. But actually, Swift seems like it might go even quite a bit faster than Einstein in an even more flexible way, thanks to this new compiler infrastructure called MLIR, which actually builds off this and really exciting new research in the compiler world, kind of been coming over the last few years, particularly coming out of a system called Halide, which is H-A-L-I-D-E, which is this super cool language that basically showed it's possible to create a language that can create very, very, very, like totally optimized linear algebra computations in a really flexible, convenient way.

And since that came along, there's been all kinds of cool research using these techniques like something called polyhedral compilation, which kind of have the promise that we're going to be able to hopefully, within the next couple of years, write Swift code that runs as fast as the next thing I'm about to show you, because the next thing I'm about to show you is the PyTorch operation called matmul.

And matmul takes 8a microseconds, which is 50,000 times faster than Python. Why is it so fast? Well, if you think about what you're doing when you do a matrix multiply of something that's like 50,000 by 768 by 768 by 10, these are things that aren't going to fit in the cache in your CPU.

So if you do the kind of standard thing of going down all the rows and across all the columns, by the time you've got to the end and you go back to exactly the same column again, it forgot the contents and has to go back to RAM and pull it in again.

So if you're smart, what you do is you break your matrix up into little smaller matrices and you do a little bit at a time. And that way, everything is kind of in cache and it goes super fast. Now, normally, to do that, you have to write kind of assembly language code, particularly if you want to kind of get it all running in your vector processor.

And that's how you get these 18 microseconds. So currently, to get a fast matrix multiply, things like PyTorch, they don't even write it themselves, they basically push that off to something called a BLAS, B-L-A-S, a BLAS is a Basic Linear Algebra Subprograms Library, where companies like Intel and AMD and NVIDIA write these things for you.

So you can look up KuBLAS, for example, and this is like NVIDIA's version of BLAS. Or you could look up MKL and this is Intel's version of BLAS and so forth. And this is kind of awful because, you know, the program is limited to this like subset of things that your BLAS can handle.

And to use it, you don't really get to write it in Python, you kind of have to write the one thing that happens to be turned into that pre-existing BLAS call. So this is kind of why we need to do better, right? And there are people working on this and there are people actually in Chris Latner's team working on this.

You know, there's some really cool stuff like there's something called Tensor Comprehensions, which is like really originally came in PyTorch, and I think they're now inside Chris's team at Google, where people are basically saying, hey, here are ways to like compile these much more general things. And this is what we want as more advanced practitioners.

Anyway, for now, in PyTorch world, we're stuck at this level, which is to recognize there are some things this is, you know, three times faster than the best we can do in an even vaguely flexible way. And if we compare it to the actually flexible way, which is broadcasting, we had 254, yeah, so still over 10 times better, right?

So wherever possible today, we want to use operations that are predefined in our library, particularly for things that kind of operate over lots of rows and columns, the things we're kind of dealing with this memory caching stuff is going to be complicated. So keep an eye out for that.

Matrix modification is so common and useful that it's actually got its own operator, which is at. These are actually calling the exact same code. So they're the exact same speed. At is not actually just matrix modification at covers a much broader array of kind of tensor reductions across different levels of axes.

So it's worth checking out what matball can do, because often it'll be able to handle things like batch wise or matrix versus vectors, don't think of it as being only something that can do rank two by rank two, because it's a little bit more flexible. OK, so that's that we have matrix multiplication, and so now we're allowed to use it.

And so we're going to use it to try to create a forward pass, which means we first need value and matrix initialization, because remember, a model contains parameters which start out randomly initialized. And then we use the gradients to gradually update them with SGD. So let's do that. So here is O2.

So let's start by importing NBO1, and I just copied and pasted the three lines we used to grab the data, and I'm just going to pop them into a function so we can use it to grab MNIST when we need it. And now that we know about broadcasting, let's create a normalization function that takes our tensor and subtracts the means and divides by the standard deviation.

So now let's grab our data, OK, and pop it into x, y, x, y. Let's grab the mean and standard deviation, and notice that they're not 0 and 1. Why would they be? Right? But we want them to be 0 and 1. And we're going to be seeing a lot of why we want them to be 0 and 1 over the next couple of lessons.

But for now, let's just take my word for it. We want them to be 0 and 1. So that means that we need to subtract the mean, divide by the standard deviation, but not for the validation set. We don't subtract the validation set's mean and divide by the validation set's standard deviation.

Because if we did, those two data sets would be on totally different scales, right? So if the training set was mainly green frogs, and the validation set was mainly red frogs, right, then if we normalize with the validation sets mean and variance, we would end up with them both having the same average coloration, and we wouldn't be able to tell the two apart, right?

So that's an important thing to remember when normalizing, is to always make sure your validation and training set are normalized in the same way. So after doing that, get it twice, okay, so after doing that, our mean is pretty close to 0, and our standard deviation is very close to 1, and it would be nice to have something to easily check that these are true.

So let's create a test near 0 function, and then test that the mean is near 0, and 1 minus the standard deviation is near 0, and that's all good. Let's define N and M and C the same as before, so the size of the training set and the number of activations we're going to eventually need in our model being C, and let's try to create our model.

Okay, so the model is going to have one hidden layer, and normally we would want the final output to have 10 activations, because we would use cross-entropy against those 10 activations, but to simplify things for now, we're going to not use cross-entropy, we're going to use mean squared error, which means we're going to have one activation, okay, which makes no sense from our modeling point of view, we'll fix that later, but just to simplify things for now.

So let's create a simple neural net with a single hidden layer and a single output activation, which we're going to use mean squared error. So let's pick a hidden size, so the number of hidden will make 50, okay, so our two layers, we're going to need two weight matrices and two bias vectors.

So here are our two weight matrices, W1 and W2, so they're random numbers, normal random numbers of size M, which is the number of columns, 768, by NH, number of hidden, and then this one is NH by 1. Now our inputs now are mean zero, standard deviation 1, the inputs to the first layer.

We want the inputs to the second layer to also be mean zero, standard deviation 1. Well, how are we going to do that? Because if we just grab some normal random numbers and then we define a function called linear, this is our linear layer, which is X by W plus B, and then create T, which is the activation of that linear layer with our validation set and our weights and biases.

We have a mean of minus 5 and a standard deviation of 27, which is terrible. So I'm going to let you work through this at home, but once you actually look at what happens when you multiply those things together and add them up, as you do in matrix multiplication, you'll see that you're not going to end up with 0, 1.

But if instead you divide by square root m, so root 768, then it's actually damn good. So this is a simplified version of something which PyTorch calls Keiming initialization, named after Keiming He who wrote a paper, or was the lead writer of a paper that we're going to look at in a moment.

So the weights, rand n gives you random numbers with a mean of 0 and a standard deviation of 1. So if you divide by root m, it will have a mean of 0 and a standard deviation of 1 on root m. So we can test this. So in general, normal random numbers of mean 0 and standard deviation of 1 over root of whatever this is, so here it's m and here it's nh, will give you an output of 0, 1.

Now this may seem like a pretty minor issue, but as we're going to see in the next couple of lessons, it's like the thing that matters when it comes to training neural nets. It's actually, in the last few months, people have really been noticing how important this is. There are things like fix-up initialization, where these folks actually trained a 10,000-layer deep neural network with no normalization layers, just by basically doing careful initialization.

So it's really, people are really spending a lot of time now thinking like, okay, how we initialize things is really important. And you know, we've had a lot of success with things like one cycle training and super convergence, which is all about what happens in those first few iterations, and it really turns out that it's all about initializations.

So we're going to be spending a lot of time studying this in depth. So the first thing I'm going to point out is that this is actually not how our first layer is defined. Our first layer is actually defined like this. It's got a ReLU on it. So first let's define ReLU.

So ReLU is just grab our data and replace any negatives with zeros. That's all Clamp min means. Now there's lots of ways I could have written this. But if you can do it with something that's like a single function in PyTorch, it's almost always faster because that thing's generally written in C for you.

So try to find the thing that's as close to what you want as possible. There's a lot of functions in PyTorch. So that's a good way of implementing ReLU. And unfortunately, that does not have a mean zero and standard deviation of one. Why not? Well, where's my stylus? Okay, so we had some data that had a mean of zero and a standard deviation of one.

And then we took everything that was smaller than zero and removed it. So that obviously does not have a mean of zero and it obviously now has about half the standard deviation that it used to have. So this was one of the fantastic insights and one of the most extraordinary papers of the last few years.

It was the paper from the 2015 ImageNet winners led by the person we've mentioned, Kaiming He. Kaiming at that time was at Microsoft Research. And this is full of great ideas. Reading papers from competition winners is a very, very good idea because they tend to be, you know, normal papers will have like one tiny tweak that they spend pages and pages trying to justify why they should be accepted into NeurIPS, whereas competition winners have 20 good ideas and only time to mention them in passing.

This paper introduced us to ResNets, PreluLayers, and Kaiming initialization amongst others. So here is section 2.2. Section 2.2, initialization of filter weights or rectifiers. What's a rectifier? A rectifier is a rectified linear unit or rectifier network is any neural network with rectifier linear units in it. This is only 2015, but it already reads like something from another age in so many ways.

Like even the word rectifier units and traditional sigmoid activation networks, no one uses sigmoid activations anymore, you know. So a lot's changed since 2015. So when you read these papers, you kind of have to keep these things in mind. They describe how what happens if you train very deep models with more than eight layers.

So things have changed, right? But anyway, they said that in the old days, people used to initialize these with random Gaussian distributions. So this is a Gaussian distribution. It's just a fancy word for normal or bell shaped. And when you do that, they tend to not train very well.

And the reason why, they point out, or actually Glorow and Benjio pointed out. Let's look at that paper. So you'll see two initializations come up all the time. One is either Kaiming or Her initialization, which is this one, or the other you'll see a lot is Glorow or Xavier initialization, again, named after Xavier Glorow.

This is a really interesting paper to read. It's a slightly older one. It's from 2010. It's been massively influential. And one of the things you'll notice if you read it is it's very readable. It's very practical. And the actual final result they come up with is it's incredibly simple.

And we're actually going to be re-implementing much of the stuff in this paper over the next couple of lessons. But basically, they describe one suggestion for how to initialize neural nets. And they suggest this particular approach, which is root six over the root of the number of input filters plus the number of output filters.

And so what happened was Kaiming Her and that team pointed out that that does not account for the impact of a ReLU, the thing that we just noticed. So this is a big problem. If your variance halves each layer and you have a massive deep network with like eight layers, then you've got one over two to the eight squishes.

Like by the end, it's all gone. And if you want to be fancy like the fix up people with 10,000 layers, forget it, right? Your gradients have totally disappeared. So this is totally unacceptable. So they do something super genius smart. They replace the one on the top with a two on the top.

So this, which is not to take anything away from this, it's a fantastic paper, right? But in the end, the thing they do is to stick a two on the top. So we can do that by taking that exact equation we just used and sticking a two on the top.

And if we do, then the result is much closer. It's not perfect, right, but it actually varies quite a lot. It's really random. Sometimes it's quite close. Sometimes it's further away, but it's certainly a lot better than it was. So that's good. And it's really worth reading. So law homework for this week is to read 2.2 of the ResNet paper.

And what you'll see is that they describe what happens in the forward pass of a neural net. And they point out that for the conv layer, this is the response, Y equals WX plus B. Now if you're concentrating, that might be confusing because a conv layer isn't quite Y equals WX plus B.

A conv layer has a convolution. But you remember in part one, I pointed out this neat article from Matt Clinesmith where he showed that CNNs in convolutions actually are just matrix multiplications with a bunch of zeros and some tideweights. So this is basically all they're saying here. So sometimes there are these kind of like throwaway lines in papers that are actually quite deep and worth thinking about.

So they point out that you can just think of this as a linear layer. And then they basically take you through step by step what happens to the variance of your network depending on the initialization. And so just try to get to this point here, get as far as backward propagation case.

So you've got about, I don't know, six paragraphs to read. None of the math notation is weird. Maybe this one is if you haven't seen this before. This is exactly the same as sigma, but instead of doing a sum, you do a product. So this is a great way to kind of warm up your paper reading muscles is to try and read this section.

And then if that's going well, you can keep going with the backward propagation case because the forward pass does a matrix multiply. And as we'll see in a moment, the backward pass does a matrix multiply with a transpose of the matrix. So the backward pass is slightly different, but it's nearly the same.

And so then at the end of that, they will eventually come up with their suggestion. Let's see if we can find it. Oh yeah, here it is. They suggest root two over nL, where nL is the number of input activations. Okay. So that's what we're using. That is called climbing initialization, and it gives us a pretty nice variance.

It doesn't give us a very nice mean though. And the reason it doesn't give us a very nice mean is because as we saw, we deleted everything below the axis. So naturally, our main is now half, not zero. I haven't seen anybody talk about this in the literature, but something I was just trying over the last week is something kind of obvious, which is to replace value with not just x.plantmin, but x.plantmin minus 0.5.

And in my brief experiments, that seems to help. So there's another thing that you could try out and see if it actually helps or if I'm just imagining things. It certainly returns you to the correct mean. Okay, so now that we have this formula, we can replace it with init.climbing_normal according to our rules, because it's the same thing.

And let's check that it does the same thing, and it does, okay? So again, we've got this about half mean and bit under one standard deviation. You'll notice here I had to add something extra, which is mode equals fan out. What does that mean? What it means is explained here, fan in or fan out, fan in preserves the magnitude of variance in the forward pass, fan out preserves the magnitudes in the backward pass.

Basically, all it's saying is, are you dividing by root m or root nh? Because if you divide by root m, as you'll see in that part of the paper I was suggesting you read, that will keep the variance at one during the forward pass. But if you use nh, it will give you the right unit variance in the backward pass.

So it's weird that I had to say fan out, because according to the documentation, that's for the backward pass to keep the unit variance. So why did I need that? Well, it's because our weight shape is 784 by 50, but if you actually create a linear layer with PyTorch of the same dimensions, it creates it of 50 by 784.

It's the opposite. So how can that possibly work? And these are the kind of things that it's useful to know how to dig into. So how is this working? So to find out how it's working, you have to look in the source code. So you can either set up Visual Studio code or something like that and kind of set it up so you can jump between things.

It's a nice way to do it. Or you can just do it here with question mark, question mark. And you can see that this is the forward function, and it calls something called f.linear. In PyTorch, capital F always refers to the torch.nn.functional module, because you like it's used everywhere, so they decided that's worth a single letter.

So torch.nn.functional.linear is what it calls, and let's look at how that's defined. Input.matmal.weight.t, t means transpose. So now we know in PyTorch, a linear layer doesn't just do a matrix product. It does a matrix product with a transpose. So in other words, it's actually going to turn this into 784 by 50 and then do it.

And so that's why we kind of had to give it the opposite information when we were trying to do it with our linear layer, which doesn't have transpose. So the main reason I show you that is to kind of show you how you can dig in to the PyTorch source code, see exactly what's going on.

And when you come across these kind of questions, you want to be able to answer them yourself. Which also then leads to the question, if this is how linear layers can be initialized, what about convolutional layers? What does PyTorch do for convolutional layers? So we could look inside torch.nn.conf2d, and when I looked at it, I noticed that it basically doesn't have any code.

It just has documentation. All of the code actually gets passed down to something called _convnd. And so you need to know how to find these things. And so if you go to the very bottom, you can find the file name it's in. And so you see this is actually torch.nn.modules.conf.

So we can find torch.nn.modules.conf._convnd. And so here it is. And here's how it initializes things. And it calls chiming_uniform, which is basically the same as chiming_normal, but it's uniform instead. But it has a special multiplier of math.square root 5. And that is not documented anywhere. I have no idea where it comes from.

And in my experiments, this seems to work pretty badly, as you'll see. So it's kind of useful to look inside the code. And when you're writing your own code, presumably somebody put this here for a reason. Wouldn't it have been nice if they had a URL above it with a link to the paper that they're implementing so we could see what's going on?

So it's always a good idea, always to put some comments in your code to let the next person know what the hell are you doing? So that particular thing, I have a strong feeling, isn't great, as you'll see. So we're going to try this thing. It's attracting 0.5 from our ReLU.

So this is pretty cool, right? We've already designed our own new activation function. Is it great? Is it terrible? I don't know. But it's this kind of level of tweak, which is kind of-- when people write papers, this is the level of-- it's like a minor change to one line of code.

It'll be interesting to see how much it helps. But if I use it, then you can see here, yep, now I have a mean of 0 thereabouts. And interestingly, I've also noticed it helps my variance a lot. All of my variance, remember, was generally around 0.7 to 0.8. But now it's generally above 0.8.

So it helps both, which makes sense as to why I think I'm seeing these better results. So now we have ReLU. We have a linear. We have init. So we can do a forward pass. So we're now up to here. And so here it is. And remember, in PyTorch, a model can just be a function.

And so here's our model. It's just a function that does one linear layer, one ReLU layer, and one more linear layer. And let's try running it. And OK, it takes eight milliseconds to run the model on the validation set. So it's plenty fast enough to train. It's looking good.

Add an assert to make sure the shape seems sensible. So the next thing we need for our forward pass is a loss function. And as I said, we're going to simplify things for now by using mean squared error, even though that's obviously a dumb idea. Our model is returning something of size 10,000 by 1.

But mean squared error, you would expect it just to be a single vector of size 10,000. So I want to get rid of this unit axis. In PyTorch, the thing to add a unit axis we've learned is called squeeze-- sorry, unsqueeze. The thing to get rid of a unit axis, therefore, is called squeeze.

So we just go output.squeeze to get rid of that unit axis. But actually, now I think about it-- this is lazy. Because output.squeeze gets rid of all unit axes. And we very commonly see on the fastAO forums people saying that their code's broken. And it's when they've got squeeze.

And it's that one case where maybe they had a batch size of size 1. And so that 1,1 would get squeezed down to a scalar. And things would break. So rather than just calling squeeze, it's actually better to say which dimension you want to squeeze, which we could write either 1 or minus 1, it would be the same thing.

And this is going to be more resilient now to that weird edge case of a batch size of size 1. OK, so output minus target squared main-- that's main squared error. So remember, in PyTorch, loss functions can just be functions. For main squared error, we're going to have to make sure these are floats.

So let's convert them. So now we can calculate some predictions. That's the shape of our predictions. And we can calculate our main squared error. So there we go. So we've done a forward pass. So we're up to here. A forward pass is useless. What we need is a backward pass, because that's the thing that tells us how to update our parameters.

So we need gradients. OK, how much do you want to know about matrix calculus? I don't know. It's up to you. But if you want to know everything about matrix calculus, I can point you to this excellent paper by Terrence Parr and Jeremy Howard, which tells you everything about matrix calculus from scratch.

So this is a few weeks work to get through, but it absolutely assumes nothing at all. So basically, Terrence and I both felt like, oh, we don't know any of this stuff. Let's learn all of it and tell other people. And so we wrote it with that in mind.

And so this will take you all the way up to knowing everything that you need for deep learning. You can actually get away with a lot less. But if you're here, maybe it's worth it. But I'll tell you what you do need to know. What you need to know is the chain rule.

Because let me point something out. We start with some input. We start with some input. And we stick it through the first linear layer. And then we stick it through ReLU. And then we stick it through the second linear layer. And then we stick it through MSE. And that gives us our predictions.

Or to put it another way, we start with x. And we put it through the function lin1. And then we take the output of that, and we put it through the function ReLU. And then we take the output of that, and we put it through the function lin2. And then we take the output of that, and we put it through the function MSE.

And strictly speaking, MSE has a second argument, which is the actual target value. And we want the gradient of the output with respect to the input. So it's a function of a function of a function of a function of a function. So if we simplify that down a bit, we could just say, what if it's just like y equals f of x-- sorry, y equals f of u and u equals f of x.

So that's like a function of a function. Simplify it a little bit. And the derivative is that. That's the chain rule. If that doesn't look familiar to you, or you've forgotten it, go to Khan Academy. Khan Academy has some great tutorials on the chain rule. But this is actually the thing we need to know.

Because once you know that, then all you need to know is the derivative of each bit on its own, and you just multiply them all together. And if you ever forget the chain rule, just cross-multiply. So that would be dy/dx, cross out to the u's, you get dy/dx. And if you went to a fancy school, they would have told you not to do that.

They said you can't treat calculus like this, because they're special magic small things. Actually you can. There's actually a different way of treating calculus called the calculus of infinitesimals, where all of this just makes sense. And you suddenly realize you actually can do this exact thing. So any time you see a derivative, just remember that all it's actually doing is it's taking some function, and it's saying, as you go across a little bit, how much do you go up?

And that it's dividing that change in y divided by that change in x. That's literally what it is, where y and x, you must make them small numbers. So they behave very sensibly when you just think of them as a small change in y over a small change in x, as I just did, showing you the chain rule.

So to do the chain rule, we're going to have to start with the very last function. The very last function on the outside was the loss function, mean squared error. So we just do each bit separately. So the gradient of the loss with respect to output of previous layer.

So the output of the previous layer, the MSC is just input minus target squared. And so the derivative of that is just 2 times input minus target, because the derivative of blah squared is 2 times blah. So that's it. Now I need to store that gradient somewhere. Now the thing is that for the chain rule, I'm going to need to multiply all these things together.

So if I store it inside the dot g attribute of the previous layer, because remember this is the previous layer, then when the previous layer, so the input of MSC is the same as the output of the previous layer. So if I store it away in here, I can then quite comfortably refer to it.

So here, look, ReLU, let's do ReLU. So ReLU is this, okay, what's the gradient there? 0. What's the gradient there? 1. So therefore, that's the gradient of the ReLU. It's just imp greater than 0. But we need the chain rule, so we need to multiply this by the gradient of the next layer, which remember we store it away.

So we can just grab it. So this is really cool. So same thing for the linear layer, the gradient is simply, and this is where the matrix calculus comes in, the gradient of a matrix product is simply the matrix product with the transpose. So you can either read all that stuff I showed you, or you can take my word for it.

So here's the cool thing, right? Here's the function which does the forward pass that we've already seen, and then it goes backwards. It calls each of the gradients backwards, right, in reverse order, because we know we need that for the chain rule. And you can notice that every time we're passing in the result of the forward pass, and it also has access, as we discussed, to the gradient of the next layer.

This is called backpropagation, right? So when people say, as they love to do, backpropagation is not just the chain rule, they're basically lying to you. Backpropagation is the chain rule, where we just save away all the intermediate calculations so we don't have to calculate them again. So this is a full forward and backward pass.

One interesting thing here is this value here, loss, this value here, loss, we never actually use it, because the loss never actually appears in the gradients. I mean, just by the way, you still probably want it to be able to print it out, or whatever, but it's actually not something that appears in the gradients.

So that's it, so w1.g, w2.g, et cetera, they now contain all of our gradients, which we're going to use for the optimizer. And so let's cheat and use PyTorch autograd to check our results, because PyTorch can do this for us. So let's clone all of our weights and biases and input, and then turn on requires-grad for all of them.

So requires-grad_ is how you take a PyTorch tensor and turn it into a magical autogradified PyTorch tensor. So what it's now going to do is everything that gets calculated with test tensor, it's basically going to keep track of what happened. So it basically keeps track of these steps, so that then it can do these things.

It's not actually that magical, you could totally write it yourself, you just need to make sure that each time you do an operation, you remember what it is, and so then you can just go back through them in reverse order. Okay, so now that we've done requires-grad, we can now just do the forward pass like so, that gives us loss in PyTorch, you say loss.backward, and now we can test that, and remember PyTorch doesn't store things in .g, it stores them in .grad, and we can test them, and all of our gradients were correct, or at least they're the same as PyTorch's.

So that's pretty interesting, right, I mean that's an actual neural network that kind of contains all the main pieces that we're going to need, and we've written all these pieces from scratch, so there's nothing magical here, but let's do some cool refactoring. I really love this refactoring, and this is massively inspired by and very closely stolen from the PyTorch API, but it's kind of interesting, I didn't have the PyTorch API in mind as I did this, but as I kept refactoring, I kind of noticed like, oh, I just recreated the PyTorch API, that makes perfect sense.

So let's take each of our layers, relu and linear, and create classes, right, and for the forward, let's use dundercall, now do you remember that dundercall means that we can now treat this as if it was a function, right, so if you call this class just with parentheses, it calls this function.

And let's save the input, let's save the output, and let's return the output, right, and then backward, do you remember this was our backward pass, okay, so it's exactly the same as we had before, okay, but we're going to save it inside self.input.gradient, so this is exactly the same code as we had here, okay, but I've just moved the forward and backward into the same class, right?

So here's linear, forward, exactly the same, but each time I'm saving the input, I'm saving the output, I'm returning the output, and then here's our backward. One thing to notice, the backward pass here, for linear, we don't just want the gradient of the outputs with respect to the inputs, we also need the gradient of the outputs with respect to the weights and the output with respect to the biases, right, so that's why we've got three lots of dot g's going on here, okay, so there's our linear layers forward and backward, and then we've got our mean squared error, okay, so there's our forward, and we'll save away both the input and the target for using later, and there's our gradient, again, same as before, two times input minus target.

So with this refactoring, we can now create our model, we can just say let's create a model class and create something called dot layers with a list of all of our layers, all right, notice I'm not using any PyTorch machinery, this is all from scratch, let's define loss and then let's define call, and it's going to go through each layer and say x equals lx, so this is how I do that function composition, we're just calling the function on the result of the previous thing, okay, and then at the other very end call self dot loss on that, and then for backward we do the exact opposite, we go self dot loss dot backward and then we go through the reversed layers and call backward on each one, right, and remember the backward passes are going to save the gradient away inside the dot g, so with that, let's just set all of our gradients to none so that we know we're not cheating, we can then create our model, right, this class model, and call it, and we can call it as if it was a function because we have done to call, right, so this is going to call done to call, and then we can call backward, and then we can check that our gradients are correct, right, so that's nice, one thing that's not nice is, holy crap that took a long time, let's run it, there we go, 3.4 seconds, so that was really really slow, so we'll come back to that, I don't like duplicate code, there's a lot of duplicate code here, self dot imp equals imp, return self dot out, that's messy, so let's get rid of it, so what we could do is we could create a new class called module, which basically does the self dot imp equals imp, and return self dot out for us, and so now we're not going to use done to call to implement our forward, we're going to have a call something called self dot forward, which we will initially set to raise exception, not implemented, and backward is going to call self dot bwd, passing in the thing that we just saved, and so now relu has something called forward, which just has that, so we're now basically back to where we were, and backward just has that, right, so now look how neat that is, and we also realized that this thing we were doing to calculate the derivative of the output of the linear layer with respect to the weights, where we're doing an unsqueeze and an unsqueeze, which is basically a big adder product and a sum, we could actually re-express that with einsum, okay, and when we do that, so our code is now neater, and our 3.4 seconds is down to 143 milliseconds, okay, so thank you again to einsum, so you'll see this now, look, model equals model, loss equals bla, bla dot backward, and now the gradients are all there, that looks almost exactly like PyTorch, and so we can see why, why it's done this way, why do we have to inherit from nn dot module, why do we have to define forward, this is why, right, it lets PyTorch factor out all this duplicate stuff, so all we have to do is do the implementation, so I think that's pretty fun, and then once we realized, we thought more about it, more like what are we doing with this einsum, and we actually realized that it's exactly the same as just doing input dot transpose times output, so we replaced the einsum with a matrix product, and that's 140 milliseconds, and so now we've basically implemented nn dot linear and nn dot module, so let's now use nn dot linear and nn dot module, because we're allowed to, that's the rules, and the forward pass is almost exactly the same speed as our forward pass, and their backward pass is about twice as fast, I'm guessing that's because we're calculating all of the gradients, and they're not calculating all of them, only the ones they need, but it's basically the same thing.

So at this point, we're ready in the next lesson to do a training loop. We have something, we have a multi-layer fully connected neural network, what her paper would call a rectified network, we have matrix multiply organized, we have our forward and backward passes organized, it's all nicely refactored out into classes and a module class, so in the next lesson, we will see how far we can get, hopefully we will build a high quality, fast ResNet, and we're also going to take a very deep dive into optimizers and callbacks and training loops and normalization methods.

Any questions before we go? No? That's great. Okay, thanks everybody, see you on the forums. (audience applauds)

Lesson 8 (2019) - Deep Learning from the Foundations

Chapters

Transcript