Lesson 2: Deep Learning 2019 - Data cleaning and production; SGD from scratch

Welcome to Lesson 2, where we're going to be taking a deeper dive into computer vision applications and taking some of the amazing stuff that you've all been doing during the week and going even further. So let's take a look. Before we do, a reminder that we have these two really important topics on the forums.

They're pinned at the top of the forum category. One is Fact Resources and Official Course Updates. This is where, if there's something useful for you to know during the course, we will post there. Nobody else can reply to that thread. So if you set that thread to Watching and to Notifications, you're not going to be bugged by anybody else except stuff that we think you need to know for the course.

And it's got all the official information about how to get set up on each platform. Please note, a lot of people post all kinds of other tidbits about how they've set up things on previous solutions or previous courses or other places. I don't recommend you use those because these are the ones that we're testing every day and that the folks involved in these platforms are testing every day and they definitely work.

So I would strongly suggest you follow those tips. And if you do have a question about using one of these platforms, please use these discussions, not some other topic that you create. Because this way, people that are involved in these platforms will be able to see it and things won't get messy.

And then secondly, for every lesson there will be an official updates thread for that lesson. So Lesson 1, official updates, and the same thing. Only FastAI people will be posting to that. So you can watch it safely and we'll have all the things like the videos, the notebooks, and so forth.

And they're all wiki threads, so you can help us to make them better as well. So I mentioned the idea of watching a thread. So this is a really good idea, is that you can go to a thread, like particularly those official update ones, and click at the bottom, "Watching".

And if you do that, that's going to enable notifications for any updates to that thread. Particularly if you go into, click on your little user name in the top right, "Preferences", and turn this on. That will give you an email as well. So any of you that have missed some of the updates so far, go back and have a look through because we're really trying to make sure that we keep you updated with anything that we think is important.

One thing which can be more than a little overwhelming is even now, after just one week, the most popular thread has 1.1000 replies. So that's an intimidatingly large number. I've actually read every single one of them, and I know Rachel has, and I know Sylvia has, and I think Francisco has.

But you shouldn't need to. What you should do is click "Summarize this topic" and it'll appear like this, which is all of the most liked ones will appear, and then there'll be "View 31 hidden replies" or whatever in between. So that's how you navigate these giant topics. That's also why it's important you click the "Like" button, because that's the thing that's going to cause people to see it in this recommended view.

So when you come back to work, hopefully you've realized by now that on the official course website, course-v3.fast.ai, you will click "Returning to Work". You will click the name of the platform you're using, and you will then follow the two steps. Step one will be how to make sure that you've got the latest notebooks, and step two will be how to make sure you've got the latest Python library software.

They all look pretty much like this, but they're slightly different from platform to platform, so please don't use some different set of commands you read somewhere else. Only use the commands that you read about here, and that will make everything very smooth. If things aren't working for you, if you get into some kind of messy situation, which we all do, and just delete your instance and start again, unless you've got mission-critical stuff there, it's the easiest way just to get out of a sticky situation.

And if you follow the instructions here, you really should find it works fine. So, this is what I really wanted to talk about, most of all, is what people have been doing this week. If you've noticed, and a lot of you have, there's been 167 people sharing their work.

And this is really cool, because it's pretty intimidating to put yourself out there and say like, "I'm new to all this, but here's what I've done." And so examples of things I thought was really interesting was figuring out who's talking. Is it Ben Affleck or Joe Rogan? I thought this is really interesting.

This is actually very practical. I wanted to clean up my WhatsApp downloaded images to get rid of memes, so I actually built a little neural network. I mean, how cool is that to say like, "Oh yeah, I've got something that cleans up my WhatsApp. It's a deep learning application I wrote last week." Why not?

Like, it's so easy now. You can do stuff like this. And then there's been some really interesting projects. One was looking at the sound data that was used in this paper. And in this paper, they were trying to figure out what kind of sound things were, and they got, as you would expect since they published a paper, they got a state of the art of nearly 80% accuracy.

Ethan Sooten then tried using the Lesson 1 techniques and got 80.5% accuracy. So I think this is pretty awesome. Best as we know, it's a new state of the art for this problem. Maybe somebody since then has published something. We haven't found it yet. They take all of these with a slight grain of salt.

But I've mentioned them on Twitter, and lots of people on Twitter follow me, so if everybody knew that there was a much better approach, I'm sure somebody would have said so. This one is pretty cool. Survash has a new state of the art accuracy for Devangari text recognition. I think he's got it even higher than this now.

And this is actually confirmed by the person on Twitter who created the dataset. I don't think he had any idea. He just posted, "Hey, here's a nice thing I did." And this guy on Twitter was like, "Oh, I made that dataset. Congratulations. You've got a new record." So that was pretty cool.

I really like this post from Alina Harley. She describes in quite a bit of detail about the issue of them metastasizing cancers and the use of point mutations and why that's a challenging important problem. And she's got some nice pictures describing what she wants to do with this and how she can go about turning this into pictures.

Hey, see, this is the cool trick, right? It's the same with this sounds one, turning sounds into pictures and then using the Lesson 1 approach. And here it's turning point mutations into pictures and then using the Lesson 1 approach. And what did she find? It seems that she's got a new state of the art result by more than 30%, beating the previous best.

Somebody on Twitter who's a VP at a genomics analysis company looked at this as well and thought it looked to be a state of the art in this particular point mutation one as well. So that's pretty exciting. So you can see when we talked about last week this idea that this simple process is something which can take you a long way.

It really can. I will mention that something like this one in particular is using a lot of domain expertise, like it's figuring out what picture to create. I wouldn't know how to do that because I don't even really know what a point mutation is, let alone how to create something that visually is meaningful that a CNN could recognize.

But the actual big learning side is actually pretty straightforward. Another very cool result from Simon Willison and Natalie Down, they created a Cougar or not web application over the weekend and won the Science HAC Day Award in San Francisco. And so I think that's pretty fantastic. So lots of examples of people doing really interesting work.

Hopefully this will be inspiring to you to think, wow, this is cool that I can do this with what I've learned. It can also be intimidating to think like, wow, these people are doing amazing things. But it's important to realize out of the thousands of people doing this course, you know, I'm just picking out a few of the really amazing ones.

And in fact, Simon is one of these very annoying people like Christine Payne, who we talked about last week, who seems to be good at everything he does. He created Django, one of the world's most popular web frameworks. He founded a very successful startup and blah, blah, blah, blah, blah.

So, you know, one of these really annoying people who tends to keep being good at things, now it turns out he's good at deep learning as well. So, you know, that's fine. Simon can go and win a hackathon on his first week of playing with deep learning. Maybe it'll take you two weeks to win your first hackathon.

That's okay. And I think like it's important to mention this because there was this really inspiring blog post this week from James Dellinger, who talked about how he created a bird classifier using the techniques from lesson one. But what I really found interesting was at the end, he said he nearly didn't start on deep learning at all because he went to the scikit-learn website, which is one of the most important libraries of Python, and he saw this.

And he described in this blog post how he was just like, "That's not something I can do. That's not something I understand." And then this kind of realization of like, "Oh, I can do useful things without reading the Greek." So I thought that was a really cool message. And I really want to highlight, actually, Daniel Armstrong on the forum, I think really shows, is a great role model here, which was here saying, "I want to contribute to the library." And I looked at the docs, and I just found it overwhelming.

And his next message one day later was, "I don't know what any of this is. I didn't know how much there was to it. It caught me off guard. My brain shut down. But I love the way it forces me to learn so much." And then one day later, I just submitted my first pull request.

So I think that's awesome. It's okay to feel intimidated. There's a lot. But just pick one piece and dig into it. Try and push a piece of code or a documentation update, or create a classifier, or whatever. So here's lots of cool classifiers people have built. It's been really, really inspiring.

Triddedad and Tobago Islander versus Masquerada classifier, a Zucchini versus Cucumber classifier. This one was really nice. This was taking the dog breeds, dog and cat breeds thing from last week, and actually doing some exploratory work to see what the main features were, and discovered that they could now create a hairiness and classifier.

And so here we have the most hairy dogs and the most bald cats. So there are interesting things you can do with interpretation. Somebody else in the forum took that and did the same thing for anime to find that they had accidentally discovered an anime hair color classifier. We can now detect the new versus the old Panamanian buses correctly.

Apparently these are the new ones. I much prefer the old ones, but maybe that's just me. This was really interesting. Henry Palachi discovered that he can recognize with 85% accuracy which of 110 countries a satellite image is of, which is definitely got to be beyond human performance of just about anybody.

Like I can't imagine anybody who can do that in practice. So that was fascinating. Batik cloth classification with 100% accuracy. David Ward did this interesting one. We actually went a little bit further using some techniques we'll be discussing in the next couple of courses to build something that can recognize complete or incomplete or foundation buildings and actually plot them on aerial satellite view.

So lots and lots of fascinating projects. So don't worry, it's only been one week. It doesn't mean everybody has to have had a project out yet. A lot of the folks who already have a project out have done a previous course, so they've got a bit of a head start.

But we'll see today how you can definitely create your own classifier this week. So from today after we dig a bit deeper into really how to make these computer vision classifiers in particular work well, we're then going to look at the same thing for text. We're then going to look at the same thing for tabular data, so they're kind of like more like spreadsheets and databases.

Then we're going to look at collaborative filtering, so kind of recommendation systems. That's going to take us into a topic called embeddings, which is basically a key underlying platform behind these applications. That will take us back into more computer vision and then back into more NLP. So the idea here is that it turns out that it's much better for learning if you kind of see things multiple times.

So rather than being like, okay, that's computer vision, you won't see it again for the rest of the course, we're actually going to come back to the two key applications, NLP and computer vision, a few weeks apart, and that's going to force your brain to realize like, oh, I have to remember this.

It's not just something I can throw away. So people who have more of a hard sciences kind of background in particular, a lot of folks find this, hey, here's some code, type it in, start running an approach rather than here's lots of theory approach, confusing and surprising and odd at first.

And so for those of you, I just wanted to remind you this basic tip, which is keep going. You're not expected to remember everything yet. You're not expected to understand everything yet. You're not expected to know why everything works yet. You just want to be in a situation where you can enter the code and you can run it and you can get something happening and then you can start to experiment and you kind of get a feel for what's going on and then push on, right?

Most of the people who have done the course and have gone on to be really successful watch the videos at least three times. So they kind of go through the whole lot and then go through it slowly the second time, then they go through it really slowly the third time and I consistently hear them say I get a lot more out of it each time I go through.

So don't pause at lesson one and stop until you can continue. So this approach is based on a lot of research, academic research into learning theory and one guy in particular, David Perkins from Harvard has this really great analogy. He's a researcher into learning theory. He describes this approach of the whole game, which is basically if you're teaching a kid to play soccer, you don't first of all teach them about how the friction between a ball and grass works and then teach them how to sew a soccer ball with their bare hands and then teach them the mathematics of parabolas when you kick something in the air.

No, you say here's a ball, let's watch some people playing soccer. Okay, now we'll play soccer and then you, you know, gradually over the following years learn more and more so that you can get better and better at it. So this is kind of what we're trying to get you to do is to play soccer, which in our case is to type code and look at the inputs and look at the outputs.

Okay, so let's dig into our first notebook, which is called Lesson 2 Download. And what we're going to do is we're going to see how to create your own classifier with your own images. So it's going to be a lot like last week's pet detector, but it'll detect whatever you like.

So it'll be like some of those examples we just saw. How would you create your own Panama bus detector from scratch? So this is inspired, the approach is inspired by Adrian Rosebrock who has a terrific website called Pi Image Search and he has this nice explanation of how to create a dataset using Google Images.

So that was definitely an inspiration for some of the techniques we use here. So thank you to Adrian. And you should definitely check out his site. It's full of lots of good resources. So here we are. So we are going to try to create a teddy bear detector. Thanks.

We're going to try and make a teddy bear detector and we're going to try and separate teddy bears from black bears from grizzly bears. Now this is very important. I have a three-year-old daughter and she needs to know what she's dealing with. In our house you would be surprised at the number of monsters, lions and other terrifying threats that are around, particularly around Halloween.

And so we always need to be on the lookout to make sure that the thing we're about to cuddle is in fact a genuine teddy bear. So let's deal with that situation as best as we can. So our starting point is to find some pictures of teddy bears so we can learn what they look like.

So I go to images.google.com and I type in "teddy bear" and I just scroll through until I kind of find a goodly bunch of them. And it's like okay, that looks like plenty of teddy bears to me. So then I'll go back to here. So you can see it says search and scroll.

Go to Google Images and search. And the next thing we need to do is to get a list of all of the URLs there. And so to do that, back in your Google Images you hit Ctrl+Shift+J or Command+Option+J and you paste this into the window that appears. So I've got Windows, so I go Ctrl+Shift+J and I paste in that code.

So this is a JavaScript console. For those of you who haven't done any JavaScript before, I hit Enter and it downloads my file for me. So I would call this "teddies.txt" and press Save. Okay, so I now have a file of teddies or URLs of teddies. So then I would repeat that process for black bears and for brown bears since that's a classifier I would want.

And I'd put each one in a file with an appropriate name. So that's step one. So step two is we now need to download those URLs to our server. Because remember when we're using Jupyter Notebook it's not running on our computer. It's running on SageMaker or Cressel or Google Cloud or whatever.

So to do that we start running some Jupyter cells. So let's grab the FastAI library and let's start with black bears. I've already got my black bears URL. So I click on this cell for black bears and I run it. See here how I've got three different cells doing the same thing with different information?

This is one way I like to work with Jupyter Notebook. It's something that a lot of people with a more strict scientific background are horrified by. This is not reproducible research. So I actually click here and I run this cell to create a folder called "black" and a file called "URL black" for my black bears.

I skip the next two cells. And then I run this cell to create that folder. And then I go down to the next section and I run the next cell which is "download images for black bears". So that's just going to download my black bears to that folder. And then I'll go back and I'll click on "Tetties" and I run that cell and then scroll back down and I'll run this cell.

And so that way I'm just going backwards and forwards to download each of the classes that I want. Very manual, but for me I'm very iterative and very experimental. That works well for me. If you're better at kind of planning ahead than I am, you can write a proper loop or whatever and do it that way.

But when you see my notebooks and see things where there's these kind of like configuration cells doing the same thing in different places, this is a strong sign that I didn't run this in order. I clicked one place, went to another, ran that, went back, went back, went back.

And for me, I'm an experimentalist. I really like to experiment in my notebook. I treat it like a lab journal. I try things out and I see what happens. And so this is how my notebooks end up looking. It's a really controversial topic. Like for a lot of people they feel this is like wrong, that you should only ever run things top to bottom.

Everything you should do should be reproducible. For me, I don't think that's the best way of using human creativity. I think human creativity is best inspired by trying things out, seeing what happens and fiddling around. So you can see how you go, see what works for you. So that will download the images to your server.

It's going to use multiple processes to do so. And one problem there is if something goes wrong, it's a bit hard to see what went wrong. So you can see in the next section there's a commented out section that says max workers equals zero. That will do it without spinning up a bunch of processes and will tell you the errors better.

So if things aren't downloading, try using the second version. Okay, so I grabbed a small number of each. And then the next thing that I found I needed to do was to remove the images that aren't actually images at all. And this happens all the time. There's always a few images in every batch that are corrupted for whatever reason.

Google image tried to tell us that this URL had an image, but actually it doesn't anymore. So we've got this thing in the library called verify images, which will check all of the images in a path and will tell you if there's a problem. If you say delete equals true, it will actually delete it for you.

So that's a really nice easy way to end up with a clean dataset. So at this point I now have a bears folder containing a grizzly folder and a teddies folder and a black folder. In other words, I have the basic structure we need to create an image data bunch to start doing some deep learning.

So let's go ahead and do that. Now, very often when you download a dataset from like Kaggle or from some academic dataset, there will often be a folder called train and a folder called valid and a folder called test, right, containing the different datasets. In this case, we don't have a separate validation set because we just grab these images from Google search, right?

But you still need a validation set, otherwise you don't know how well your model is going. We'll talk more about this in a moment. So whatever you create a data bunch, if you don't have a separate training and validation set, then you can just say, okay, well the training set is in the current folder because by default it looks in a folder called train.

And I want you to set aside 20% of the data, please. So this is going to create a validation set for you automatically and randomly. You'll see that whenever I create a validation set randomly, I always set my random seed to something fixed beforehand. This means that every time I run this code, I'll get the same validation set.

So in general, I'm not a fan of making my machine learning experiments reproducible, i.e. ensuring I get exactly the same result every time. The randomness is to me really important, a really important part of finding out is your solution stable? Is it going to work like each time you run it?

But what is important is that you always have the same validation set. Otherwise, when you're trying to decide, has this hyperparameter change improved my model, but you've got a different set of data you're testing it on, then you don't know maybe that set of data just happens to be a bit easier.

So that's why I always set the random seed here. So we've now got, let's run that cell, so we've now got a data bunch. And so you can look inside at the data.classes and you'll see these are the folders that we created. So it knows that the classes, so by classes we mean all the possible labels, black bear, grizzly bear or teddy bear.

We can run show batch and we can take a little look. And it tells us straight away that some of these are going to be a little bit tricky. So this is not a photo, for instance. Some of them kind of crop funny. Some of them might be tricky, like if you ended up with a black bear standing on top of a grizzly bear, that might be tough.

Anyway, so you can kind of double check here data.classes, there they are. Remember C is the attribute which the classifier tells us how many possible labels there are. We'll learn about some other more specific meanings of C later. We can see how many things are in our training set.

We can see how many things are in our validation set. So we've got 473 training set, 141 validation set. So at that point we can go ahead. You'll see all these commands are identical to the pet classifier from last week. We can create our CNN, our convolutional neural network, using that data.

I tend to default to using a ResNet 34. And let's print out the error rate each time, and run fit one cycle four times and see how we go. And we have a 2% error rate. So that's pretty good. Sometimes it's easy for me to recognize a black bear from a grizzly bear, but sometimes it's a bit tricky.

This one seems to be doing pretty well. Okay, so after I kind of make some progress with my model and things looking good, I always like to save where I'm up to, to save me the 54 seconds of going back and doing it again. And as per usual, we unfreeze the rest of our model.

We're going to be learning more about what that means during the course. And then we run the learning rate finder and plot it. It tells you exactly what to type, and we take a look. Now, we're going to be learning about learning rates today, actually. But for now, here's what you need to know.

On the learning rate finder, what you're looking for is the strongest downward slope that's kind of sticking around for quite a while. So this one here looks more like a bump, but this looks like an actual downward slope to me. So it's kind of like it's something you're going to have to practice with and get a feel for, like which bit works.

So if you're not sure is it this bit or this bit, try both learning rates and see which one works better. But I've been doing this for a while, and I'm pretty sure this looks like where it's really learning properly. So I would pick something. Okay, here it's not so steep.

So I would probably pick something back here for my learning rate. So you can see I picked 3E neg 5. So, you know, somewhere around here. That sounds pretty good. So that's my bottom learning rate. So my top learning rate, I normally pick 1E neg 4 or 3E neg 4.

It's kind of like I don't really think about it too much. That's a rule of thumb. It always works pretty well. One of the things you'll realize is that most of these parameters don't actually matter that much in detail. If you just copy the numbers that I use each time, the vast majority of the time it will just work fine.

And we'll see places where it doesn't today. Okay, so we've got a 1.4% error rate after doing another couple of epochs. So that's looking great. So we've downloaded some images from Google Image, search, created a classifier. We've got a 1.4% error rate. Let's save it. And then as per usual, we can use the classification interpretation class to have a look at what's going on.

And in this case, we made one mistake. There was one black bear classified as grizzly bear. So that's a really good step. We've come a long way. But possibly you could do even better if your data set was less noisy. Like maybe Google Image search didn't give you exactly the right images all the time.

So how do we fix that? And so we want to clean it up. And so combining a human expert with a computer learner is a really good idea. Almost, not nobody, but very, very few people publish on this. Very, very few people teach this. But to me, it's like the most useful skill.

Particularly for you, most of the people watching this are domain experts, not computer science experts. And so this is where you can use your knowledge of point mutations in genomics or Panamanian buses or whatever. So let's see how that would work. What I'm going to do is, do you remember the plot top losses from last time where we saw the images which it was like either the most wrong about or the least confident about.

We're going to look at those and decide which of those are noisy. Like if you think about it, it's very unlikely that if there is some mislabeled data that it's going to be predicted correctly and with high confidence. But that's really unlikely to happen. So we're going to focus on the ones which the model is saying either it's not confident of or it was confident of and it was wrong about.

They are the things which might be mislabeled. So big shout out to the San Francisco Fast AI Study Group who created this new widget this week called the File Deleter. So that's Zach and Jason and Francisco built this thing where we basically can take the top losses from that interpretation object we just created.

And then what we're going to do is we're going to say, okay, that returns top losses. There's not just plot top losses, but there's also just top losses. And top losses return two things, the losses of the things that were the worst and the indexes into the data set of the things that were the worst.

And if you don't pass anything at all, it's going to actually return the entire data set but sorted. So the first things will be the highest losses. As we'll learn during the course, so we'll keep seeing during the course, every data set in Fast AI has an x and a y.

And the x contains the things that are used to, in this case, get the images. So this is the image file names and the y's will be the labels. So if we grab the indexes and pass them into the data set x, this is going to give us the file names of the data set ordered by which ones had the highest loss.

So which ones it was either confident and wrong about or not confident about. And so we can pass that to this new widget that they've created called the FileDeleader widget. So just to clarify, this top plus paths contains all of the file names in our data set. When I say in our data set, this particular one is in our validation data set.

So what this is going to do is it's going to clean up mislabeled images or images that shouldn't be there. And we're going to remove them from a validation set so that our metrics will be more correct. So you then need to rerun these two steps replacing valid ds with train ds to clean up your training set to get the noise out of that as well.

So it's a good practice to do both. We'll talk about test sets later as well. If you also have a test set, you would then repeat the same thing. So we run FileDeleader passing in that sorted list of paths. And so what pops up is basically the same thing as plot top losses.

So in other words, these are the ones which is either wrong about or at least confident about. And so not surprisingly, this one here does not appear to be a teddy bear or a black bear or a brown bear. So this shouldn't be in our data set. So what I do is I work on the Delete button.

And all the rest do look indeed like bears. And then so I can click Confirm, and it'll bring up another 5. What's that? That's not a bear, is it? Does anybody know what that is? I'm going to say that's not a bear. Delete. Confirm. Well, not bear. Well, that's a teddy bear.

I'll leave that. That's not really. I'll get rid of that one. Confirm. OK, so what I tend to do when I do this is I'll keep going Confirm until I get to a couple of screen fulls of things that all look OK. And that suggests to me that I've kind of got past the worst bits of the data.

OK, and that's it. And so now you can go back once you do it for the training set as well and retrain your model. So I'll just note here that what our San Francisco study group did here was that they actually built a little app inside Jupyter Notebook, which you might not have realized is possible.

But not only is it possible, it's actually surprisingly straightforward. And just like everything else, you can hit double question mark to find out their secrets. So here is the source code. OK, and really, if you've done any GUI programming before, it'll look incredibly normal. You know, there's there's basically callbacks for what happens when you click on a button where you just do standard Python things.

And to actually render it, you just use widgets and you can lay it out using standard boxes and whatever. So it's this idea of creating applications inside notebooks is like it's really underused, but it's super neat because it lets you create tools for your fellow practitioners, for your fellow experimenters.

Right. And you could definitely envisage taking this a lot further. In fact, by the time you're watching this on the MOOC, you'll probably find that there's a whole lot more buttons here because we've already got a long list of to do's that we're going to add. To this particular thing.

So so that's it. So I think like. I'd love for you to have a think about now that you know it's possible to write applications in your notebook, what are you going to write? And if you Google for eye pie widgets, you can learn about the little gooey framework to find out what kind of widgets you can create and what they look like and how they work and so forth.

And you'll find it's you know, it's actually a pretty, you know, complete gooey programming environment you can play with. And this will all work nicely with your models and so forth. It's not a great way to productionize an application because it is sitting inside a notebook. This is really for things which are going to help other practitioners, other experimentalists and so forth.

For productionizing things, you need to actually build a production web app, which we'll look at next. OK, so after you have cleaned up your noisy images, you can then retrain your model and hopefully you'll find it's a little bit more accurate. One thing you might be interested to discover when you do this is it actually doesn't matter most of the time very much.

Now on the whole, these models are pretty good at dealing with moderate amounts of noisy data. The problem would occur is if your data was not randomly noisy, but biased noisy. So I guess the main thing I'm saying is if you go through this process of cleaning up your data and then you rerun your model and find it's like 0.001% better, that's normal.

OK, that's fine, but it's still a good idea just to make sure that you don't have too much noise in your data in case it is biased. So at this point, we're ready to put our model in production. And this is where I hear a lot of people ask me about which mega Google Facebook highly distributed serving system they should use and how do they use a thousand GPUs at the same time and whatever else.

For the vast, vast, vast majority of things that you all do, you will want to actually run in production on a CPU, not a GPU. Why is that? Because a GPU is good at doing lots of things at the same time. But unless you have a very busy website, it's pretty unlikely that you're going to have 64 images to classify at the same time to put into a batch into a GPU.

And if you did, you've got to deal with all that queuing and running it all together. All of your users have to wait until that batch has got filled up and run. It's a whole lot of hassle, right? And then if you want to scale that, there's another whole lot of hassle.

It's much easier if you just grab one thing, throw it at a CPU to get it done, and it comes back again. So yes, it's going to take, you know, maybe 10 or 20 times longer, right? So maybe it'll take 0.2 seconds rather than 0.01 seconds. That's about the kind of times we're talking about.

But it's so easy to scale, right? You can chuck it on any standard serving infrastructure. It's going to be cheap as hell. You can horizontally scale it really easily, right? So most people I know who are running apps that aren't kind of at Google scale based on deep learning are using CPUs.

And the term we use is inference, right? So when you're not training a model, but you've got a trained model and you're getting to predict things, we call that inference. That's why we say here you probably want to use CPU for inference. So at inference time, you've got your pre-trained model, you saved those weights, and how are you going to use them to create something like Simon Willison's Cougar Detector?

Well, first thing you're going to need to know is what were the classes that you trained with, right? You need to not know, not just what are they, but what were the order, okay? So you will actually need to, like, serialize that or just type them in or in some way make sure you've got exactly the same classes that you trained with.

If you don't have a GPU on your server, it will use the CPU automatically. If you want to test, if you have a GPU machine and you want to test using a CPU, you can just uncomment this line. And that tells FastAI that you want to use CPU by passing it back to PyTorch.

So here's an example. We don't have a Cougar Detector, we have a Teddy Bear Detector, and my daughter Claire is about to decide whether to cuddle this friend, okay? So what she does is she takes Daddy's Deep Learning model and she gets a picture of this, and here is a picture that she's uploaded to the web app, okay?

And here is a picture of the potentially cuddlesome object, and so we're going to store that in a very record image. So OpenImage is how you open an image in FastAI, funnily enough. Here is that list of classes that we saved earlier. And so as per usual, we created a data bunch, but this time we're not going to create a data bunch from a folder full of images.

We're going to create a special kind of data bunch, which is one that's going to grab one single image at a time. So we're not actually passing it any data. The only reason we pass it a path is so that it knows where to load our model from, right?

That's just the path, that's the folder that the model is going to be in. But what we do need to do is that we need to pass it the same information that we trained with, so the same transforms, the same size, the same normalization. This is all stuff we'll learn more about, but just make sure it's the same stuff that you used before.

And so now you've got a data bunch that actually doesn't have any data in it at all. It's just something that knows how to transform a new image in the same way that you trained with, so that you can now do inference. So you can now create a CNN with this kind of fake data bunch.

And again, you would use exactly the same model that you trained with. You can now load in those saved weights, okay? And so this is the stuff that you would do once, just once when your web app's starting up, okay? And it takes, you know, 0.1 of a second to run this code.

And then you just go learn.predict image. And it's lucky we did that because it is not a teddy bear. This is actually a black bear. So thankfully, due to this excellent deep learning model, my daughter will avoid having a very embarrassing black bear cut-all incident. So what does this look like in production?

Well, I took Simon Willison's code and shamelessly stole it, made it probably a little bit worse. But basically it's going to look something like this. So now Simon used a really cool web app toolkit called Starlet. If you've ever used Flask, this will look extremely similar, but it's kind of a more modern approach.

And by modern, what I really mean is that you can use a weight. This basically means that you can wait for something that takes a while, such as grabbing some data, without using up a process. So for things like I want to get a prediction or I want to load up some data or whatever, it's really great to be able to use this modern Python 3 asynchronous stuff.

So Starlet would come highly recommended for creating your web app. And so, yeah, you just create a route as per usual in a web app. And in that, you say this is async to ensure that it doesn't steal the process while it's waiting for things. You open your image, you call .predict, and you return that response.

And then you can use whatever, JavaScript client or whatever to show it. And that's it. That's basically the main contents of your web app. So give it a go, right? This week, even if you've never created a web application before, there's a lot of nice little tutorials online and kind of starter code.

If in doubt, why don't you try Starlet? There's a free hosting that you can use. There's one called Python Anywhere, for example. The one that Simon's used, we'll mention that on the forum. It's something you can basically package it up as a Docker thing and shoot it off, then it'll serve it up for you.

So it doesn't even need to cost you any money. And so all these classifiers that you're creating, you can turn them into web applications. So I'll be really interested to see what you're able to make of that. That'll be really fun. Okay, so let's take a break. We'll come back at 7.35.

See you then. Okay. So let's move on. So I mentioned that most of the time, the kind of rules of thumb I've shown you will probably work. And if you look at the Share Your Work thread, you'll find most of the time, people are posting things saying, "I downloaded these images.

I tried this thing. It works much better than I expected." Well, that's cool. And then like 1 out of 20 says like, "Ah, I had a problem." So let's have a talk about what happens when you have a problem. And this is where we're going to start getting into a little bit of theory, because in order to understand why we have these problems and how we fix them, it really helps to know a little bit about what's going on.

So first of all, let's look at examples of some problems. The problems basically will be either your learning rate is too high or low, or your number of epochs is too high or low. So we're going to learn about what those mean and why they matter. But first of all, because we're experimentalists, let's try them.

So let's go with our teddy bear detector and let's make our learning rate really high. The default learning rate is 0.003. It works most of the time. So what if we try a learning rate of 0.5? That's huge. So what happens? Our validation loss gets pretty damn high. Remember, this is normally something that's underneath 1.

So if you see your validation loss, do that. Before we even learn what validation loss is, just know this. If it does that, your learning rate is too high. That's all you need to know. Make it lower. It doesn't matter how many epochs you do. And if this happens, there's no way to undo this.

So you have to go back and create your neural net again and fit from scratch with a lower learning rate. So that's learning rate too high. Learning rate too low, what if we use a learning rate not of 0.003, but 1e_neg_5, so 0.00001, right? So this is just, I've just copied and pasted what happened when we trained before with our default learning rate.

And within one epoch we were down to a 2 or 3% error rate. With this really low learning rate, our error rate does get better, but very, very slowly. And you can plot it. If you go to learn.recorder is an object which is going to keep track of lots of things happening while you train.

You can call plot_losses to plot out the validation and training loss. And you can just see them just gradually going down so slow. So if you see that happening, then you have a learning rate which is too small. So bump it up by 10 or bump it up by 100 and try again.

The other thing you'll see if your learning rate is too small is that your training loss will be higher than your validation loss. You never want a model where your training loss is higher than your validation loss. That always means you haven't fitted enough, which means either your learning rate is too low or your number of epochs is too low.

So if you have a model like that, train it some more or train it with a higher learning rate. OK? Too few epochs. So what if we train for just one epoch? Our error rate is certainly better than random, 5%, but look at this. The difference between training loss and validation loss, the training loss, is much higher than the validation loss.

So too few epochs and too low a learning rate look very similar. And so you can just try running more epochs. And if it's taking forever, you can try a higher learning rate. Or if you try a higher learning rate and the loss goes off to 100,000 million, then put it back to where it was and try a few more epochs.

That's the balance. That's basically all you care about 99% of the time. And this is only the 1 in 20 times that the defaults don't work for you. OK. Too many epochs. We're going to be talking more about this. Create something called overfitting. If you train for too long, as we're going to learn about, it will learn to recognize your particular teddy bears, but not teddy bears in general.

Here's the thing. Despite what you may have heard, it's very hard to overfit with deep learning. So we were trying today to show you an example of overfitting and I turned off everything. We're going to learn all about these terms soon. I turned off all the data augmentation. I turned off dropout.

I turned off weight decay. I tried to make it overfit as much as I can. I trained it on a small-ish learning rate. I trained it for a really long time. And like, maybe I started to get it to overfit. Maybe. So the only thing that tells you that you're overfitting is that the error rate improves for a while and then starts getting worse again.

You will see a lot of people, even people that claim to understand machine learning, tell you that if your training loss is lower than your validation loss, then you are overfitting. As you will learn today in more detail and during the rest of the course, that is absolutely not true.

Any model that is trained correctly will always have train loss lower than validation loss. That is not a sign of overfitting. That is not a sign you've done something wrong. That is a sign you have done something right. The sign that you are overfitting is that your error starts getting worse.

Because that's what you care about, right? You want your model to have a low error. So as long as your training and your model error is improving, you are not overfitting. How could you be? Okay. So there's basically the four possible, the main four things that can go wrong.

There are some other details that we will learn about during the rest of this course. But honestly, if you stopped listening now, please don't. That would be embarrassing. And you're just like, okay, I'm going to go and download images. I'm going to create CNNs with ResNet 34 or ResNet 50.

I'm going to make sure that my learning rate and number of epochs is okay. And then I'm going to chuck them up in a Starlet web API. Most of the time you're done, okay, at least for computer vision. Hopefully you'll stick around because you want to learn about NLP and collaborative filtering and tabular data and segmentation and stuff like that as well.

Let's now understand what's actually going on. What does it mean? Loss mean? What does an epoch mean? What does learning rate mean? Because for you to really understand these ideas, you need to know what's going on. And so we're going to go all the way to the other side.

Rather than creating a state-of-the-art Cougar detector, we're going to go back and create the simplest possible linear model. So we're going to actually start seeing a little bit of math. But don't be turned off. It's okay. We're going to do a little bit of math but it's going to be totally fine even if math is not your thing.

Because the first thing we're going to realize is that when we see a picture like this number 8, it's actually just a bunch of numbers. It's a matrix of numbers. For this grayscale one, it's a matrix of numbers. If it was a color image, it would have a third dimension.

So when you add an extra dimension, we call it a tensor. Rather than a matrix, it would be a 3D tensor of numbers, red, green and blue. So when we created that heavy-bear detector, what we actually did was we created a mathematical function that took the numbers from the images of the teddy bears, and the mathematical function converted those numbers into, in our case, three numbers.

A number for the probability that it's a teddy, a probability that it's a grisly, and a probability that it's a black bear. In this case, there's some hypothetical function that's taking the pixels and presenting a handwritten digit and returning 10 numbers, the probability for each possible outcome, the numbers from 0 to 9.

And so what you'll often see in our code, in other deep learning code, is that you'll find this bunch of probabilities and then you'll find something called max or argmax attached to it, the function called. And so what that function is doing is it's saying, find the highest number, the highest probability, and tell me what the index is.

So np.argmax or torch.argmax of this array would return this number here. Okay, we return index 8. Does that make sense? In fact, let's try it. So we know that the function to predict something is called learn.predict. Okay, so we can chuck two question marks before it or after it to get the source code.

And here it is, right? pred=res_result.argmax. And then what is the class? We just pass that into the classes array. So like you should find that the source code in the FastAI library can both kind of strengthen your understanding of the concepts and make sure that you know what's going on and really help you here.

You've got a question. Come on over. Alright, so can we have a definition of the error rate being discussed and how it is calculated? I assume it's cross-validation error. Sure. So one way to answer the question of how is error rate calculated would be to type error rate, question mark, and look at the source code.

And it is 1 - accuracy. Fair enough. And so then a question might be what is accuracy? Accuracy, question mark. It is argmax. So we now know that means find out which particular thing it is and then look at how often that equals the target. So in other words, the actual value and take the mean.

So that's basically what it is. And so then the question is, OK, what is that being applied to? And always in FastAI, metrics, so these things that we pass in, we call them metrics, are always going to be applied to the validation set. OK? So any time you put a metric here, it'll be applied to the validation set because that's your best practice, right?

That's what you always want to do, is make sure that you're checking your performance on data that your model hasn't seen. And we'll be learning more about the validation set shortly. Remember, you can also type doc. If the source code is not what you want, which I will not be, you actually want the documentation.

That will both give you a summary of the types in and out of the function and a link to the full documentation where you can find out all about how metrics work and what other metrics there are and so forth. And generally speaking, you'll also find links to more information where, for example, you will find complete runs through and sample code and so forth showing you how to use all these things.

So don't forget that the doc function is your friend, OK? And also in the documentation, both in the doc function and in the documentation, you will see a source link. This is like question mark, question mark. But what the source link does is it takes you into the exact line of code in GitHub so you can see exactly how that's implemented and what else is around it.

So lots of good stuff there. Why were you using 3s for your learning rates earlier with 3eNEG 5 and 3eNEG 4? We found that 3eNEG 3 is just a really good default learning rate. Most of the time, for your initial fine-tuning before you unfreeze. And then I tend to kind of just multiply from there.

So I generally find then that the next stage I will pick 10 times lower than that for the second part of the slice and whatever the LRFinder found for the first part of the slice. The second part of the slice doesn't come from the LRFinder. It's just a rule of thumb which is like 10 times less than your first part which defaults to 3eNEG 3.

And then the first part of the slice is what comes out of the LRFinder. And we'll be learning a lot more about these learning rate details both today and in the coming lessons. But yeah, for now all you need to remember is that in your basic approach looked like this.

It was learn.fit 1 cycle, some number of epochs I often pick 4 and some learning rate which defaults to 3eNEG 3. I'll just type it out fully so you can see. And then we do that for a bit and then we unfreeze it. And then we learn some more.

And so this is the bit where I just take whatever I did last time and divide it by 10. And then I also write like that and then I have to put one more number in here and that's the number that I get from the LRFinder. The bit where it's got the strongest slope.

So that's kind of the, kind of, don't have to think about it, don't really have to know what's going on, rule of thumb that works most of the time. But let's now dig in and actually understand it more completely. So we're going to create this mathematical function that takes the numbers that represent the pixels and spits out probabilities for each possible plus.

And by the way, a lot of the stuff that we're using here, we are stealing from other people who are awesome and so we are putting their details here. So like please check out their work because they've got great work that we are highlighting in our course. I really like this idea of this little animated gif of the numbers.

So thank you to Adam Guykey for creating that. And I guess that was probably on Quora by the looks of this medium. Oh yes it was too, that terrific medium post, I remember. In fact a whole series of medium posts. So let's look and see how we create one of these functions.

And let's start with the simplest function I know, y = ax + b. That's a line, right? That's a line and the gradient of the line is here and the intercept of the line is here. So hopefully when we said that you need to know high school math to do this course, these are the things we're assuming that you remember.

If we do kind of mention some math thing which I'm assuming you remember and you don't remember it, don't freak out, right? Happens to all of us. Khan Academy is actually terrific, it's not just for school kids. Go to Khan Academy, find the concept that you need a refresher on and he explains things really well.

So I strongly recommend checking that out. Now remember I'm just a philosophy student, right? All the time I'm trying to either remind myself about something or I never learnt something and so we have the whole internet to teach us these things. So I'm going to rewrite this slightly. y = a1 x + a2.

So let's just replace b with a2 and just give it a different name. So there's another way of saying the same thing. And another way of saying that would be if I could multiply a2 by the number 1. This still is the same thing. And so now at this point I'm actually going to say let's not put the number 1 there but let's put an x1 here and an x2 here and I'll say x2 = 1.

So far this is pretty early high school math. This is multiplying by 1 which I think we can handle. So these two are equivalent with a bit of renaming. Now in machine learning we don't just have one equation. We've got lots, right? So if we've got some data that represents the temperature versus the number of ice creams sold then we kind of have lots of dots.

And so each one of those dots we might hypothesize is based on this formula. y = a1 x1 + a2 x2. And so basically there's lots of, so this is our y, this is our x. There's lots of values of y so we can stick a little i here.

And there's lots of values of x so we can stick a little x here. So the way we kind of do that is a lot like numpy indexing. But rather than things in square brackets or pytorch indexing, rather than things in square brackets we kind of put them down here in the subscript of our equation.

So this is now saying there's actually lots of these different yi's based on lots of different xi1 and xi2. But notice there's still only one of each of these. So these things here are called the coefficients or the parameters. So this is our linear equation and we're going to say that every xi2 is equal to 1.

Why did I do it that way? Because I want to do linear algebra. Why do I want to do it in linear algebra? Well one reason is because Rachel teaches the world's best linear algebra course. So if you're interested check out computational linear algebra for coders. So it's a good opportunity for me to throw in a pitch for this free course which we make no money but never mind.

But more to the point right now it's going to make life much easier because I hate writing loops. I hate writing code. I just want the computer to do everything for me. And anytime you see like these little i subscripts that sounds like you're going to have to do loops and all kinds of stuff.

But what you might remember from school is that when you've got like two things being multiplied together, two things being multiplied together and then they get added up, that's called a dot product. And then if you do that for lots and lots of different numbers i, then that's called a matrix product.

So in fact this whole thing can be written like this. Rather than lots of different yi's we can say there's one vector called y which is equal to one matrix called x times one vector called a. Now at this point I know a lot of you don't remember that.

So that's fine. We have a picture to show you. I don't know who created this. So now I do. Somebody called Andre Stouts created this fantastic thing called matrix multiplication dot xyz. And here we have a matrix by a vector and we're going to do a matrix vector product.

Go. That times that times that plus plus plus plus. That times that times that plus plus plus plus. That times that times that plus plus plus plus. Finished. That is what matrix vector multiplication does. In other words, it's just that. Except his version is much less messy. OK. So let's.

This is actually an excellent spot to have a little break and find out what questions we have coming through our students. What are they asking Rachel? OK. When generating new image data sets how do you know how many images are enough? What are ways to measure enough? Yeah. That's a great question.

So another possible problem you have is you don't have enough data. How do you know if you don't have enough data? Because you found a good learning rate because if you make it higher than it goes off into massive losses, if you make it lower, it goes really slowly.

So you've got a good learning rate. And then you train for such a long time that your error starts getting worse. OK. So you know that you've trained for long enough and you're still not happy with the accuracy. It's not good enough for the teddy bear cuddling level of safety you want.

So if that happens, there's a number of things you can do and we'll learn about some of them during or learn pretty much all of them during this course. But one of the easiest ones is get more data. If you get more data, then you can train for longer, get a higher accuracy, lower error rate without overfitting.

Unfortunately, there's no shortcut. I wish there was. I wish there was somewhere to know ahead of time how much data you need. But I will say this, most of the time you need less data than you think. So organizations very commonly spend too much time gathering data, getting more data than it turned out they actually needed.

So get a small amount first and see how you go. What do you do if you have unbalanced classes such as 200 grizzlies and 50 teddies? Nothing. Try it. It works. A lot of people ask this question about how do I deal with unbalanced data. I've done lots of analysis with unbalanced data over the last couple of years and I just can't make it not work.

It always works. So there's actually a paper that said if you want to get it slightly better, then the best thing to do is to take that uncommon class and just make a few copies of it. That's called oversampling. But I haven't found a situation in practice where I needed to do that.

I've found it always just works fine for me. Once you unfreeze and retrain with one cycle again, if your training loss is still lower than your validation loss, likely underfitting, do you retrain it unfrozen again, which will technically be more than one cycle, or do you redo everything with a longer epoch for the cycle?

Hey, you guys asked me that last week. My answer is still the same. I don't know. Either is fine. If you do another cycle, then it'll kind of maybe generalize a little bit better. If you start again, do twice as long. It's kind of annoying. Depends how patient you are.

It won't make much difference. For me personally, I normally just train a few more cycles. But yeah, it doesn't make much difference most of the time. So showing the code sample where you were creating a CNN with ResNet 34 for the Grizzly/Teddy classifier, it says this requires ResNet 34, which I find surprising.

I had assumed that the model created by .save, which is about 85 megabytes on disk, would be able to run without also needing a copy of ResNet 34. Yeah, and I understand. We're going to be learning all about this shortly. There's no copy of ResNet 34. ResNet 34 is actually what we call an architecture.

We're going to be learning a lot about this. It's a functional form. Just like this is a linear functional form, it doesn't take up any room, it doesn't contain anything, it's just a function. ResNet 34 is just a function. It doesn't contain anything, it doesn't store anything. I think the confusion here is that we often use a pre-trained neural net that's being learnt on ImageNet.

In this case, we don't need to use a pre-trained neural net. And actually, to entirely avoid that even getting created, you can actually pass pre-trained equals false, and that'll ensure that nothing even gets loaded, which will save you another .2 seconds, I guess. But we'll be learning a lot more about this, so don't worry if this is a bit unclear.

But the basic idea is, this thing here is basically the equivalent of saying, is it a line, or is it a quadratic, or is it a reciprocal? This is just a function. This is the ResNet 34 function. It's a mathematical function. It doesn't take any storage, it doesn't have any numbers, it doesn't have to be loaded, as opposed to a pre-trained model.

And so that's why when we did it at inference time, the thing that took space is this bit, which is where we load our parameters, which is basically saying, as we were able to find out, what are the values of A and B. We have to store those numbers.

But for ResNet 34, you don't just store two numbers, you store a few millions, or a few tens of millions of numbers. So why did we do all this? Well, it's because I wanted to be able to write it out like this. And the nice thing of being able to write it out like this is that we can now do that in PyTorch with no loops, single line of code, and it's also going to run faster.

PyTorch really doesn't like loops. It really wants you to send it a whole equation to do all at once, which means you really want to try and specify things in these kind of linear algebra ways. So let's go and take a look, because what we're going to try and do then is we're going to try and take this, we're going to call it an architecture.

This is like the tiniest world's tiniest neural network. It's got two parameters, you know, A1 and A2. We're going to try and fit this architecture to some data. So let's jump into a notebook and generate some dots, right, and see if we can get it to fit a line somehow.

And the somehow is going to be using something called SGD. What is SGD? Well, there's two types of SGD. The first one is where I said in Lesson 1, "Hey, you should all try building these models and try and come up with something cool." And you guys all experimented and found really good stuff.

So that's where the S would be student. That would be student gradient descent. So that's version 1 of SGD. Version 2 of SGD, which is what we're going to talk about today, is where we're going to have a computer try lots of things and try and come up with a really good function.

And that will be called "Stochastic Gradient Descent". So the other one that you hear a lot on Twitter is "Stochastic Grad Student Descent". So that's the other one that you hear. So we're going to jump into Lesson 2, SGD. And so we're going to kind of go bottom up rather than top down.

We're going to create the simplest possible model we can, which is going to be a linear model. And the first thing that we need is we need some data. And so we're going to generate some data. The data we're going to generate looks like this. So this might represent temperature and this might represent number of ice creams we sell or something like that.

But we're just going to create some synthetic data that we know is following a line. And so as we build this, we're actually going to learn a little bit about PyTorch as well. So basically the way we're going to generate this data is by creating some coefficients. A1 will be 3 and A2 will be 2.

And we're going to create some, like we just looked at before, basically a column of numbers for our x's and a whole bunch of 1's. And then we're going to do this, x at a. What is x at a? x at a in Python means a matrix product between x and a.

It actually is even more general than that. It can be a vector-vector product, a matrix-vector product, a vector-matrix product, or a matrix-matrix product. And then actually in PyTorch specifically, it can mean even more general things where we get into higher-rank tensors, which we will learn all about very soon.

But this is basically the key thing that's going to go on in all of our deep learning. The vast majority of the time, our computers are going to be basically doing this. Multiplying numbers together and adding them up, which is a surprisingly useful thing to do. OK. So we basically are going to generate some data by creating a line and then we're going to add some random numbers to it.

But let's go back and see how we created x and a. So I mentioned that we've basically got these two coefficients, 3 and 2. And you'll see that we've wrapped it in this function called Tensor. You might have heard this word "tensor" before. Who's heard the word "tensor" before?

About two-thirds of you. OK. So it's one of these words that sounds scary. And apparently if you're a physicist, it actually is scary. But in the world of deep learning, it's actually not scary at all. Tensor means array. OK? It means array. Now specifically, it's an array of a regular shape.

Right? So it's not an array where row one has two things and row three has three things and row four has one thing, what you call a jagged array. That's not a tensor. A tensor is any array which has a rectangular or cube or whatever, you know, a shape where every element, every row is the same length and then every column is the same length.

A 4 by 3 matrix would be a tensor. A vector of length 4 would be a tensor. A 3D array of length 3 by 4 by 6 would be a tensor. That's all a tensor is. OK? And so we have these all the time. For example, an image is a three-dimensional tensor.

It's got number of rows by number of columns by number of channels, normally red, green, blue. So for example, kind of a VGA texture would be 640 by 480 by 3. Or actually, we do things backwards. So when people talk about images, they normally go width by height. But when we talk mathematically, we always go number of rows by number of columns.

So it'd actually be 480 by 640 by 3. That will catch you out. We don't say dimensions, though, with tensors. We use one of two words. We either say "rank" or "axis". "Rank" specifically means how many axes are there, how many dimensions are there. So an image is generally a rank 3 tensor.

So what we've created here is a rank 1 tensor, or also known as a vector. But in math, people come up with slightly different words, or actually, no, they come up with very different words for slightly different concepts. Why is a one-dimensional array a vector and a two-dimensional array is a matrix, and then a three-dimensional array doesn't even have a name?

Not really. It doesn't have a name. It doesn't make any sense. With computers, we try to have some simple, consistent naming conventions. They're all called tensors. Rank 1 tensor, rank 2 tensor, rank 3 tensor. You can certainly have a rank 4 tensor. If you've got 64 images, then that would be a rank 4 tensor of 64 by 480 by 640 by 3, for example.

So tensors are very simple. They just mean arrays. And so in PyTorch, you say tensor, and you pass in some numbers, and you get back, if in this case, just a list, get back a vector. So this then represents our coefficients, the slope and the intercept of our line.

And so because remember, we're not actually going to have a special case of Ax + b. Instead, we're going to say there's always this second x value, which is always 1. You can see it here, always 1, which allows us just to do a simple matrix vector product. OK, so that's a.

And then we wanted to generate this x array of data, which is going to have, we're going to put random numbers in the first column and a whole bunch of ones in the second column. So to do that, we basically say to PyTorch, create a rank 2 tensor. Actually, no, sorry, let's start that again.

We say to PyTorch that we want to create a tensor of n by 2. So since we passed in a total of two things, we get a rank 2 tensor, the number of rows will be n, and the number of columns will be 2. And in there, every single thing in it will be a 1.

That's what torch.ones mean. And then, this is really important, you can index into that, just like you can index into a list in Python, but you can put a colon anywhere, and a colon means every single value on that axis, or every single value on that dimension. So this here means every single row.

And then this here means column 0. So this is every row of column 0. I want you to grab a uniform random number. And here's another very important concept. In PyTorch, any time you've got a function that ends in an underscore, that means don't return to me that uniform random number, but replace whatever this is being called on with the result of this function.

So this takes column 0, and replaces it with a uniform random number between -1 and 1. So there's a lot to unpack there, right? But the good news is, those two lines of code, plus this one, which we're coming to, cover 95% of what you need to know about PyTorch.

How to create an array, how to change things in an array, and how to do matrix operations on an array. So there's a lot to unpack, but these small number of concepts are incredibly powerful. So I can now print out the first five rows. OK, so colon 5 is standard Python slicing syntax to say the first five rows.

So here are the first five rows, two columns, looking like my random numbers, and my 1. So now I can do a matrix product of that x by my a, add in some random numbers to add a bit of noise, and then I can do a scatterplot. And I'm not really interested in my scatterplot in this column of 1's, right?

They're just there to make my linear function more convenient. So I'm just going to plot my 0 index column against my y's, and there it is. PLT is what we universally use to refer to the plotting library matplotlib. And that's what most people use for most of their plotting in Python, in scientific Python, we use matplotlib.

It's certainly a library you'll want to get familiar with, because being able to plot things is really important. There are lots of other plotting packages. Lots of them, the other packages are better at certain things than matplotlib, but like matplotlib can do everything reasonably well. Sometimes it's a little awkward, but for me, I do pretty much everything in matplotlib because there's really nothing it can't do, even though some libraries can do other things a little bit better or a little bit prettier.

But it's really powerful, so once you know matplotlib, you can do everything. So here I'm asking matplotlib to give me a scatterplot with my x's against my y's, and there it is. So this is my dummy data representing like temperature and ice cream sales. So now what we're going to do is we're going to pretend we were given this data, and we don't know that the values of our coefficients are 3 and 2.

So we're going to pretend that we never knew that we have to figure them out. So how would we figure them out? How would we draw a line to fit to this data? And why would that even be interesting? Well, we're going to look at more about why it's interesting in just a moment, but the basic idea is this.

If we can find, this is going to be kind of perhaps really surprising, but if we can find a way to find those two parameters to fit that line to those, how many points were there, and was 100. If we can find a way to fit that line to those 100 points, we can also fit these arbitrary functions that convert from pixel values to probability.

It'll turn out that these techniques that we're going to learn to find these two numbers works equally well for the 50 million numbers in ResNet 34. So we're actually going to use an almost identical approach. And this is the bit that I found in previous classes, people have the most trouble digesting.

Like I often find, even after week four or week five, people will come up to me and say, "I don't get it. How do we actually train these models?" And I'll say, "It's SGD. It's that thing we throw in the notebook with the two numbers." It's like, "Yeah, but we're fitting a neural network." It's like, "I know, and we can't print the 50 million numbers anymore, but it is literally identically doing the same thing." And the reason this is hard to digest is that the human brain has a lot of trouble conceptualizing of what an equation with 50 million numbers looks like and can do.

So you just kind of, for now, will have to take my word for it. It can do things like recognize teddy bits. You know, these functions turn out to be very powerful. And we're going to learn a little bit more in just a moment about how to make them extra powerful.

But for now, this thing we're going to learn to fit these two numbers is the same thing that we've just been using to fit 50 million numbers. Okay. So we want to find what PyTorch calls parameters, or in statistics you'll often hear called coefficients, these values A1 and A2.

We want to find these parameters such that the line that they create minimizes the error between that line and the points. So in other words, you know, if we created, you know, if the A1 and A2 we came up with resulted in this line, then we'd look and we'd see, like, how far away is that line from each point?

And we'd say, oh, that's quite a long way. And so maybe there was some other A1 or A2 which resulted in this line. And they would say, like, oh, how far away is each of those points? And then eventually we come up with, we come up with this line and it's like, oh, in this case, each of those is actually very close.

So you can see how in each case we can say, how far away is the line at each spot away from its point? And then we can take the average of all those and that's called the loss. That is the value of our loss. So you need some mathematical function that can basically say, how far away is this line from those points?

For this kind of problem, which is called a regression problem, a problem where your dependent variable is continuous. So rather than being grizzlies or teddies, it's like some number between -1 and 6. This is called a regression problem. And for regression, the most common loss function is called mean squared error, which pretty much everybody calls MSE.

You may also see RMSE, which is root mean squared error. And so the mean squared error is a loss. It's the difference between some prediction that you've made, which is the value of the line, and the actual number of ice cream sales. And so in the mathematics of this, people normally refer to the actual, they normally call it Y, and the prediction they normally call it Y hat, as in they write it.

Like that. And so what I try to do, like when we're writing something like a mean squared error equation, there's no point writing ice cream here and temperature here, because we want it to apply to anything. So we tend to use these mathematical placeholders. So the value of mean squared error is simply the difference between those two squared.

And then we can take the mean. Because remember, that is actually a vector, or what we now call it, a rank one tensor. And that is actually a rank one tensor. So it's the value of the number of ice cream sales at each place. So when we subtract one vector from another vector, and we're going to be learning a lot more about this, but it does something called element-wise arithmetic.

In other words, it subtracts each one from each other. And so we end up with a vector of differences. And then if we take the square of that, it squares everything in that vector. And so then we can take the mean of that to find the average square of the differences between the actuals and the predicted.

So if you're more comfortable with mathematical notation, what we just wrote there was the sum of (which way round do we do it?) y hat minus y squared over n. So that equation is the same as that equation. So one of the things I'll note here is I don't think this is more complicated or unwieldy than this.

But the benefit of this is you can experiment with it. Once you've defined it, you can use it. You can send things into it and get stuff out of it and see how it works. So for me, most of the time I prefer to explain things with code rather than with math, because they're the same.

In this case, at least, in all the cases we'll look at, they're exactly the same. They're just different notations for the same thing. But one of the notations is executable. It's something that you can experiment with. And one of them is abstract. So that's why I'm generally going to show code.

So the good news is if you're a coder with not much of a math background, actually you do have a math background, because code is math. If you've got more of a math background and less of a code background, then actually a lot of the stuff that you learn from math is going to translate very directly into code.

And now you can start to experiment really with your math. OK, so this is the loss function. This is something that tells us how good our line is. So now we have to kind of come up with what is the line that fits through here. Remember, we don't know.

We're going to pretend we don't know. So what you actually have to do is you have to guess. You actually have to come up with a guess. What are the values of a1 and a2? So let's say we guessed that a1 and a2 are both 1. So this is our tensor.

a is 1, 1. Right? So here is how we create that tensor. And I wanted to write it this way because you'll see this all the time. Like, written out, it should be 1.0. Oh, sorry, it looks like we're starting with -1. -1. Written out fully, it would be -1.0, 1.0.

Like, that's written out fully. We can't write it without the point because that's now an int, not a floating point. So that's going to fit the dummy if you try to do calculations with that in neural nets. Right? I'm lazy. I'm far too lazy to type .0 every time.

Python knows perfectly well that if you add a dot next to any of these numbers, then the whole thing is now floats. Right? So that's why you'll often see it written this way, particularly by lazy people like me. Okay. So a is a tensor. You can see it's floating point.

You see, like, even PyTorch is lazy. They just put a dot. They don't bother with a zero. Right? But if you want to actually see exactly what it is, you can write .type. And you can see it's a float tensor. Okay? And so now we can calculate our predictions with this, like, random guess x at a, matrix product of x and a.

And we can now calculate the mean squared error of our predictions and our actuals. And that's our loss. Okay? So for this regression, our loss is 8.9. And so we can now plot a scatter plot of x against y, and we can plot the scatter plot of x against y hat, our predictions, and there they are.

Okay? So this is the 1 comma minus 1 line. Sorry, minus 1 comma 1 line, and here's our actuals. So that's not great. It's not surprising. It's just a guess. So SGD, or gradient descent more generally, and anybody who's done any engineering or probably computer science at school will have done plenty of this, like, Newton's method, whatever.

It's all the stuff that you did at university. If you didn't, don't worry. We're going to learn it now. It's basically about taking this guess and trying to make it a little bit better. So how do we make it a little bit better? Well, there's only two numbers, right?

And the two numbers are the intercept of that orange line and the gradient of that orange line. So what we're going to do with gradient descent is we're going to simply say, what if we change those two numbers a little bit? What if we made the intercept a little bit higher or a little bit lower?

What if we made the gradient a little bit more positive or a little bit more negative? So there's like four possibilities. And then we can just calculate the loss for each of those four possibilities and see what works. Did lifting it up or down make it better? Did tilting it more positive or more negative make it better?

And then all we do is we say, OK, well, whichever one of those made it better, that's what we're going to do. And that's it. But here's the cool thing for those of you that remember calculus. You don't actually have to move it up and down and round about.

You can actually calculate the derivative. The derivative is the thing that tells you would moving it up or down make it better or would rotating it this way or that way make it better. So the good news is if you didn't do calculus or you don't remember calculus, I just told you everything you need to know about it, right?

Which is that it tells you how changing one thing changes the function. That's what the derivative is. Kind of not quite strictly speaking, but close enough, also called the gradient. So the gradient or the derivative tells you how changing A1 up or down would change our MSE, how changing A2 up or down would change our MSE.

And it does it more quickly. It does it more quickly than actually moving it up and down, right? So in school, unfortunately, they force us to sit there and calculate these derivatives by hand. We have computers. Computers can do that for us. We are not going to calculate them by hand.

Instead, we're going to call .grad. On our computer, that will calculate the gradient for us. So here's what we're going to do. We're going to create a loop. We're going to loop through 100 times and we're going to call a function called update. That function is going to calculate y hat, our prediction.

It is going to calculate loss, our mean squared error. From time to time, it will print that out so we can see how we're going. It will then calculate the gradient. And in PyTorch, calculating the gradient is done by using a method called backward. So you'll see something really interesting, which is mean squared error was just a simple standard mathematical function.

PyTorch, for us, keeps track of how it was calculated and lets us calculate the derivative. So if you do a mathematical operation on a tensor in PyTorch, you can call backward to calculate the derivative. What happens to that derivative? It gets stuck inside an attribute called .grad. So I'm going to take my coefficients a, and I'm going to subtract from them my gradient.

And there's an underscore here. Why? Because that's going to do it in place. So it's going to actually update those coefficients a to subtract the gradients from them. So why do we subtract? Because the gradient tells us if I move the whole thing downwards, the loss goes up. If I move the whole thing upwards, the loss goes down.

So I want to do the opposite of the thing that makes it go up. Because we want our loss to be small. So that's why we have to subtract. And then there's something here called Lr. Lr is our learning rate. And so literally all it is, is the thing that we multiply by the gradient.

Why is there any Lr at all? Let me show you why. Let's take a really simple example. A quadratic. And let's say your algorithm's job was to find where that quadratic was at its lowest point. And so how could it do this? Well, just like what we're doing now, the starting point would just be to pick some x value at random.

And then pop up here to find out what the value of y is. And that's the starting point. And so then it can calculate the gradient. And the gradient is simply the slope. It tells you moving in which direction is going to make you go down. And so the gradient tells you you have to go this way.

So if the gradient was really big, you might jump this way a very long way. So you might jump all the way over to here. Maybe even here. Right? And so if you jumped over to there, then that's actually not going to be very helpful. Because then you see, well, where does that take us to?

Oh, it's now worse. Right? We jumped too far. So we don't want to jump too far. So maybe we should just jump a little bit. Maybe to here. And the good news is that is actually a little bit closer. And so then we'll just do another little jump. See what the gradient is into another little jump.

That takes us to here. And another little jump. That takes us to here. Here. Here. Right? So in other words, we find our gradient to tell us kind of what direction to go and like, do we have to go a long way or not too far? But then we multiply it by some number less than one so we don't jump too far.

And so hopefully at this point, this might be reminding you of something, which is what happened when our learning rate was too high. So do you see why that happened now? Our learning rate was too high, meant that we jumped all the way past the right answer further than we started with, and it got worse and worse and worse.

So that's what a learning rate too high does. On the other hand, if our learning rate is too low, then you just take tiny little steps. And so eventually you're going to get there, but you're doing lots and lots of calculations along the way. So you really want to find something where it's kind of either like this, or maybe it's kind of a little bit backwards and forwards.

Maybe it's kind of like this. Something like that. You know, you want something that kind of gets in there quickly, but not so quickly it jumps out and diverges, not so slowly that it takes lots of steps. So that's why we need a good learning rate. And so that's all it does.

So if you look inside the source code of any deep learning library, you'll find this. You'll find something that just says coefficients dot subtract learning rate times gradient. And we'll learn about some easy but important optimizations we can do to make this go faster. But that's basically it. There's a couple of other little minor issues that we don't need to talk about now.

One involving zeroing out the gradients, another involving making sure that you turn gradient calculation off when you do the SGD update. If you're interested, we can discuss them on the forum. Or you can do our introduction to machine learning course, which covers all the mechanics of this in more detail.

But this is the basic idea. So if we run update 100 times, printing out the loss from time to time, you can see it starts at 8.9 and it goes down, down, down, down, down, down, down. And so we can then print out scatter plots. And there it is.

That's it. Believe it or not, that's gradient descent. So we just need to start with a function that's a bit more complex than x at a. But as long as we have a function that can represent things like is this a teddy bear, we now have a way to fit it.

Okay? And so let's now take a look at this as a picture, as an animation. And this is one of the nice things that you can do with... This is one of the nice things that you can do with Matplotlib is you can take any plot and turn it into an animation.

And so you can now actually see it updating each step. So let's see what we did here. We simply said, as before, create a scatter plot. But then rather than having a loop, we used Matplotlib's func animation. So call 100 times this function. And this function just called that update that we created earlier and then updated the y data in our line.

And so did that 100 times, waiting 20 milliseconds after each one. And there it is. So you might think that visualizing your algorithms with animations is some amazing and complex thing to do. But actually now you know it's 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 lines of code.

Okay? So I think that is pretty damn cool. So that is SGD visualized. And so we can't visualize as conveniently what updating 50 million parameters in a ResNet 34 looks like. But it's basically doing the same thing. Okay? And so studying these simple versions is actually a great way to get an intuition.

So you should try running this notebook with a really big learning rate, with a really small learning rate, and see what this animation looks like. Try to get a feel for it. Maybe you can even try a 3D plot. I haven't tried that yet, but I'm sure it would work fine too.

So the only difference between stochastic gradient descent and this is something called many batches. You'll see what we did here was we calculated the value of the loss on the whole data set on every iteration. But if your data set is 1.5 million images in ImageNet, that's going to be really slow, right?

Just to do a single update of your parameters, you've got to calculate the loss on 1.5 million images. You wouldn't want to do that. So what we do is we grab 64 images or so at a time, at random, and we calculate the loss on those 64 images, and we update our weights.

And then we grab another 64 random images, and we update the weights. So in other words, the loop basically looks exactly the same, but at this point here, it would basically be y^2 and some random indexes here, and would basically do the same thing. Well, actually, sorry, it would be there, right?

So some random indexes on our x and some random indexes on our y to do a mini-batch at a time. That would be the basic difference. And so once you add those, grab a random few points each time, those random few points accord your mini-batch, and that approach is called SGD, or Stochastic Gradient Descent.

So there's quite a bit of vocab we've just covered, right? So let's just remind ourselves. The learning rate is a thing that we multiply our gradient by to decide how much to update the weights by. And epoch is one complete run through all of our data points, all of our images.

So for the non-stochastic gradient descent we just did, every single loop we did the entire data set. But if you've got a data set with a thousand images, and your mini-batch size is 100, then it would take you 10 iterations to see every image once. So that would be one epoch.

Epochs are important because if you do lots of epochs, then you're looking at your images lots of time. And so every time you see an image, there's a bigger chance of overfitting. So we generally don't want to do too many epochs. A mini-batch is just a random bunch of points that you use to update your weights.

SGD is just gradient descent using mini-batches. Architecture and model kind of mean the same thing. In this case, architecture is y = x_a. The architecture is the mathematical function that you're fitting the parameters to. And we're going to learn today or next week what the mathematical function of things like ResNet 34 actually is.

But it's basically pretty much what you've just seen. It's a bunch of matrix products. Parameters, also known as coefficients, also known as weights, are the numbers that you're updating. And then loss function is the thing that's telling you how far away or how close you are to the correct answer.

Any questions? Alright, so these models, these predictors, these teddy bear classifiers are functions that take pixel values and return probability. They start with some functional form like y = x_a and they fit the parameters a using SGD to try and do the best to calculate your predictions. So far we've learned how to do regression, which is a single number.

Next week we'll learn how to do the same thing for classification where we have multiple numbers. But it's basically the same. In the process we had to do some math. We had to do some linear algebra and we had to do some calculus. And a lot of people get a bit scared at that point and tell us, "I am not a math person." If that is you, that's totally okay, but you're wrong.

You are a math person. In fact, it turns out that in the actual academic research around this, there are not math people and non-math people. It turns out to be entirely a result of culture and expectations. So you should check out Rachel's talk. There's No Such Thing as Not a Math Person, where she will introduce you to some of that academic research.

And so if you think of yourself as not a math person, you should watch this so that you learn that you're wrong, that your thoughts are actually there because somebody has told you you're not a math person. But there's actually no academic research to suggest that there is such a thing.

In fact, there are some cultures like Romania and China where the Not a Math Person concept never even appeared. It's almost unheard of in some cultures for somebody to say, "I'm not a math person," because they just never entered that cultural identity. So don't freak out if words like derivative and gradient and matrix product are things that you're kind of scared of.

It's something you can learn. It's something you'll be okay with, okay? So the last thing that we're going to close with today... Oh, I just got a message from Simon Willison. Ah, Simon's telling me he's actually not that special. Lots of people won medals. That's the worst part about Simon, is not only is he really smart, he's also really modest, which I think is just awful.

I mean, if you're going to be that smart, at least be a horrible human being and make it okay. Okay, so the last thing I want to close with is the idea of, and we're going to look at this more next week, underfitting and overfitting. We just fit a line to our data, but imagine that our data wasn't actually line-shaped, right?

And so if we tried to fit something which was like constant plus constant times x, i.e. a line to it, then it's never going to fit very well, right? No matter how much we change these two coefficients, it's never going to get really close. On the other hand, we could fit some much bigger equation, so in this case it's a higher degree polynomial, with lots and lots of wiggly bits, like so, right?

But if we did that, it's very unlikely we go and look at some other place to find out the temperature that it is and how much ice cream they're selling, and that we'll get a good result, because like the wiggles are far too wiggly. So this is called overfitting.

We're looking for some mathematical function that fits just right, to stay with the teddy bear analogy. So you might think if you have a statistics background, the way to make things fit just right is to have exactly the right number of parameters. To use a mathematical function that doesn't have too many parameters in it.

It turns out that's actually completely not the right way to think about it. There are other ways to make sure that we don't overfit, and in general this is called regularization. Regularization are all the techniques to make sure that when we train our model, that it's going to work not only well on the data it's seen, but on the data it hasn't seen yet.

So the most important thing to know when you've trained a model is actually how well does it work on data that it hasn't been trained with. And so as we're going to learn a lot about next week, that's why we have this thing called a validation set. So what happens with a validation set is that we do our mini-batch STD training loop with one set of data, with one set of teddy bears, grizzlies, black bears.

And then when we're done, we check the loss function and the accuracy to see how good is it on a bunch of images which were not included in the training. And so if we do that, then if we have something which is too wiggly, it'll tell us, "Oh, your loss function and your error is really bad." Because on the bears that it hasn't been trained with, the wiggly bits are in the wrong spot.

Or if it was underfitting, it would also tell us that your validation set's really bad. So even for people that don't go through this course and don't learn about the details of deep learning, like if you've got managers or colleagues or whatever at work who are kind of wanting to learn about AI, the only thing that you really need to be teaching them is about the idea of a validation set.

Because that's the thing they can then use to figure out, you know, if somebody's telling them snake oil or not, you know, to like hold back some data and then they get told like, "Oh, here's a model that we're going to roll out." And then you say, "Okay, fine.

I'm just going to check it on this held-out data to see whether it generalizes." There's a lot of details to get right when you design your validation set. We will talk about them briefly next week, but a more full version would be in Rachel's piece on the Fast AI blog called "How and Why to Create a Good Validation Set." And this is also one of the things we go into in a lot of detail in the Intro to Machine Learning course.

So we're going to try and give you enough to get by for this course, but it is certainly something that's worth deeper study as well. Any questions or comments before we wrap up? Okay, good. All right. Well, thanks, everybody. I hope you have a great time building your web applications.

See you next week. (audience applauds)

Lesson 2: Deep Learning 2019 - Data cleaning and production; SGD from scratch

Chapters

Transcript