back to indexLesson 3 - Deep Learning for Coders (2020)
Chapters
0:0 Recap of Lesson 2 + What's next
1:8 Resizing Images with DataBlock
8:46 Data Augmentation and item_tfms vs batch_tfms
12:28 Training your model, and using it to clean your data
18:7 Turning your model into an online application
36:12 Deploying to a mobile phone
38:13 How to avoid disaster
50:59 Unforeseen consequences and feedback loops
57:20 End of Chapter 2 Recap + Blogging
64:9 Starting MNIST from scratch
66:58 untar_data and path explained
70:57 Exploring at the MNIST data
72:5 NumPy Array vs PyTorch Tensor
76:0 Creating a simple baseline model
88:38 Working with arrays and tensors
90:50 Computing metrics with Broadcasting
99:46 Stochastic Gradient Descent (SGD)
114:40 End-to-end Gradient Descent example
121:56 MNIST loss function
124:40 Lesson 3 review
00:00:00.000 |
So hello and welcome to lesson three of practical deep learning for coders 00:00:09.480 |
We're looking at getting our model into production last week 00:00:14.380 |
And so we're going to finish off that today and then we're going to start to look behind the scenes at what actually goes 00:00:20.360 |
On when we train a neural network, we're going to look at 00:00:26.760 |
And we're going to learn about SGD and that's important stuff like that 00:00:31.820 |
The the order is slightly different to the book in the book. There's a part in the book which says like hey 00:00:38.640 |
You can either go to lesson four or lesson three now 00:00:41.880 |
And then go back to the other one afterwards. So we're doing lesson four and then lesson three 00:00:46.440 |
Chapter four and then chapter three I should say 00:00:49.280 |
You can choose it whichever way you're interested in chapter four is the more 00:00:55.320 |
Technical chapter about the foundations of how deep learning really works or else chapter three is all about ethics 00:01:02.120 |
And so with the lessons we'll do that next week 00:01:13.720 |
We've got to look at the fast book version the one with in fact everything I'm looking at today will be in the fast book version 00:01:25.680 |
Our bears and we created this data loaders object 00:01:34.240 |
The data block API, which I hope everybody's had a chance to experiment with this week if you haven't 00:01:42.480 |
We kind of skipped over one of the lines a little bit 00:01:48.840 |
So what this is doing here when we said resize 00:01:53.440 |
The the images we downloaded from the internet are lots of different sizes and lots of different aspect ratios 00:01:58.320 |
Some are tall and some are wide. I'm a square and some are big some are small 00:02:01.920 |
When you say resize for an item transform, it means each item to an item in this case is one image 00:02:09.080 |
It's going to be resized to 128 by 128 by squishing it or stretching it 00:02:14.960 |
And so we had a look at you can always say show batch to see a few examples and this is what they look like 00:02:25.000 |
Squishing and stretching isn't the only way that we can resize remember we have everything we have to make everything into a square 00:02:31.280 |
Before we kind of get it into our model by the time it gets to our model everything has to be the same size in 00:02:37.520 |
Each mini-batch, but that's why and they're making it a square is not the only way to do that 00:02:41.640 |
But it's the easiest way and it's the by far the most common way 00:02:56.720 |
Data block object and we can make a data block object 00:03:01.980 |
That's an identical copy of an existing data block object where we can then change just some pieces 00:03:07.940 |
And we can do that by calling the new method, which is super handy. And so let's create another data block 00:03:13.840 |
Object and this time with different item transform where we resize using the 00:03:23.480 |
We have a question. What are the advantages of having square images versus rectangular ones? 00:03:40.800 |
If you know all of your images are rectangular of a particular aspect ratio to start with you may as well 00:03:47.680 |
Just keep them that way. But if you've got some which are tall and some which are wide 00:03:51.840 |
Making them all square is kind of the easiest 00:03:55.440 |
Otherwise you would have to kind of organize them such as all of the tall ones kind of ended up in a mini-batch 00:04:01.760 |
Nor the wide ones ended up in a mini-batch and then you'd have to kind of then figure out 00:04:05.960 |
What the best aspect ratio for each mini-batch is and we actually have some research that does that in fast AI - 00:04:19.000 |
I just lied to you the default is not actually to squish or stretch the default I should have said sorry the default 00:04:28.640 |
Grab the center so actually all we're doing is we're grabbing the center of each image 00:04:34.880 |
So if we want to squish or stretch you can add the resize method dot squish 00:04:38.840 |
Argument to resize and you can now see that this black bear is now looking much thinner 00:04:44.680 |
But we have got the kind of leaves that are around on each side instance 00:04:49.000 |
Another question when you use the dls dot new method what can and cannot be changed is it just the transforms 00:04:59.720 |
So it's not dls dot new it's bears dot new right? So we're not creating a new data lotus object 00:05:04.680 |
We're creating a new data block object. I don't remember off the top of my head 00:05:08.920 |
so check the documentation and I'm sure somebody can pop the answer into the 00:05:16.520 |
So you can see when we use dot squish that this grizzly bear has got 00:05:24.280 |
Wide and weird looking and this black bear has got pretty weird and thin looking and it's easiest kind of to see what's going on 00:05:32.680 |
resize method dot pad and what dot pad does as you can see is it just adds some 00:05:37.200 |
Black bars around each side so you can see the grizzly bear was tall 00:05:41.640 |
So then when we stretch squishing and stretching opposites of each other 00:05:45.680 |
so when we stretched it it ended up wide and the black bear was 00:05:49.560 |
Originally a wide rectangle so it ended up looking kind of thin 00:05:54.240 |
You don't have to use zeros zeros means pad it with black you can also say like reflect to kind of have 00:06:02.580 |
The the pixels will kind of look a bit better that way if you use reflect 00:06:08.880 |
All of these different methods have their own problems 00:06:11.440 |
The the pad method is kind of the cleanest you end up with the correct size you end up with all of the pixels 00:06:18.400 |
But you also end up with wasted pixels so you kind of end up with wasted computation 00:06:22.840 |
The squish method is the most efficient because you get all of the information 00:06:28.440 |
You know and and nothing's kind of wasted but on the downside your neural net is going to have to learn to kind of like 00:06:38.200 |
Recognize when something's been squished or stretched and in some cases it might it wouldn't even know 00:06:42.440 |
So if there's two objects you're trying to recognize one of which tends to be thin and one of which tends to be thick 00:06:47.480 |
And otherwise they're the same they could actually be impossible to distinguish 00:06:51.520 |
And then the default cropping approach actually 00:07:07.440 |
So if figuring it out what kind of bear it was 00:07:10.400 |
Required looking at its feet. Well. We don't have its feet anymore 00:07:16.920 |
So there's something else that you can do a different approach which is instead us to say resize you can say 00:07:25.280 |
Random resized crop and actually this is the most common approach and what random resize crop does is each time 00:07:35.920 |
Different part of the image and kind of zooms into it 00:07:40.000 |
Right. So these this is all the same image and we're just grabbing a batch of 00:07:45.080 |
four different versions of it and you can see some are kind of 00:07:48.920 |
You know, they're all squished in different ways and we've kind of selected different subsets and so forth now this 00:07:56.160 |
Kind of seems worse than any of the previous approaches because I'm losing information like this one here 00:08:02.880 |
I've actually lost a whole lot of its of its back, right? 00:08:07.000 |
But the cool thing about this is that remember we want to avoid overfitting and 00:08:12.720 |
When you see a different part of the animal each time 00:08:17.660 |
It's much less likely to overfit because you're not seeing the same image on each epoch that you go around 00:08:28.920 |
This random random resized crop approach is actually super popular and so min scale 0.3 means 00:08:35.240 |
We're going to pick at least 30% of the pixels of kind of the original size each time 00:08:41.280 |
And then we're kind of like zoom into that that square 00:08:52.280 |
this idea of doing something so that each time the 00:08:56.400 |
Model sees the image it looks a bit different to last time. It's called data augmentation and this is one type of data augmentation 00:09:09.400 |
One of the best ways to do data augmentation is to use 00:09:13.960 |
this org transforms function and what org transforms does is it actually returns a list of 00:09:27.040 |
augmentations which change contrast which change brightness 00:09:30.600 |
Which warp the perspective so you can see in this one here 00:09:34.120 |
It looks like this bits much closer to you and this is much away from you because it's going to be in perspective warped 00:09:38.520 |
It rotates them. See this one's actually been rotated. This one's been made really dark, right? 00:09:43.860 |
These are batch transforms not item transforms 00:09:48.120 |
The difference is that item transforms happen one image at a time and so the thing that resizes them all to the same size 00:09:56.000 |
Pop it all into a mini batch put it on the GPU and then a batch transform happens to a whole mini batch at a time 00:10:03.160 |
And by putting these as batch transforms that the augmentation happens super fast because it happens on the GPU and I don't know if there's any other 00:10:12.120 |
Libraries as as we speak which allow you to write your own GPU accelerated transformations that run on the GPU in this way 00:10:19.760 |
So this is a super handy thing in fast AI too 00:10:32.040 |
Orc transforms and when you do you'll find the documentation for all of the underlying transforms that it basically wraps, right? 00:10:43.320 |
I don't remember if I've shown you this trick before if you go inside the parentheses of a function and hit shift tab a few 00:10:48.900 |
times it'll pop open a list of all of the arguments and so you can basically see you can say like oh 00:10:55.620 |
Can I sometimes flip it left right? Can I sometimes flip it up down? What's the maximum and I can rotate zoom? 00:11:10.680 |
How can we add different augmentations for train and validation sets? 00:11:18.960 |
Automatically fast AI will avoid doing data augmentation on the validation set 00:11:28.200 |
so all of these org transforms will only be applied to the 00:11:37.720 |
With the exception of random resized crop random resized crop has a different behavior for each 00:11:43.120 |
the behavior for the training set is what we just saw which is to randomly pick a subset and kind of zoom into it and 00:11:51.080 |
Behavior for the validation set is just to grab the center the largest center square that it can 00:12:02.440 |
Transformations that they're just Python. They're the standard pytorch code 00:12:06.120 |
The way if you and by default it will only be applied to the training set if you want to do something fancy like random 00:12:13.440 |
Resize crop where you actually have different things being applied to each 00:12:15.960 |
You should come back to the next course to find out how to do that or read the documentation. It's not rocket science, but it's 00:12:31.720 |
Last time we we here bit did best not new with a random resized crop min scale of 0.5. We added some transforms 00:12:39.140 |
We went ahead and trained actually since last week. I've rerun this notebook 00:12:44.080 |
I've got it's on a different computer and I've got different images. So it's not all exactly the same 00:12:47.920 |
but I still got a good confusion matrix, so of the 00:12:52.840 |
Black bears 37 were classified correctly two were grizzlies of one was a teddy 00:13:02.360 |
And we talked about plot top plot top losses and it's interesting. You can see in this case 00:13:06.800 |
There's some clearly kind of odd things going on. This is not a bear at all 00:13:10.960 |
This looks like it's a drawing of a bear which has decided is is 00:13:16.000 |
Predicted as a teddy, but this thing's meant to be a drawing of a black bear. I can certainly see the confusion 00:13:22.880 |
You can see how some parts of it are being cut off. We'll talk about how to deal with that later 00:13:30.280 |
Now one of the interesting things is that we didn't really do 00:13:33.460 |
Much data cleaning at all before we built this model 00:13:37.840 |
The only data cleaning we did was just to validate that each image can be opened. There was that verify images call 00:13:44.360 |
And the reason for that is it's actually much easier normally to clean your data after you create a model and I'll show you how 00:13:52.220 |
We've got this thing called image classifier cleaner 00:13:56.840 |
Where you can pick a category right and training set or validation set 00:14:04.060 |
And then what it will do is it will then list all of the images in that set and it will pick the ones 00:14:16.880 |
Which is the least confident about which the most likely to be wrong 00:14:23.280 |
where the where the loss is the worst to be more precise and so this 00:14:29.200 |
This is a great way to look through your data and find problems. So in this case, the first one is 00:14:37.520 |
Not a teddy or a brown bear or a black bear. It's a puppy dog 00:14:41.800 |
Right. So this is a great cleaner because what I can do is I can now click delete here 00:14:47.600 |
This one here looks a bit like an ewok rather than a teddy. I'm not sure 00:14:51.760 |
What do you think Rachel is an ewok? I'm gonna call it an ewok 00:14:56.920 |
Okay, that's definitely not a teddy and so you can either say like oh that's wrong 00:15:05.400 |
It's a black bear or I should delete it or by default just keep it right and you can kind of keep going through until 00:15:09.960 |
You think like okay, they're all seem to be fine 00:15:17.280 |
And kind of once you get to the point where they all seem to be fine, you can kind of say, okay 00:15:21.880 |
Probably all the rest are fine too because they all have lower losses. So they all fit the kind of the mold of a teddy 00:15:31.400 |
Where I just go through planar.delete. So that's all the things which I selected delete for and unlink them 00:15:41.440 |
Is just another way of saying delete a file that's the Python name 00:15:46.760 |
And then go through all the ones that we said change and we can actually move them to the correct directory 00:15:53.340 |
If you haven't seen this before you might be surprised that we've kind of created our own little GUI inside 00:16:04.800 |
Yeah, you can do this and we built this with less than a screen of code you can check out the source code in the 00:16:11.960 |
Fast AI notebooks. So this is a great time to remind you that 00:16:26.480 |
And so if you go to the fast AI repo and clone it and then go to nbs you'll find 00:16:38.000 |
Written as notebooks and they've got a lot of pros and examples and tests and so forth 00:16:43.660 |
so the best place to learn about how this is implemented is to look at the notebooks rather than looking at the 00:16:58.280 |
By the way, sometimes you'll see like weird little comments like this 00:17:03.080 |
These weird little comments are part of a development environment for Jupyter notebook. We use called nbdev which we built 00:17:08.800 |
So so far and I built this thing to make it much easier for us to kind of create books 00:17:14.800 |
and websites and libraries in Jupyter notebooks. So this particular one here hide 00:17:24.280 |
When this is turned into a book or into documentation don't show this cell 00:17:28.720 |
And the reason for that is because you can see I've actually got it in the text, right? 00:17:32.880 |
But I thought when you're actually running it it would be nice to have it sitting here waiting for you to run directly 00:17:38.960 |
So that's why it's shown in the notebook, but not in the in the book. It's shown differently 00:17:43.800 |
And you'll also see these things like s colon with a quote in the book that would end up saying 00:17:53.560 |
so there's kind of little bits and pieces in the in the notebooks that just look a little bit odd and that's because it's 00:17:59.920 |
designed that way in order to show in order to create stuff in the 00:18:04.600 |
Right, so then last week we saw how you can export that to a pickle file that contains all the information for the model 00:18:13.860 |
And then on the server where you're going to actually do your inference 00:18:18.560 |
You can then load that saved file and you'll get back a learner that you can call predict on so predict 00:18:25.080 |
Perhaps the most interesting part of predict is the third thing that it returns 00:18:34.160 |
Which is a tensor in this case containing three numbers 00:18:38.600 |
The three numbers there's three of them because we have three classes teddy bear grizzly bear and black bear right and so 00:18:48.360 |
This doesn't make any sense until you know what the order of the classes is kind of in in 00:18:55.100 |
In your data loaders and you can ask the data loaders what the order is by asking for its vocab 00:19:02.240 |
So a vocab in fast AI is a really common concept 00:19:06.640 |
it's basically any time that you've got like a mapping from numbers to strings or 00:19:14.920 |
The mapping is always stored in the vocab. So here this shows us that 00:19:28.560 |
10 a neg 6 the activation for grizzly is 1 and the activation for teddy is 00:19:42.200 |
Very very confident that this particular one. It was a grizzly not surprisingly. This was something called grizzly JPEG 00:19:54.040 |
This mapping in order to display the correct thing 00:19:57.400 |
But of course the data loaders object already knows that mapping and it's all the vocab and it's stored in with the loader 00:20:03.280 |
So that's how it knows to say grizzly automatically 00:20:06.640 |
So the first thing it gives you is the human readable string that you'd want to display 00:20:13.080 |
With fast AI 2 you you save this object which has everything you need for inference. It's got all the you know information about 00:20:20.840 |
Normalization about any kind of transformation steps about what the vocab is so it can display everything correctly 00:20:37.000 |
Now if you've done some web programming before then all you need to know is that this line of code and this line of code 00:20:46.480 |
So this is the line of code you would call once when your application starts up 00:20:51.040 |
And then this is the line of code you would call 00:20:53.400 |
Every time you want to do an inference and there's also a batch version of it which you can look up if you're interested 00:21:03.200 |
So there's nothing special if you're already a web programmer or have access to a web programmer 00:21:08.880 |
These are you know you just have to stick these two lines of code somewhere and the three things you get back are the 00:21:13.680 |
The human readable string if you're doing categorization 00:21:17.240 |
The index of that which in this case is one is grizzly and the probability of each class 00:21:23.440 |
One of the things we really wanted to do in this course though is not assume that everybody is a web developer 00:21:31.840 |
Most data scientists aren't but gee wouldn't it be great if all data scientists could at least like prototype an application to show off 00:21:43.640 |
Tried to kind of curate an approach which none of its stuff. We've built. It's really is curated 00:21:48.680 |
Which shows how you can create a GUI and create a complete application in Jupyter Notebook? 00:21:59.440 |
Key pieces of technology we use to do this are IPython widgets 00:22:07.220 |
IPy widgets which we import by default as widgets, and that's also what they use in their own documentation as 00:22:19.240 |
so if I create this file upload button and then display it I 00:22:25.880 |
See and we saw this in the last lesson as well. Maybe it was less than one an actual clickable button 00:22:33.960 |
Click it and it says now, okay, you've selected one thing 00:22:43.640 |
Well these widgets have all kinds of methods and properties and the upload button has a data property 00:23:02.440 |
Pio image dot create and so dot create is kind of the standard 00:23:07.520 |
Factory method we use in fast AI to create items 00:23:14.000 |
And Pio image dot create is smart enough to be able to create an item from all kinds of different things 00:23:19.160 |
And one of the things it can create it from is a binary blob, which is what a file upload contains 00:23:27.520 |
There's our teddy, right? So you can see how you know cells of Jupiter notebook can refer to other cells that were created 00:23:41.320 |
so let's hide that teddy away for a moment and 00:23:45.080 |
the next thing to know about is that there's a kind of widget called output and an output widget is 00:23:55.720 |
You can fill in later, right? So if I delete actually 00:24:01.040 |
This part here, so I've now got an output widget 00:24:13.360 |
You can't see the output widget even though I said please display it because nothing is output 00:24:17.920 |
So then in the next cell I can say with that output placeholder display a thumbnail of the image 00:24:25.240 |
And you'll see that the the display will not appear here 00:24:31.800 |
Right because that's how that's where the placeholder 00:24:36.800 |
So let's run that again to clear out that placeholder 00:24:40.240 |
So we can create another kind of placeholder, which is a label the labels kind of a 00:24:47.480 |
Something where you can put text in it. They can give it a value like I 00:24:55.660 |
Okay, so we've now got a label containing. Please choose an image. Let's create another button to do a classification 00:25:03.440 |
Now this is not a file upload button. It's just a general button. So this button doesn't do anything 00:25:08.920 |
All right, it doesn't do anything until we attach an event handler to it an event handler is 00:25:17.600 |
A callback we'll be learning all about callbacks in this course 00:25:20.440 |
If you've ever done any GUI programming before or even web programming you'll be familiar with the idea that you 00:25:27.400 |
Write a function which is the thing you want to be called when the button is clicked on and then somehow you tell your framework 00:25:35.320 |
That this is the on click event. So here I go. Here's my button run. I say the on click event the button run is 00:25:45.520 |
Recall this code and this code is going to do all the stuff. We just saw going to create an image from the upload 00:25:51.600 |
It's going to clear the output display the image 00:25:54.960 |
Call predict and then replace the label with a prediction 00:26:00.680 |
So there it all is now so that hasn't done anything but I can now go back to this classify button 00:26:07.320 |
Which now has an event handler attached to it. So what's this? 00:26:12.440 |
Boom and look that's been filled in that's been filled in right in case you missed it 00:26:18.680 |
Let's run these again to clear everything out 00:26:23.800 |
This is please choose an image there's nothing here I click classify oh 00:26:34.840 |
Right. So it's kind of amazing how our notebook has suddenly turned into this 00:26:42.080 |
interactive prototyping playground building applications and so once all this works 00:26:51.000 |
The easiest way to dump things together is to create a V box 00:26:57.400 |
A V box is a vertical box and it's just it's just something that you put widgets in and so in this case 00:27:04.200 |
we're going to have a label that says select your bear then an upload button a run button an output placeholder and a 00:27:13.120 |
But let's run these again just to clear everything out 00:27:18.440 |
And let's create our V box so as you can see it's just got all the 00:27:35.240 |
Oh, I accidentally ran the thing that displayed the bear let's get rid of that 00:27:42.840 |
Okay, so there it is, so now I can click upload I can choose my bear 00:27:56.840 |
Right and notice I've this is exactly that this is this is like the same buttons as as these buttons 00:28:04.160 |
They're like two places with we're viewing the same button. Which is kind of a wild idea 00:28:09.000 |
so if I click classify, it's going to change this label and 00:28:12.680 |
This label because they're actually both references to the same label look 00:28:22.480 |
This is our app right and so this is actually how I built that 00:28:28.600 |
That image cleaner gooey is is just using these exact things and I built that image cleaner gooey 00:28:38.360 |
Cell by cell in a notebook just like this. And so you get this kind of interactive 00:28:47.280 |
So if you're a data scientist who's never done gooey stuff before 00:28:50.760 |
This is a great time to get started because now you can you can make actual programs 00:28:59.400 |
Running inside a notebook is kind of cool. But what we really want is this program to run 00:29:11.560 |
Needs to be installed. So you can just run these lines or install it 00:29:23.240 |
and what voila does is it takes a notebook and 00:29:28.440 |
Doesn't display anything except for the markdown 00:29:37.440 |
Right. So all the code cells disappear and it doesn't give the person looking at that page the ability to run their own code. They can only 00:29:45.360 |
Interact with the widgets, right? So what I did 00:29:53.440 |
From the notebook into a separate notebook, which only has 00:30:05.560 |
So these are just the same lines of code that we saw before 00:30:12.800 |
So this is a notebook. It's just a normal notebook 00:30:15.200 |
And then I installed voila and then when you do that if you navigate to this notebook 00:30:38.440 |
Just as I said the markdown and the widgets though here I've got 00:30:45.480 |
My bear classifier and I can click upload. Let's do a grizzly bear this time 00:30:51.680 |
And this is a slightly different version I actually made this so there's no classified button 00:31:00.160 |
I thought it would be a bit more fancy to make it so when you click upload it just runs everything 00:31:11.560 |
Simplest prototype, but it's it's a proof of concept, right? So you can add 00:31:18.800 |
dropdowns and sliders and charts and you know everything that you can have in a 00:31:25.040 |
You know an angular app or a react app or whatever and in fact, there's there's even 00:31:30.320 |
Stuff which lets you use for example the whole Vue.js framework if you know that it's a very popular 00:31:36.720 |
JavaScript framework the whole Vue.js framework you can actually use it in 00:31:51.480 |
Someone out there in the world. So the voila documentation shows a few ways to do that, but perhaps the easiest one 00:32:07.360 |
All you do is you paste in your github repository name here, right? And this is all in the book, right? 00:32:25.280 |
You can see and then you put in the path which we were just experimenting with 00:32:36.880 |
So pop that here and then you say launch and what that does is it then gives you a URL 00:32:51.880 |
Interactive running application. So binders free and so this isn't you know 00:32:57.680 |
Anybody can now use this to take their voila app and make it a publicly available web application 00:33:04.920 |
So try it as it mentions here the first time you do this binder takes about five minutes to build your site 00:33:14.000 |
Because it actually uses something called Docker to deploy the whole fast AI framework and Python and blah blah blah 00:33:23.320 |
That virtual machine will keep running for you know, as long as people are using it. It'll keep running for a while 00:33:27.800 |
That virtual machine will keep running for a while as long as people are using it and you know, it's it's 00:33:43.000 |
Being a free service. You won't be surprised to hear this is not using a GPU. It's using a CPU 00:33:48.040 |
And so that might be surprising but we're deploying to something which runs on a CPU 00:33:55.600 |
When you think about it though, this makes much more sense to deploy to a CPU than a GPU 00:34:13.200 |
Um, the thing that's happening here is that I am 00:34:17.200 |
Passing along let's go back to my app in my app. I'm passing along a single image at a time 00:34:24.040 |
So when I pass along that single image, I don't have a huge amount of parallel work for a GPU to do 00:34:30.640 |
This is actually something that a CPU is going to be doing more efficiently 00:34:34.560 |
So we found that for folks coming through this course 00:34:41.000 |
The vast majority of the time they wanted to deploy 00:34:44.200 |
Inference on a CPU not a GPU because they're normally just doing one 00:34:51.800 |
It's way cheaper and easier to deploy to a CPU 00:34:55.600 |
And the reason for that is that you can just use any hosting service you like because just remember this is just a 00:35:07.920 |
And you can use all the usual horizontal scaling vertical scaling, you know, you can use Heroku you can use AWS 00:35:21.080 |
Having said that there are times you might need to deploy to a GPU 00:35:25.200 |
For example, maybe you're processing videos and so like a single video on on a CPU to process it. It might take all day 00:35:37.760 |
You might be so successful that you have a thousand requests per second 00:35:41.960 |
In which case you could like take 128 at a time 00:35:45.240 |
Batch them together and put the whole batch on the GPU and get the results back and pass them back around 00:35:50.680 |
I mean you've got to be careful of that right because 00:35:53.840 |
as if your requests aren't coming fast enough your user has to wait for a whole batch of people to be ready to 00:36:06.400 |
As long as your site is popular enough that could work 00:36:10.400 |
The other thing to talk about is you might want to deploy to a mobile phone 00:36:19.440 |
Deploying to a mobile phone our recommendation is wherever possible 00:36:23.320 |
Do that by actually deploying to a server and then have a mobile phone talk to the server over a network 00:36:32.920 |
Again, you can just use a normal pytorch program on a normal server and normal network calls. It makes life super easy 00:36:44.480 |
You're suddenly now not an environment where not an environment where like pytorch will run natively and so you'll have to like convert 00:36:54.600 |
There are other forms and the the main form that you convert it to is something called 00:37:10.160 |
Approach that can run on both servers or on mobile phones 00:37:27.000 |
Not using it. It's harder to debug. It's harder to set it up, but it's harder to maintain it. So 00:37:37.280 |
If you're lucky enough that you're so successful that you need to scale it up to GPUs or and stuff like that 00:37:43.520 |
then great, you know, hopefully you've got the 00:37:46.640 |
Finances at that point to justify, you know spending money on a I went an ex expert or serving expert or whatever 00:37:58.040 |
Systems you can use to like ONNX runtime and AWS SageMaker where you can kind of say here's my ONNX 00:38:03.540 |
Bundle and it'll serve it for you or whatever. Pytorch also has a mobile framework same idea 00:38:15.120 |
Alright, so you've got I mean, it's kind of funny. We're talking about two different kinds of deployment here 00:38:21.560 |
Hobby application, you know that you're prototyping showing off to your friends to explaining to your colleagues how something might work 00:38:27.960 |
You know a little interactive analysis. That's one thing. Well, but maybe you're actually prototyping something that you're 00:38:44.320 |
You know something in in real life, there's all kinds of things you got to be careful of 00:38:51.360 |
One example is something to be careful of is let's say you did exactly what we just did 00:38:56.760 |
Which actually this is your homework is to create your own? 00:39:01.000 |
Application right? I want you to create your own image search application. You can use 00:39:06.200 |
My exact set of widgets and whatever if you want to but better still go to the IPY widgets website and see what other widgets 00:39:14.320 |
They have and try and come up with something cool 00:39:17.520 |
Try and come at you know, try and show off as best as you can then show us on the forum 00:39:24.360 |
That you want to create an app that would help 00:39:29.080 |
The users of your app decide if they have healthy skin or unhealthy skin 00:39:34.360 |
So if you did the exact thing we just did rather than searching for grizzly bear and teddy bear and so forth on 00:39:40.000 |
Bing you would search for healthy skin and unhealthy skin, right? So here's what happens, right? 00:39:47.040 |
If I and and remember in our version, we never actually looked at being we just used the Bing API the image search API 00:39:54.480 |
But behind the scenes, it's just using the website 00:39:57.240 |
right, so if I click healthy if I type healthy skin and say search I 00:40:02.400 |
Actually discover that the definition of healthy skin is 00:40:16.080 |
Your healthy skin classifier would learn to detect 00:40:22.560 |
This is so this is a great example from Debra G and you should check out her paper actionable auditing 00:40:28.880 |
for lots of cool insights about model bias, but I mean here's here's like a 00:40:35.440 |
Fascinating example of how if you weren't looking at your data carefully 00:40:43.840 |
With something that doesn't at all actually solve the problem you want to solve 00:40:55.000 |
The data that you train your algorithm on if you're building like a new product that didn't exist before by definition 00:41:03.240 |
You don't have examples of the kind of data that's going to be used in real life 00:41:07.400 |
Right, so you kind of try to find some from somewhere and if they and if you do that through like a Google search 00:41:14.740 |
Pretty likely you're not going to end up with a 00:41:17.800 |
Set of data that actually reflects the kind of mix you would see in real life 00:41:27.560 |
You know the main thing here is to say be careful right and and in particular for your test set 00:41:36.840 |
Really try hard to gather data that that reflects 00:41:40.280 |
The real world so like just you know for example for the healthy skin example 00:41:45.000 |
You might go and actually talk to a dermatologist and try and find like 10 examples of healthy and unhealthy skin or something 00:41:51.620 |
And that would be your kind of gold standard test 00:41:54.880 |
There's all kinds of issues you have to think about in deployment I can't cover all of them 00:42:03.840 |
I can tell you that this O'Reilly book called building machine learning powered applications 00:42:14.560 |
And this is one of the reasons we don't go into a detail 00:42:19.600 |
AP to a B testing and when should we refresh our data and how do we money monitor things and so forth is because 00:42:26.160 |
That book's already been written. So we don't want to 00:42:33.680 |
Do want to mention a particular area that I care a lot about though 00:42:43.800 |
Let's say you're rolling out this bear detection system and it's going to be attached to video cameras around a campsite 00:42:49.500 |
It's going to warn campers of incoming bears. So if we used a model 00:42:54.120 |
That was trained with that data that we just looked at 00:43:00.360 |
Very nicely taken pictures of pretty perfect bears, right? 00:43:04.560 |
There's really no relationship to the kinds of pictures 00:43:08.960 |
You're actually going to have to be dealing with in your in your campsite bear detector, which has it's going to have video and not images 00:43:15.040 |
It's going to be nighttime. It's going to be probably low resolution 00:43:21.080 |
You need to make sure that the performance of the system is fast enough to tell you about it before the bear kills you 00:43:29.040 |
Know there will be bears that are partially obscured by bushes or in lots of shadow or whatever 00:43:34.240 |
None of which are the kinds of things you would see normally in like internet pictures 00:43:38.240 |
So what we call this we call this out of domain data out of domain data refers to a situation where? 00:43:45.600 |
The data that you are trying to do inference on is in some way different to the kind of data 00:43:58.360 |
There's no perfect way to answer this question and when we look at ethics we'll talk about some 00:44:07.520 |
Minimize how much this happens for example it turns out that having a diverse team is a great way to 00:44:14.880 |
Kind of avoid being surprised by the kinds of data that people end up coming up with 00:44:20.720 |
But really it's just something you've got to be 00:44:30.680 |
Domain shift and domain shift is where maybe you start out with all of your data is in domain data 00:44:36.880 |
But over time the kinds of data that you're seeing 00:44:43.760 |
raccoons start invading your campsite and you 00:44:48.000 |
Weren't training on raccoons before it was just a bear detector and so that's called domain shift 00:44:53.280 |
And that's another thing that you have to be very careful of Rachel. What's your question? 00:44:57.960 |
No, I was just going to add to that in saying that 00:45:00.300 |
all data is biased so there's not kind of a you know form of a debias data or 00:45:06.880 |
perfectly representative in all cases data and that a lot of the 00:45:11.200 |
proposals around addressing this have kind of been converging to this idea and that you see in papers like Timnit Gebru's 00:45:17.840 |
data sheets for data sets of just writing down a lot of the 00:45:23.720 |
Details about your data set and how it was gathered and in which situations it's appropriate to use and how it was maintained 00:45:30.000 |
And so there that's not that you've totally eliminated bias 00:45:34.440 |
But that you're just very aware of the attributes of your data set so that you won't be blindsided by them later 00:45:42.280 |
proposals in that school of thought which I which I really like around this idea of just kind of 00:45:47.800 |
Understanding how your data was gathered and what its limitations are 00:45:54.800 |
So a key problem here is that you can't know the entire behavior of your neural network 00:46:03.680 |
With normal programming you typed in the if statements and the loops and whatever so in theory 00:46:11.240 |
You know what the hell it does although it's still sometimes surprising in this case you you didn't tell it anything you just gave it 00:46:18.000 |
Examples to learn from and hope that it learns something useful 00:46:21.560 |
There are hundreds of millions of parameters in a lot of these neural networks 00:46:25.940 |
And so there's no way you can understand how they all combine with each other to create complex behavior 00:46:31.280 |
so really like there's a natural compromise here is that we're trying to 00:46:35.560 |
Get sophisticated behavior so it's like like recognizing pictures 00:46:42.920 |
Sophisticated enough behavior we can't describe it 00:46:46.400 |
And so the natural downside is you can't expect the process that the thing is using to do that to be 00:46:52.300 |
Describable for you for you to be able to understand it. So 00:46:55.720 |
Our recommendation for kind of dealing with these issues is a very careful 00:47:00.800 |
Deployment strategy which I've summarized in this little graph this little chart here 00:47:10.360 |
Whatever it is that you're going to use the model for start out by doing it manually 00:47:19.360 |
Have the model running next to them and each time the park ranger sees a bear 00:47:24.320 |
They can check the model and see like did it seem to have pick it up? 00:47:28.000 |
So the model is not doing anything. There's just a person who's like running it and seeing would it have made sensible choices 00:47:35.280 |
And once you're confident that it makes sense that what it's doing seems reasonable 00:47:41.480 |
You know it's been as close to the real life situation as possible 00:47:45.640 |
Then deploy it in a time and geography limited way 00:47:52.160 |
so pick like one campsite not the entirety of California and do it for you know one day and 00:48:00.000 |
Have somebody watching it super carefully, right? 00:48:03.960 |
So now the basic bear detection is being done by the bed at bear detector 00:48:08.320 |
But there's still somebody watching it pretty closely and it's only happening in one campsite for one day 00:48:21.920 |
And then let's do you know the entirety of Marin for a month and so forth. So this is actually what we did when I 00:48:34.400 |
optimal decisions was a company that I founded to do insurance pricing and if you 00:48:39.240 |
If you change insurance prices by you know a percent or two in the wrong direction in the wrong way 00:48:46.360 |
You can basically destroy the whole company. This has happened many times, you know insurers are companies 00:48:53.320 |
That set prices that's basically the the product that they provide 00:48:58.320 |
So when we deployed new prices for optimal decisions, we always did it by like saying like, okay 00:49:04.940 |
We're going to do it for like five minutes or everybody whose name ends with a D, you know 00:49:13.360 |
Group which hopefully would be fairly, you know 00:49:16.960 |
It'll be different but not too many of them and we'd gradually scale it up and you've got to make sure that when you're 00:49:25.680 |
Really good reporting systems in place that you can recognize 00:49:28.920 |
Are your customers yelling at you? Are your computers burning up? 00:49:37.200 |
Are your computers burning up are your costs spiraling out of control and so forth so it really requires great 00:49:52.080 |
This fast AI have methods built-in that provide for incremental learning ie improving the model slowly over time with a single data point each time 00:50:03.480 |
This is a little bit different which is this is really about 00:50:07.120 |
Dealing with domain shift and similar issues by continuing to train your model as you do inference. And so the good news is 00:50:18.760 |
It's basically just a transfer learning problem. So you can do this in many different ways 00:50:24.360 |
Probably the easiest is just to say like okay each night 00:50:27.280 |
Probably the easiest is just to say okay each night 00:50:31.080 |
You know at midnight we're going to set off a task which 00:50:36.560 |
Grabs all of the previous day's transactions as mini batches and trains another epoch 00:50:44.000 |
And so yeah that that actually works fine. You can basically think of this as a 00:50:50.280 |
Fine tuning approach where your pre-trained model is yesterday's model and your fine-tuning data is today's data 00:51:03.160 |
One thing to be thinking about super carefully is that it might change the behavior of the system that it's a part of 00:51:12.200 |
And this can create something called a feedback loop and feedback loops are one of the most challenging things for 00:51:18.360 |
For real-world model deployment particularly of machine learning models 00:51:22.840 |
Because they can take a very minor issue and explode it into a really big issue 00:51:38.040 |
It's an algorithm that was trained to recognize 00:51:44.040 |
Basically trained on data that says whereabouts or arrests being made 00:51:48.120 |
And then as you train that algorithm based on where arrests are being made 00:51:58.820 |
sends police officers to places that the model says are likely to have crime which in this case where we're 00:52:12.560 |
Find more crime because the more police that are there the more they'll see they arrest more people 00:52:19.540 |
Causing, you know, and then if you do this incremental learning like we're just talking about then it's going to say 00:52:24.400 |
Oh, there's actually even more crime here. And so tomorrow it sends even more police 00:52:28.600 |
And so in that situation you end up like the predictive policing algorithm ends up kind of sending all of your police 00:52:36.800 |
For one street block because at that point all of the arrests are happening there because that's the only place you have policemen 00:52:48.200 |
This issue called to protect and serve and in to protect and serve the authors write this really nice phrase 00:52:57.000 |
Predictive policing is aptly named. It is predicting policing not predicting 00:53:07.800 |
Perfect, whatever the hell that even means but like it somehow sent police to exactly 00:53:14.080 |
The best places to find crime based on the probability of crimes actually being in place. I 00:53:24.920 |
But as soon as there's any amount of bias right so for example in the US 00:53:35.680 |
Of black people than of white people even for crimes where black people and white people are known to do them the same amount 00:53:49.480 |
You're kind of like setting off this this domino chain of feedback loops where that bias will be 00:54:03.760 |
You know one thing I like to think about is to think like well, what would happen if this? 00:54:08.320 |
If this model was just really really really good 00:54:12.640 |
So like who would be impacted, you know, what would this extreme result look like? 00:54:18.320 |
How would you know what was really happening this incredibly predictive algorithm that was like? 00:54:23.380 |
Changing the behavior of yours if your police officers or whatever, you know, what would that look like? What would actually happen? 00:54:34.560 |
What could go wrong and then what kind of rollout plan what kind of monitoring systems what kind of oversight? 00:54:40.260 |
Could provide the the circuit breaker because that's what we really need here 00:54:45.560 |
Right is we need like nothing's going to be perfect. You can't 00:54:51.680 |
But what you can do is try to be sure that you see when the behavior of your system is behaving in a way 00:55:10.240 |
anytime that your model is kind of controlling what your next round of data looks like and I think that's true for pretty much all 00:55:17.240 |
products and that can be I think a hard jump from people people coming from kind of a science background where you may be thinking of 00:55:24.720 |
Data as I have just observed some sort of experiment whereas kind of whenever you're, you know 00:55:30.440 |
Building something that interacts with the real world 00:55:32.440 |
You are now also controlling what your future data looks like based on kind of behavior of your your algorithm for the current current round of 00:55:43.360 |
So given that you probably can't avoid feedback loops 00:55:48.880 |
That you know the the thing you need to then really invest in is the human in the loop 00:55:54.280 |
And so a lot of people like to focus on automating things, which I find weird 00:55:59.480 |
You know if you can decrease the amount of human involvement by like 90% 00:56:03.280 |
You've got almost all of the economic upside of automating it completely 00:56:07.440 |
But you still have the room to put human circuit breakers in place. You need these appeals processes 00:56:12.560 |
You need the monitoring you need, you know humans involved to kind of go 00:56:17.720 |
Hey, that's that's weird. I don't think that's what we want 00:56:23.600 |
Yes, Rachel and I just want more note about that those humans though do need to be integrated well with 00:56:30.000 |
kind of product and engineering and so one issue that comes up is that in many companies I think that 00:56:36.880 |
Ends up kind of being underneath trust and safety handles a lot of sort of issues with how things can go wrong or how your 00:56:43.360 |
Platform can be abused and often trust and safety is pretty siloed away from 00:56:48.720 |
Product and edge which actually kind of has the the control over you know 00:56:52.920 |
These decisions that really end up influencing them and so having they the engineers probably consider them to be pretty pretty annoying a lot 00:56:59.880 |
Of the time how they get in the way and get in the way of them getting software out the door 00:57:04.440 |
But like the kind of the more integration you can have between those I think it's helpful for the kind of the people 00:57:09.080 |
Building the product to see what is going wrong and what can go wrong if the engineers are actually on top of that 00:57:15.240 |
They're actually seeing these these things happening that it's not some kind of abstract problem anymore 00:57:20.600 |
So, you know at this point now that we've got to the end of chapter two 00:57:24.600 |
You actually know a lot more than most people about 00:57:32.000 |
About deep learning and actually about some pretty important foundations of machine learning more generally and of data products more generally 00:57:50.560 |
Formatted text that doesn't quite format correctly in Jupiter notebook, by the way 00:57:54.360 |
It only formats correctly in the book book. So that's what it means when you see this kind of pre-formatted text 00:58:08.380 |
Starting writing at this point before you go too much further Rachel 00:58:16.560 |
There's a question. Okay, let's get the question 00:58:20.920 |
Question is I am I assume there are fast AI type ways of keeping a nightly updated transfer learning setup 00:58:29.040 |
Well, could there be one of the fast AI version for notebooks have an example of the nightly transfer learning training? 00:58:35.280 |
Like the previous person asked I would be interested in knowing how to do that most effectively with fast AI 00:58:41.360 |
Sure. So I guess my view is there's nothing faster. Yeah, I specific about that at all 00:58:46.880 |
So I actually suggest you read Emmanuel's book that book I showed you to understand the kind of the ideas 00:58:53.120 |
And if people are interested in this I can also point to it some academic research about this as well 00:58:58.520 |
And there's not as much as that there should be 00:59:00.680 |
But there is some there is some good work in this area 00:59:03.820 |
Okay, so the reason we mentioned writing at this point in our journey is because 00:59:13.600 |
You know things are going to start to get more and more heavy more and more complicated and 00:59:20.280 |
A really good way to make sure that you're on top of it is to try to write down what you've learned 00:59:27.560 |
So, sorry, I wasn't sharing the right part of the screen before but this is what I was describing in terms of the 00:59:32.040 |
Pre-formatted text which doesn't look correct 00:59:42.760 |
Rachel actually has this great article that you should check out which is why you should blog and 00:59:50.760 |
Will say it sort of her saying because I have it in front of me and she doesn't 00:59:56.960 |
That the top advice she would give her younger self is to start blogging sooner 01:00:02.120 |
So Rachel has a math PhD in this kind of idea of like blogging was not exactly something 01:00:11.060 |
but actually it's like it's a really great way of 01:00:15.240 |
Finding jobs. In fact, most of my students who have got the best jobs are students that have 01:00:25.480 |
The thing I really love is that it helps you learn 01:00:27.600 |
by by writing down it's kind of synthesizes your ideas and 01:00:32.700 |
Yeah, you know, there's lots of reasons to look so there's actually 01:00:44.240 |
As also just going to note I have a second post called advice for better blog post 01:00:49.560 |
That's a little bit more advanced which I'll post a link to as well 01:00:53.560 |
And that talks about some common pitfalls that I've seen in many in many blog posts and kind of the importance of putting 01:01:00.560 |
Putting the time in to do it. Well and some things to think about so I'll share that post as well. Thanks Rachel 01:01:08.560 |
Blog is because it's kind of annoying to figure out how to 01:01:12.840 |
particularly because I think the thing that a lot of you will want to blog about is 01:01:18.160 |
Cool stuff that you're building in Jupyter notebooks. So we've actually teamed up with a guy called Hamel Sane 01:01:32.120 |
As usual with fast AI no ads. No anything called fast pages where you can actually blog 01:01:42.480 |
You can go to fast pages and see for yourself how to do it 01:01:46.200 |
But the basic idea is that like you literally click one button 01:01:50.960 |
It sets up a blog for you and then you dump your notebooks 01:01:59.160 |
Folder called underscore notebooks and they get turned into 01:02:02.440 |
blog posts, it's it's basically like magic and Hamels done this amazing job of this and so 01:02:11.240 |
This means that you can create blog posts where you've got charts and tables and images 01:02:17.280 |
You know where they're all actually the output of Jupyter notebook along with all the the markdown 01:02:24.280 |
Formatted text headings and so forth and Piper links and the whole thing 01:02:29.960 |
So this is a great way to start writing about what you're learning about here 01:02:37.480 |
So something that Rachel and I both feel strongly about when it comes to blogging is this which is 01:02:44.120 |
Don't try to think about the absolute most advanced thing 01:02:51.560 |
You know and try to write a blog post that would impress 01:02:54.320 |
Jeff Hinton right because most people are not Jeff Hinton 01:02:59.000 |
so like a you probably won't do a good job because you're trying to like 01:03:03.400 |
blog for somebody who's more got more expertise than you and be 01:03:11.280 |
Actually, there's far more people that are not very familiar with deep learning than people who are 01:03:16.200 |
They try to think you know, and and you really understand what it's like 01:03:19.800 |
What it was like six months ago to be you because you were there six months ago 01:03:24.560 |
So try and write something which the six months ago version of you 01:03:28.200 |
Would have been like super interesting full of little tidbits. You would have loved 01:03:33.320 |
You know that you would have that would have delighted you 01:03:42.640 |
Don't move on until you've had a go at the questionnaire 01:03:48.760 |
You know understand the key things we think that you need to understand 01:03:53.600 |
And yeah, have a think about these further research questions as well because they might 01:03:58.760 |
Help you to engage more closely with material 01:04:02.440 |
So, let's have a break and we'll come back in five minutes time 01:04:15.600 |
Interesting moment in the course because we're kind of jumping from 01:04:19.960 |
the part of the course, which is you know, very heavily around kind of 01:04:27.040 |
The kind of this the the the structure of like what are we trying to do with machine learning? 01:04:33.240 |
And what are the kind of the pieces and what do we need to know? 01:04:38.920 |
There was a bit of code but not masses. There was basically no math 01:04:47.640 |
We kind of wanted to put that at the start for everybody who's not 01:04:51.000 |
You know who's kind of wanting to an understanding of these issues 01:04:59.960 |
Wanting to kind of dive deep into the code and the math themselves and now we're getting into the diving deep part 01:05:09.280 |
Interested in that diving deep yourself. You might want to skip to the next lesson about ethics 01:05:14.280 |
where we you know is kind of a rule that rounds out the kind of 01:05:25.440 |
So what we're going to look at here is we're going to look at 01:05:34.000 |
Just a few years ago is considered a pretty challenging problem 01:05:37.240 |
The problem is recognizing handwritten digits 01:05:46.000 |
Right. I'm going to try and look at a number of different ways to do it 01:05:53.520 |
Called MNIST. And so if you've done any machine learning before you may well have come across MNIST it contains handwritten digits 01:06:00.760 |
And it was collated into a machine learning data set by a guy called John LeCun 01:06:05.800 |
and some colleagues and they use that to demonstrate I'm one of the 01:06:10.280 |
You know probably the first computer system to provide really practically useful scalable recognition of handwritten digits 01:06:16.840 |
Lynette 5 with the system was actually used to 01:06:21.280 |
Automatically process like 10% of the checks in in the US 01:06:31.680 |
I think when building a new model is to kind of start with something simple and gradually scale it up. So 01:06:38.080 |
We've created an even simpler version of MNIST which we call MNIST sample which only has threes and sevens 01:06:45.160 |
Okay, so this is a good starting point to make sure that we can kind of do something easy 01:06:50.120 |
I picked threes and sevens for MNIST sample because they're very different. So I feel like if we can't do this 01:06:55.240 |
We're going to have trouble recognizing every digit 01:06:57.400 |
So step one is to call untard data untard data is the fast AI 01:07:09.760 |
Checks whether you've already downloaded it if you haven't it downloads it checks whether you've already 01:07:16.680 |
Uncompressed it if you haven't it uncompresses it and then it finally returns the path of where that ended up 01:07:29.160 |
So you could just hit tab to get autocomplete 01:07:34.160 |
Is just some some location right doesn't really matter where it is and so then when we 01:07:45.240 |
Call that I've already downloaded it and already uncompressed it because I've already run this once before so it happens straight away and 01:07:54.240 |
Where it is now in this case path is dot and the reason path is dot is because I've used a special base path 01:08:06.880 |
Where's my starting point, you know, and and that's used to print 01:08:11.160 |
So when I go here LS which prints a list of files, these are all relative to 01:08:16.520 |
Where I actually untard this - this just makes it a lot easier not to have to see the whole 01:08:41.900 |
Path lib is part of the Python standard library. It's a really very very very nice library, but it doesn't actually have LS 01:08:50.820 |
Where there are libraries that we find super helpful, but they don't have exactly the things we want 01:08:56.440 |
We liberally add the things we want to them. So we add 01:09:08.760 |
You know, there's as we've mentioned, there's a few ways you can do it 01:09:11.180 |
You can pop a question mark there and that will show you where it comes from 01:09:15.280 |
so there's actually a library called fast core which is a lot of the foundational stuff in fast AI that is not dependent on 01:09:28.000 |
So this is part of fast core and if you want to see exactly what it does you of course remember you can put in 01:09:38.560 |
the source code and as you can see there's not much source code do it and 01:09:49.640 |
Because really importantly that gives you this show in docs link which you can click on to get to the documentation to see 01:09:59.200 |
Textures if relevant tutorials tests and so forth 01:10:07.960 |
What's so when you're looking at a new data set you kind of just use I always start with just LS see what's in it 01:10:14.160 |
And I can see here. There's a train folder and there's a valid folder. That's pretty normal 01:10:25.480 |
it's got a folder called 7 and a folder called 3 and 01:10:28.620 |
So this is looking quite a lot like our bare classifier data set. We downloaded each set of images into 01:10:41.440 |
Well, the first level of the folder hierarchy is is it training or valid and the second level is what's the label? 01:10:48.480 |
And this is the most common way for image data sets to be distributed 01:11:00.040 |
let's just create something called threes that contains all of the contents of the three directory training and 01:11:06.720 |
Let's just sort them so that this is consistent 01:11:09.520 |
Do the same for sevens and let's just look at the threes and you can see there's just they're just numbered 01:11:19.080 |
Open it and take a look. Okay, so there's the picture of a three and so what is that really? 01:11:35.240 |
So PIL is the Python imaging library. It's the most popular library by far for working with images 01:11:53.760 |
Knows how to display many different types and you can actually tell if you create a new type 01:11:58.440 |
You can tell it how to display your type and so PIL comes with something that will automatically 01:12:05.360 |
What I want to do here though is to look at like how are we going to treat this as numbers? 01:12:10.960 |
Right, and so one easy way to treat things as numbers is to turn it into an array 01:12:16.720 |
So array is part of numpy, which is the most popular 01:12:31.680 |
Just converts the image into a bunch of numbers and the truth is it was a bunch of numbers the whole time 01:12:37.880 |
It was actually stored as a bunch of numbers on disk 01:12:40.760 |
It's just that there's this magic thing in Jupyter that knows how to display those numbers on the screen 01:12:48.720 |
Turning it back into a numpy array. We're kind of removing this ability for Jupyter notebook to know how to display it like a picture 01:12:56.000 |
So once I do this we can then index into that array and create everything from the grab everything all the rows from 4 01:13:03.520 |
up to but not including 10 and all the columns from 4 up to and not including 10 and here are some numbers and 01:13:12.440 |
8 bit unsigned integers, so they are between 0 and 255 01:13:16.960 |
So an image just like everything on a computer is just a bunch of numbers and therefore we can compute 01:13:26.320 |
We could do the same thing but instead of saying array we could say tensor now a tensor is 01:13:31.880 |
basically the PyTorch version of a numpy array and 01:13:37.160 |
So you can see it looks it's exactly the same code as above 01:13:41.640 |
But I've just replaced array with tensor and the output looks almost exactly the same except it replaces array with tensor 01:13:47.920 |
And so you'll see this that basically a PyTorch tensor and a numpy array behave 01:13:56.840 |
Much if not most of the time, but the key thing is that a PyTorch tensor 01:14:07.280 |
So in our work and in the book and in the notebooks in our code 01:14:12.000 |
We tend to use tensors PyTorch tensors much more often than numpy arrays 01:14:17.240 |
Because they kind of have nearly all the benefits of numpy arrays plus all the benefits of GPU computation 01:14:23.160 |
And they've got a whole lot of extra functionality as well 01:14:31.640 |
Python for a long time always jump into numpy because that's what they're used to if that's you 01:14:37.740 |
You might want to start considering jumping into 01:14:40.000 |
Tensor like wherever you used to write a ray start writing tensor 01:14:43.520 |
And just see what happens because you might be surprised at how many things you can speed up or do more easily 01:14:52.120 |
That that three image turn it into a tensor. And so that's going to be three image tensor 01:14:58.360 |
That's why I've got I am 3t. Okay, and let's grab a bit of it 01:15:06.200 |
And the only reason I'm turning it into a pandas data frame is the pandas has a very convenient thing called background gradient 01:15:11.560 |
That turns a background into a gradient as you can see 01:15:15.800 |
So here is the top bit of the three you can see that the zeros of the whites and the numbers near 255 01:15:24.240 |
Other plaques and there's some what's it bits in the middle, which are which are great 01:15:27.800 |
so here we have we can see what's going on when our 01:15:32.920 |
Images which are numbers actually get displayed on the screen. It's just it's just doing this 01:15:38.020 |
And so I'm just showing a subset here the actual full number in MNIST is a 28 by 28 pixel square 01:15:52.920 |
My mobile phone, I don't know how many megapixels it is, but it's millions of pixels 01:15:57.440 |
So it's nice to start with something simple and small 01:16:02.680 |
So is our goal create a model but by model that has been some kind of computer program learnt from data 01:16:10.520 |
That can recognize threes versus sevens. They could think of it as a three detector. Is it a three? 01:16:20.120 |
So have it stop here pause the video and have a think 01:16:26.440 |
How would you like you don't need to know anything about neural networks or anything else? How might you just with common sense build a 01:16:37.080 |
Okay, so I hope you grabbed a piece of paper a pen jutted some notes down 01:16:41.080 |
I'll tell you the first idea that came into my head 01:16:47.080 |
Was what if we grab every single three in the data set and take the average of the pixels? 01:16:56.960 |
This pixel the average of this pixel the average of this pixel the average of this pixel, right? 01:17:07.160 |
Which is the average of all of the threes and that would be like the ideal three and then we'll do the same for sevens and 01:17:15.400 |
Then so when we then grab something from the validation set to classify, we'll say like oh 01:17:20.600 |
Is this image closer to the ideal threes the ideal three the mean of the threes or the ideal seven? 01:17:27.960 |
This is my idea. And so I'm going to call this the pixel similarity approach 01:17:33.140 |
I'm describing this as a baseline a baseline is like a super simple model 01:17:39.120 |
That should be pretty easy to program from scratch with very little magic, you know 01:17:43.080 |
maybe it's just a bunch of kind of simple averages simple arithmetic, which 01:17:46.720 |
You're super confident is going to be better than better than a random model 01:17:53.280 |
one of the biggest mistakes I see in even experienced practitioners is that they fail to create a baseline and 01:18:00.200 |
so then they build some fancy Bayesian model or 01:18:08.400 |
They create some fancy Bayesian model or some fancy neural network and they go Wow Jeremy 01:18:13.440 |
Look at my amazingly great model and I'll say like how do you know it's amazingly great? 01:18:17.760 |
And they say oh look the accuracy is 80% and then I'll say okay 01:18:21.380 |
Let's see what happens if we create a model where we always predict the mean. Oh 01:18:30.360 |
People get pretty disheartened when they discover this right and so make sure you start with a reasonable baseline and then gradually build on top of it 01:18:45.040 |
We're going to learn some nice Python programming tricks to do this 01:18:48.320 |
so the first thing we need to do is we need a list of 01:18:58.280 |
Hey, which is just a list of file names, right and 01:19:05.000 |
So for each of those file names in the sevens 01:19:07.600 |
Let's image dot open that file just like we did before to get a PIO object and let's convert that into a tensor 01:19:14.960 |
So this thing here is called a list comprehension. So if you haven't seen this before 01:19:19.840 |
This is one of the most powerful and useful tools in Python. If you've done something with C sharp 01:19:26.080 |
It's a little bit like link. It's not as powerful as link, but it's a similar idea 01:19:30.640 |
If you've done some functional programming in in JavaScript, it's a bit like some of the things you can do with that, too 01:19:40.400 |
Each item will become called O and then it will be passed to this function 01:19:45.420 |
Which opens it up and turns it into a tensor and then it will be collated all back into a list 01:19:58.320 |
So Silva and I use list and dictionary comprehensions every day 01:20:02.840 |
And so you should definitely spend some time checking it out if you haven't already 01:20:19.600 |
So remember, this is a tensor not a PIO image object 01:20:29.960 |
some command to display it and show image is a fast AI command that displays a tensor and so here is our three 01:20:37.600 |
So we need to get the average of all of those threes 01:20:47.040 |
The first thing we need to do is to turn change this so it's not a list 01:21:04.960 |
Which is 28 by 28. So this is this is the rows by columns the size of this thing, right? But three tensors itself 01:21:13.560 |
It's just a list so I can't really easily do mathematical computations on that 01:21:22.880 |
so what we could do is we could stack all of these 28 by 28 images on top of each other to create a 01:21:28.800 |
Like a 3d cube of images and that's still quite a tensor 01:21:35.120 |
So a tensor can have as many of these axes or dimensions as you like and to stack them up you use funnily enough 01:21:41.960 |
Stack, right? So this is going to turn the list 01:21:45.640 |
Into a tensor and as you can see the shape of it is now 01:21:51.760 |
6 1 3 1 by 28 by 28. So it's kind of like a cube of height 6 1 3 1 by 01:22:02.840 |
The other thing we want to do is if we're going to take the mean 01:22:09.520 |
We want to turn them into floating point values 01:22:13.160 |
Because we we don't want to kind of have integers rounding off 01:22:18.000 |
The other thing to know is that it's just as kind of a standard in 01:22:22.080 |
Computer vision that when you're working with floats that you you expect them to be between 0 and 1 01:22:29.320 |
so we just divide by 255 because they were between 0 and 255 before so this is a pretty standard way to kind of 01:22:41.960 |
So these three things here are called the axes 01:22:51.680 |
Overall we would say that this is a rank 3 tensor. It has three axes. So the 01:23:00.100 |
This one here was a rank 2 tensor. It has two axes 01:23:06.120 |
So you can get the rank from a tensor by just taking the length of its shape 1 2 3 01:23:24.280 |
You can also use the word dimension. I think numpy tends to call it axis PyTorch tends to call it dimension 01:23:40.120 |
So you need to make sure that you remember this word rank is the number of axes or dimensions in a tensor and the shape 01:23:54.360 |
So we can now say stack threes dot mean now if we just say stack threes dot mean 01:24:08.240 |
That returns a single number that's the average pixel across that whole cube that whole rank 3 tensor 01:24:17.080 |
That is take the mean over this axis. So that's the mean across the images 01:24:25.960 |
That's now 28 by 28 again because we kind of like 01:24:32.840 |
Reduced over this 6 1 3 1 6 1 3 1 axis. We took the mean across that axis 01:24:39.880 |
And so we can show that image and here is our ideal 3 01:24:44.440 |
So here's the ideal 7 using the same approach 01:24:48.920 |
All right, so now let's just grab a 3 there's just any old 3 here it is 01:24:55.840 |
And what I'm going to do is I'm going to say well 01:24:58.820 |
Is this 3 more similar to the perfect 3 or is it more similar to the perfect 7 and whichever one? 01:25:04.800 |
It's more similar to I'm going to assume that's that's the answer 01:25:07.840 |
So we can't just say look at each pixel and say 01:25:19.000 |
You know 0 0 here and 0 0 here and then 0 1 here and then 0 1 here and take the average 01:25:24.600 |
The reason we can't just take the average is that there's positives and negatives and they're going to average out 01:25:29.120 |
To nothing, right? So I actually need them all to be positive numbers 01:25:34.220 |
So there's two ways to make them all positive numbers. I could take the absolute value which simply means remove the minus signs 01:25:42.600 |
Okay, and then I could take the average of those 01:25:46.620 |
That's called the mean absolute difference or L1 norm 01:25:52.800 |
or I could take the square of each difference and 01:25:58.160 |
Then take the mean of that and then at the end I could take the square root 01:26:02.740 |
Kind of undoes the squaring and that's called the root mean squared error 01:26:15.080 |
Subtract from it the mean of the threes and take the absolute value and take the mean 01:26:22.600 |
Okay, and call that the distance using absolute value of the three to a three 01:26:28.160 |
And that there is the number point one, right? So this is the mean absolute difference or L1 norm 01:26:34.980 |
So when you see a word like L1 norm, if you haven't seen it before it may sound pretty fancy 01:26:39.720 |
But all these math terms that we see, you know, you can 01:26:48.600 |
It's it's you know, don't let the mathy bits for you that they're often like in code 01:26:54.860 |
It's just very obvious what they mean where else with math. You just you just have to learn it or 01:27:02.880 |
So here's the same version for squaring take the difference where it take the mean and then take the square root 01:27:09.280 |
So then we'll do the same thing for our three this time we'll compare it to the mean of the sevens 01:27:16.800 |
Right. So the distance from a three to the mean of the threes were in terms of absolute was point one 01:27:23.040 |
And the distance from a three to the mean of the sevens was point one five 01:27:28.800 |
So it's closer to the mean of the threes than it is to the mean of the sevens. So we guess therefore that this is a 01:27:41.200 |
Same thing with RMSE root mean squared error would be to compare this value 01:27:46.080 |
With this value and again root mean squared error. It's closer to the mean three than to the mean seven. So this is like a 01:27:54.920 |
Machine learning model kind of it's a data driven model which attempts to recognize threes versus sevens 01:28:04.840 |
Mean, it's it's a reasonable baseline. It's going to be better than random 01:28:16.320 |
We can just actually use L1 loss. L1 loss does exactly that 01:28:25.960 |
We can just write MSE loss that doesn't do the square root by default. So we have to pop that in 01:28:36.480 |
It's very important before we kind of go too much further to make sure we're very comfortable 01:28:45.240 |
Working with arrays and tensors and you know, they're they're so similar 01:28:49.240 |
So we could start with a list of lists, right? Which is kind of a matrix 01:29:01.040 |
We can display it and they look almost the same 01:29:08.480 |
You can index into a single column and so it's important to know 01:29:16.680 |
Every row because I put it in the first spot, right? So if I put it in the second spot 01:29:27.800 |
Comma colon is exactly the same as removing it 01:29:39.360 |
Colons that are at the end because they're kind of they're just implied right you never have to and I often kind of put 01:29:45.840 |
Them in any way because just kind of makes it a bit more obvious how these things kind of 01:29:54.120 |
You can combine them together so give me the first row and everything from the first up to but not including the third column 01:30:07.200 |
You can add stuff to them. You can check their type. Notice that this is different to the Python oopsy the Python 01:30:18.680 |
Just tells you it's a tensor. If you want to know what kind of tensor you have to use type as a method 01:30:29.840 |
Multiply them by a float turns it into a float, you know to have a fiddle around if you haven't done much stuff with numpy or 01:30:37.120 |
Pytorch before this is a good opportunity to just 01:30:40.680 |
Go crazy try things out try try things that you think might not work and see if you actually get an error message, you know 01:30:57.680 |
our model that involves just comparing something to to the 01:31:11.400 |
You should not check how good our model is on the training set as we've discussed 01:31:17.080 |
We should check it on a validation set and we already have a validation set. It's everything inside the valid directory 01:31:24.360 |
So let's go ahead and like combine all those steps before let's go through everything in the validation set 3 LS 01:31:31.040 |
Open them turn them into a tensor stack them all up 01:31:43.740 |
So we're just putting all the steps we did before into a couple of lines 01:31:47.740 |
Yeah, I always try to print out shapes like all the time 01:31:53.100 |
Because if a shape is not what you expected then you can you know get weird things going on 01:32:01.180 |
So the idea is we want some function is 3 that will return true if we think something is a 3 01:32:12.540 |
Digit that we're testing on is closer to the ideal 3 or the ideal 7 01:32:26.580 |
Takes the difference between two things takes the absolute value and then takes the mean 01:32:31.420 |
So we're going to create this function MNIST distance that takes the difference between two answers 01:32:40.580 |
Takes their absolute value and then takes the mean and it takes the mean and look at this 01:32:46.280 |
We've got minus this time. It takes the mean over the last 01:32:56.580 |
Sorry, the last and second last dimensions. So this is going to take 01:33:02.260 |
The mean across the kind of x and y axes and so here you can see it's returning a 01:33:10.820 |
single number which is the distance of a 3 from the mean 3 01:33:16.980 |
So that's the same as the value that we got earlier point 1 1 1 4 01:33:24.340 |
So we need to do this for every image in the validation set because we're trying to find the overall metric 01:33:30.560 |
Remember the metric is the thing we look at to say how good is our model? 01:33:34.260 |
So here's something crazy. We can call MNIST distance 01:33:39.260 |
Not just on our 3 but are on the entire validation set 01:33:49.300 |
So that's wild like there's no normal programming that we would do where we could somehow pass in 01:33:55.860 |
either a matrix or a rank 3 tensor and somehow it works both times and 01:34:03.660 |
What actually happened here is that instead of returning a single number? 01:34:15.500 |
And it did this because it used something called 01:34:18.180 |
Broadcasting and broadcasting is like the super special magic trick 01:34:23.980 |
That lets you make Python into a very very high performance language. And in fact, if you do this broadcasting on 01:34:31.900 |
GPU tenses and pytorch it actually does this operation on the GPU even though you wrote it in Python 01:34:45.180 |
So we're doing a - B on two things. We've got first of all valid 3 tens of valid 3 tensor 01:34:53.940 |
Is a thousand or so images right and remember that mean 3 01:35:08.100 |
Something of this shape minus something of this shape 01:35:14.640 |
Well broadcasting means that if this shape doesn't match this shape 01:35:20.100 |
Like if they did match it would just subtract every corresponding item, but because they don't match 01:35:26.560 |
It's a it's actually acts as if there's a thousand and ten versions 01:35:34.360 |
So it's actually going to subtract this from every single one of these 01:35:45.380 |
So broadcasting requires us to first of all understand the idea of element wise operations 01:35:52.200 |
This is an element wise operation. Here is a rank 1 tensor of 01:36:00.680 |
So we would say these sizes match they're the same and so when I add 1 2 3 to 1 1 1 I get back 01:36:08.200 |
2 3 4 it just takes the corresponding items and adds them together. So that's called element wise operations 01:36:27.440 |
What it ends up doing is it basically copies? 01:36:32.680 |
this this number a thousand and ten times and it acts as if we had said valid 3 tens minus a 01:36:45.520 |
As it says here it doesn't actually copy mean 3 a thousand and ten times it just pretends that it did right? 01:36:53.600 |
It just acts as if it did so basically kind of loops back around to the start again and again 01:36:57.400 |
And it does the whole thing in C or in CUDA on the GPU 01:37:04.320 |
Then we see absolute value, right? So let's go back up here 01:37:07.960 |
After we do the minus we go absolute value, but what happens when we call absolute value on 01:37:19.120 |
10 10 by 28 by 28 just cause absolute value on each underlying thing 01:37:32.960 |
Minus one is the last element always in Python minus two is a second last 01:37:37.480 |
so this is taking the mean over the last two axes and 01:37:41.520 |
So then it's going to return just the first axis. So we're going to end up with a thousand and ten 01:37:51.320 |
Distances which is exactly what we want. We want to know how far away is 01:37:54.920 |
our each of our validation items away from the 01:38:05.160 |
We can create our is three function, which is hey is the distance 01:38:09.640 |
between the number in question and the perfect three 01:38:14.360 |
Less than the distance between the number in question and the perfect seven if it is 01:38:22.400 |
Our three that was an actual three we had is it a three. Yes 01:38:27.640 |
Okay, and then we can turn that into a float and yes becomes one 01:38:32.320 |
Thanks to broadcasting we can do it for that entire 01:38:37.640 |
Set right. So this is so cool. We basically get rid of loops 01:38:42.440 |
In in in in this kind of programming. You should have very few very very few loops loops make things 01:38:52.400 |
And and hundreds of thousands of times slower on the GPU potentially tens of millions of times slower 01:39:02.360 |
Valid three tens and then turn that into float and then take the mean 01:39:07.080 |
So that's going to be the accuracy of the threes on average and here's the accuracy of the sevens. It's just one minus that 01:39:14.080 |
So the accuracy across threes is about 91 and a bit percent the accuracy on sevens is about 98 percent and 01:39:22.040 |
the average of those two is about 95 percent. So here we have a 01:39:27.320 |
Model that's 95 percent accurate at recognizing threes from sevens 01:39:41.560 |
So that's what I mean by getting a good baseline 01:39:51.040 |
It's not obvious how we kind of improve this right. I mean the thing is it doesn't match 01:39:57.840 |
Arthur Samuel's description of machine learning, right? 01:40:03.200 |
This is not something where there's a function which has some parameters which we're testing 01:40:09.320 |
Against some kind of measure of fitness and then using that to like improve the parameters iteratively. We kind of we just did one 01:40:21.320 |
So we want to try and do it in this way where we arrange for some automatic means of testing the effectiveness of he called 01:40:28.680 |
We'd call it a parameter assignment in terms of performance and a mechanism for 01:40:32.800 |
Alternating altering the weight assignment to maximize the performance that we want to do it that way 01:40:38.280 |
Right because we know from from chapter one from lesson one, but if we do it that way we have this like 01:40:45.880 |
Magic box, right called machine learning that can do, you know, particularly combined with neural nets should be able to solve 01:40:55.920 |
If you can at least find the right set of weights 01:40:59.880 |
So we need something that we can get better and better 01:41:13.800 |
So instead of finding an ideal image and seeing how far away something is from the ideal image 01:41:21.200 |
So instead of like having something where we test how far away we are from an ideal image 01:41:31.680 |
What we could instead do is come up with a set of weights 01:41:36.840 |
For each pixel. So we're trying to find out if something is the number three and so we know that like in the places 01:41:47.840 |
You could give those like high weights so you can say hey if there's a dot in those places 01:41:52.840 |
We give it like a high score and if there's dots in other places 01:41:59.640 |
but we could actually come up with a function where the probability of something being an 01:42:12.560 |
Multiplied by some set of weights and then we sum them up 01:42:29.120 |
It's going to end up with a high probability. So here X is the image that we're interested in 01:42:35.960 |
And we're just going to represent it as a vector. So let's just have all the rows stacked up 01:42:44.560 |
so we're going to use an approach where we're going to start with a 01:42:55.320 |
Okay, we're going to start with a vector W. That's going to contain 01:43:04.320 |
Depending on whether you use the Arthur Samuel version of the terminology or not 01:43:11.080 |
We'll then predict whether a number appears to be a 3 or a 7 01:43:21.600 |
And then we will figure out how good the model is 01:43:26.080 |
So we will calculate like how accurate it is or something like that 01:43:33.360 |
Then the key step is we're then going to calculate the gradient now 01:43:37.840 |
The gradient is something that measures for each weight. If I made it a little bit bigger 01:43:47.800 |
Would the loss get better or worse? And so if we do that for every weight 01:43:51.480 |
We can decide for every weight whether we should make that weight a bit bigger or a bit smaller 01:43:56.320 |
So that's called the gradient, right? So once we have the gradient we then step is the word we used to step 01:44:06.920 |
Up a little bit for the ones where the gradient we should said we should make them a bit higher and 01:44:12.560 |
Down a little bit for all the ones where the gradient said they should be a bit lower 01:44:16.200 |
So now it should be a tiny bit better and then we go back to step 2 and 01:44:22.240 |
Calculate a new set of predictions using this formula 01:44:32.040 |
So this is basically the flow chart and then at some point when we're sick of waiting or when the loss gets good enough 01:44:49.040 |
These seven steps are the key to training all deep learning models this technique is called stochastic gradient descent 01:44:56.480 |
Well, it's called gradient descent. We'll see the stochastic bit very soon 01:45:03.040 |
Seven steps, there's lots of choices around exactly how to do it, right? 01:45:08.760 |
We've just kind of hand-waved a lot like what kind of random initialization and how do you calculate the gradient and exactly? 01:45:15.600 |
What step do you take based on the gradient and how do you decide when to stop blah blah blah, right? 01:45:19.440 |
So in this in this course, we're going to be like learning 01:45:28.080 |
You know, that's kind of part one, you know, then the other big part is like well, what's the actual function neural network? 01:45:35.440 |
So how do we train the thing and what is the thing that we train? 01:45:38.960 |
So we initialize parameters with random values 01:45:41.880 |
We need some function that's going to be the loss function that will return a number that's small if the performance of the model is 01:45:51.640 |
In some way to figure out whether the weight should be increased a bit or decrease a bit 01:45:58.800 |
Then we need to decide like when to stop which we'll just say let's just do a certain number of epochs 01:46:08.520 |
Go even simpler, right? We're not even going to do MNIST. We're going to start with this function x squared 01:46:14.360 |
Okay, and in fast AI we've created a tiny little thing called plot function that plots the function 01:46:26.880 |
What we're going to do is we're going to try to find this is our loss function 01:46:33.320 |
So we're going to try and find the bottom point 01:46:36.600 |
Right, so we're going to try and figure out what is the x value which is at the bottom? 01:46:41.440 |
So our seven-step procedure requires us to start out by initializing 01:46:49.720 |
Some value right? So the value we pick was just say oh, let's just randomly pick minus one and a half 01:46:55.480 |
Great. So now we need to know if I increase x a bit 01:47:01.120 |
Does my remember this is my loss does my loss get a bit better? Remember better is smaller or a bit worse 01:47:10.080 |
We can just try a slightly higher x and a slightly lower x and see what happens 01:47:14.600 |
Right and you can see it's just the slope right the slope at this point 01:47:23.760 |
Then my loss will decrease because that is the slope at this point 01:47:35.000 |
Just a little bit in the direction of the slope 01:47:38.180 |
Right. So here is the direction of the slope and so here's the new value at that point 01:47:46.920 |
Eventually, we'll get to the bottom of this curve 01:47:49.880 |
So this idea goes all the way back to Isaac Newton at the very least and this basic idea is called 01:48:02.040 |
So a key thing we need to be able to do is to calculate 01:48:13.280 |
At least that's bad news for me because I've never been a fan of calculus we have to calculate the derivative 01:48:22.160 |
Maybe you spent ages in school learning how to calculate derivatives 01:48:27.920 |
You don't have to anymore the computer does it for you and the computer does it fast. It uses all of those 01:48:34.880 |
Methods that you learned at school and it had a whole lot more 01:48:38.880 |
Like clever tricks for speeding them up and it just does it all 01:48:43.160 |
Automatically. So for example, it knows I don't know if you remember this from high school that the derivative of x squared is 2x 01:48:51.640 |
It is just something it knows. It's part of its kind of bag of tricks, right? So 01:48:57.980 |
So PyTorch knows that PyTorch has an engine built in that can take derivatives and find the gradient of functions 01:49:11.620 |
Tensor, let's say and in this case, we're going to modify this tensor with this special 01:49:17.660 |
method called requires-grad and what this does is it tells PyTorch that any time I do a calculation with this 01:49:26.140 |
Xt it should remember what calculation it does so that I can take the derivative later 01:49:35.700 |
An underscore at the end of a method in PyTorch means that this is called an in place operation 01:49:42.360 |
It actually modifies this so requires-grad underscore 01:49:47.140 |
Modifies this tensor to tell PyTorch that we want to be calculating gradients on it 01:49:52.940 |
So that means it's just going to have to keep track of all of the computations we do so that it can calculate the derivative later 01:50:06.220 |
Let's say we then call f on it. Remember f is just squaring it though 3 squared is 9 01:50:13.660 |
But the value is not just 9 it's 9 accompanied with a grad function 01:50:18.860 |
Which is that it's it knows that a power operation has been taken 01:50:30.400 |
Which refers to back propagation which we'll learn about 01:50:33.320 |
which basically means take the derivative and 01:50:36.280 |
so once it does that we can now look inside Xt because we said requires-grad and 01:51:00.360 |
We just call backward and then get the grad attribute to get the derivative 01:51:10.120 |
What you need to know about calculus is not how to take a derivative 01:51:24.840 |
Now here's something interesting let's not just take three, but let's take a rank one tensor 01:51:39.760 |
To our f function. So it's going to go x squared dot sum 01:52:18.240 |
So that's kind of all you need to know about calculus right and if this is 01:52:23.300 |
If this idea that that a derivative for gradient is a slope is unfamiliar 01:52:32.200 |
They had some great introductory calculus and don't forget you can skip all the bits where they teach you how to calculate 01:52:40.320 |
So now that we know how to calculate the gradient that is the slope of the function 01:52:46.160 |
That tells us if we change our input a little bit 01:52:56.320 |
Right and so that tells us that every one of our parameters if we know their gradients 01:53:02.200 |
Then we know if we change that parameter up a bit or down a bit. How will it change our loss? 01:53:07.520 |
So therefore we then know how to change our parameters 01:53:12.080 |
So what we do is let's say all of our weights called W 01:53:24.320 |
Small number and that small number is often a number between about 0.001 and 0.1 01:53:30.920 |
It's called the learning rate right and this year is the essence of 01:53:40.800 |
So if you pick a learning rate, that's very small 01:53:43.680 |
Then you take the slope and you take a really small step in that direction 01:53:47.680 |
And another small step another small step another small steps note. It's going to take forever to get to the end 01:53:59.400 |
Each time and again, it's going to take forever 01:54:07.160 |
We're assuming we're starting here and it's actually so big that it got worse and worse 01:54:11.120 |
Or here's one where we start here and it's like it's not 01:54:16.680 |
So big it gets worse and worse, but it just takes a long time to bounce in and out 01:54:23.680 |
Picking a good learning rate is really important both to making sure that it's even possible 01:54:28.220 |
To solve the problem and that it's possible to solve it in a reasonable amount of time 01:54:33.440 |
So we'll be learning about picking how to pick learning rates in this course 01:54:43.080 |
Let's try this. Let's try using gradient descent. I said SGD. That's not quite accurate. It's just going to be gradient descent 01:54:53.920 |
So the problem we're going to solve is let's imagine you were watching a roller coaster 01:55:03.040 |
So as it comes out of the previous hill, it's going super fast and it's going up the hill 01:55:10.420 |
And it's going slower and slower and slower until it gets to the top of the hump 01:55:14.640 |
And then it goes down the other side it goes faster and faster and faster 01:55:17.880 |
So if you like had a stopwatch or whatever or a sudden watch some kind of speedometer and you are measuring it just by hand 01:55:26.360 |
At kind of equal time points. You might end up with something that looks a bit like this 01:55:31.240 |
Right. And so the way I did this was I just grabbed a range just grabs 01:55:36.080 |
The numbers from naught up to but not including 20. Alright, so these are the time periods at which I'm taking my speed measurement 01:55:46.400 |
Quadratic function here and multiply it by 3 and then square it and then add 1 whatever right and then I 01:55:59.280 |
Square it times 0.75 add 1 and then I add a random number to that or add a random number to every observation 01:56:07.080 |
So I end up with a quadratic function, which is a bit bumpy 01:56:10.580 |
So this is kind of like what it might look like in real life because my speedometer 01:56:22.800 |
We want to create a function that estimates at any time. What is the speed of the roller coaster? 01:56:30.480 |
Guessing what function it might be so we guess that it's a function 01:56:36.200 |
A times time squared plus B times time plus C. You might remember from school is called a quadratic 01:56:49.000 |
Let's create it using kind of the Arthur Samuel's technique the machine learning technique. This function is going to take two things 01:56:56.440 |
which in this case is a time and it's going to take some parameters and 01:57:00.720 |
The parameters are a B and C. So in in Python you can split out a 01:57:07.520 |
List or a collection into its components like so and then here's that function. Okay 01:57:14.640 |
So we're not just trying to find any function in the world. We're just trying to find some function 01:57:18.840 |
Which is a quadratic by finding an A and a B and a C 01:57:22.760 |
So the the Arthur Samuel technique for doing this is to next up come up with a loss function 01:57:30.280 |
Come up with a measurement of how good we are. So if we've got some predictions 01:57:35.000 |
That come out of our function and the targets which are these you know actual values 01:57:44.720 |
Mean squared error. Okay. So here's that mean squared error we saw before the difference squared and take the mean 01:57:52.880 |
So now we need to go through our seven-step process 01:57:57.680 |
We want to come up with a set of three parameters A B and C 01:58:01.960 |
Which are as good as possible. The step one is to initialize A B and C to random values 01:58:07.920 |
So this is how you get random values three of them in PyTorch and remember we're going to be adjusting them 01:58:13.560 |
So we have to tell PyTorch that we want the gradients 01:58:15.920 |
I'm just going to save those away so I can check them later and then I calculate the predictions using that function f 01:58:29.320 |
And then let's create a little function which just plots how good at this point are our predictions 01:58:40.600 |
our predictions and in blue our targets, so that looks pretty terrible 01:58:52.720 |
Okay, so now we want to improve this so calculate the gradients using the two steps we saw 01:58:58.960 |
Call backward and then get grad and this says that each of our 01:59:06.880 |
Let's pick a learning rate of 10 to the minus 5 so we multiply that by 10 to the minus 5 01:59:15.880 |
And step the weights and remember step the weights means minus equals learning rate times 01:59:25.280 |
The gradient there's a wonderful trick here which I've called dot data 01:59:32.040 |
the reason I've called dot data is dot data is a special attribute in PyTorch which if you use it then the 01:59:40.000 |
gradient is not calculated and we certainly wouldn't want the gradient to be calculated of 01:59:47.040 |
The actual step we're doing we only want the gradient to be calculated of our 01:59:54.320 |
Right. So when we step the weights we have to use this special dot data attribute 02:00:04.320 |
And let's see if loss improved. So the loss before was 02:00:12.200 |
Now it's 5,400 and the plot has gone from something that goes down to minus 300 02:00:26.920 |
So I've just grabbed those previous lines of code and pasted them all into a single cell 02:00:30.720 |
Okay, so preds loss backward data grad is none 02:00:35.120 |
and then from time to time print the loss out and 02:00:38.360 |
Repeat that ten times and look getting better and better 02:00:42.920 |
And so we can actually look at it getting better and better 02:00:52.200 |
So this is pretty cool, right? We have a technique. This is the Arthur Samuel technique 02:01:03.240 |
Continuously improves by getting feedback from the result of measuring some loss function 02:01:18.760 |
So you should make sure that you kind of go back and feel super comfortable 02:01:25.800 |
You know if you're not feeling comfortable that that's fine 02:01:28.280 |
Right if it's been a while or if you've never done this kind of gradient descent before 02:01:36.600 |
So kind of trying to find the first cell in this notebook where you don't fully understand what it's doing 02:01:42.760 |
And then like stop and figure it out like look at everything that's going on do some experiments do some reading 02:01:53.280 |
That cell where you were stuck before you move forwards 02:02:09.280 |
We want to use this exact technique and there's basically nothing extra we have to do 02:02:18.120 |
The metric that we've been using is the error rate or the accuracy 02:02:25.160 |
It's like how often are we correct? Right and and that's the thing that we're actually trying to make 02:02:31.380 |
Good now metric, but we've got a very serious problem 02:02:36.440 |
Which is remember we need to calculate the gradient 02:02:39.360 |
To figure out how we should change our parameters and the gradient is the slope or this deepness 02:02:46.480 |
Which you might remember from school is defined as rise over run 02:02:50.320 |
It's y new minus y old divided by x new minus x old 02:02:58.440 |
The gradients actually defined when x new is is very very close to x old 02:03:09.520 |
Accuracy if I change a parameter by a tiny tiny tiny amount 02:03:19.760 |
Three that we now predict as a seven or any seven that we now predict as a three 02:03:25.400 |
Because we change the parameter by such a small amount 02:03:28.160 |
So it's it's it's possible. In fact, it's certain that the gradient is 02:03:35.840 |
Many places and that means that our parameters 02:03:39.080 |
Aren't going to change at all because learning rate times gradient is still zero when the gradients zero for any learning rate 02:03:52.960 |
Function and the metric are not always the same thing 02:04:01.240 |
Our loss if that metric has a gradient of zero 02:04:04.960 |
So we need something different so we want to find something that kind of 02:04:12.680 |
Is pretty similar to the accuracy in that like as the accuracy gets better this ideal function we want gets better as well 02:04:39.800 |
This is actually probably a good time to stop because actually, you know 02:04:45.040 |
We've kind of we've got to the point here where we understand gradient descent 02:04:49.640 |
We kind of know how to do it with a simple loss function and 02:04:55.720 |
I actually think before we start looking at the MNIST loss function 02:04:59.680 |
We shouldn't move on because we've got so much so much assignments to do for this week already 02:05:06.400 |
So we've got build your web application and we've got go step through step through this notebook to make sure you fully understand it 02:05:17.640 |
Stop right here before we make things too crazy. So before I do 02:05:26.760 |
Okay, great. All right. Well, thanks everybody. I'm sorry for that last-minute change of tack there, but I think this is going to make sense 02:05:35.360 |
So I hope you have a lot of fun with your web applications try and think of something. That's really fun really interesting 02:05:43.200 |
It doesn't have to be like important. It could just be some, you know cute thing 02:05:49.320 |
We've had students before a student that I think he said he had 16 different cousins and he created something that would classify 02:05:57.280 |
A photo based on which of his cousins is for like his fiance meeting his family 02:06:08.000 |
but you know, yeah show off your application and 02:06:11.120 |
Maybe have a look around at what IPY widgets can do and try and come up with something that you think is pretty cool 02:06:19.120 |
All right. Thanks, everybody. I will see you next week.