back to index

Lesson 3 - Deep Learning for Coders (2020)


Chapters

0:0 Recap of Lesson 2 + What's next
1:8 Resizing Images with DataBlock
8:46 Data Augmentation and item_tfms vs batch_tfms
12:28 Training your model, and using it to clean your data
18:7 Turning your model into an online application
36:12 Deploying to a mobile phone
38:13 How to avoid disaster
50:59 Unforeseen consequences and feedback loops
57:20 End of Chapter 2 Recap + Blogging
64:9 Starting MNIST from scratch
66:58 untar_data and path explained
70:57 Exploring at the MNIST data
72:5 NumPy Array vs PyTorch Tensor
76:0 Creating a simple baseline model
88:38 Working with arrays and tensors
90:50 Computing metrics with Broadcasting
99:46 Stochastic Gradient Descent (SGD)
114:40 End-to-end Gradient Descent example
121:56 MNIST loss function
124:40 Lesson 3 review

Whisper Transcript | Transcript Only Page

00:00:00.000 | So hello and welcome to lesson three of practical deep learning for coders
00:00:09.480 | We're looking at getting our model into production last week
00:00:14.380 | And so we're going to finish off that today and then we're going to start to look behind the scenes at what actually goes
00:00:20.360 | On when we train a neural network, we're going to look at
00:00:22.980 | Kind of the math of what's going on
00:00:26.760 | And we're going to learn about SGD and that's important stuff like that
00:00:31.820 | The the order is slightly different to the book in the book. There's a part in the book which says like hey
00:00:38.640 | You can either go to lesson four or lesson three now
00:00:41.880 | And then go back to the other one afterwards. So we're doing lesson four and then lesson three
00:00:46.440 | Chapter four and then chapter three I should say
00:00:49.280 | You can choose it whichever way you're interested in chapter four is the more
00:00:55.320 | Technical chapter about the foundations of how deep learning really works or else chapter three is all about ethics
00:01:02.120 | And so with the lessons we'll do that next week
00:01:06.420 | So we're looking at
00:01:10.760 | 0 to production notebook and
00:01:13.720 | We've got to look at the fast book version the one with in fact everything I'm looking at today will be in the fast book version
00:01:23.680 | Remember last week we had a look at
00:01:25.680 | Our bears and we created this data loaders object
00:01:31.520 | by using
00:01:34.240 | The data block API, which I hope everybody's had a chance to experiment with this week if you haven't
00:01:39.420 | Now's a good time to do it
00:01:42.480 | We kind of skipped over one of the lines a little bit
00:01:45.880 | Which is this item transforms?
00:01:48.840 | So what this is doing here when we said resize
00:01:53.440 | The the images we downloaded from the internet are lots of different sizes and lots of different aspect ratios
00:01:58.320 | Some are tall and some are wide. I'm a square and some are big some are small
00:02:01.920 | When you say resize for an item transform, it means each item to an item in this case is one image
00:02:09.080 | It's going to be resized to 128 by 128 by squishing it or stretching it
00:02:14.960 | And so we had a look at you can always say show batch to see a few examples and this is what they look like
00:02:25.000 | Squishing and stretching isn't the only way that we can resize remember we have everything we have to make everything into a square
00:02:31.280 | Before we kind of get it into our model by the time it gets to our model everything has to be the same size in
00:02:37.520 | Each mini-batch, but that's why and they're making it a square is not the only way to do that
00:02:41.640 | But it's the easiest way and it's the by far the most common way
00:02:48.680 | Another way to do this
00:02:54.240 | We can create a another
00:02:56.720 | Data block object and we can make a data block object
00:03:01.980 | That's an identical copy of an existing data block object where we can then change just some pieces
00:03:07.940 | And we can do that by calling the new method, which is super handy. And so let's create another data block
00:03:13.840 | Object and this time with different item transform where we resize using the
00:03:21.480 | squish method
00:03:23.480 | We have a question. What are the advantages of having square images versus rectangular ones?
00:03:29.800 | That's a great question
00:03:36.920 | Really its simplicity
00:03:40.800 | If you know all of your images are rectangular of a particular aspect ratio to start with you may as well
00:03:47.680 | Just keep them that way. But if you've got some which are tall and some which are wide
00:03:51.840 | Making them all square is kind of the easiest
00:03:55.440 | Otherwise you would have to kind of organize them such as all of the tall ones kind of ended up in a mini-batch
00:04:01.760 | Nor the wide ones ended up in a mini-batch and then you'd have to kind of then figure out
00:04:05.960 | What the best aspect ratio for each mini-batch is and we actually have some research that does that in fast AI -
00:04:13.280 | But it's still a bit clunky
00:04:17.520 | I should mention okay
00:04:19.000 | I just lied to you the default is not actually to squish or stretch the default I should have said sorry the default
00:04:24.160 | When we say resize is actually just to grab
00:04:28.640 | Grab the center so actually all we're doing is we're grabbing the center of each image
00:04:34.880 | So if we want to squish or stretch you can add the resize method dot squish
00:04:38.840 | Argument to resize and you can now see that this black bear is now looking much thinner
00:04:44.680 | But we have got the kind of leaves that are around on each side instance
00:04:49.000 | Another question when you use the dls dot new method what can and cannot be changed is it just the transforms
00:04:59.720 | So it's not dls dot new it's bears dot new right? So we're not creating a new data lotus object
00:05:04.680 | We're creating a new data block object. I don't remember off the top of my head
00:05:08.920 | so check the documentation and I'm sure somebody can pop the answer into the
00:05:14.520 | into the forum
00:05:16.520 | So you can see when we use dot squish that this grizzly bear has got
00:05:22.000 | pretty kind of
00:05:24.280 | Wide and weird looking and this black bear has got pretty weird and thin looking and it's easiest kind of to see what's going on
00:05:31.120 | if we use
00:05:32.680 | resize method dot pad and what dot pad does as you can see is it just adds some
00:05:37.200 | Black bars around each side so you can see the grizzly bear was tall
00:05:41.640 | So then when we stretch squishing and stretching opposites of each other
00:05:45.680 | so when we stretched it it ended up wide and the black bear was
00:05:49.560 | Originally a wide rectangle so it ended up looking kind of thin
00:05:54.240 | You don't have to use zeros zeros means pad it with black you can also say like reflect to kind of have
00:06:02.580 | The the pixels will kind of look a bit better that way if you use reflect
00:06:08.880 | All of these different methods have their own problems
00:06:11.440 | The the pad method is kind of the cleanest you end up with the correct size you end up with all of the pixels
00:06:18.400 | But you also end up with wasted pixels so you kind of end up with wasted computation
00:06:22.840 | The squish method is the most efficient because you get all of the information
00:06:28.440 | You know and and nothing's kind of wasted but on the downside your neural net is going to have to learn to kind of like
00:06:38.200 | Recognize when something's been squished or stretched and in some cases it might it wouldn't even know
00:06:42.440 | So if there's two objects you're trying to recognize one of which tends to be thin and one of which tends to be thick
00:06:47.480 | And otherwise they're the same they could actually be impossible to distinguish
00:06:51.520 | And then the default cropping approach actually
00:06:56.200 | Removes some information
00:06:59.000 | So in this case
00:07:01.400 | You know this this grizzly bear here
00:07:04.840 | We actually lost a lot of its legs
00:07:07.440 | So if figuring it out what kind of bear it was
00:07:10.400 | Required looking at its feet. Well. We don't have its feet anymore
00:07:14.000 | So they all have downsides
00:07:16.920 | So there's something else that you can do a different approach which is instead us to say resize you can say
00:07:25.280 | Random resized crop and actually this is the most common approach and what random resize crop does is each time
00:07:31.480 | it actually grabs a
00:07:35.920 | Different part of the image and kind of zooms into it
00:07:40.000 | Right. So these this is all the same image and we're just grabbing a batch of
00:07:45.080 | four different versions of it and you can see some are kind of
00:07:48.920 | You know, they're all squished in different ways and we've kind of selected different subsets and so forth now this
00:07:56.160 | Kind of seems worse than any of the previous approaches because I'm losing information like this one here
00:08:02.880 | I've actually lost a whole lot of its of its back, right?
00:08:07.000 | But the cool thing about this is that remember we want to avoid overfitting and
00:08:12.720 | When you see a different part of the animal each time
00:08:17.660 | It's much less likely to overfit because you're not seeing the same image on each epoch that you go around
00:08:24.680 | that makes sense, so
00:08:28.920 | This random random resized crop approach is actually super popular and so min scale 0.3 means
00:08:35.240 | We're going to pick at least 30% of the pixels of kind of the original size each time
00:08:41.280 | And then we're kind of like zoom into that that square
00:08:52.280 | this idea of doing something so that each time the
00:08:56.400 | Model sees the image it looks a bit different to last time. It's called data augmentation and this is one type of data augmentation
00:09:04.040 | It's probably the most common
00:09:06.040 | but there are others and
00:09:09.400 | One of the best ways to do data augmentation is to use
00:09:13.960 | this org transforms function and what org transforms does is it actually returns a list of
00:09:23.000 | different augmentations and so there are
00:09:27.040 | augmentations which change contrast which change brightness
00:09:30.600 | Which warp the perspective so you can see in this one here
00:09:34.120 | It looks like this bits much closer to you and this is much away from you because it's going to be in perspective warped
00:09:38.520 | It rotates them. See this one's actually been rotated. This one's been made really dark, right?
00:09:43.860 | These are batch transforms not item transforms
00:09:48.120 | The difference is that item transforms happen one image at a time and so the thing that resizes them all to the same size
00:09:54.000 | That has to be an item transform
00:09:56.000 | Pop it all into a mini batch put it on the GPU and then a batch transform happens to a whole mini batch at a time
00:10:03.160 | And by putting these as batch transforms that the augmentation happens super fast because it happens on the GPU and I don't know if there's any other
00:10:12.120 | Libraries as as we speak which allow you to write your own GPU accelerated transformations that run on the GPU in this way
00:10:19.760 | So this is a super handy thing in fast AI too
00:10:25.400 | So you can check out the documentation for
00:10:32.040 | Orc transforms and when you do you'll find the documentation for all of the underlying transforms that it basically wraps, right?
00:10:41.200 | So you can see if I shift tab
00:10:43.320 | I don't remember if I've shown you this trick before if you go inside the parentheses of a function and hit shift tab a few
00:10:48.900 | times it'll pop open a list of all of the arguments and so you can basically see you can say like oh
00:10:55.620 | Can I sometimes flip it left right? Can I sometimes flip it up down? What's the maximum and I can rotate zoom?
00:11:03.340 | change the lighting
00:11:05.960 | Walk the perspective
00:11:07.960 | and so forth
00:11:10.680 | How can we add different augmentations for train and validation sets?
00:11:14.800 | So the cool thing is that
00:11:18.960 | Automatically fast AI will avoid doing data augmentation on the validation set
00:11:28.200 | so all of these org transforms will only be applied to the
00:11:33.800 | Training set
00:11:37.720 | With the exception of random resized crop random resized crop has a different behavior for each
00:11:43.120 | the behavior for the training set is what we just saw which is to randomly pick a subset and kind of zoom into it and
00:11:51.080 | Behavior for the validation set is just to grab the center the largest center square that it can
00:11:56.760 | You can write your own
00:12:02.440 | Transformations that they're just Python. They're the standard pytorch code
00:12:06.120 | The way if you and by default it will only be applied to the training set if you want to do something fancy like random
00:12:13.440 | Resize crop where you actually have different things being applied to each
00:12:15.960 | You should come back to the next course to find out how to do that or read the documentation. It's not rocket science, but it's
00:12:22.480 | Not something most people need to do
00:12:25.600 | Okay, so
00:12:31.720 | Last time we we here bit did best not new with a random resized crop min scale of 0.5. We added some transforms
00:12:39.140 | We went ahead and trained actually since last week. I've rerun this notebook
00:12:44.080 | I've got it's on a different computer and I've got different images. So it's not all exactly the same
00:12:47.920 | but I still got a good confusion matrix, so of the
00:12:52.840 | Black bears 37 were classified correctly two were grizzlies of one was a teddy
00:13:02.360 | And we talked about plot top plot top losses and it's interesting. You can see in this case
00:13:06.800 | There's some clearly kind of odd things going on. This is not a bear at all
00:13:10.960 | This looks like it's a drawing of a bear which has decided is is
00:13:16.000 | Predicted as a teddy, but this thing's meant to be a drawing of a black bear. I can certainly see the confusion
00:13:22.880 | You can see how some parts of it are being cut off. We'll talk about how to deal with that later
00:13:30.280 | Now one of the interesting things is that we didn't really do
00:13:33.460 | Much data cleaning at all before we built this model
00:13:37.840 | The only data cleaning we did was just to validate that each image can be opened. There was that verify images call
00:13:44.360 | And the reason for that is it's actually much easier normally to clean your data after you create a model and I'll show you how
00:13:52.220 | We've got this thing called image classifier cleaner
00:13:56.840 | Where you can pick a category right and training set or validation set
00:14:04.060 | And then what it will do is it will then list all of the images in that set and it will pick the ones
00:14:14.320 | which are
00:14:16.880 | Which is the least confident about which the most likely to be wrong
00:14:23.280 | where the where the loss is the worst to be more precise and so this
00:14:29.200 | This is a great way to look through your data and find problems. So in this case, the first one is
00:14:37.520 | Not a teddy or a brown bear or a black bear. It's a puppy dog
00:14:41.800 | Right. So this is a great cleaner because what I can do is I can now click delete here
00:14:47.600 | This one here looks a bit like an ewok rather than a teddy. I'm not sure
00:14:51.760 | What do you think Rachel is an ewok? I'm gonna call it an ewok
00:14:54.060 | Right. And so you can kind of go through
00:14:56.920 | Okay, that's definitely not a teddy and so you can either say like oh that's wrong
00:15:03.120 | It's actually a grizzly bear or it's wrong
00:15:05.400 | It's a black bear or I should delete it or by default just keep it right and you can kind of keep going through until
00:15:09.960 | You think like okay, they're all seem to be fine
00:15:12.520 | Maybe that one's not
00:15:17.280 | And kind of once you get to the point where they all seem to be fine, you can kind of say, okay
00:15:21.880 | Probably all the rest are fine too because they all have lower losses. So they all fit the kind of the mold of a teddy
00:15:28.240 | And so then I can write run this code here
00:15:31.400 | Where I just go through planar.delete. So that's all the things which I selected delete for and unlink them
00:15:38.840 | so unlink
00:15:41.440 | Is just another way of saying delete a file that's the Python name
00:15:46.760 | And then go through all the ones that we said change and we can actually move them to the correct directory
00:15:53.340 | If you haven't seen this before you might be surprised that we've kind of created our own little GUI inside
00:16:01.320 | Jupyter notebook
00:16:04.800 | Yeah, you can do this and we built this with less than a screen of code you can check out the source code in the
00:16:11.960 | Fast AI notebooks. So this is a great time to remind you that
00:16:16.380 | This is a great time to remind you that
00:16:22.080 | Fast AI is built with notebooks
00:16:26.480 | And so if you go to the fast AI repo and clone it and then go to nbs you'll find
00:16:31.920 | all of the code of fast AI
00:16:38.000 | Written as notebooks and they've got a lot of pros and examples and tests and so forth
00:16:43.660 | so the best place to learn about how this is implemented is to look at the notebooks rather than looking at the
00:16:52.080 | module
00:16:58.280 | By the way, sometimes you'll see like weird little comments like this
00:17:03.080 | These weird little comments are part of a development environment for Jupyter notebook. We use called nbdev which we built
00:17:08.800 | So so far and I built this thing to make it much easier for us to kind of create books
00:17:14.800 | and websites and libraries in Jupyter notebooks. So this particular one here hide
00:17:21.640 | means
00:17:24.280 | When this is turned into a book or into documentation don't show this cell
00:17:28.720 | And the reason for that is because you can see I've actually got it in the text, right?
00:17:32.880 | But I thought when you're actually running it it would be nice to have it sitting here waiting for you to run directly
00:17:38.960 | So that's why it's shown in the notebook, but not in the in the book. It's shown differently
00:17:43.800 | And you'll also see these things like s colon with a quote in the book that would end up saying
00:17:51.160 | Sylvain says and then what he says
00:17:53.560 | so there's kind of little bits and pieces in the in the notebooks that just look a little bit odd and that's because it's
00:17:59.920 | designed that way in order to show in order to create stuff in the
00:18:04.600 | Right, so then last week we saw how you can export that to a pickle file that contains all the information for the model
00:18:13.860 | And then on the server where you're going to actually do your inference
00:18:18.560 | You can then load that saved file and you'll get back a learner that you can call predict on so predict
00:18:25.080 | Perhaps the most interesting part of predict is the third thing that it returns
00:18:34.160 | Which is a tensor in this case containing three numbers
00:18:38.600 | The three numbers there's three of them because we have three classes teddy bear grizzly bear and black bear right and so
00:18:48.360 | This doesn't make any sense until you know what the order of the classes is kind of in in
00:18:55.100 | In your data loaders and you can ask the data loaders what the order is by asking for its vocab
00:19:02.240 | So a vocab in fast AI is a really common concept
00:19:06.640 | it's basically any time that you've got like a mapping from numbers to strings or
00:19:12.140 | discrete levels
00:19:14.920 | The mapping is always stored in the vocab. So here this shows us that
00:19:19.520 | The the activation
00:19:26.560 | black bear is
00:19:28.560 | 10 a neg 6 the activation for grizzly is 1 and the activation for teddy is
00:19:35.480 | 10 a neg 6
00:19:42.200 | Very very confident that this particular one. It was a grizzly not surprisingly. This was something called grizzly JPEG
00:19:48.120 | So you need to kind of know this
00:19:54.040 | This mapping in order to display the correct thing
00:19:57.400 | But of course the data loaders object already knows that mapping and it's all the vocab and it's stored in with the loader
00:20:03.280 | So that's how it knows to say grizzly automatically
00:20:06.640 | So the first thing it gives you is the human readable string that you'd want to display
00:20:10.320 | So this is kind of nice that
00:20:13.080 | With fast AI 2 you you save this object which has everything you need for inference. It's got all the you know information about
00:20:20.840 | Normalization about any kind of transformation steps about what the vocab is so it can display everything correctly
00:20:28.520 | right, so
00:20:32.000 | now we want to
00:20:34.880 | Deploy this as an app
00:20:37.000 | Now if you've done some web programming before then all you need to know is that this line of code and this line of code
00:20:46.480 | So this is the line of code you would call once when your application starts up
00:20:51.040 | And then this is the line of code you would call
00:20:53.400 | Every time you want to do an inference and there's also a batch version of it which you can look up if you're interested
00:20:58.800 | This is just a one at a time
00:21:03.200 | So there's nothing special if you're already a web programmer or have access to a web programmer
00:21:08.880 | These are you know you just have to stick these two lines of code somewhere and the three things you get back are the
00:21:13.680 | The human readable string if you're doing categorization
00:21:17.240 | The index of that which in this case is one is grizzly and the probability of each class
00:21:23.440 | One of the things we really wanted to do in this course though is not assume that everybody is a web developer
00:21:31.840 | Most data scientists aren't but gee wouldn't it be great if all data scientists could at least like prototype an application to show off
00:21:39.300 | the thing they're working on and
00:21:41.300 | so we've
00:21:43.640 | Tried to kind of curate an approach which none of its stuff. We've built. It's really is curated
00:21:48.680 | Which shows how you can create a GUI and create a complete application in Jupyter Notebook?
00:21:56.360 | so the
00:21:59.440 | Key pieces of technology we use to do this are IPython widgets
00:22:04.240 | Which is always called IPy widgets and voila
00:22:07.220 | IPy widgets which we import by default as widgets, and that's also what they use in their own documentation as
00:22:14.760 | GUI widgets for example a file upload button
00:22:19.240 | so if I create this file upload button and then display it I
00:22:25.880 | See and we saw this in the last lesson as well. Maybe it was less than one an actual clickable button
00:22:30.520 | So I can go ahead and
00:22:33.960 | Click it and it says now, okay, you've selected one thing
00:22:39.860 | So how do I use that? Well these?
00:22:43.640 | Well these widgets have all kinds of methods and properties and the upload button has a data property
00:22:55.920 | Which is an array?
00:22:57.920 | containing all of the images you uploaded
00:22:59.920 | so you can pass that to
00:23:02.440 | Pio image dot create and so dot create is kind of the standard
00:23:07.520 | Factory method we use in fast AI to create items
00:23:14.000 | And Pio image dot create is smart enough to be able to create an item from all kinds of different things
00:23:19.160 | And one of the things it can create it from is a binary blob, which is what a file upload contains
00:23:25.520 | so then we can display it and
00:23:27.520 | There's our teddy, right? So you can see how you know cells of Jupiter notebook can refer to other cells that were created
00:23:36.120 | that were kind of have
00:23:38.760 | GUI created data in them
00:23:41.320 | so let's hide that teddy away for a moment and
00:23:45.080 | the next thing to know about is that there's a kind of widget called output and an output widget is
00:23:53.720 | It's basically something that
00:23:55.720 | You can fill in later, right? So if I delete actually
00:24:01.040 | This part here, so I've now got an output widget
00:24:05.980 | Yeah, actually let's do it this way around
00:24:13.360 | You can't see the output widget even though I said please display it because nothing is output
00:24:17.920 | So then in the next cell I can say with that output placeholder display a thumbnail of the image
00:24:25.240 | And you'll see that the the display will not appear here
00:24:28.760 | It appears back here
00:24:31.800 | Right because that's how that's where the placeholder
00:24:36.800 | So let's run that again to clear out that placeholder
00:24:40.240 | So we can create another kind of placeholder, which is a label the labels kind of a
00:24:47.480 | Something where you can put text in it. They can give it a value like I
00:24:51.400 | Don't know. Please choose an image
00:24:55.660 | Okay, so we've now got a label containing. Please choose an image. Let's create another button to do a classification
00:25:03.440 | Now this is not a file upload button. It's just a general button. So this button doesn't do anything
00:25:08.920 | All right, it doesn't do anything until we attach an event handler to it an event handler is
00:25:17.600 | A callback we'll be learning all about callbacks in this course
00:25:20.440 | If you've ever done any GUI programming before or even web programming you'll be familiar with the idea that you
00:25:27.400 | Write a function which is the thing you want to be called when the button is clicked on and then somehow you tell your framework
00:25:35.320 | That this is the on click event. So here I go. Here's my button run. I say the on click event the button run is
00:25:45.520 | Recall this code and this code is going to do all the stuff. We just saw going to create an image from the upload
00:25:51.600 | It's going to clear the output display the image
00:25:54.960 | Call predict and then replace the label with a prediction
00:26:00.680 | So there it all is now so that hasn't done anything but I can now go back to this classify button
00:26:07.320 | Which now has an event handler attached to it. So what's this?
00:26:09.840 | click
00:26:12.440 | Boom and look that's been filled in that's been filled in right in case you missed it
00:26:18.680 | Let's run these again to clear everything out
00:26:20.680 | Okay, everything's gone
00:26:23.800 | This is please choose an image there's nothing here I click classify oh
00:26:31.360 | pop pop
00:26:34.840 | Right. So it's kind of amazing how our notebook has suddenly turned into this
00:26:42.080 | interactive prototyping playground building applications and so once all this works
00:26:47.200 | We can dump it all together. And so
00:26:51.000 | The easiest way to dump things together is to create a V box
00:26:57.400 | A V box is a vertical box and it's just it's just something that you put widgets in and so in this case
00:27:02.840 | We're going to put the following widgets
00:27:04.200 | we're going to have a label that says select your bear then an upload button a run button an output placeholder and a
00:27:11.280 | label for predictions
00:27:13.120 | But let's run these again just to clear everything out
00:27:15.400 | So that we're not cheating
00:27:18.440 | And let's create our V box so as you can see it's just got all the
00:27:26.120 | All the pieces
00:27:29.520 | right
00:27:31.760 | Now we've got
00:27:35.240 | Oh, I accidentally ran the thing that displayed the bear let's get rid of that
00:27:42.840 | Okay, so there it is, so now I can click upload I can choose my bear
00:27:52.040 | Okay, and then I can click classify
00:27:56.840 | Right and notice I've this is exactly that this is this is like the same buttons as as these buttons
00:28:04.160 | They're like two places with we're viewing the same button. Which is kind of a wild idea
00:28:09.000 | so if I click classify, it's going to change this label and
00:28:12.680 | This label because they're actually both references to the same label look
00:28:17.840 | There we are, okay, so
00:28:22.480 | This is our app right and so this is actually how I built that
00:28:28.600 | That image cleaner gooey is is just using these exact things and I built that image cleaner gooey
00:28:38.360 | Cell by cell in a notebook just like this. And so you get this kind of interactive
00:28:43.760 | Experimental framework for building a gooey
00:28:47.280 | So if you're a data scientist who's never done gooey stuff before
00:28:50.760 | This is a great time to get started because now you can you can make actual programs
00:28:56.840 | Now of course an actual program
00:28:59.400 | Running inside a notebook is kind of cool. But what we really want is this program to run
00:29:05.320 | In a place anybody can run it
00:29:08.840 | That's where voila comes in. So voila
00:29:11.560 | Needs to be installed. So you can just run these lines or install it
00:29:18.440 | It's listed in the pros
00:29:23.240 | and what voila does is it takes a notebook and
00:29:28.440 | Doesn't display anything except for the markdown
00:29:33.720 | The IPython widgets and the outputs
00:29:37.440 | Right. So all the code cells disappear and it doesn't give the person looking at that page the ability to run their own code. They can only
00:29:45.360 | Interact with the widgets, right? So what I did
00:29:49.680 | Was a copied and pasted that code
00:29:53.440 | From the notebook into a separate notebook, which only has
00:29:57.400 | Those lines of code
00:30:01.200 | right, so
00:30:05.560 | So these are just the same lines of code that we saw before
00:30:12.800 | So this is a notebook. It's just a normal notebook
00:30:15.200 | And then I installed voila and then when you do that if you navigate to this notebook
00:30:24.320 | But you replace
00:30:26.920 | Notebooks up here with
00:30:33.720 | voila
00:30:35.560 | it actually displays not the notebook, but
00:30:38.440 | Just as I said the markdown and the widgets though here I've got
00:30:45.480 | My bear classifier and I can click upload. Let's do a grizzly bear this time
00:30:51.680 | And this is a slightly different version I actually made this so there's no classified button
00:31:00.160 | I thought it would be a bit more fancy to make it so when you click upload it just runs everything
00:31:04.040 | But as you can see there it all is
00:31:06.440 | Right. It's all working. So
00:31:08.440 | This is the world's
00:31:11.560 | Simplest prototype, but it's it's a proof of concept, right? So you can add
00:31:16.520 | widgets with
00:31:18.800 | dropdowns and sliders and charts and you know everything that you can have in a
00:31:25.040 | You know an angular app or a react app or whatever and in fact, there's there's even
00:31:30.320 | Stuff which lets you use for example the whole Vue.js framework if you know that it's a very popular
00:31:36.720 | JavaScript framework the whole Vue.js framework you can actually use it in
00:31:41.040 | widgets and voila
00:31:44.280 | So now we want to get it so that this
00:31:47.000 | this app
00:31:49.560 | can be run by
00:31:51.480 | Someone out there in the world. So the voila documentation shows a few ways to do that, but perhaps the easiest one
00:31:58.000 | is to use a system called binder
00:32:01.680 | And so binder is at mybinder.org and
00:32:07.360 | All you do is you paste in your github repository name here, right? And this is all in the book, right?
00:32:13.640 | So you
00:32:17.160 | paste in your github repo name
00:32:19.720 | You change where it says
00:32:21.720 | File you change that to URL
00:32:25.280 | You can see and then you put in the path which we were just experimenting with
00:32:33.480 | Right
00:32:36.880 | So pop that here and then you say launch and what that does is it then gives you a URL
00:32:43.280 | So then this URL
00:32:46.520 | You can pass on
00:32:48.760 | to people and this is actually your
00:32:51.880 | Interactive running application. So binders free and so this isn't you know
00:32:57.680 | Anybody can now use this to take their voila app and make it a publicly available web application
00:33:04.920 | So try it as it mentions here the first time you do this binder takes about five minutes to build your site
00:33:14.000 | Because it actually uses something called Docker to deploy the whole fast AI framework and Python and blah blah blah
00:33:20.440 | But once you've done that
00:33:23.320 | That virtual machine will keep running for you know, as long as people are using it. It'll keep running for a while
00:33:27.800 | That virtual machine will keep running for a while as long as people are using it and you know, it's it's
00:33:37.880 | reasonably fast
00:33:40.480 | So a few things to note here
00:33:43.000 | Being a free service. You won't be surprised to hear this is not using a GPU. It's using a CPU
00:33:48.040 | And so that might be surprising but we're deploying to something which runs on a CPU
00:33:55.600 | When you think about it though, this makes much more sense to deploy to a CPU than a GPU
00:34:09.480 | Just a moment
00:34:13.200 | Um, the thing that's happening here is that I am
00:34:17.200 | Passing along let's go back to my app in my app. I'm passing along a single image at a time
00:34:24.040 | So when I pass along that single image, I don't have a huge amount of parallel work for a GPU to do
00:34:30.640 | This is actually something that a CPU is going to be doing more efficiently
00:34:34.560 | So we found that for folks coming through this course
00:34:41.000 | The vast majority of the time they wanted to deploy
00:34:44.200 | Inference on a CPU not a GPU because they're normally just doing one
00:34:49.000 | item at a time
00:34:51.800 | It's way cheaper and easier to deploy to a CPU
00:34:55.600 | And the reason for that is that you can just use any hosting service you like because just remember this is just a
00:35:03.320 | This is just a program at this point, right?
00:35:07.920 | And you can use all the usual horizontal scaling vertical scaling, you know, you can use Heroku you can use AWS
00:35:14.520 | You can use inexpensive instances
00:35:17.160 | Super cheap and super easy
00:35:21.080 | Having said that there are times you might need to deploy to a GPU
00:35:25.200 | For example, maybe you're processing videos and so like a single video on on a CPU to process it. It might take all day
00:35:37.760 | You might be so successful that you have a thousand requests per second
00:35:41.960 | In which case you could like take 128 at a time
00:35:45.240 | Batch them together and put the whole batch on the GPU and get the results back and pass them back around
00:35:50.680 | I mean you've got to be careful of that right because
00:35:53.840 | as if your requests aren't coming fast enough your user has to wait for a whole batch of people to be ready to
00:36:00.880 | to be processed
00:36:04.400 | But you know conceptually
00:36:06.400 | As long as your site is popular enough that could work
00:36:10.400 | The other thing to talk about is you might want to deploy to a mobile phone
00:36:19.440 | Deploying to a mobile phone our recommendation is wherever possible
00:36:23.320 | Do that by actually deploying to a server and then have a mobile phone talk to the server over a network
00:36:30.280 | And because if you do that
00:36:32.920 | Again, you can just use a normal pytorch program on a normal server and normal network calls. It makes life super easy
00:36:39.580 | When you try to run a pytorch app on a phone
00:36:44.480 | You're suddenly now not an environment where not an environment where like pytorch will run natively and so you'll have to like convert
00:36:51.560 | your program into some other form and
00:36:54.600 | There are other forms and the the main form that you convert it to is something called
00:36:59.920 | ONNX which is specifically designed for
00:37:02.360 | kind of
00:37:05.040 | super high speed the high performance
00:37:07.280 | you know a
00:37:10.160 | Approach that can run on both servers or on mobile phones
00:37:13.440 | It does not require the whole
00:37:16.320 | Python and pytorch kind of
00:37:19.280 | runtime in place
00:37:22.400 | but it's it's much more complex and
00:37:27.000 | Not using it. It's harder to debug. It's harder to set it up, but it's harder to maintain it. So
00:37:33.200 | if possible keep things simple and
00:37:37.280 | If you're lucky enough that you're so successful that you need to scale it up to GPUs or and stuff like that
00:37:43.520 | then great, you know, hopefully you've got the
00:37:46.640 | Finances at that point to justify, you know spending money on a I went an ex expert or serving expert or whatever
00:37:56.440 | and there are various
00:37:58.040 | Systems you can use to like ONNX runtime and AWS SageMaker where you can kind of say here's my ONNX
00:38:03.540 | Bundle and it'll serve it for you or whatever. Pytorch also has a mobile framework same idea
00:38:15.120 | Alright, so you've got I mean, it's kind of funny. We're talking about two different kinds of deployment here
00:38:19.640 | one is deploying like a
00:38:21.560 | Hobby application, you know that you're prototyping showing off to your friends to explaining to your colleagues how something might work
00:38:27.960 | You know a little interactive analysis. That's one thing. Well, but maybe you're actually prototyping something that you're
00:38:34.000 | Want to turn into a real product
00:38:36.960 | Or an actual real part of your company's
00:38:40.200 | operations
00:38:42.320 | when you're deploying
00:38:44.320 | You know something in in real life, there's all kinds of things you got to be careful of
00:38:51.360 | One example is something to be careful of is let's say you did exactly what we just did
00:38:56.760 | Which actually this is your homework is to create your own?
00:39:01.000 | Application right? I want you to create your own image search application. You can use
00:39:06.200 | My exact set of widgets and whatever if you want to but better still go to the IPY widgets website and see what other widgets
00:39:14.320 | They have and try and come up with something cool
00:39:17.520 | Try and come at you know, try and show off as best as you can then show us on the forum
00:39:21.400 | Now let's say you decided
00:39:24.360 | That you want to create an app that would help
00:39:29.080 | The users of your app decide if they have healthy skin or unhealthy skin
00:39:34.360 | So if you did the exact thing we just did rather than searching for grizzly bear and teddy bear and so forth on
00:39:40.000 | Bing you would search for healthy skin and unhealthy skin, right? So here's what happens, right?
00:39:47.040 | If I and and remember in our version, we never actually looked at being we just used the Bing API the image search API
00:39:54.480 | But behind the scenes, it's just using the website
00:39:57.240 | right, so if I click healthy if I type healthy skin and say search I
00:40:02.400 | Actually discover that the definition of healthy skin is
00:40:06.960 | Young white women
00:40:10.720 | touching their face lovingly
00:40:13.640 | so that's what your
00:40:16.080 | Your healthy skin classifier would learn to detect
00:40:19.860 | right, and so
00:40:22.560 | This is so this is a great example from Debra G and you should check out her paper actionable auditing
00:40:28.880 | for lots of cool insights about model bias, but I mean here's here's like a
00:40:35.440 | Fascinating example of how if you weren't looking at your data carefully
00:40:40.920 | You you end up
00:40:43.840 | With something that doesn't at all actually solve the problem you want to solve
00:40:47.840 | This is
00:40:52.160 | This is tricky right because
00:40:55.000 | The data that you train your algorithm on if you're building like a new product that didn't exist before by definition
00:41:03.240 | You don't have examples of the kind of data that's going to be used in real life
00:41:07.400 | Right, so you kind of try to find some from somewhere and if they and if you do that through like a Google search
00:41:14.740 | Pretty likely you're not going to end up with a
00:41:17.800 | Set of data that actually reflects the kind of mix you would see in real life
00:41:27.560 | You know the main thing here is to say be careful right and and in particular for your test set
00:41:33.880 | You know that final set that you check on
00:41:36.840 | Really try hard to gather data that that reflects
00:41:40.280 | The real world so like just you know for example for the healthy skin example
00:41:45.000 | You might go and actually talk to a dermatologist and try and find like 10 examples of healthy and unhealthy skin or something
00:41:51.620 | And that would be your kind of gold standard test
00:41:54.880 | There's all kinds of issues you have to think about in deployment I can't cover all of them
00:42:03.840 | I can tell you that this O'Reilly book called building machine learning powered applications
00:42:10.120 | Is is a great?
00:42:12.800 | resource
00:42:14.560 | And this is one of the reasons we don't go into a detail
00:42:17.880 | about
00:42:19.600 | AP to a B testing and when should we refresh our data and how do we money monitor things and so forth is because
00:42:26.160 | That book's already been written. So we don't want to
00:42:28.480 | Rewrite it
00:42:33.680 | Do want to mention a particular area that I care a lot about though
00:42:37.960 | Which is
00:42:42.000 | Let's take this example
00:42:43.800 | Let's say you're rolling out this bear detection system and it's going to be attached to video cameras around a campsite
00:42:49.500 | It's going to warn campers of incoming bears. So if we used a model
00:42:54.120 | That was trained with that data that we just looked at
00:42:57.160 | You know those are all
00:43:00.360 | Very nicely taken pictures of pretty perfect bears, right?
00:43:04.560 | There's really no relationship to the kinds of pictures
00:43:08.960 | You're actually going to have to be dealing with in your in your campsite bear detector, which has it's going to have video and not images
00:43:15.040 | It's going to be nighttime. It's going to be probably low resolution
00:43:18.440 | security cameras
00:43:21.080 | You need to make sure that the performance of the system is fast enough to tell you about it before the bear kills you
00:43:29.040 | Know there will be bears that are partially obscured by bushes or in lots of shadow or whatever
00:43:34.240 | None of which are the kinds of things you would see normally in like internet pictures
00:43:38.240 | So what we call this we call this out of domain data out of domain data refers to a situation where?
00:43:45.600 | The data that you are trying to do inference on is in some way different to the kind of data
00:43:52.680 | That you trained with
00:43:55.080 | and this is actually
00:43:58.360 | There's no perfect way to answer this question and when we look at ethics we'll talk about some
00:44:04.200 | really helpful ways to to
00:44:07.520 | Minimize how much this happens for example it turns out that having a diverse team is a great way to
00:44:14.880 | Kind of avoid being surprised by the kinds of data that people end up coming up with
00:44:20.720 | But really it's just something you've got to be
00:44:24.320 | super thoughtful about
00:44:28.680 | Very similar to that is something called
00:44:30.680 | Domain shift and domain shift is where maybe you start out with all of your data is in domain data
00:44:36.880 | But over time the kinds of data that you're seeing
00:44:40.480 | Changes and so over time maybe
00:44:43.760 | raccoons start invading your campsite and you
00:44:48.000 | Weren't training on raccoons before it was just a bear detector and so that's called domain shift
00:44:53.280 | And that's another thing that you have to be very careful of Rachel. What's your question?
00:44:57.960 | No, I was just going to add to that in saying that
00:45:00.300 | all data is biased so there's not kind of a you know form of a debias data or
00:45:06.880 | perfectly representative in all cases data and that a lot of the
00:45:11.200 | proposals around addressing this have kind of been converging to this idea and that you see in papers like Timnit Gebru's
00:45:17.840 | data sheets for data sets of just writing down a lot of the
00:45:23.720 | Details about your data set and how it was gathered and in which situations it's appropriate to use and how it was maintained
00:45:30.000 | And so there that's not that you've totally eliminated bias
00:45:34.440 | But that you're just very aware of the attributes of your data set so that you won't be blindsided by them later
00:45:39.700 | And there have been kind of several
00:45:42.280 | proposals in that school of thought which I which I really like around this idea of just kind of
00:45:47.800 | Understanding how your data was gathered and what its limitations are
00:45:51.960 | Thanks Rachel
00:45:54.800 | So a key problem here is that you can't know the entire behavior of your neural network
00:46:03.680 | With normal programming you typed in the if statements and the loops and whatever so in theory
00:46:11.240 | You know what the hell it does although it's still sometimes surprising in this case you you didn't tell it anything you just gave it
00:46:18.000 | Examples to learn from and hope that it learns something useful
00:46:21.560 | There are hundreds of millions of parameters in a lot of these neural networks
00:46:25.940 | And so there's no way you can understand how they all combine with each other to create complex behavior
00:46:31.280 | so really like there's a natural compromise here is that we're trying to
00:46:35.560 | Get sophisticated behavior so it's like like recognizing pictures
00:46:42.920 | Sophisticated enough behavior we can't describe it
00:46:46.400 | And so the natural downside is you can't expect the process that the thing is using to do that to be
00:46:52.300 | Describable for you for you to be able to understand it. So
00:46:55.720 | Our recommendation for kind of dealing with these issues is a very careful
00:47:00.800 | Deployment strategy which I've summarized in this little graph this little chart here
00:47:06.040 | the idea would be
00:47:08.880 | first of all
00:47:10.360 | Whatever it is that you're going to use the model for start out by doing it manually
00:47:14.880 | So have a have a park ranger
00:47:16.880 | watching for bears
00:47:19.360 | Have the model running next to them and each time the park ranger sees a bear
00:47:24.320 | They can check the model and see like did it seem to have pick it up?
00:47:28.000 | So the model is not doing anything. There's just a person who's like running it and seeing would it have made sensible choices
00:47:35.280 | And once you're confident that it makes sense that what it's doing seems reasonable
00:47:41.480 | You know it's been as close to the real life situation as possible
00:47:45.640 | Then deploy it in a time and geography limited way
00:47:52.160 | so pick like one campsite not the entirety of California and do it for you know one day and
00:48:00.000 | Have somebody watching it super carefully, right?
00:48:03.960 | So now the basic bear detection is being done by the bed at bear detector
00:48:08.320 | But there's still somebody watching it pretty closely and it's only happening in one campsite for one day
00:48:13.400 | And so then as you say like okay
00:48:15.400 | We haven't
00:48:18.240 | destroyed our company yet
00:48:19.920 | But let's do two campsites for a week
00:48:21.920 | And then let's do you know the entirety of Marin for a month and so forth. So this is actually what we did when I
00:48:29.160 | used to
00:48:31.280 | Be at this company called optimal decisions
00:48:34.400 | optimal decisions was a company that I founded to do insurance pricing and if you
00:48:39.240 | If you change insurance prices by you know a percent or two in the wrong direction in the wrong way
00:48:46.360 | You can basically destroy the whole company. This has happened many times, you know insurers are companies
00:48:53.320 | That set prices that's basically the the product that they provide
00:48:58.320 | So when we deployed new prices for optimal decisions, we always did it by like saying like, okay
00:49:04.940 | We're going to do it for like five minutes or everybody whose name ends with a D, you know
00:49:11.360 | So we'd kind of try to find some
00:49:13.360 | Group which hopefully would be fairly, you know
00:49:16.960 | It'll be different but not too many of them and we'd gradually scale it up and you've got to make sure that when you're
00:49:22.800 | doing this that you have a lot of
00:49:25.680 | Really good reporting systems in place that you can recognize
00:49:28.920 | Are your customers yelling at you? Are your computers burning up?
00:49:33.920 | You know are your
00:49:37.200 | Are your computers burning up are your costs spiraling out of control and so forth so it really requires great
00:49:47.480 | Reporting systems
00:49:52.080 | This fast AI have methods built-in that provide for incremental learning ie improving the model slowly over time with a single data point each time
00:50:00.360 | Yeah, that's a great question. So
00:50:03.480 | This is a little bit different which is this is really about
00:50:07.120 | Dealing with domain shift and similar issues by continuing to train your model as you do inference. And so the good news is
00:50:15.040 | You don't need anything special for that
00:50:18.760 | It's basically just a transfer learning problem. So you can do this in many different ways
00:50:24.360 | Probably the easiest is just to say like okay each night
00:50:27.280 | Probably the easiest is just to say okay each night
00:50:31.080 | You know at midnight we're going to set off a task which
00:50:36.560 | Grabs all of the previous day's transactions as mini batches and trains another epoch
00:50:44.000 | And so yeah that that actually works fine. You can basically think of this as a
00:50:50.280 | Fine tuning approach where your pre-trained model is yesterday's model and your fine-tuning data is today's data
00:50:58.120 | So as you roll out your model
00:51:03.160 | One thing to be thinking about super carefully is that it might change the behavior of the system that it's a part of
00:51:12.200 | And this can create something called a feedback loop and feedback loops are one of the most challenging things for
00:51:18.360 | For real-world model deployment particularly of machine learning models
00:51:22.840 | Because they can take a very minor issue and explode it into a really big issue
00:51:31.840 | so for example think about a
00:51:35.120 | predictive policing algorithm
00:51:38.040 | It's an algorithm that was trained to recognize
00:51:42.040 | you know
00:51:44.040 | Basically trained on data that says whereabouts or arrests being made
00:51:48.120 | And then as you train that algorithm based on where arrests are being made
00:51:54.200 | Then you put in place a system that
00:51:58.820 | sends police officers to places that the model says are likely to have crime which in this case where we're
00:52:06.920 | Were there where were arrests?
00:52:10.560 | So then more police go to that place
00:52:12.560 | Find more crime because the more police that are there the more they'll see they arrest more people
00:52:19.540 | Causing, you know, and then if you do this incremental learning like we're just talking about then it's going to say
00:52:24.400 | Oh, there's actually even more crime here. And so tomorrow it sends even more police
00:52:28.600 | And so in that situation you end up like the predictive policing algorithm ends up kind of sending all of your police
00:52:36.800 | For one street block because at that point all of the arrests are happening there because that's the only place you have policemen
00:52:42.940 | Right. I should say police officers
00:52:45.040 | So there's actually a paper about
00:52:48.200 | This issue called to protect and serve and in to protect and serve the authors write this really nice phrase
00:52:57.000 | Predictive policing is aptly named. It is predicting policing not predicting
00:53:02.880 | crime, so
00:53:05.800 | if the initial model was
00:53:07.800 | Perfect, whatever the hell that even means but like it somehow sent police to exactly
00:53:14.080 | The best places to find crime based on the probability of crimes actually being in place. I
00:53:21.320 | Guess there's no problem, right?
00:53:24.920 | But as soon as there's any amount of bias right so for example in the US
00:53:33.680 | There's a lot more arrests
00:53:35.680 | Of black people than of white people even for crimes where black people and white people are known to do them the same amount
00:53:43.820 | So in the presence of this bias
00:53:46.720 | Or any kind of bias
00:53:49.480 | You're kind of like setting off this this domino chain of feedback loops where that bias will be
00:53:57.160 | exploded
00:53:59.720 | over time
00:54:03.760 | You know one thing I like to think about is to think like well, what would happen if this?
00:54:08.320 | If this model was just really really really good
00:54:12.640 | So like who would be impacted, you know, what would this extreme result look like?
00:54:18.320 | How would you know what was really happening this incredibly predictive algorithm that was like?
00:54:23.380 | Changing the behavior of yours if your police officers or whatever, you know, what would that look like? What would actually happen?
00:54:32.160 | And then like think about like, okay
00:54:34.560 | What could go wrong and then what kind of rollout plan what kind of monitoring systems what kind of oversight?
00:54:40.260 | Could provide the the circuit breaker because that's what we really need here
00:54:45.560 | Right is we need like nothing's going to be perfect. You can't
00:54:48.800 | Be sure that there's no feedback loops
00:54:51.680 | But what you can do is try to be sure that you see when the behavior of your system is behaving in a way
00:54:58.640 | That's not what you want
00:55:01.640 | Did you have anything to add to that Rachel?
00:55:03.640 | All I would add to that is that you're at
00:55:07.840 | risk of potentially having a feedback loop
00:55:10.240 | anytime that your model is kind of controlling what your next round of data looks like and I think that's true for pretty much all
00:55:17.240 | products and that can be I think a hard jump from people people coming from kind of a science background where you may be thinking of
00:55:24.720 | Data as I have just observed some sort of experiment whereas kind of whenever you're, you know
00:55:30.440 | Building something that interacts with the real world
00:55:32.440 | You are now also controlling what your future data looks like based on kind of behavior of your your algorithm for the current current round of
00:55:41.360 | right, so
00:55:43.360 | So given that you probably can't avoid feedback loops
00:55:48.880 | That you know the the thing you need to then really invest in is the human in the loop
00:55:54.280 | And so a lot of people like to focus on automating things, which I find weird
00:55:59.480 | You know if you can decrease the amount of human involvement by like 90%
00:56:03.280 | You've got almost all of the economic upside of automating it completely
00:56:07.440 | But you still have the room to put human circuit breakers in place. You need these appeals processes
00:56:12.560 | You need the monitoring you need, you know humans involved to kind of go
00:56:17.720 | Hey, that's that's weird. I don't think that's what we want
00:56:23.600 | Yes, Rachel and I just want more note about that those humans though do need to be integrated well with
00:56:30.000 | kind of product and engineering and so one issue that comes up is that in many companies I think that
00:56:36.880 | Ends up kind of being underneath trust and safety handles a lot of sort of issues with how things can go wrong or how your
00:56:43.360 | Platform can be abused and often trust and safety is pretty siloed away from
00:56:48.720 | Product and edge which actually kind of has the the control over you know
00:56:52.920 | These decisions that really end up influencing them and so having they the engineers probably consider them to be pretty pretty annoying a lot
00:56:59.880 | Of the time how they get in the way and get in the way of them getting software out the door
00:57:04.440 | But like the kind of the more integration you can have between those I think it's helpful for the kind of the people
00:57:09.080 | Building the product to see what is going wrong and what can go wrong if the engineers are actually on top of that
00:57:15.240 | They're actually seeing these these things happening that it's not some kind of abstract problem anymore
00:57:20.600 | So, you know at this point now that we've got to the end of chapter two
00:57:24.600 | You actually know a lot more than most people about
00:57:32.000 | About deep learning and actually about some pretty important foundations of machine learning more generally and of data products more generally
00:57:39.720 | So now's a great time to think about
00:57:42.560 | writing
00:57:47.640 | Sometimes we have a
00:57:50.560 | Formatted text that doesn't quite format correctly in Jupiter notebook, by the way
00:57:54.360 | It only formats correctly in the book book. So that's what it means when you see this kind of pre-formatted text
00:58:03.760 | The the idea here is to think about
00:58:08.380 | Starting writing at this point before you go too much further Rachel
00:58:16.560 | There's a question. Okay, let's get the question
00:58:20.920 | Question is I am I assume there are fast AI type ways of keeping a nightly updated transfer learning setup
00:58:29.040 | Well, could there be one of the fast AI version for notebooks have an example of the nightly transfer learning training?
00:58:35.280 | Like the previous person asked I would be interested in knowing how to do that most effectively with fast AI
00:58:41.360 | Sure. So I guess my view is there's nothing faster. Yeah, I specific about that at all
00:58:46.880 | So I actually suggest you read Emmanuel's book that book I showed you to understand the kind of the ideas
00:58:53.120 | And if people are interested in this I can also point to it some academic research about this as well
00:58:58.520 | And there's not as much as that there should be
00:59:00.680 | But there is some there is some good work in this area
00:59:03.820 | Okay, so the reason we mentioned writing at this point in our journey is because
00:59:13.600 | You know things are going to start to get more and more heavy more and more complicated and
00:59:20.280 | A really good way to make sure that you're on top of it is to try to write down what you've learned
00:59:27.560 | So, sorry, I wasn't sharing the right part of the screen before but this is what I was describing in terms of the
00:59:32.040 | Pre-formatted text which doesn't look correct
00:59:42.760 | Rachel actually has this great article that you should check out which is why you should blog and
00:59:50.760 | Will say it sort of her saying because I have it in front of me and she doesn't
00:59:54.800 | Weird as it is. So Rachel says
00:59:56.960 | That the top advice she would give her younger self is to start blogging sooner
01:00:02.120 | So Rachel has a math PhD in this kind of idea of like blogging was not exactly something
01:00:07.720 | I think they had a lot of in the PhD program
01:00:11.060 | but actually it's like it's a really great way of
01:00:15.240 | Finding jobs. In fact, most of my students who have got the best jobs are students that have
01:00:23.240 | good blog posts
01:00:25.480 | The thing I really love is that it helps you learn
01:00:27.600 | by by writing down it's kind of synthesizes your ideas and
01:00:32.700 | Yeah, you know, there's lots of reasons to look so there's actually
01:00:39.400 | Something really cool. I want to show you
01:00:44.240 | As also just going to note I have a second post called advice for better blog post
01:00:49.560 | That's a little bit more advanced which I'll post a link to as well
01:00:53.560 | And that talks about some common pitfalls that I've seen in many in many blog posts and kind of the importance of putting
01:01:00.560 | Putting the time in to do it. Well and some things to think about so I'll share that post as well. Thanks Rachel
01:01:06.160 | so one reason that sometimes people
01:01:08.560 | Blog is because it's kind of annoying to figure out how to
01:01:12.840 | particularly because I think the thing that a lot of you will want to blog about is
01:01:18.160 | Cool stuff that you're building in Jupyter notebooks. So we've actually teamed up with a guy called Hamel Sane
01:01:25.320 | And and with github to create this
01:01:29.720 | free product
01:01:32.120 | As usual with fast AI no ads. No anything called fast pages where you can actually blog
01:01:38.760 | with Jupyter notebooks and so
01:01:42.480 | You can go to fast pages and see for yourself how to do it
01:01:46.200 | But the basic idea is that like you literally click one button
01:01:50.960 | It sets up a blog for you and then you dump your notebooks
01:01:57.080 | into a
01:01:59.160 | Folder called underscore notebooks and they get turned into
01:02:02.440 | blog posts, it's it's basically like magic and Hamels done this amazing job of this and so
01:02:11.240 | This means that you can create blog posts where you've got charts and tables and images
01:02:17.280 | You know where they're all actually the output of Jupyter notebook along with all the the markdown
01:02:24.280 | Formatted text headings and so forth and Piper links and the whole thing
01:02:29.960 | So this is a great way to start writing about what you're learning about here
01:02:37.480 | So something that Rachel and I both feel strongly about when it comes to blogging is this which is
01:02:44.120 | Don't try to think about the absolute most advanced thing
01:02:51.560 | You know and try to write a blog post that would impress
01:02:54.320 | Jeff Hinton right because most people are not Jeff Hinton
01:02:59.000 | so like a you probably won't do a good job because you're trying to like
01:03:03.400 | blog for somebody who's more got more expertise than you and be
01:03:07.800 | You've got a small audience now, right?
01:03:11.280 | Actually, there's far more people that are not very familiar with deep learning than people who are
01:03:16.200 | They try to think you know, and and you really understand what it's like
01:03:19.800 | What it was like six months ago to be you because you were there six months ago
01:03:24.560 | So try and write something which the six months ago version of you
01:03:28.200 | Would have been like super interesting full of little tidbits. You would have loved
01:03:33.320 | You know that you would have that would have delighted you
01:03:35.760 | That six months ago version of you
01:03:39.240 | Okay, so once again
01:03:42.640 | Don't move on until you've had a go at the questionnaire
01:03:45.800 | to make sure that you
01:03:48.760 | You know understand the key things we think that you need to understand
01:03:53.600 | And yeah, have a think about these further research questions as well because they might
01:03:58.760 | Help you to engage more closely with material
01:04:02.440 | So, let's have a break and we'll come back in five minutes time
01:04:06.960 | So welcome back everybody
01:04:12.520 | This is a
01:04:15.600 | Interesting moment in the course because we're kind of jumping from
01:04:19.960 | the part of the course, which is you know, very heavily around kind of
01:04:27.040 | The kind of this the the the structure of like what are we trying to do with machine learning?
01:04:33.240 | And what are the kind of the pieces and what do we need to know?
01:04:35.680 | To make everything kind of work together
01:04:38.920 | There was a bit of code but not masses. There was basically no math
01:04:47.640 | We kind of wanted to put that at the start for everybody who's not
01:04:51.000 | You know who's kind of wanting to an understanding of these issues
01:04:56.840 | without
01:04:58.360 | necessarily
01:04:59.960 | Wanting to kind of dive deep into the code and the math themselves and now we're getting into the diving deep part
01:05:06.000 | if you're not
01:05:09.280 | Interested in that diving deep yourself. You might want to skip to the next lesson about ethics
01:05:14.280 | where we you know is kind of a rule that rounds out the kind of
01:05:18.200 | You know slightly less technical material
01:05:25.440 | So what we're going to look at here is we're going to look at
01:05:27.840 | What we think of is kind of a toy problem
01:05:34.000 | Just a few years ago is considered a pretty challenging problem
01:05:37.240 | The problem is recognizing handwritten digits
01:05:41.000 | And we're going to try and do it
01:05:44.320 | from scratch
01:05:46.000 | Right. I'm going to try and look at a number of different ways to do it
01:05:48.960 | So we're going to have a look at a data set
01:05:53.520 | Called MNIST. And so if you've done any machine learning before you may well have come across MNIST it contains handwritten digits
01:06:00.760 | And it was collated into a machine learning data set by a guy called John LeCun
01:06:05.800 | and some colleagues and they use that to demonstrate I'm one of the
01:06:10.280 | You know probably the first computer system to provide really practically useful scalable recognition of handwritten digits
01:06:16.840 | Lynette 5 with the system was actually used to
01:06:21.280 | Automatically process like 10% of the checks in in the US
01:06:25.760 | So one of the things that really helps
01:06:31.680 | I think when building a new model is to kind of start with something simple and gradually scale it up. So
01:06:38.080 | We've created an even simpler version of MNIST which we call MNIST sample which only has threes and sevens
01:06:45.160 | Okay, so this is a good starting point to make sure that we can kind of do something easy
01:06:50.120 | I picked threes and sevens for MNIST sample because they're very different. So I feel like if we can't do this
01:06:55.240 | We're going to have trouble recognizing every digit
01:06:57.400 | So step one is to call untard data untard data is the fast AI
01:07:06.000 | Function which takes a URL
01:07:09.760 | Checks whether you've already downloaded it if you haven't it downloads it checks whether you've already
01:07:16.680 | Uncompressed it if you haven't it uncompresses it and then it finally returns the path of where that ended up
01:07:22.040 | So you can see here
01:07:24.600 | is he
01:07:26.880 | URLs dot MNIST sample
01:07:29.160 | So you could just hit tab to get autocomplete
01:07:34.160 | Is just some some location right doesn't really matter where it is and so then when we
01:07:45.240 | Call that I've already downloaded it and already uncompressed it because I've already run this once before so it happens straight away and
01:07:51.560 | so path goes me
01:07:54.240 | Where it is now in this case path is dot and the reason path is dot is because I've used a special base path
01:08:02.600 | attribute to path
01:08:04.880 | to tell it kind of like where's my
01:08:06.880 | Where's my starting point, you know, and and that's used to print
01:08:11.160 | So when I go here LS which prints a list of files, these are all relative to
01:08:16.520 | Where I actually untard this - this just makes it a lot easier not to have to see the whole
01:08:22.180 | Set of parent path folders
01:08:24.920 | LS is actually so so path is a
01:08:34.520 | See what kind of type it is. Oh
01:08:38.920 | It's a path lib path object
01:08:41.900 | Path lib is part of the Python standard library. It's a really very very very nice library, but it doesn't actually have LS
01:08:50.820 | Where there are libraries that we find super helpful, but they don't have exactly the things we want
01:08:56.440 | We liberally add the things we want to them. So we add
01:09:02.040 | So if you want to find out what LS is
01:09:08.760 | You know, there's as we've mentioned, there's a few ways you can do it
01:09:11.180 | You can pop a question mark there and that will show you where it comes from
01:09:15.280 | so there's actually a library called fast core which is a lot of the foundational stuff in fast AI that is not dependent on
01:09:22.960 | Pytorch or
01:09:25.640 | Pandas or any of these big heavy libraries
01:09:28.000 | So this is part of fast core and if you want to see exactly what it does you of course remember you can put in
01:09:34.540 | a second question mark
01:09:37.000 | to get
01:09:38.560 | the source code and as you can see there's not much source code do it and
01:09:43.480 | You know, maybe most importantly
01:09:46.320 | Please don't forget about doc
01:09:49.640 | Because really importantly that gives you this show in docs link which you can click on to get to the documentation to see
01:09:57.040 | examples
01:09:59.200 | Textures if relevant tutorials tests and so forth
01:10:07.960 | What's so when you're looking at a new data set you kind of just use I always start with just LS see what's in it
01:10:14.160 | And I can see here. There's a train folder and there's a valid folder. That's pretty normal
01:10:20.200 | so let's look at LS on the
01:10:23.040 | train folder and
01:10:25.480 | it's got a folder called 7 and a folder called 3 and
01:10:28.620 | So this is looking quite a lot like our bare classifier data set. We downloaded each set of images into
01:10:35.880 | a folder based on what its label was
01:10:38.560 | This is doing it another level though
01:10:41.440 | Well, the first level of the folder hierarchy is is it training or valid and the second level is what's the label?
01:10:48.480 | And this is the most common way for image data sets to be distributed
01:10:53.920 | So let's have a look
01:11:00.040 | let's just create something called threes that contains all of the contents of the three directory training and
01:11:06.720 | Let's just sort them so that this is consistent
01:11:09.520 | Do the same for sevens and let's just look at the threes and you can see there's just they're just numbered
01:11:15.360 | Alright, so let's grab one of those
01:11:19.080 | Open it and take a look. Okay, so there's the picture of a three and so what is that really?
01:11:28.120 | Well, not three
01:11:31.880 | I am three
01:11:35.240 | So PIL is the Python imaging library. It's the most popular library by far for working with images
01:11:42.360 | On Python and it's a PNG
01:11:45.520 | not surprisingly
01:11:50.800 | Jupyter notebook
01:11:53.760 | Knows how to display many different types and you can actually tell if you create a new type
01:11:58.440 | You can tell it how to display your type and so PIL comes with something that will automatically
01:12:02.680 | display the image like so
01:12:05.360 | What I want to do here though is to look at like how are we going to treat this as numbers?
01:12:10.960 | Right, and so one easy way to treat things as numbers is to turn it into an array
01:12:16.720 | So array is part of numpy, which is the most popular
01:12:20.560 | array
01:12:22.520 | programming library
01:12:24.280 | for Python and so if we pass our
01:12:27.080 | PIL image object to array it
01:12:31.680 | Just converts the image into a bunch of numbers and the truth is it was a bunch of numbers the whole time
01:12:37.880 | It was actually stored as a bunch of numbers on disk
01:12:40.760 | It's just that there's this magic thing in Jupyter that knows how to display those numbers on the screen
01:12:46.320 | So when we say array
01:12:48.720 | Turning it back into a numpy array. We're kind of removing this ability for Jupyter notebook to know how to display it like a picture
01:12:56.000 | So once I do this we can then index into that array and create everything from the grab everything all the rows from 4
01:13:03.520 | up to but not including 10 and all the columns from 4 up to and not including 10 and here are some numbers and
01:13:10.360 | they are
01:13:12.440 | 8 bit unsigned integers, so they are between 0 and 255
01:13:16.960 | So an image just like everything on a computer is just a bunch of numbers and therefore we can compute
01:13:23.760 | with it
01:13:26.320 | We could do the same thing but instead of saying array we could say tensor now a tensor is
01:13:31.880 | basically the PyTorch version of a numpy array and
01:13:37.160 | So you can see it looks it's exactly the same code as above
01:13:41.640 | But I've just replaced array with tensor and the output looks almost exactly the same except it replaces array with tensor
01:13:47.920 | And so you'll see this that basically a PyTorch tensor and a numpy array behave
01:13:53.880 | nearly identically
01:13:56.840 | Much if not most of the time, but the key thing is that a PyTorch tensor
01:14:02.280 | Can also be computed on a GPU not just a CPU
01:14:07.280 | So in our work and in the book and in the notebooks in our code
01:14:12.000 | We tend to use tensors PyTorch tensors much more often than numpy arrays
01:14:17.240 | Because they kind of have nearly all the benefits of numpy arrays plus all the benefits of GPU computation
01:14:23.160 | And they've got a whole lot of extra functionality as well
01:14:26.500 | a lot of people who have used
01:14:31.640 | Python for a long time always jump into numpy because that's what they're used to if that's you
01:14:37.740 | You might want to start considering jumping into
01:14:40.000 | Tensor like wherever you used to write a ray start writing tensor
01:14:43.520 | And just see what happens because you might be surprised at how many things you can speed up or do more easily
01:14:48.520 | So let's grab
01:14:52.120 | That that three image turn it into a tensor. And so that's going to be three image tensor
01:14:58.360 | That's why I've got I am 3t. Okay, and let's grab a bit of it
01:15:02.680 | Okay, and turn it into a pandas data frame
01:15:06.200 | And the only reason I'm turning it into a pandas data frame is the pandas has a very convenient thing called background gradient
01:15:11.560 | That turns a background into a gradient as you can see
01:15:15.800 | So here is the top bit of the three you can see that the zeros of the whites and the numbers near 255
01:15:24.240 | Other plaques and there's some what's it bits in the middle, which are which are great
01:15:27.800 | so here we have we can see what's going on when our
01:15:32.920 | Images which are numbers actually get displayed on the screen. It's just it's just doing this
01:15:38.020 | And so I'm just showing a subset here the actual full number in MNIST is a 28 by 28 pixel square
01:15:46.760 | So that's 768
01:15:48.920 | pixels
01:15:50.560 | So that's super tiny, right?
01:15:52.920 | My mobile phone, I don't know how many megapixels it is, but it's millions of pixels
01:15:57.440 | So it's nice to start with something simple and small
01:16:02.680 | So is our goal create a model but by model that has been some kind of computer program learnt from data
01:16:10.520 | That can recognize threes versus sevens. They could think of it as a three detector. Is it a three?
01:16:17.360 | Because if it's not a three, it's a seven
01:16:20.120 | So have it stop here pause the video and have a think
01:16:24.140 | How would you do it?
01:16:26.440 | How would you like you don't need to know anything about neural networks or anything else? How might you just with common sense build a
01:16:34.680 | tree detector
01:16:37.080 | Okay, so I hope you grabbed a piece of paper a pen jutted some notes down
01:16:41.080 | I'll tell you the first idea that came into my head
01:16:47.080 | Was what if we grab every single three in the data set and take the average of the pixels?
01:16:54.960 | so what's the average of
01:16:56.960 | This pixel the average of this pixel the average of this pixel the average of this pixel, right?
01:17:02.120 | And so there'll be a 28 by 28
01:17:04.640 | picture
01:17:07.160 | Which is the average of all of the threes and that would be like the ideal three and then we'll do the same for sevens and
01:17:15.400 | Then so when we then grab something from the validation set to classify, we'll say like oh
01:17:20.600 | Is this image closer to the ideal threes the ideal three the mean of the threes or the ideal seven?
01:17:27.960 | This is my idea. And so I'm going to call this the pixel similarity approach
01:17:33.140 | I'm describing this as a baseline a baseline is like a super simple model
01:17:39.120 | That should be pretty easy to program from scratch with very little magic, you know
01:17:43.080 | maybe it's just a bunch of kind of simple averages simple arithmetic, which
01:17:46.720 | You're super confident is going to be better than better than a random model
01:17:51.340 | right and
01:17:53.280 | one of the biggest mistakes I see in even experienced practitioners is that they fail to create a baseline and
01:18:00.200 | so then they build some fancy Bayesian model or
01:18:03.880 | or some fancy
01:18:08.400 | They create some fancy Bayesian model or some fancy neural network and they go Wow Jeremy
01:18:13.440 | Look at my amazingly great model and I'll say like how do you know it's amazingly great?
01:18:17.760 | And they say oh look the accuracy is 80% and then I'll say okay
01:18:21.380 | Let's see what happens if we create a model where we always predict the mean. Oh
01:18:25.520 | Look, that's 85%
01:18:30.360 | People get pretty disheartened when they discover this right and so make sure you start with a reasonable baseline and then gradually build on top of it
01:18:38.640 | So we need to get
01:18:40.640 | the average of the pixels
01:18:45.040 | We're going to learn some nice Python programming tricks to do this
01:18:48.320 | so the first thing we need to do is we need a list of
01:18:52.160 | all of the
01:18:54.760 | Sevens, so remember we've got the sevens
01:18:58.280 | Hey, which is just a list of file names, right and
01:19:05.000 | So for each of those file names in the sevens
01:19:07.600 | Let's image dot open that file just like we did before to get a PIO object and let's convert that into a tensor
01:19:14.960 | So this thing here is called a list comprehension. So if you haven't seen this before
01:19:19.840 | This is one of the most powerful and useful tools in Python. If you've done something with C sharp
01:19:26.080 | It's a little bit like link. It's not as powerful as link, but it's a similar idea
01:19:30.640 | If you've done some functional programming in in JavaScript, it's a bit like some of the things you can do with that, too
01:19:35.920 | But basically we're just going to go through
01:19:38.200 | this collection
01:19:40.400 | Each item will become called O and then it will be passed to this function
01:19:45.420 | Which opens it up and turns it into a tensor and then it will be collated all back into a list
01:19:50.440 | And so this will be all of the
01:19:52.440 | sevens as tensors
01:19:58.320 | So Silva and I use list and dictionary comprehensions every day
01:20:02.840 | And so you should definitely spend some time checking it out if you haven't already
01:20:07.360 | so now that we've got a list of
01:20:10.680 | all of the
01:20:12.720 | threes as tensors
01:20:14.720 | Let's just grab one of them
01:20:17.040 | And display it
01:20:19.600 | So remember, this is a tensor not a PIO image object
01:20:23.720 | So Jupiter doesn't know how to display it
01:20:27.960 | So we have to use
01:20:29.960 | some command to display it and show image is a fast AI command that displays a tensor and so here is our three
01:20:37.600 | So we need to get the average of all of those threes
01:20:44.240 | so to get the average
01:20:47.040 | The first thing we need to do is to turn change this so it's not a list
01:20:51.160 | But it's a tensor itself
01:20:54.080 | currently
01:20:56.800 | three
01:20:58.600 | tensors
01:21:00.600 | One as a shape
01:21:04.960 | Which is 28 by 28. So this is this is the rows by columns the size of this thing, right? But three tensors itself
01:21:13.560 | It's just a list so I can't really easily do mathematical computations on that
01:21:22.880 | so what we could do is we could stack all of these 28 by 28 images on top of each other to create a
01:21:28.800 | Like a 3d cube of images and that's still quite a tensor
01:21:35.120 | So a tensor can have as many of these axes or dimensions as you like and to stack them up you use funnily enough
01:21:41.960 | Stack, right? So this is going to turn the list
01:21:45.640 | Into a tensor and as you can see the shape of it is now
01:21:51.760 | 6 1 3 1 by 28 by 28. So it's kind of like a cube of height 6 1 3 1 by
01:21:59.480 | 28 by 28
01:22:02.840 | The other thing we want to do is if we're going to take the mean
01:22:09.520 | We want to turn them into floating point values
01:22:13.160 | Because we we don't want to kind of have integers rounding off
01:22:18.000 | The other thing to know is that it's just as kind of a standard in
01:22:22.080 | Computer vision that when you're working with floats that you you expect them to be between 0 and 1
01:22:29.320 | so we just divide by 255 because they were between 0 and 255 before so this is a pretty standard way to kind of
01:22:37.760 | Represent a bunch of images in PyTorch
01:22:41.960 | So these three things here are called the axes
01:22:47.920 | first axis second axis third axis and
01:22:51.680 | Overall we would say that this is a rank 3 tensor. It has three axes. So the
01:23:00.100 | This one here was a rank 2 tensor. It has two axes
01:23:06.120 | So you can get the rank from a tensor by just taking the length of its shape 1 2 3
01:23:13.760 | 3 okay
01:23:17.760 | You can also get that from
01:23:19.760 | So the word I've been using the word axis
01:23:24.280 | You can also use the word dimension. I think numpy tends to call it axis PyTorch tends to call it dimension
01:23:32.040 | so the rank is also
01:23:34.640 | the number of dimensions and dim
01:23:40.120 | So you need to make sure that you remember this word rank is the number of axes or dimensions in a tensor and the shape
01:23:48.000 | Is a list containing the size of each axis?
01:23:51.040 | in a tensor
01:23:54.360 | So we can now say stack threes dot mean now if we just say stack threes dot mean
01:24:08.240 | That returns a single number that's the average pixel across that whole cube that whole rank 3 tensor
01:24:14.700 | But if we say mean 0
01:24:17.080 | That is take the mean over this axis. So that's the mean across the images
01:24:23.460 | right and so
01:24:25.960 | That's now 28 by 28 again because we kind of like
01:24:32.840 | Reduced over this 6 1 3 1 6 1 3 1 axis. We took the mean across that axis
01:24:39.880 | And so we can show that image and here is our ideal 3
01:24:44.440 | So here's the ideal 7 using the same approach
01:24:48.920 | All right, so now let's just grab a 3 there's just any old 3 here it is
01:24:55.840 | And what I'm going to do is I'm going to say well
01:24:58.820 | Is this 3 more similar to the perfect 3 or is it more similar to the perfect 7 and whichever one?
01:25:04.800 | It's more similar to I'm going to assume that's that's the answer
01:25:07.840 | So we can't just say look at each pixel and say
01:25:15.280 | What's the difference between this pixel?
01:25:19.000 | You know 0 0 here and 0 0 here and then 0 1 here and then 0 1 here and take the average
01:25:24.600 | The reason we can't just take the average is that there's positives and negatives and they're going to average out
01:25:29.120 | To nothing, right? So I actually need them all to be positive numbers
01:25:34.220 | So there's two ways to make them all positive numbers. I could take the absolute value which simply means remove the minus signs
01:25:42.600 | Okay, and then I could take the average of those
01:25:46.620 | That's called the mean absolute difference or L1 norm
01:25:52.800 | or I could take the square of each difference and
01:25:58.160 | Then take the mean of that and then at the end I could take the square root
01:26:02.740 | Kind of undoes the squaring and that's called the root mean squared error
01:26:07.280 | or L2
01:26:10.000 | So let's have a look let's take a 3 and
01:26:15.080 | Subtract from it the mean of the threes and take the absolute value and take the mean
01:26:22.600 | Okay, and call that the distance using absolute value of the three to a three
01:26:28.160 | And that there is the number point one, right? So this is the mean absolute difference or L1 norm
01:26:34.980 | So when you see a word like L1 norm, if you haven't seen it before it may sound pretty fancy
01:26:39.720 | But all these math terms that we see, you know, you can
01:26:44.060 | Turn them into a tiny bit of code, right?
01:26:48.600 | It's it's you know, don't let the mathy bits for you that they're often like in code
01:26:54.860 | It's just very obvious what they mean where else with math. You just you just have to learn it or
01:27:00.080 | Learn how to Google it
01:27:02.880 | So here's the same version for squaring take the difference where it take the mean and then take the square root
01:27:09.280 | So then we'll do the same thing for our three this time we'll compare it to the mean of the sevens
01:27:16.800 | Right. So the distance from a three to the mean of the threes were in terms of absolute was point one
01:27:23.040 | And the distance from a three to the mean of the sevens was point one five
01:27:28.800 | So it's closer to the mean of the threes than it is to the mean of the sevens. So we guess therefore that this is a
01:27:36.000 | three based on the
01:27:38.760 | mean absolute difference
01:27:41.200 | Same thing with RMSE root mean squared error would be to compare this value
01:27:46.080 | With this value and again root mean squared error. It's closer to the mean three than to the mean seven. So this is like a
01:27:54.920 | Machine learning model kind of it's a data driven model which attempts to recognize threes versus sevens
01:28:02.160 | And so this is a good baseline. I
01:28:04.840 | Mean, it's it's a reasonable baseline. It's going to be better than random
01:28:09.280 | We don't actually have to write out
01:28:12.960 | minus abs mean
01:28:16.320 | We can just actually use L1 loss. L1 loss does exactly that
01:28:21.680 | We don't have to write minus squared
01:28:25.960 | We can just write MSE loss that doesn't do the square root by default. So we have to pop that in
01:28:31.680 | Okay, and as you can see they're exactly
01:28:33.840 | the same numbers
01:28:36.480 | It's very important before we kind of go too much further to make sure we're very comfortable
01:28:45.240 | Working with arrays and tensors and you know, they're they're so similar
01:28:49.240 | So we could start with a list of lists, right? Which is kind of a matrix
01:28:55.240 | We can convert it into an array
01:28:58.080 | or into a tensor
01:29:01.040 | We can display it and they look almost the same
01:29:05.360 | You can index into a single row
01:29:08.480 | You can index into a single column and so it's important to know
01:29:13.800 | This is very important colon means
01:29:16.680 | Every row because I put it in the first spot, right? So if I put it in the second spot
01:29:24.240 | It would mean every column and so therefore
01:29:27.800 | Comma colon is exactly the same as removing it
01:29:34.520 | So it just turns out you can always remove
01:29:39.360 | Colons that are at the end because they're kind of they're just implied right you never have to and I often kind of put
01:29:45.840 | Them in any way because just kind of makes it a bit more obvious how these things kind of
01:29:51.480 | Match up or how they differ
01:29:54.120 | You can combine them together so give me the first row and everything from the first up to but not including the third column
01:30:02.520 | Right. So there's that five six
01:30:07.200 | You can add stuff to them. You can check their type. Notice that this is different to the Python oopsy the Python
01:30:15.180 | Type though type is a function
01:30:18.680 | Just tells you it's a tensor. If you want to know what kind of tensor you have to use type as a method
01:30:24.760 | So it's a long tensor
01:30:26.760 | You can
01:30:29.840 | Multiply them by a float turns it into a float, you know to have a fiddle around if you haven't done much stuff with numpy or
01:30:37.120 | Pytorch before this is a good opportunity to just
01:30:40.680 | Go crazy try things out try try things that you think might not work and see if you actually get an error message, you know
01:30:52.520 | We now want to find out
01:30:54.520 | How good is our model?
01:30:57.680 | our model that involves just comparing something to to the
01:31:02.580 | domain
01:31:08.200 | We should not compare
01:31:11.400 | You should not check how good our model is on the training set as we've discussed
01:31:17.080 | We should check it on a validation set and we already have a validation set. It's everything inside the valid directory
01:31:24.360 | So let's go ahead and like combine all those steps before let's go through everything in the validation set 3 LS
01:31:31.040 | Open them turn them into a tensor stack them all up
01:31:36.060 | Turn them into floats divide by 255
01:31:41.740 | Let's do the same for sevens
01:31:43.740 | So we're just putting all the steps we did before into a couple of lines
01:31:47.740 | Yeah, I always try to print out shapes like all the time
01:31:53.100 | Because if a shape is not what you expected then you can you know get weird things going on
01:32:01.180 | So the idea is we want some function is 3 that will return true if we think something is a 3
01:32:07.380 | So to do that we have to decide whether our
01:32:12.540 | Digit that we're testing on is closer to the ideal 3 or the ideal 7
01:32:18.160 | So let's create a little function that
01:32:22.500 | returns
01:32:26.580 | Takes the difference between two things takes the absolute value and then takes the mean
01:32:31.420 | So we're going to create this function MNIST distance that takes the difference between two answers
01:32:40.580 | Takes their absolute value and then takes the mean and it takes the mean and look at this
01:32:46.280 | We've got minus this time. It takes the mean over the last
01:32:50.060 | over the
01:32:54.580 | second last and third last
01:32:56.580 | Sorry, the last and second last dimensions. So this is going to take
01:33:02.260 | The mean across the kind of x and y axes and so here you can see it's returning a
01:33:10.820 | single number which is the distance of a 3 from the mean 3
01:33:16.980 | So that's the same as the value that we got earlier point 1 1 1 4
01:33:24.340 | So we need to do this for every image in the validation set because we're trying to find the overall metric
01:33:30.560 | Remember the metric is the thing we look at to say how good is our model?
01:33:34.260 | So here's something crazy. We can call MNIST distance
01:33:39.260 | Not just on our 3 but are on the entire validation set
01:33:45.120 | Against the mean 3
01:33:49.300 | So that's wild like there's no normal programming that we would do where we could somehow pass in
01:33:55.860 | either a matrix or a rank 3 tensor and somehow it works both times and
01:34:03.660 | What actually happened here is that instead of returning a single number?
01:34:09.220 | It returned
01:34:12.780 | 1010 numbers
01:34:15.500 | And it did this because it used something called
01:34:18.180 | Broadcasting and broadcasting is like the super special magic trick
01:34:23.980 | That lets you make Python into a very very high performance language. And in fact, if you do this broadcasting on
01:34:31.900 | GPU tenses and pytorch it actually does this operation on the GPU even though you wrote it in Python
01:34:38.860 | Here's what happens
01:34:41.620 | Look here this a - B
01:34:45.180 | So we're doing a - B on two things. We've got first of all valid 3 tens of valid 3 tensor
01:34:53.940 | Is a thousand or so images right and remember that mean 3
01:35:01.220 | Is just our single ideal 3 so what is
01:35:08.100 | Something of this shape minus something of this shape
01:35:14.640 | Well broadcasting means that if this shape doesn't match this shape
01:35:20.100 | Like if they did match it would just subtract every corresponding item, but because they don't match
01:35:26.560 | It's a it's actually acts as if there's a thousand and ten versions
01:35:32.040 | of this
01:35:34.360 | So it's actually going to subtract this from every single one of these
01:35:42.960 | So broadcasting let's look at some examples
01:35:45.380 | So broadcasting requires us to first of all understand the idea of element wise operations
01:35:52.200 | This is an element wise operation. Here is a rank 1 tensor of
01:35:56.240 | Size 3 and another rank 1 tensor of size 3
01:36:00.680 | So we would say these sizes match they're the same and so when I add 1 2 3 to 1 1 1 I get back
01:36:08.200 | 2 3 4 it just takes the corresponding items and adds them together. So that's called element wise operations
01:36:15.800 | So when I have different
01:36:23.040 | Shapes as we described before
01:36:27.440 | What it ends up doing is it basically copies?
01:36:32.680 | this this number a thousand and ten times and it acts as if we had said valid 3 tens minus a
01:36:40.720 | thousand and ten copies of
01:36:43.400 | mean 3
01:36:45.520 | As it says here it doesn't actually copy mean 3 a thousand and ten times it just pretends that it did right?
01:36:53.600 | It just acts as if it did so basically kind of loops back around to the start again and again
01:36:57.400 | And it does the whole thing in C or in CUDA on the GPU
01:37:04.320 | Then we see absolute value, right? So let's go back up here
01:37:07.960 | After we do the minus we go absolute value, but what happens when we call absolute value on
01:37:14.880 | Something of size
01:37:19.120 | 10 10 by 28 by 28 just cause absolute value on each underlying thing
01:37:26.080 | And then finally we call mean
01:37:32.960 | Minus one is the last element always in Python minus two is a second last
01:37:37.480 | so this is taking the mean over the last two axes and
01:37:41.520 | So then it's going to return just the first axis. So we're going to end up with a thousand and ten
01:37:47.560 | means
01:37:51.320 | Distances which is exactly what we want. We want to know how far away is
01:37:54.920 | our each of our validation items away from the
01:38:00.480 | the ideal three
01:38:02.480 | So then
01:38:05.160 | We can create our is three function, which is hey is the distance
01:38:09.640 | between the number in question and the perfect three
01:38:14.360 | Less than the distance between the number in question and the perfect seven if it is
01:38:19.320 | It's a three. All right, so
01:38:22.400 | Our three that was an actual three we had is it a three. Yes
01:38:27.640 | Okay, and then we can turn that into a float and yes becomes one
01:38:32.320 | Thanks to broadcasting we can do it for that entire
01:38:37.640 | Set right. So this is so cool. We basically get rid of loops
01:38:42.440 | In in in in this kind of programming. You should have very few very very few loops loops make things
01:38:49.600 | much harder to read
01:38:52.400 | And and hundreds of thousands of times slower on the GPU potentially tens of millions of times slower
01:38:58.320 | So we can just say is three on our whole
01:39:02.360 | Valid three tens and then turn that into float and then take the mean
01:39:07.080 | So that's going to be the accuracy of the threes on average and here's the accuracy of the sevens. It's just one minus that
01:39:14.080 | So the accuracy across threes is about 91 and a bit percent the accuracy on sevens is about 98 percent and
01:39:22.040 | the average of those two is about 95 percent. So here we have a
01:39:27.320 | Model that's 95 percent accurate at recognizing threes from sevens
01:39:32.440 | It might surprise you
01:39:35.520 | That we can do that using nothing but
01:39:38.400 | Arithmetic, right?
01:39:41.560 | So that's what I mean by getting a good baseline
01:39:44.760 | Now the thing is
01:39:51.040 | It's not obvious how we kind of improve this right. I mean the thing is it doesn't match
01:39:57.840 | Arthur Samuel's description of machine learning, right?
01:40:03.200 | This is not something where there's a function which has some parameters which we're testing
01:40:09.320 | Against some kind of measure of fitness and then using that to like improve the parameters iteratively. We kind of we just did one
01:40:16.400 | step and
01:40:17.960 | That's that right
01:40:21.320 | So we want to try and do it in this way where we arrange for some automatic means of testing the effectiveness of he called
01:40:27.840 | It a weight assignment
01:40:28.680 | We'd call it a parameter assignment in terms of performance and a mechanism for
01:40:32.800 | Alternating altering the weight assignment to maximize the performance that we want to do it that way
01:40:38.280 | Right because we know from from chapter one from lesson one, but if we do it that way we have this like
01:40:45.880 | Magic box, right called machine learning that can do, you know, particularly combined with neural nets should be able to solve
01:40:53.400 | any problem in theory
01:40:55.920 | If you can at least find the right set of weights
01:40:59.880 | So we need something that we can get better and better
01:41:04.480 | alone
01:41:07.440 | So let's think about a
01:41:10.080 | function which has parameters
01:41:13.800 | So instead of finding an ideal image and seeing how far away something is from the ideal image
01:41:21.200 | So instead of like having something where we test how far away we are from an ideal image
01:41:31.680 | What we could instead do is come up with a set of weights
01:41:36.840 | For each pixel. So we're trying to find out if something is the number three and so we know that like in the places
01:41:44.560 | That you would expect to find three pixels
01:41:47.840 | You could give those like high weights so you can say hey if there's a dot in those places
01:41:52.840 | We give it like a high score and if there's dots in other places
01:41:57.360 | We'll give it like a low score
01:41:59.640 | but we could actually come up with a function where the probability of something being an
01:42:05.760 | Well in this case, let's say an 8
01:42:07.760 | is equal to
01:42:10.200 | the pixels in the image
01:42:12.560 | Multiplied by some set of weights and then we sum them up
01:42:16.920 | right, so then anywhere where
01:42:22.720 | The image we're looking at, you know
01:42:24.720 | As pixels where there are high weights
01:42:29.120 | It's going to end up with a high probability. So here X is the image that we're interested in
01:42:35.960 | And we're just going to represent it as a vector. So let's just have all the rows stacked up
01:42:41.840 | end-to-end into a single long line
01:42:44.560 | so we're going to use an approach where we're going to start with a
01:42:51.440 | Vector W. So a vector is a rank one tensor
01:42:55.320 | Okay, we're going to start with a vector W. That's going to contain
01:42:59.320 | random weights
01:43:02.680 | random parameters
01:43:04.320 | Depending on whether you use the Arthur Samuel version of the terminology or not
01:43:08.360 | and so
01:43:11.080 | We'll then predict whether a number appears to be a 3 or a 7
01:43:15.800 | by using this
01:43:18.720 | tiny little function
01:43:21.600 | And then we will figure out how good the model is
01:43:26.080 | So we will calculate like how accurate it is or something like that
01:43:30.240 | the idea is the loss and
01:43:33.360 | Then the key step is we're then going to calculate the gradient now
01:43:37.840 | The gradient is something that measures for each weight. If I made it a little bit bigger
01:43:42.520 | Would the loss get better or worse?
01:43:45.440 | If I made it a little bit smaller
01:43:47.800 | Would the loss get better or worse? And so if we do that for every weight
01:43:51.480 | We can decide for every weight whether we should make that weight a bit bigger or a bit smaller
01:43:56.320 | So that's called the gradient, right? So once we have the gradient we then step is the word we used to step
01:44:04.080 | change all the weights
01:44:06.920 | Up a little bit for the ones where the gradient we should said we should make them a bit higher and
01:44:12.560 | Down a little bit for all the ones where the gradient said they should be a bit lower
01:44:16.200 | So now it should be a tiny bit better and then we go back to step 2 and
01:44:22.240 | Calculate a new set of predictions using this formula
01:44:25.640 | Calculate the gradient again
01:44:29.400 | step the weights
01:44:31.200 | Keep doing that
01:44:32.040 | So this is basically the flow chart and then at some point when we're sick of waiting or when the loss gets good enough
01:44:37.640 | We'll stop
01:44:39.800 | So these seven steps
01:44:43.840 | One two, three, four five
01:44:47.880 | seven
01:44:49.040 | These seven steps are the key to training all deep learning models this technique is called stochastic gradient descent
01:44:56.480 | Well, it's called gradient descent. We'll see the stochastic bit very soon
01:45:00.160 | and for each of these
01:45:03.040 | Seven steps, there's lots of choices around exactly how to do it, right?
01:45:08.760 | We've just kind of hand-waved a lot like what kind of random initialization and how do you calculate the gradient and exactly?
01:45:15.600 | What step do you take based on the gradient and how do you decide when to stop blah blah blah, right?
01:45:19.440 | So in this in this course, we're going to be like learning
01:45:24.840 | About you know these steps
01:45:28.080 | You know, that's kind of part one, you know, then the other big part is like well, what's the actual function neural network?
01:45:35.440 | So how do we train the thing and what is the thing that we train?
01:45:38.960 | So we initialize parameters with random values
01:45:41.880 | We need some function that's going to be the loss function that will return a number that's small if the performance of the model is
01:45:51.640 | In some way to figure out whether the weight should be increased a bit or decrease a bit
01:45:58.800 | Then we need to decide like when to stop which we'll just say let's just do a certain number of epochs
01:46:06.720 | So let's like
01:46:08.520 | Go even simpler, right? We're not even going to do MNIST. We're going to start with this function x squared
01:46:14.360 | Okay, and in fast AI we've created a tiny little thing called plot function that plots the function
01:46:20.800 | Alright so there's our function f and
01:46:26.880 | What we're going to do is we're going to try to find this is our loss function
01:46:33.320 | So we're going to try and find the bottom point
01:46:36.600 | Right, so we're going to try and figure out what is the x value which is at the bottom?
01:46:41.440 | So our seven-step procedure requires us to start out by initializing
01:46:46.480 | So we need to pick
01:46:49.720 | Some value right? So the value we pick was just say oh, let's just randomly pick minus one and a half
01:46:55.480 | Great. So now we need to know if I increase x a bit
01:47:01.120 | Does my remember this is my loss does my loss get a bit better? Remember better is smaller or a bit worse
01:47:07.040 | So we can do that easily enough
01:47:10.080 | We can just try a slightly higher x and a slightly lower x and see what happens
01:47:14.600 | Right and you can see it's just the slope right the slope at this point
01:47:19.360 | Tells you that if I increase x by a bit
01:47:23.760 | Then my loss will decrease because that is the slope at this point
01:47:30.480 | So if we change our our weight our parameter
01:47:35.000 | Just a little bit in the direction of the slope
01:47:38.180 | Right. So here is the direction of the slope and so here's the new value at that point
01:47:43.720 | And then do it again and then do it again
01:47:46.920 | Eventually, we'll get to the bottom of this curve
01:47:49.880 | So this idea goes all the way back to Isaac Newton at the very least and this basic idea is called
01:48:00.040 | Newton's method
01:48:02.040 | So a key thing we need to be able to do is to calculate
01:48:04.680 | this slope and
01:48:08.080 | the bad news is
01:48:11.240 | Do that we need calculus
01:48:13.280 | At least that's bad news for me because I've never been a fan of calculus we have to calculate the derivative
01:48:19.560 | Here's the good news though
01:48:22.160 | Maybe you spent ages in school learning how to calculate derivatives
01:48:27.920 | You don't have to anymore the computer does it for you and the computer does it fast. It uses all of those
01:48:34.880 | Methods that you learned at school and it had a whole lot more
01:48:38.880 | Like clever tricks for speeding them up and it just does it all
01:48:43.160 | Automatically. So for example, it knows I don't know if you remember this from high school that the derivative of x squared is 2x
01:48:51.640 | It is just something it knows. It's part of its kind of bag of tricks, right? So
01:48:57.980 | So PyTorch knows that PyTorch has an engine built in that can take derivatives and find the gradient of functions
01:49:06.460 | so to do that
01:49:09.300 | we start with a
01:49:11.620 | Tensor, let's say and in this case, we're going to modify this tensor with this special
01:49:17.660 | method called requires-grad and what this does is it tells PyTorch that any time I do a calculation with this
01:49:26.140 | Xt it should remember what calculation it does so that I can take the derivative later
01:49:31.880 | You see the underscore at the end
01:49:35.700 | An underscore at the end of a method in PyTorch means that this is called an in place operation
01:49:42.360 | It actually modifies this so requires-grad underscore
01:49:47.140 | Modifies this tensor to tell PyTorch that we want to be calculating gradients on it
01:49:52.940 | So that means it's just going to have to keep track of all of the computations we do so that it can calculate the derivative later
01:49:59.140 | Okay, so we've got the number 3 and
01:50:06.220 | Let's say we then call f on it. Remember f is just squaring it though 3 squared is 9
01:50:13.660 | But the value is not just 9 it's 9 accompanied with a grad function
01:50:18.860 | Which is that it's it knows that a power operation has been taken
01:50:23.280 | So we can now call a special method
01:50:25.540 | backward
01:50:28.400 | and backward
01:50:30.400 | Which refers to back propagation which we'll learn about
01:50:33.320 | which basically means take the derivative and
01:50:36.280 | so once it does that we can now look inside Xt because we said requires-grad and
01:50:42.840 | Find out its gradient and
01:50:45.480 | Remember the derivative of X squared is 2x
01:50:48.940 | In this case that was 3
01:50:53.120 | 2 times 3 is 6
01:50:55.120 | right, so
01:50:57.240 | We didn't have to figure out the derivative
01:51:00.360 | We just call backward and then get the grad attribute to get the derivative
01:51:05.180 | So that's how easy it is to do
01:51:07.640 | calculus in PyTorch, so
01:51:10.120 | What you need to know about calculus is not how to take a derivative
01:51:17.320 | But what it means and what it means is
01:51:21.680 | It's the slope at some point
01:51:24.840 | Now here's something interesting let's not just take three, but let's take a rank one tensor
01:51:32.660 | also known as a vector three four ten and
01:51:36.120 | Let's add sum
01:51:39.760 | To our f function. So it's going to go x squared dot sum
01:51:42.920 | And now we can take f of
01:51:46.160 | This vector get back 125
01:51:51.880 | Then we can say backward and grad and look
01:51:55.240 | 2x 2x
01:51:59.800 | Right so we can calculate
01:52:01.800 | This is this is
01:52:05.640 | vector calculus right we're getting the
01:52:08.520 | gradient for every element of a vector
01:52:13.000 | With the same two lines of code
01:52:18.240 | So that's kind of all you need to know about calculus right and if this is
01:52:23.300 | If this idea that that a derivative for gradient is a slope is unfamiliar
01:52:30.360 | I'm check out Khan Academy
01:52:32.200 | They had some great introductory calculus and don't forget you can skip all the bits where they teach you how to calculate
01:52:37.560 | the gradients yourself
01:52:40.320 | So now that we know how to calculate the gradient that is the slope of the function
01:52:46.160 | That tells us if we change our input a little bit
01:52:50.480 | How will our output change?
01:52:53.360 | Correspondingly, that's what a slope is
01:52:56.320 | Right and so that tells us that every one of our parameters if we know their gradients
01:53:02.200 | Then we know if we change that parameter up a bit or down a bit. How will it change our loss?
01:53:07.520 | So therefore we then know how to change our parameters
01:53:12.080 | So what we do is let's say all of our weights called W
01:53:17.280 | We just subtract off them the gradients
01:53:21.520 | multiplied by some
01:53:24.320 | Small number and that small number is often a number between about 0.001 and 0.1
01:53:30.920 | It's called the learning rate right and this year is the essence of
01:53:37.480 | gradient descent
01:53:40.800 | So if you pick a learning rate, that's very small
01:53:43.680 | Then you take the slope and you take a really small step in that direction
01:53:47.680 | And another small step another small step another small steps note. It's going to take forever to get to the end
01:53:53.840 | If you pick a learning rate, that's too big
01:53:56.400 | You jump way too far
01:53:59.400 | Each time and again, it's going to take forever
01:54:02.880 | And in fact in this case, sorry at this case
01:54:07.160 | We're assuming we're starting here and it's actually so big that it got worse and worse
01:54:11.120 | Or here's one where we start here and it's like it's not
01:54:16.680 | So big it gets worse and worse, but it just takes a long time to bounce in and out
01:54:20.920 | right, so
01:54:23.680 | Picking a good learning rate is really important both to making sure that it's even possible
01:54:28.220 | To solve the problem and that it's possible to solve it in a reasonable amount of time
01:54:33.440 | So we'll be learning about picking how to pick learning rates in this course
01:54:43.080 | Let's try this. Let's try using gradient descent. I said SGD. That's not quite accurate. It's just going to be gradient descent
01:54:51.240 | to solve an actual problem
01:54:53.920 | So the problem we're going to solve is let's imagine you were watching a roller coaster
01:55:00.680 | Go over the top of a hump, right?
01:55:03.040 | So as it comes out of the previous hill, it's going super fast and it's going up the hill
01:55:10.420 | And it's going slower and slower and slower until it gets to the top of the hump
01:55:14.640 | And then it goes down the other side it goes faster and faster and faster
01:55:17.880 | So if you like had a stopwatch or whatever or a sudden watch some kind of speedometer and you are measuring it just by hand
01:55:26.360 | At kind of equal time points. You might end up with something that looks a bit like this
01:55:31.240 | Right. And so the way I did this was I just grabbed a range just grabs
01:55:36.080 | The numbers from naught up to but not including 20. Alright, so these are the time periods at which I'm taking my speed measurement
01:55:43.560 | and then I've just got some
01:55:46.400 | Quadratic function here and multiply it by 3 and then square it and then add 1 whatever right and then I
01:55:54.400 | Also, actually sorry I take my time
01:55:56.400 | minus 9.5
01:55:59.280 | Square it times 0.75 add 1 and then I add a random number to that or add a random number to every observation
01:56:07.080 | So I end up with a quadratic function, which is a bit bumpy
01:56:10.580 | So this is kind of like what it might look like in real life because my speedometer
01:56:15.100 | Kind of testing is not perfect
01:56:17.680 | All right, so
01:56:22.800 | We want to create a function that estimates at any time. What is the speed of the roller coaster?
01:56:28.040 | so we start by
01:56:30.480 | Guessing what function it might be so we guess that it's a function
01:56:36.200 | A times time squared plus B times time plus C. You might remember from school is called a quadratic
01:56:43.400 | so let's create a function right and so
01:56:49.000 | Let's create it using kind of the Arthur Samuel's technique the machine learning technique. This function is going to take two things
01:56:54.320 | It's going to take an input
01:56:56.440 | which in this case is a time and it's going to take some parameters and
01:57:00.720 | The parameters are a B and C. So in in Python you can split out a
01:57:07.520 | List or a collection into its components like so and then here's that function. Okay
01:57:14.640 | So we're not just trying to find any function in the world. We're just trying to find some function
01:57:18.840 | Which is a quadratic by finding an A and a B and a C
01:57:22.760 | So the the Arthur Samuel technique for doing this is to next up come up with a loss function
01:57:30.280 | Come up with a measurement of how good we are. So if we've got some predictions
01:57:35.000 | That come out of our function and the targets which are these you know actual values
01:57:42.720 | Then we could just do the
01:57:44.720 | Mean squared error. Okay. So here's that mean squared error we saw before the difference squared and take the mean
01:57:52.880 | So now we need to go through our seven-step process
01:57:57.680 | We want to come up with a set of three parameters A B and C
01:58:01.960 | Which are as good as possible. The step one is to initialize A B and C to random values
01:58:07.920 | So this is how you get random values three of them in PyTorch and remember we're going to be adjusting them
01:58:13.560 | So we have to tell PyTorch that we want the gradients
01:58:15.920 | I'm just going to save those away so I can check them later and then I calculate the predictions using that function f
01:58:26.480 | Which was this?
01:58:29.320 | And then let's create a little function which just plots how good at this point are our predictions
01:58:38.040 | So here is a function that prints in red
01:58:40.600 | our predictions and in blue our targets, so that looks pretty terrible
01:58:45.920 | But let's calculate the loss
01:58:49.840 | Using that MSE function we wrote
01:58:52.720 | Okay, so now we want to improve this so calculate the gradients using the two steps we saw
01:58:58.960 | Call backward and then get grad and this says that each of our
01:59:04.640 | Parameters has a gradient that's negative
01:59:06.880 | Let's pick a learning rate of 10 to the minus 5 so we multiply that by 10 to the minus 5
01:59:15.880 | And step the weights and remember step the weights means minus equals learning rate times
01:59:25.280 | The gradient there's a wonderful trick here which I've called dot data
01:59:32.040 | the reason I've called dot data is dot data is a special attribute in PyTorch which if you use it then the
01:59:40.000 | gradient is not calculated and we certainly wouldn't want the gradient to be calculated of
01:59:47.040 | The actual step we're doing we only want the gradient to be calculated of our
01:59:51.800 | function f
01:59:54.320 | Right. So when we step the weights we have to use this special dot data attribute
02:00:00.480 | After we do that
02:00:02.320 | Delete the gradients that we already had
02:00:04.320 | And let's see if loss improved. So the loss before was
02:00:08.280 | 25,800
02:00:12.200 | Now it's 5,400 and the plot has gone from something that goes down to minus 300
02:00:19.660 | Oh to something that looks much better
02:00:23.840 | So let's do that a few times
02:00:26.920 | So I've just grabbed those previous lines of code and pasted them all into a single cell
02:00:30.720 | Okay, so preds loss backward data grad is none
02:00:35.120 | and then from time to time print the loss out and
02:00:38.360 | Repeat that ten times and look getting better and better
02:00:42.920 | And so we can actually look at it getting better and better
02:00:52.200 | So this is pretty cool, right? We have a technique. This is the Arthur Samuel technique
02:01:00.480 | Finding a set of parameters that
02:01:03.240 | Continuously improves by getting feedback from the result of measuring some loss function
02:01:10.460 | So that was kind of the key step, right?
02:01:14.280 | This this is the gradient descent method
02:01:18.760 | So you should make sure that you kind of go back and feel super comfortable
02:01:23.160 | with what's happened and
02:01:25.800 | You know if you're not feeling comfortable that that's fine
02:01:28.280 | Right if it's been a while or if you've never done this kind of gradient descent before
02:01:33.160 | This might feel super unfamiliar
02:01:36.600 | So kind of trying to find the first cell in this notebook where you don't fully understand what it's doing
02:01:42.760 | And then like stop and figure it out like look at everything that's going on do some experiments do some reading
02:01:50.320 | until you understand
02:01:53.280 | That cell where you were stuck before you move forwards
02:01:55.980 | So let's now apply this to MNIST
02:02:00.880 | So for MNIST
02:02:09.280 | We want to use this exact technique and there's basically nothing extra we have to do
02:02:13.560 | except one thing
02:02:16.120 | we need a loss function and
02:02:18.120 | The metric that we've been using is the error rate or the accuracy
02:02:25.160 | It's like how often are we correct? Right and and that's the thing that we're actually trying to make
02:02:31.380 | Good now metric, but we've got a very serious problem
02:02:36.440 | Which is remember we need to calculate the gradient
02:02:39.360 | To figure out how we should change our parameters and the gradient is the slope or this deepness
02:02:46.480 | Which you might remember from school is defined as rise over run
02:02:50.320 | It's y new minus y old divided by x new minus x old
02:02:58.440 | The gradients actually defined when x new is is very very close to x old
02:03:04.680 | Meaning that difference is very small
02:03:06.680 | But think about it
02:03:09.520 | Accuracy if I change a parameter by a tiny tiny tiny amount
02:03:14.280 | The accuracy might not change at all
02:03:17.120 | Because there might not be any
02:03:19.760 | Three that we now predict as a seven or any seven that we now predict as a three
02:03:25.400 | Because we change the parameter by such a small amount
02:03:28.160 | So it's it's it's possible. In fact, it's certain that the gradient is
02:03:34.480 | zero at
02:03:35.840 | Many places and that means that our parameters
02:03:39.080 | Aren't going to change at all because learning rate times gradient is still zero when the gradients zero for any learning rate
02:03:46.680 | So this is why
02:03:50.400 | the loss
02:03:52.960 | Function and the metric are not always the same thing
02:03:57.400 | we can't use a metric as
02:04:01.240 | Our loss if that metric has a gradient of zero
02:04:04.960 | So we need something different so we want to find something that kind of
02:04:12.680 | Is pretty similar to the accuracy in that like as the accuracy gets better this ideal function we want gets better as well
02:04:22.880 | But it should not have a gradient of zero
02:04:25.760 | So let's think about that function
02:04:29.640 | Suppose we had three images
02:04:35.840 | Actually, you know what
02:04:39.800 | This is actually probably a good time to stop because actually, you know
02:04:45.040 | We've kind of we've got to the point here where we understand gradient descent
02:04:49.640 | We kind of know how to do it with a simple loss function and
02:04:55.720 | I actually think before we start looking at the MNIST loss function
02:04:59.680 | We shouldn't move on because we've got so much so much assignments to do for this week already
02:05:06.400 | So we've got build your web application and we've got go step through step through this notebook to make sure you fully understand it
02:05:15.040 | So I actually think we should probably
02:05:17.640 | Stop right here before we make things too crazy. So before I do
02:05:24.680 | Rachel are there any questions?
02:05:26.760 | Okay, great. All right. Well, thanks everybody. I'm sorry for that last-minute change of tack there, but I think this is going to make sense
02:05:35.360 | So I hope you have a lot of fun with your web applications try and think of something. That's really fun really interesting
02:05:43.200 | It doesn't have to be like important. It could just be some, you know cute thing
02:05:49.320 | We've had students before a student that I think he said he had 16 different cousins and he created something that would classify
02:05:57.280 | A photo based on which of his cousins is for like his fiance meeting his family
02:06:03.440 | you know
02:06:06.000 | You can come up with anything you like
02:06:08.000 | but you know, yeah show off your application and
02:06:11.120 | Maybe have a look around at what IPY widgets can do and try and come up with something that you think is pretty cool
02:06:19.120 | All right. Thanks, everybody. I will see you next week.