Back to Index

Lesson 3 - Deep Learning for Coders (2020)


Chapters

0:0 Recap of Lesson 2 + What's next
1:8 Resizing Images with DataBlock
8:46 Data Augmentation and item_tfms vs batch_tfms
12:28 Training your model, and using it to clean your data
18:7 Turning your model into an online application
36:12 Deploying to a mobile phone
38:13 How to avoid disaster
50:59 Unforeseen consequences and feedback loops
57:20 End of Chapter 2 Recap + Blogging
64:9 Starting MNIST from scratch
66:58 untar_data and path explained
70:57 Exploring at the MNIST data
72:5 NumPy Array vs PyTorch Tensor
76:0 Creating a simple baseline model
88:38 Working with arrays and tensors
90:50 Computing metrics with Broadcasting
99:46 Stochastic Gradient Descent (SGD)
114:40 End-to-end Gradient Descent example
121:56 MNIST loss function
124:40 Lesson 3 review

Transcript

So hello and welcome to lesson three of practical deep learning for coders We We're looking at getting our model into production last week And so we're going to finish off that today and then we're going to start to look behind the scenes at what actually goes On when we train a neural network, we're going to look at Kind of the math of what's going on And we're going to learn about SGD and that's important stuff like that The the order is slightly different to the book in the book.

There's a part in the book which says like hey You can either go to lesson four or lesson three now And then go back to the other one afterwards. So we're doing lesson four and then lesson three Chapter four and then chapter three I should say You can choose it whichever way you're interested in chapter four is the more Technical chapter about the foundations of how deep learning really works or else chapter three is all about ethics And so with the lessons we'll do that next week So we're looking at 0 to production notebook and We've got to look at the fast book version the one with in fact everything I'm looking at today will be in the fast book version and Remember last week we had a look at Our bears and we created this data loaders object by using The data block API, which I hope everybody's had a chance to experiment with this week if you haven't Now's a good time to do it We kind of skipped over one of the lines a little bit Which is this item transforms?

So what this is doing here when we said resize The the images we downloaded from the internet are lots of different sizes and lots of different aspect ratios Some are tall and some are wide. I'm a square and some are big some are small When you say resize for an item transform, it means each item to an item in this case is one image It's going to be resized to 128 by 128 by squishing it or stretching it And so we had a look at you can always say show batch to see a few examples and this is what they look like um Squishing and stretching isn't the only way that we can resize remember we have everything we have to make everything into a square Before we kind of get it into our model by the time it gets to our model everything has to be the same size in Each mini-batch, but that's why and they're making it a square is not the only way to do that But it's the easiest way and it's the by far the most common way So Another way to do this Is We can create a another Data block object and we can make a data block object That's an identical copy of an existing data block object where we can then change just some pieces And we can do that by calling the new method, which is super handy.

And so let's create another data block Object and this time with different item transform where we resize using the squish method We have a question. What are the advantages of having square images versus rectangular ones? That's a great question so Really its simplicity If you know all of your images are rectangular of a particular aspect ratio to start with you may as well Just keep them that way.

But if you've got some which are tall and some which are wide Making them all square is kind of the easiest Otherwise you would have to kind of organize them such as all of the tall ones kind of ended up in a mini-batch Nor the wide ones ended up in a mini-batch and then you'd have to kind of then figure out What the best aspect ratio for each mini-batch is and we actually have some research that does that in fast AI - But it's still a bit clunky I should mention okay I just lied to you the default is not actually to squish or stretch the default I should have said sorry the default When we say resize is actually just to grab Grab the center so actually all we're doing is we're grabbing the center of each image So if we want to squish or stretch you can add the resize method dot squish Argument to resize and you can now see that this black bear is now looking much thinner But we have got the kind of leaves that are around on each side instance Another question when you use the dls dot new method what can and cannot be changed is it just the transforms So it's not dls dot new it's bears dot new right?

So we're not creating a new data lotus object We're creating a new data block object. I don't remember off the top of my head so check the documentation and I'm sure somebody can pop the answer into the into the forum So you can see when we use dot squish that this grizzly bear has got pretty kind of Wide and weird looking and this black bear has got pretty weird and thin looking and it's easiest kind of to see what's going on if we use resize method dot pad and what dot pad does as you can see is it just adds some Black bars around each side so you can see the grizzly bear was tall So then when we stretch squishing and stretching opposites of each other so when we stretched it it ended up wide and the black bear was Originally a wide rectangle so it ended up looking kind of thin You don't have to use zeros zeros means pad it with black you can also say like reflect to kind of have The the pixels will kind of look a bit better that way if you use reflect All of these different methods have their own problems The the pad method is kind of the cleanest you end up with the correct size you end up with all of the pixels But you also end up with wasted pixels so you kind of end up with wasted computation The squish method is the most efficient because you get all of the information You know and and nothing's kind of wasted but on the downside your neural net is going to have to learn to kind of like Recognize when something's been squished or stretched and in some cases it might it wouldn't even know So if there's two objects you're trying to recognize one of which tends to be thin and one of which tends to be thick And otherwise they're the same they could actually be impossible to distinguish And then the default cropping approach actually Removes some information So in this case You know this this grizzly bear here We actually lost a lot of its legs So if figuring it out what kind of bear it was Required looking at its feet.

Well. We don't have its feet anymore So they all have downsides So there's something else that you can do a different approach which is instead us to say resize you can say Random resized crop and actually this is the most common approach and what random resize crop does is each time it actually grabs a Different part of the image and kind of zooms into it Right.

So these this is all the same image and we're just grabbing a batch of four different versions of it and you can see some are kind of You know, they're all squished in different ways and we've kind of selected different subsets and so forth now this Kind of seems worse than any of the previous approaches because I'm losing information like this one here I've actually lost a whole lot of its of its back, right?

But the cool thing about this is that remember we want to avoid overfitting and When you see a different part of the animal each time It's much less likely to overfit because you're not seeing the same image on each epoch that you go around that makes sense, so This random random resized crop approach is actually super popular and so min scale 0.3 means We're going to pick at least 30% of the pixels of kind of the original size each time And then we're kind of like zoom into that that square So this idea of doing something so that each time the Model sees the image it looks a bit different to last time.

It's called data augmentation and this is one type of data augmentation It's probably the most common but there are others and One of the best ways to do data augmentation is to use this org transforms function and what org transforms does is it actually returns a list of different augmentations and so there are augmentations which change contrast which change brightness Which warp the perspective so you can see in this one here It looks like this bits much closer to you and this is much away from you because it's going to be in perspective warped It rotates them.

See this one's actually been rotated. This one's been made really dark, right? These are batch transforms not item transforms The difference is that item transforms happen one image at a time and so the thing that resizes them all to the same size That has to be an item transform Pop it all into a mini batch put it on the GPU and then a batch transform happens to a whole mini batch at a time And by putting these as batch transforms that the augmentation happens super fast because it happens on the GPU and I don't know if there's any other Libraries as as we speak which allow you to write your own GPU accelerated transformations that run on the GPU in this way So this is a super handy thing in fast AI too So you can check out the documentation for Orc transforms and when you do you'll find the documentation for all of the underlying transforms that it basically wraps, right?

So you can see if I shift tab I don't remember if I've shown you this trick before if you go inside the parentheses of a function and hit shift tab a few times it'll pop open a list of all of the arguments and so you can basically see you can say like oh Can I sometimes flip it left right?

Can I sometimes flip it up down? What's the maximum and I can rotate zoom? change the lighting Walk the perspective and so forth How can we add different augmentations for train and validation sets? So the cool thing is that Automatically fast AI will avoid doing data augmentation on the validation set so all of these org transforms will only be applied to the Training set With the exception of random resized crop random resized crop has a different behavior for each the behavior for the training set is what we just saw which is to randomly pick a subset and kind of zoom into it and the Behavior for the validation set is just to grab the center the largest center square that it can You can write your own Transformations that they're just Python.

They're the standard pytorch code The way if you and by default it will only be applied to the training set if you want to do something fancy like random Resize crop where you actually have different things being applied to each You should come back to the next course to find out how to do that or read the documentation.

It's not rocket science, but it's Not something most people need to do Okay, so Last time we we here bit did best not new with a random resized crop min scale of 0.5. We added some transforms We went ahead and trained actually since last week. I've rerun this notebook I've got it's on a different computer and I've got different images.

So it's not all exactly the same but I still got a good confusion matrix, so of the Black bears 37 were classified correctly two were grizzlies of one was a teddy now And we talked about plot top plot top losses and it's interesting. You can see in this case There's some clearly kind of odd things going on.

This is not a bear at all This looks like it's a drawing of a bear which has decided is is Predicted as a teddy, but this thing's meant to be a drawing of a black bear. I can certainly see the confusion You can see how some parts of it are being cut off.

We'll talk about how to deal with that later Now one of the interesting things is that we didn't really do Much data cleaning at all before we built this model The only data cleaning we did was just to validate that each image can be opened. There was that verify images call And the reason for that is it's actually much easier normally to clean your data after you create a model and I'll show you how We've got this thing called image classifier cleaner Where you can pick a category right and training set or validation set And then what it will do is it will then list all of the images in that set and it will pick the ones which are Which is the least confident about which the most likely to be wrong where the where the loss is the worst to be more precise and so this This is a great way to look through your data and find problems.

So in this case, the first one is Not a teddy or a brown bear or a black bear. It's a puppy dog Right. So this is a great cleaner because what I can do is I can now click delete here This one here looks a bit like an ewok rather than a teddy.

I'm not sure What do you think Rachel is an ewok? I'm gonna call it an ewok Right. And so you can kind of go through Okay, that's definitely not a teddy and so you can either say like oh that's wrong It's actually a grizzly bear or it's wrong It's a black bear or I should delete it or by default just keep it right and you can kind of keep going through until You think like okay, they're all seem to be fine Maybe that one's not And kind of once you get to the point where they all seem to be fine, you can kind of say, okay Probably all the rest are fine too because they all have lower losses.

So they all fit the kind of the mold of a teddy And so then I can write run this code here Where I just go through planar.delete. So that's all the things which I selected delete for and unlink them so unlink Is just another way of saying delete a file that's the Python name And then go through all the ones that we said change and we can actually move them to the correct directory If you haven't seen this before you might be surprised that we've kind of created our own little GUI inside Jupyter notebook Yeah, you can do this and we built this with less than a screen of code you can check out the source code in the Fast AI notebooks.

So this is a great time to remind you that This is a great time to remind you that Fast AI is built with notebooks And so if you go to the fast AI repo and clone it and then go to nbs you'll find all of the code of fast AI Written as notebooks and they've got a lot of pros and examples and tests and so forth so the best place to learn about how this is implemented is to look at the notebooks rather than looking at the module Okay By the way, sometimes you'll see like weird little comments like this These weird little comments are part of a development environment for Jupyter notebook.

We use called nbdev which we built So so far and I built this thing to make it much easier for us to kind of create books and websites and libraries in Jupyter notebooks. So this particular one here hide means When this is turned into a book or into documentation don't show this cell And the reason for that is because you can see I've actually got it in the text, right?

But I thought when you're actually running it it would be nice to have it sitting here waiting for you to run directly So that's why it's shown in the notebook, but not in the in the book. It's shown differently And you'll also see these things like s colon with a quote in the book that would end up saying Sylvain says and then what he says so there's kind of little bits and pieces in the in the notebooks that just look a little bit odd and that's because it's designed that way in order to show in order to create stuff in the Right, so then last week we saw how you can export that to a pickle file that contains all the information for the model And then on the server where you're going to actually do your inference You can then load that saved file and you'll get back a learner that you can call predict on so predict Perhaps the most interesting part of predict is the third thing that it returns Which is a tensor in this case containing three numbers The three numbers there's three of them because we have three classes teddy bear grizzly bear and black bear right and so This doesn't make any sense until you know what the order of the classes is kind of in in In your data loaders and you can ask the data loaders what the order is by asking for its vocab So a vocab in fast AI is a really common concept it's basically any time that you've got like a mapping from numbers to strings or discrete levels The mapping is always stored in the vocab.

So here this shows us that The the activation for black bear is 10 a neg 6 the activation for grizzly is 1 and the activation for teddy is 10 a neg 6 so Very very confident that this particular one. It was a grizzly not surprisingly. This was something called grizzly JPEG So you need to kind of know this This mapping in order to display the correct thing But of course the data loaders object already knows that mapping and it's all the vocab and it's stored in with the loader So that's how it knows to say grizzly automatically So the first thing it gives you is the human readable string that you'd want to display So this is kind of nice that With fast AI 2 you you save this object which has everything you need for inference.

It's got all the you know information about Normalization about any kind of transformation steps about what the vocab is so it can display everything correctly right, so now we want to Deploy this as an app Now if you've done some web programming before then all you need to know is that this line of code and this line of code So this is the line of code you would call once when your application starts up And then this is the line of code you would call Every time you want to do an inference and there's also a batch version of it which you can look up if you're interested This is just a one at a time So there's nothing special if you're already a web programmer or have access to a web programmer These are you know you just have to stick these two lines of code somewhere and the three things you get back are the The human readable string if you're doing categorization The index of that which in this case is one is grizzly and the probability of each class One of the things we really wanted to do in this course though is not assume that everybody is a web developer Most data scientists aren't but gee wouldn't it be great if all data scientists could at least like prototype an application to show off the thing they're working on and so we've Tried to kind of curate an approach which none of its stuff.

We've built. It's really is curated Which shows how you can create a GUI and create a complete application in Jupyter Notebook? so the Key pieces of technology we use to do this are IPython widgets Which is always called IPy widgets and voila IPy widgets which we import by default as widgets, and that's also what they use in their own documentation as GUI widgets for example a file upload button so if I create this file upload button and then display it I See and we saw this in the last lesson as well.

Maybe it was less than one an actual clickable button So I can go ahead and Click it and it says now, okay, you've selected one thing So how do I use that? Well these? Well these widgets have all kinds of methods and properties and the upload button has a data property Which is an array?

containing all of the images you uploaded so you can pass that to Pio image dot create and so dot create is kind of the standard Factory method we use in fast AI to create items And Pio image dot create is smart enough to be able to create an item from all kinds of different things And one of the things it can create it from is a binary blob, which is what a file upload contains so then we can display it and There's our teddy, right?

So you can see how you know cells of Jupiter notebook can refer to other cells that were created that were kind of have GUI created data in them so let's hide that teddy away for a moment and the next thing to know about is that there's a kind of widget called output and an output widget is It's basically something that You can fill in later, right?

So if I delete actually This part here, so I've now got an output widget Yeah, actually let's do it this way around And You can't see the output widget even though I said please display it because nothing is output So then in the next cell I can say with that output placeholder display a thumbnail of the image And you'll see that the the display will not appear here It appears back here Right because that's how that's where the placeholder was So let's run that again to clear out that placeholder So we can create another kind of placeholder, which is a label the labels kind of a Something where you can put text in it.

They can give it a value like I Don't know. Please choose an image Okay, so we've now got a label containing. Please choose an image. Let's create another button to do a classification Now this is not a file upload button. It's just a general button. So this button doesn't do anything All right, it doesn't do anything until we attach an event handler to it an event handler is A callback we'll be learning all about callbacks in this course If you've ever done any GUI programming before or even web programming you'll be familiar with the idea that you Write a function which is the thing you want to be called when the button is clicked on and then somehow you tell your framework That this is the on click event.

So here I go. Here's my button run. I say the on click event the button run is Recall this code and this code is going to do all the stuff. We just saw going to create an image from the upload It's going to clear the output display the image Call predict and then replace the label with a prediction So there it all is now so that hasn't done anything but I can now go back to this classify button Which now has an event handler attached to it.

So what's this? click Boom and look that's been filled in that's been filled in right in case you missed it Let's run these again to clear everything out Okay, everything's gone This is please choose an image there's nothing here I click classify oh pop pop Right. So it's kind of amazing how our notebook has suddenly turned into this interactive prototyping playground building applications and so once all this works We can dump it all together.

And so The easiest way to dump things together is to create a V box A V box is a vertical box and it's just it's just something that you put widgets in and so in this case We're going to put the following widgets we're going to have a label that says select your bear then an upload button a run button an output placeholder and a label for predictions But let's run these again just to clear everything out So that we're not cheating And let's create our V box so as you can see it's just got all the All the pieces right Now we've got Oh, I accidentally ran the thing that displayed the bear let's get rid of that You Okay, so there it is, so now I can click upload I can choose my bear Okay, and then I can click classify Right and notice I've this is exactly that this is this is like the same buttons as as these buttons They're like two places with we're viewing the same button.

Which is kind of a wild idea so if I click classify, it's going to change this label and This label because they're actually both references to the same label look There we are, okay, so This is our app right and so this is actually how I built that That image cleaner gooey is is just using these exact things and I built that image cleaner gooey Cell by cell in a notebook just like this.

And so you get this kind of interactive Experimental framework for building a gooey So if you're a data scientist who's never done gooey stuff before This is a great time to get started because now you can you can make actual programs Now of course an actual program Running inside a notebook is kind of cool.

But what we really want is this program to run In a place anybody can run it That's where voila comes in. So voila Needs to be installed. So you can just run these lines or install it It's listed in the pros and what voila does is it takes a notebook and Doesn't display anything except for the markdown The IPython widgets and the outputs Right.

So all the code cells disappear and it doesn't give the person looking at that page the ability to run their own code. They can only Interact with the widgets, right? So what I did Was a copied and pasted that code From the notebook into a separate notebook, which only has Those lines of code right, so You So these are just the same lines of code that we saw before And So this is a notebook.

It's just a normal notebook And then I installed voila and then when you do that if you navigate to this notebook But you replace Notebooks up here with voila it actually displays not the notebook, but Just as I said the markdown and the widgets though here I've got My bear classifier and I can click upload.

Let's do a grizzly bear this time And this is a slightly different version I actually made this so there's no classified button I thought it would be a bit more fancy to make it so when you click upload it just runs everything But as you can see there it all is Right.

It's all working. So This is the world's Simplest prototype, but it's it's a proof of concept, right? So you can add widgets with dropdowns and sliders and charts and you know everything that you can have in a You know an angular app or a react app or whatever and in fact, there's there's even Stuff which lets you use for example the whole Vue.js framework if you know that it's a very popular JavaScript framework the whole Vue.js framework you can actually use it in widgets and voila So now we want to get it so that this this app can be run by Someone out there in the world.

So the voila documentation shows a few ways to do that, but perhaps the easiest one is to use a system called binder And so binder is at mybinder.org and All you do is you paste in your github repository name here, right? And this is all in the book, right?

So you paste in your github repo name You change where it says File you change that to URL You can see and then you put in the path which we were just experimenting with Right So pop that here and then you say launch and what that does is it then gives you a URL So then this URL You can pass on to people and this is actually your Interactive running application.

So binders free and so this isn't you know Anybody can now use this to take their voila app and make it a publicly available web application So try it as it mentions here the first time you do this binder takes about five minutes to build your site Because it actually uses something called Docker to deploy the whole fast AI framework and Python and blah blah blah But once you've done that That virtual machine will keep running for you know, as long as people are using it.

It'll keep running for a while That virtual machine will keep running for a while as long as people are using it and you know, it's it's reasonably fast So a few things to note here Being a free service. You won't be surprised to hear this is not using a GPU.

It's using a CPU And so that might be surprising but we're deploying to something which runs on a CPU When you think about it though, this makes much more sense to deploy to a CPU than a GPU The Just a moment Um, the thing that's happening here is that I am Passing along let's go back to my app in my app.

I'm passing along a single image at a time So when I pass along that single image, I don't have a huge amount of parallel work for a GPU to do This is actually something that a CPU is going to be doing more efficiently So we found that for folks coming through this course The vast majority of the time they wanted to deploy Inference on a CPU not a GPU because they're normally just doing one item at a time It's way cheaper and easier to deploy to a CPU And the reason for that is that you can just use any hosting service you like because just remember this is just a This is just a program at this point, right?

And you can use all the usual horizontal scaling vertical scaling, you know, you can use Heroku you can use AWS You can use inexpensive instances Super cheap and super easy Having said that there are times you might need to deploy to a GPU For example, maybe you're processing videos and so like a single video on on a CPU to process it.

It might take all day or You might be so successful that you have a thousand requests per second In which case you could like take 128 at a time Batch them together and put the whole batch on the GPU and get the results back and pass them back around I mean you've got to be careful of that right because as if your requests aren't coming fast enough your user has to wait for a whole batch of people to be ready to to be processed But you know conceptually As long as your site is popular enough that could work The other thing to talk about is you might want to deploy to a mobile phone and Deploying to a mobile phone our recommendation is wherever possible Do that by actually deploying to a server and then have a mobile phone talk to the server over a network And because if you do that Again, you can just use a normal pytorch program on a normal server and normal network calls.

It makes life super easy When you try to run a pytorch app on a phone You're suddenly now not an environment where not an environment where like pytorch will run natively and so you'll have to like convert your program into some other form and There are other forms and the the main form that you convert it to is something called ONNX which is specifically designed for kind of super high speed the high performance you know a Approach that can run on both servers or on mobile phones It does not require the whole Python and pytorch kind of runtime in place but it's it's much more complex and Not using it.

It's harder to debug. It's harder to set it up, but it's harder to maintain it. So if possible keep things simple and If you're lucky enough that you're so successful that you need to scale it up to GPUs or and stuff like that then great, you know, hopefully you've got the Finances at that point to justify, you know spending money on a I went an ex expert or serving expert or whatever and there are various Systems you can use to like ONNX runtime and AWS SageMaker where you can kind of say here's my ONNX Bundle and it'll serve it for you or whatever.

Pytorch also has a mobile framework same idea So Alright, so you've got I mean, it's kind of funny. We're talking about two different kinds of deployment here one is deploying like a Hobby application, you know that you're prototyping showing off to your friends to explaining to your colleagues how something might work You know a little interactive analysis.

That's one thing. Well, but maybe you're actually prototyping something that you're Want to turn into a real product Or an actual real part of your company's operations when you're deploying You know something in in real life, there's all kinds of things you got to be careful of One example is something to be careful of is let's say you did exactly what we just did Which actually this is your homework is to create your own?

Application right? I want you to create your own image search application. You can use My exact set of widgets and whatever if you want to but better still go to the IPY widgets website and see what other widgets They have and try and come up with something cool Try and come at you know, try and show off as best as you can then show us on the forum Now let's say you decided That you want to create an app that would help The users of your app decide if they have healthy skin or unhealthy skin So if you did the exact thing we just did rather than searching for grizzly bear and teddy bear and so forth on Bing you would search for healthy skin and unhealthy skin, right?

So here's what happens, right? If I and and remember in our version, we never actually looked at being we just used the Bing API the image search API But behind the scenes, it's just using the website right, so if I click healthy if I type healthy skin and say search I Actually discover that the definition of healthy skin is Young white women touching their face lovingly so that's what your Your healthy skin classifier would learn to detect right, and so This is so this is a great example from Debra G and you should check out her paper actionable auditing for lots of cool insights about model bias, but I mean here's here's like a Fascinating example of how if you weren't looking at your data carefully You you end up With something that doesn't at all actually solve the problem you want to solve This is This is tricky right because The data that you train your algorithm on if you're building like a new product that didn't exist before by definition You don't have examples of the kind of data that's going to be used in real life Right, so you kind of try to find some from somewhere and if they and if you do that through like a Google search Pretty likely you're not going to end up with a Set of data that actually reflects the kind of mix you would see in real life So You know the main thing here is to say be careful right and and in particular for your test set You know that final set that you check on Really try hard to gather data that that reflects The real world so like just you know for example for the healthy skin example You might go and actually talk to a dermatologist and try and find like 10 examples of healthy and unhealthy skin or something And that would be your kind of gold standard test There's all kinds of issues you have to think about in deployment I can't cover all of them I can tell you that this O'Reilly book called building machine learning powered applications Is is a great?

resource And this is one of the reasons we don't go into a detail about AP to a B testing and when should we refresh our data and how do we money monitor things and so forth is because That book's already been written. So we don't want to Rewrite it I Do want to mention a particular area that I care a lot about though Which is Let's take this example Let's say you're rolling out this bear detection system and it's going to be attached to video cameras around a campsite It's going to warn campers of incoming bears.

So if we used a model That was trained with that data that we just looked at You know those are all Very nicely taken pictures of pretty perfect bears, right? There's really no relationship to the kinds of pictures You're actually going to have to be dealing with in your in your campsite bear detector, which has it's going to have video and not images It's going to be nighttime.

It's going to be probably low resolution security cameras You need to make sure that the performance of the system is fast enough to tell you about it before the bear kills you Know there will be bears that are partially obscured by bushes or in lots of shadow or whatever None of which are the kinds of things you would see normally in like internet pictures So what we call this we call this out of domain data out of domain data refers to a situation where?

The data that you are trying to do inference on is in some way different to the kind of data That you trained with and this is actually There's no perfect way to answer this question and when we look at ethics we'll talk about some really helpful ways to to Minimize how much this happens for example it turns out that having a diverse team is a great way to Kind of avoid being surprised by the kinds of data that people end up coming up with But really it's just something you've got to be super thoughtful about Very similar to that is something called Domain shift and domain shift is where maybe you start out with all of your data is in domain data But over time the kinds of data that you're seeing Changes and so over time maybe raccoons start invading your campsite and you Weren't training on raccoons before it was just a bear detector and so that's called domain shift And that's another thing that you have to be very careful of Rachel.

What's your question? No, I was just going to add to that in saying that all data is biased so there's not kind of a you know form of a debias data or perfectly representative in all cases data and that a lot of the proposals around addressing this have kind of been converging to this idea and that you see in papers like Timnit Gebru's data sheets for data sets of just writing down a lot of the Details about your data set and how it was gathered and in which situations it's appropriate to use and how it was maintained And so there that's not that you've totally eliminated bias But that you're just very aware of the attributes of your data set so that you won't be blindsided by them later And there have been kind of several proposals in that school of thought which I which I really like around this idea of just kind of Understanding how your data was gathered and what its limitations are Thanks Rachel So a key problem here is that you can't know the entire behavior of your neural network With normal programming you typed in the if statements and the loops and whatever so in theory You know what the hell it does although it's still sometimes surprising in this case you you didn't tell it anything you just gave it Examples to learn from and hope that it learns something useful There are hundreds of millions of parameters in a lot of these neural networks And so there's no way you can understand how they all combine with each other to create complex behavior so really like there's a natural compromise here is that we're trying to Get sophisticated behavior so it's like like recognizing pictures Sophisticated enough behavior we can't describe it And so the natural downside is you can't expect the process that the thing is using to do that to be Describable for you for you to be able to understand it.

So Our recommendation for kind of dealing with these issues is a very careful Deployment strategy which I've summarized in this little graph this little chart here the idea would be first of all Whatever it is that you're going to use the model for start out by doing it manually So have a have a park ranger watching for bears Have the model running next to them and each time the park ranger sees a bear They can check the model and see like did it seem to have pick it up?

So the model is not doing anything. There's just a person who's like running it and seeing would it have made sensible choices And once you're confident that it makes sense that what it's doing seems reasonable You know it's been as close to the real life situation as possible Then deploy it in a time and geography limited way so pick like one campsite not the entirety of California and do it for you know one day and Have somebody watching it super carefully, right?

So now the basic bear detection is being done by the bed at bear detector But there's still somebody watching it pretty closely and it's only happening in one campsite for one day And so then as you say like okay We haven't destroyed our company yet But let's do two campsites for a week And then let's do you know the entirety of Marin for a month and so forth.

So this is actually what we did when I used to Be at this company called optimal decisions optimal decisions was a company that I founded to do insurance pricing and if you If you change insurance prices by you know a percent or two in the wrong direction in the wrong way You can basically destroy the whole company.

This has happened many times, you know insurers are companies That set prices that's basically the the product that they provide So when we deployed new prices for optimal decisions, we always did it by like saying like, okay We're going to do it for like five minutes or everybody whose name ends with a D, you know So we'd kind of try to find some Group which hopefully would be fairly, you know It'll be different but not too many of them and we'd gradually scale it up and you've got to make sure that when you're doing this that you have a lot of Really good reporting systems in place that you can recognize Are your customers yelling at you?

Are your computers burning up? You know are your Are your computers burning up are your costs spiraling out of control and so forth so it really requires great Reporting systems This fast AI have methods built-in that provide for incremental learning ie improving the model slowly over time with a single data point each time Yeah, that's a great question.

So This is a little bit different which is this is really about Dealing with domain shift and similar issues by continuing to train your model as you do inference. And so the good news is You don't need anything special for that It's basically just a transfer learning problem. So you can do this in many different ways Probably the easiest is just to say like okay each night Probably the easiest is just to say okay each night You know at midnight we're going to set off a task which Grabs all of the previous day's transactions as mini batches and trains another epoch And so yeah that that actually works fine.

You can basically think of this as a Fine tuning approach where your pre-trained model is yesterday's model and your fine-tuning data is today's data So as you roll out your model One thing to be thinking about super carefully is that it might change the behavior of the system that it's a part of And this can create something called a feedback loop and feedback loops are one of the most challenging things for For real-world model deployment particularly of machine learning models Because they can take a very minor issue and explode it into a really big issue so for example think about a predictive policing algorithm It's an algorithm that was trained to recognize you know Basically trained on data that says whereabouts or arrests being made And then as you train that algorithm based on where arrests are being made Then you put in place a system that sends police officers to places that the model says are likely to have crime which in this case where we're Were there where were arrests?

So then more police go to that place Find more crime because the more police that are there the more they'll see they arrest more people Causing, you know, and then if you do this incremental learning like we're just talking about then it's going to say Oh, there's actually even more crime here.

And so tomorrow it sends even more police And so in that situation you end up like the predictive policing algorithm ends up kind of sending all of your police For one street block because at that point all of the arrests are happening there because that's the only place you have policemen Right.

I should say police officers So there's actually a paper about This issue called to protect and serve and in to protect and serve the authors write this really nice phrase Predictive policing is aptly named. It is predicting policing not predicting crime, so if the initial model was Perfect, whatever the hell that even means but like it somehow sent police to exactly The best places to find crime based on the probability of crimes actually being in place.

I Guess there's no problem, right? But as soon as there's any amount of bias right so for example in the US There's a lot more arrests Of black people than of white people even for crimes where black people and white people are known to do them the same amount So in the presence of this bias Or any kind of bias You're kind of like setting off this this domino chain of feedback loops where that bias will be exploded over time so You know one thing I like to think about is to think like well, what would happen if this?

If this model was just really really really good So like who would be impacted, you know, what would this extreme result look like? How would you know what was really happening this incredibly predictive algorithm that was like? Changing the behavior of yours if your police officers or whatever, you know, what would that look like?

What would actually happen? And then like think about like, okay What could go wrong and then what kind of rollout plan what kind of monitoring systems what kind of oversight? Could provide the the circuit breaker because that's what we really need here Right is we need like nothing's going to be perfect.

You can't Be sure that there's no feedback loops But what you can do is try to be sure that you see when the behavior of your system is behaving in a way That's not what you want Did you have anything to add to that Rachel? All I would add to that is that you're at risk of potentially having a feedback loop anytime that your model is kind of controlling what your next round of data looks like and I think that's true for pretty much all products and that can be I think a hard jump from people people coming from kind of a science background where you may be thinking of Data as I have just observed some sort of experiment whereas kind of whenever you're, you know Building something that interacts with the real world You are now also controlling what your future data looks like based on kind of behavior of your your algorithm for the current current round of data right, so So given that you probably can't avoid feedback loops That you know the the thing you need to then really invest in is the human in the loop And so a lot of people like to focus on automating things, which I find weird You know if you can decrease the amount of human involvement by like 90% You've got almost all of the economic upside of automating it completely But you still have the room to put human circuit breakers in place.

You need these appeals processes You need the monitoring you need, you know humans involved to kind of go Hey, that's that's weird. I don't think that's what we want Okay Yes, Rachel and I just want more note about that those humans though do need to be integrated well with kind of product and engineering and so one issue that comes up is that in many companies I think that Ends up kind of being underneath trust and safety handles a lot of sort of issues with how things can go wrong or how your Platform can be abused and often trust and safety is pretty siloed away from Product and edge which actually kind of has the the control over you know These decisions that really end up influencing them and so having they the engineers probably consider them to be pretty pretty annoying a lot Of the time how they get in the way and get in the way of them getting software out the door Yeah But like the kind of the more integration you can have between those I think it's helpful for the kind of the people Building the product to see what is going wrong and what can go wrong if the engineers are actually on top of that They're actually seeing these these things happening that it's not some kind of abstract problem anymore So, you know at this point now that we've got to the end of chapter two You actually know a lot more than most people about About deep learning and actually about some pretty important foundations of machine learning more generally and of data products more generally So now's a great time to think about writing So Sometimes we have a Formatted text that doesn't quite format correctly in Jupiter notebook, by the way It only formats correctly in the book book.

So that's what it means when you see this kind of pre-formatted text so The the idea here is to think about Starting writing at this point before you go too much further Rachel There's a question. Okay, let's get the question Question is I am I assume there are fast AI type ways of keeping a nightly updated transfer learning setup Well, could there be one of the fast AI version for notebooks have an example of the nightly transfer learning training?

Like the previous person asked I would be interested in knowing how to do that most effectively with fast AI Sure. So I guess my view is there's nothing faster. Yeah, I specific about that at all So I actually suggest you read Emmanuel's book that book I showed you to understand the kind of the ideas And if people are interested in this I can also point to it some academic research about this as well And there's not as much as that there should be But there is some there is some good work in this area Okay, so the reason we mentioned writing at this point in our journey is because You know things are going to start to get more and more heavy more and more complicated and A really good way to make sure that you're on top of it is to try to write down what you've learned So, sorry, I wasn't sharing the right part of the screen before but this is what I was describing in terms of the Pre-formatted text which doesn't look correct So When so Rachel actually has this great article that you should check out which is why you should blog and I Will say it sort of her saying because I have it in front of me and she doesn't Weird as it is.

So Rachel says That the top advice she would give her younger self is to start blogging sooner So Rachel has a math PhD in this kind of idea of like blogging was not exactly something I think they had a lot of in the PhD program but actually it's like it's a really great way of Finding jobs.

In fact, most of my students who have got the best jobs are students that have good blog posts The thing I really love is that it helps you learn by by writing down it's kind of synthesizes your ideas and Yeah, you know, there's lots of reasons to look so there's actually Something really cool.

I want to show you Yeah As also just going to note I have a second post called advice for better blog post That's a little bit more advanced which I'll post a link to as well And that talks about some common pitfalls that I've seen in many in many blog posts and kind of the importance of putting Putting the time in to do it.

Well and some things to think about so I'll share that post as well. Thanks Rachel so one reason that sometimes people Blog is because it's kind of annoying to figure out how to particularly because I think the thing that a lot of you will want to blog about is Cool stuff that you're building in Jupyter notebooks.

So we've actually teamed up with a guy called Hamel Sane And and with github to create this free product As usual with fast AI no ads. No anything called fast pages where you can actually blog with Jupyter notebooks and so You can go to fast pages and see for yourself how to do it But the basic idea is that like you literally click one button It sets up a blog for you and then you dump your notebooks into a Folder called underscore notebooks and they get turned into blog posts, it's it's basically like magic and Hamels done this amazing job of this and so This means that you can create blog posts where you've got charts and tables and images You know where they're all actually the output of Jupyter notebook along with all the the markdown Formatted text headings and so forth and Piper links and the whole thing So this is a great way to start writing about what you're learning about here So something that Rachel and I both feel strongly about when it comes to blogging is this which is Don't try to think about the absolute most advanced thing You know and try to write a blog post that would impress Jeff Hinton right because most people are not Jeff Hinton so like a you probably won't do a good job because you're trying to like blog for somebody who's more got more expertise than you and be You've got a small audience now, right?

Actually, there's far more people that are not very familiar with deep learning than people who are They try to think you know, and and you really understand what it's like What it was like six months ago to be you because you were there six months ago So try and write something which the six months ago version of you Would have been like super interesting full of little tidbits.

You would have loved You know that you would have that would have delighted you That six months ago version of you Okay, so once again Don't move on until you've had a go at the questionnaire to make sure that you You know understand the key things we think that you need to understand And yeah, have a think about these further research questions as well because they might Help you to engage more closely with material So, let's have a break and we'll come back in five minutes time So welcome back everybody This is a Interesting moment in the course because we're kind of jumping from the part of the course, which is you know, very heavily around kind of The kind of this the the the structure of like what are we trying to do with machine learning?

And what are the kind of the pieces and what do we need to know? To make everything kind of work together There was a bit of code but not masses. There was basically no math and We kind of wanted to put that at the start for everybody who's not You know who's kind of wanting to an understanding of these issues without necessarily Wanting to kind of dive deep into the code and the math themselves and now we're getting into the diving deep part if you're not Interested in that diving deep yourself.

You might want to skip to the next lesson about ethics where we you know is kind of a rule that rounds out the kind of You know slightly less technical material So what we're going to look at here is we're going to look at What we think of is kind of a toy problem but Just a few years ago is considered a pretty challenging problem The problem is recognizing handwritten digits And we're going to try and do it from scratch Right.

I'm going to try and look at a number of different ways to do it So we're going to have a look at a data set Called MNIST. And so if you've done any machine learning before you may well have come across MNIST it contains handwritten digits And it was collated into a machine learning data set by a guy called John LeCun and some colleagues and they use that to demonstrate I'm one of the You know probably the first computer system to provide really practically useful scalable recognition of handwritten digits Lynette 5 with the system was actually used to Automatically process like 10% of the checks in in the US So one of the things that really helps I think when building a new model is to kind of start with something simple and gradually scale it up.

So We've created an even simpler version of MNIST which we call MNIST sample which only has threes and sevens Okay, so this is a good starting point to make sure that we can kind of do something easy I picked threes and sevens for MNIST sample because they're very different.

So I feel like if we can't do this We're going to have trouble recognizing every digit So step one is to call untard data untard data is the fast AI Function which takes a URL Checks whether you've already downloaded it if you haven't it downloads it checks whether you've already Uncompressed it if you haven't it uncompresses it and then it finally returns the path of where that ended up So you can see here is he URLs dot MNIST sample So you could just hit tab to get autocomplete Is just some some location right doesn't really matter where it is and so then when we Call that I've already downloaded it and already uncompressed it because I've already run this once before so it happens straight away and so path goes me Where it is now in this case path is dot and the reason path is dot is because I've used a special base path attribute to path to tell it kind of like where's my Where's my starting point, you know, and and that's used to print So when I go here LS which prints a list of files, these are all relative to Where I actually untard this - this just makes it a lot easier not to have to see the whole Set of parent path folders LS is actually so so path is a See what kind of type it is.

Oh It's a path lib path object Path lib is part of the Python standard library. It's a really very very very nice library, but it doesn't actually have LS Where there are libraries that we find super helpful, but they don't have exactly the things we want We liberally add the things we want to them.

So we add LS So if you want to find out what LS is You know, there's as we've mentioned, there's a few ways you can do it You can pop a question mark there and that will show you where it comes from so there's actually a library called fast core which is a lot of the foundational stuff in fast AI that is not dependent on Pytorch or Pandas or any of these big heavy libraries So this is part of fast core and if you want to see exactly what it does you of course remember you can put in a second question mark to get the source code and as you can see there's not much source code do it and You know, maybe most importantly Please don't forget about doc Because really importantly that gives you this show in docs link which you can click on to get to the documentation to see examples Textures if relevant tutorials tests and so forth so What's so when you're looking at a new data set you kind of just use I always start with just LS see what's in it And I can see here.

There's a train folder and there's a valid folder. That's pretty normal so let's look at LS on the train folder and it's got a folder called 7 and a folder called 3 and So this is looking quite a lot like our bare classifier data set. We downloaded each set of images into a folder based on what its label was This is doing it another level though Well, the first level of the folder hierarchy is is it training or valid and the second level is what's the label?

And this is the most common way for image data sets to be distributed So let's have a look let's just create something called threes that contains all of the contents of the three directory training and Let's just sort them so that this is consistent Do the same for sevens and let's just look at the threes and you can see there's just they're just numbered Alright, so let's grab one of those Open it and take a look.

Okay, so there's the picture of a three and so what is that really? Well, not three I am three So PIL is the Python imaging library. It's the most popular library by far for working with images On Python and it's a PNG not surprisingly So Jupyter notebook Knows how to display many different types and you can actually tell if you create a new type You can tell it how to display your type and so PIL comes with something that will automatically display the image like so What I want to do here though is to look at like how are we going to treat this as numbers?

Right, and so one easy way to treat things as numbers is to turn it into an array So array is part of numpy, which is the most popular array programming library for Python and so if we pass our PIL image object to array it Just converts the image into a bunch of numbers and the truth is it was a bunch of numbers the whole time It was actually stored as a bunch of numbers on disk It's just that there's this magic thing in Jupyter that knows how to display those numbers on the screen So when we say array Turning it back into a numpy array.

We're kind of removing this ability for Jupyter notebook to know how to display it like a picture So once I do this we can then index into that array and create everything from the grab everything all the rows from 4 up to but not including 10 and all the columns from 4 up to and not including 10 and here are some numbers and they are 8 bit unsigned integers, so they are between 0 and 255 So an image just like everything on a computer is just a bunch of numbers and therefore we can compute with it We could do the same thing but instead of saying array we could say tensor now a tensor is basically the PyTorch version of a numpy array and So you can see it looks it's exactly the same code as above But I've just replaced array with tensor and the output looks almost exactly the same except it replaces array with tensor And so you'll see this that basically a PyTorch tensor and a numpy array behave nearly identically Much if not most of the time, but the key thing is that a PyTorch tensor Can also be computed on a GPU not just a CPU So in our work and in the book and in the notebooks in our code We tend to use tensors PyTorch tensors much more often than numpy arrays Because they kind of have nearly all the benefits of numpy arrays plus all the benefits of GPU computation And they've got a whole lot of extra functionality as well a lot of people who have used Python for a long time always jump into numpy because that's what they're used to if that's you You might want to start considering jumping into Tensor like wherever you used to write a ray start writing tensor And just see what happens because you might be surprised at how many things you can speed up or do more easily So let's grab That that three image turn it into a tensor.

And so that's going to be three image tensor That's why I've got I am 3t. Okay, and let's grab a bit of it Okay, and turn it into a pandas data frame And the only reason I'm turning it into a pandas data frame is the pandas has a very convenient thing called background gradient That turns a background into a gradient as you can see So here is the top bit of the three you can see that the zeros of the whites and the numbers near 255 Other plaques and there's some what's it bits in the middle, which are which are great so here we have we can see what's going on when our Images which are numbers actually get displayed on the screen.

It's just it's just doing this And so I'm just showing a subset here the actual full number in MNIST is a 28 by 28 pixel square So that's 768 pixels So that's super tiny, right? My mobile phone, I don't know how many megapixels it is, but it's millions of pixels So it's nice to start with something simple and small Okay So is our goal create a model but by model that has been some kind of computer program learnt from data That can recognize threes versus sevens.

They could think of it as a three detector. Is it a three? Because if it's not a three, it's a seven So have it stop here pause the video and have a think How would you do it? How would you like you don't need to know anything about neural networks or anything else?

How might you just with common sense build a tree detector Okay, so I hope you grabbed a piece of paper a pen jutted some notes down I'll tell you the first idea that came into my head Was what if we grab every single three in the data set and take the average of the pixels?

so what's the average of This pixel the average of this pixel the average of this pixel the average of this pixel, right? And so there'll be a 28 by 28 picture Which is the average of all of the threes and that would be like the ideal three and then we'll do the same for sevens and Then so when we then grab something from the validation set to classify, we'll say like oh Is this image closer to the ideal threes the ideal three the mean of the threes or the ideal seven?

This is my idea. And so I'm going to call this the pixel similarity approach I'm describing this as a baseline a baseline is like a super simple model That should be pretty easy to program from scratch with very little magic, you know maybe it's just a bunch of kind of simple averages simple arithmetic, which You're super confident is going to be better than better than a random model right and one of the biggest mistakes I see in even experienced practitioners is that they fail to create a baseline and so then they build some fancy Bayesian model or or some fancy They create some fancy Bayesian model or some fancy neural network and they go Wow Jeremy Look at my amazingly great model and I'll say like how do you know it's amazingly great?

And they say oh look the accuracy is 80% and then I'll say okay Let's see what happens if we create a model where we always predict the mean. Oh Look, that's 85% And People get pretty disheartened when they discover this right and so make sure you start with a reasonable baseline and then gradually build on top of it So we need to get the average of the pixels so We're going to learn some nice Python programming tricks to do this so the first thing we need to do is we need a list of all of the Sevens, so remember we've got the sevens Hey, which is just a list of file names, right and So for each of those file names in the sevens Let's image dot open that file just like we did before to get a PIO object and let's convert that into a tensor So this thing here is called a list comprehension.

So if you haven't seen this before This is one of the most powerful and useful tools in Python. If you've done something with C sharp It's a little bit like link. It's not as powerful as link, but it's a similar idea If you've done some functional programming in in JavaScript, it's a bit like some of the things you can do with that, too But basically we're just going to go through this collection Each item will become called O and then it will be passed to this function Which opens it up and turns it into a tensor and then it will be collated all back into a list And so this will be all of the sevens as tensors So Silva and I use list and dictionary comprehensions every day And so you should definitely spend some time checking it out if you haven't already so now that we've got a list of all of the threes as tensors Let's just grab one of them And display it So remember, this is a tensor not a PIO image object So Jupiter doesn't know how to display it So we have to use some command to display it and show image is a fast AI command that displays a tensor and so here is our three So we need to get the average of all of those threes so to get the average The first thing we need to do is to turn change this so it's not a list But it's a tensor itself currently three tensors One as a shape Which is 28 by 28.

So this is this is the rows by columns the size of this thing, right? But three tensors itself It's just a list so I can't really easily do mathematical computations on that so what we could do is we could stack all of these 28 by 28 images on top of each other to create a Like a 3d cube of images and that's still quite a tensor So a tensor can have as many of these axes or dimensions as you like and to stack them up you use funnily enough Stack, right?

So this is going to turn the list Into a tensor and as you can see the shape of it is now 6 1 3 1 by 28 by 28. So it's kind of like a cube of height 6 1 3 1 by 28 by 28 The other thing we want to do is if we're going to take the mean We want to turn them into floating point values Because we we don't want to kind of have integers rounding off The other thing to know is that it's just as kind of a standard in Computer vision that when you're working with floats that you you expect them to be between 0 and 1 so we just divide by 255 because they were between 0 and 255 before so this is a pretty standard way to kind of Represent a bunch of images in PyTorch So these three things here are called the axes first axis second axis third axis and Overall we would say that this is a rank 3 tensor.

It has three axes. So the This one here was a rank 2 tensor. It has two axes So you can get the rank from a tensor by just taking the length of its shape 1 2 3 3 okay You can also get that from So the word I've been using the word axis You can also use the word dimension.

I think numpy tends to call it axis PyTorch tends to call it dimension so the rank is also the number of dimensions and dim So you need to make sure that you remember this word rank is the number of axes or dimensions in a tensor and the shape Is a list containing the size of each axis?

in a tensor So we can now say stack threes dot mean now if we just say stack threes dot mean That returns a single number that's the average pixel across that whole cube that whole rank 3 tensor But if we say mean 0 That is take the mean over this axis.

So that's the mean across the images right and so That's now 28 by 28 again because we kind of like Reduced over this 6 1 3 1 6 1 3 1 axis. We took the mean across that axis And so we can show that image and here is our ideal 3 So here's the ideal 7 using the same approach All right, so now let's just grab a 3 there's just any old 3 here it is And what I'm going to do is I'm going to say well Is this 3 more similar to the perfect 3 or is it more similar to the perfect 7 and whichever one?

It's more similar to I'm going to assume that's that's the answer So we can't just say look at each pixel and say What's the difference between this pixel? You know 0 0 here and 0 0 here and then 0 1 here and then 0 1 here and take the average The reason we can't just take the average is that there's positives and negatives and they're going to average out To nothing, right?

So I actually need them all to be positive numbers So there's two ways to make them all positive numbers. I could take the absolute value which simply means remove the minus signs Okay, and then I could take the average of those That's called the mean absolute difference or L1 norm or I could take the square of each difference and Then take the mean of that and then at the end I could take the square root Kind of undoes the squaring and that's called the root mean squared error or L2 So let's have a look let's take a 3 and Subtract from it the mean of the threes and take the absolute value and take the mean Okay, and call that the distance using absolute value of the three to a three And that there is the number point one, right?

So this is the mean absolute difference or L1 norm So when you see a word like L1 norm, if you haven't seen it before it may sound pretty fancy But all these math terms that we see, you know, you can Turn them into a tiny bit of code, right?

It's it's you know, don't let the mathy bits for you that they're often like in code It's just very obvious what they mean where else with math. You just you just have to learn it or Learn how to Google it So here's the same version for squaring take the difference where it take the mean and then take the square root So then we'll do the same thing for our three this time we'll compare it to the mean of the sevens Right.

So the distance from a three to the mean of the threes were in terms of absolute was point one And the distance from a three to the mean of the sevens was point one five So it's closer to the mean of the threes than it is to the mean of the sevens.

So we guess therefore that this is a three based on the mean absolute difference Same thing with RMSE root mean squared error would be to compare this value With this value and again root mean squared error. It's closer to the mean three than to the mean seven. So this is like a Machine learning model kind of it's a data driven model which attempts to recognize threes versus sevens And so this is a good baseline.

I Mean, it's it's a reasonable baseline. It's going to be better than random We don't actually have to write out minus abs mean We can just actually use L1 loss. L1 loss does exactly that We don't have to write minus squared We can just write MSE loss that doesn't do the square root by default.

So we have to pop that in Okay, and as you can see they're exactly the same numbers It's very important before we kind of go too much further to make sure we're very comfortable Working with arrays and tensors and you know, they're they're so similar So we could start with a list of lists, right?

Which is kind of a matrix We can convert it into an array or into a tensor We can display it and they look almost the same You can index into a single row You can index into a single column and so it's important to know This is very important colon means Every row because I put it in the first spot, right?

So if I put it in the second spot It would mean every column and so therefore Comma colon is exactly the same as removing it So it just turns out you can always remove Colons that are at the end because they're kind of they're just implied right you never have to and I often kind of put Them in any way because just kind of makes it a bit more obvious how these things kind of Match up or how they differ You can combine them together so give me the first row and everything from the first up to but not including the third column Right.

So there's that five six You can add stuff to them. You can check their type. Notice that this is different to the Python oopsy the Python Type though type is a function Just tells you it's a tensor. If you want to know what kind of tensor you have to use type as a method So it's a long tensor You can Multiply them by a float turns it into a float, you know to have a fiddle around if you haven't done much stuff with numpy or Pytorch before this is a good opportunity to just Go crazy try things out try try things that you think might not work and see if you actually get an error message, you know So We now want to find out How good is our model?

our model that involves just comparing something to to the domain so We should not compare You should not check how good our model is on the training set as we've discussed We should check it on a validation set and we already have a validation set. It's everything inside the valid directory So let's go ahead and like combine all those steps before let's go through everything in the validation set 3 LS Open them turn them into a tensor stack them all up Turn them into floats divide by 255 Okay Let's do the same for sevens So we're just putting all the steps we did before into a couple of lines Yeah, I always try to print out shapes like all the time Because if a shape is not what you expected then you can you know get weird things going on So the idea is we want some function is 3 that will return true if we think something is a 3 So to do that we have to decide whether our Digit that we're testing on is closer to the ideal 3 or the ideal 7 So let's create a little function that returns Takes the difference between two things takes the absolute value and then takes the mean So we're going to create this function MNIST distance that takes the difference between two answers Takes their absolute value and then takes the mean and it takes the mean and look at this We've got minus this time.

It takes the mean over the last over the second last and third last Sorry, the last and second last dimensions. So this is going to take The mean across the kind of x and y axes and so here you can see it's returning a single number which is the distance of a 3 from the mean 3 So that's the same as the value that we got earlier point 1 1 1 4 So we need to do this for every image in the validation set because we're trying to find the overall metric Remember the metric is the thing we look at to say how good is our model?

So here's something crazy. We can call MNIST distance Not just on our 3 but are on the entire validation set Against the mean 3 So that's wild like there's no normal programming that we would do where we could somehow pass in either a matrix or a rank 3 tensor and somehow it works both times and What actually happened here is that instead of returning a single number?

It returned 1010 numbers And it did this because it used something called Broadcasting and broadcasting is like the super special magic trick That lets you make Python into a very very high performance language. And in fact, if you do this broadcasting on GPU tenses and pytorch it actually does this operation on the GPU even though you wrote it in Python Here's what happens Look here this a - B So we're doing a - B on two things.

We've got first of all valid 3 tens of valid 3 tensor Is a thousand or so images right and remember that mean 3 Is just our single ideal 3 so what is Something of this shape minus something of this shape Well broadcasting means that if this shape doesn't match this shape Like if they did match it would just subtract every corresponding item, but because they don't match It's a it's actually acts as if there's a thousand and ten versions of this So it's actually going to subtract this from every single one of these So broadcasting let's look at some examples So broadcasting requires us to first of all understand the idea of element wise operations This is an element wise operation.

Here is a rank 1 tensor of Size 3 and another rank 1 tensor of size 3 So we would say these sizes match they're the same and so when I add 1 2 3 to 1 1 1 I get back 2 3 4 it just takes the corresponding items and adds them together.

So that's called element wise operations So when I have different Shapes as we described before What it ends up doing is it basically copies? this this number a thousand and ten times and it acts as if we had said valid 3 tens minus a thousand and ten copies of mean 3 As it says here it doesn't actually copy mean 3 a thousand and ten times it just pretends that it did right?

It just acts as if it did so basically kind of loops back around to the start again and again And it does the whole thing in C or in CUDA on the GPU so Then we see absolute value, right? So let's go back up here After we do the minus we go absolute value, but what happens when we call absolute value on Something of size 10 10 by 28 by 28 just cause absolute value on each underlying thing And then finally we call mean Minus one is the last element always in Python minus two is a second last so this is taking the mean over the last two axes and So then it's going to return just the first axis.

So we're going to end up with a thousand and ten means 1010 Distances which is exactly what we want. We want to know how far away is our each of our validation items away from the the ideal three So then We can create our is three function, which is hey is the distance between the number in question and the perfect three Less than the distance between the number in question and the perfect seven if it is It's a three.

All right, so Our three that was an actual three we had is it a three. Yes Okay, and then we can turn that into a float and yes becomes one Thanks to broadcasting we can do it for that entire Set right. So this is so cool. We basically get rid of loops In in in in this kind of programming.

You should have very few very very few loops loops make things much harder to read And and hundreds of thousands of times slower on the GPU potentially tens of millions of times slower So we can just say is three on our whole Valid three tens and then turn that into float and then take the mean So that's going to be the accuracy of the threes on average and here's the accuracy of the sevens.

It's just one minus that So the accuracy across threes is about 91 and a bit percent the accuracy on sevens is about 98 percent and the average of those two is about 95 percent. So here we have a Model that's 95 percent accurate at recognizing threes from sevens It might surprise you That we can do that using nothing but Arithmetic, right?

So that's what I mean by getting a good baseline Now the thing is It's not obvious how we kind of improve this right. I mean the thing is it doesn't match Arthur Samuel's description of machine learning, right? This is not something where there's a function which has some parameters which we're testing Against some kind of measure of fitness and then using that to like improve the parameters iteratively.

We kind of we just did one step and That's that right So we want to try and do it in this way where we arrange for some automatic means of testing the effectiveness of he called It a weight assignment We'd call it a parameter assignment in terms of performance and a mechanism for Alternating altering the weight assignment to maximize the performance that we want to do it that way Right because we know from from chapter one from lesson one, but if we do it that way we have this like Magic box, right called machine learning that can do, you know, particularly combined with neural nets should be able to solve any problem in theory If you can at least find the right set of weights So we need something that we can get better and better alone So let's think about a function which has parameters So instead of finding an ideal image and seeing how far away something is from the ideal image So instead of like having something where we test how far away we are from an ideal image What we could instead do is come up with a set of weights For each pixel.

So we're trying to find out if something is the number three and so we know that like in the places That you would expect to find three pixels You could give those like high weights so you can say hey if there's a dot in those places We give it like a high score and if there's dots in other places We'll give it like a low score but we could actually come up with a function where the probability of something being an Well in this case, let's say an 8 is equal to the pixels in the image Multiplied by some set of weights and then we sum them up right, so then anywhere where our The image we're looking at, you know As pixels where there are high weights It's going to end up with a high probability.

So here X is the image that we're interested in And we're just going to represent it as a vector. So let's just have all the rows stacked up end-to-end into a single long line so we're going to use an approach where we're going to start with a Vector W.

So a vector is a rank one tensor Okay, we're going to start with a vector W. That's going to contain random weights random parameters Depending on whether you use the Arthur Samuel version of the terminology or not and so We'll then predict whether a number appears to be a 3 or a 7 by using this tiny little function And then we will figure out how good the model is So we will calculate like how accurate it is or something like that the idea is the loss and Then the key step is we're then going to calculate the gradient now The gradient is something that measures for each weight.

If I made it a little bit bigger Would the loss get better or worse? If I made it a little bit smaller Would the loss get better or worse? And so if we do that for every weight We can decide for every weight whether we should make that weight a bit bigger or a bit smaller So that's called the gradient, right?

So once we have the gradient we then step is the word we used to step change all the weights Up a little bit for the ones where the gradient we should said we should make them a bit higher and Down a little bit for all the ones where the gradient said they should be a bit lower So now it should be a tiny bit better and then we go back to step 2 and Calculate a new set of predictions using this formula Calculate the gradient again step the weights Keep doing that So this is basically the flow chart and then at some point when we're sick of waiting or when the loss gets good enough We'll stop So these seven steps One two, three, four five six seven These seven steps are the key to training all deep learning models this technique is called stochastic gradient descent Well, it's called gradient descent.

We'll see the stochastic bit very soon and for each of these Seven steps, there's lots of choices around exactly how to do it, right? We've just kind of hand-waved a lot like what kind of random initialization and how do you calculate the gradient and exactly? What step do you take based on the gradient and how do you decide when to stop blah blah blah, right?

So in this in this course, we're going to be like learning About you know these steps You know, that's kind of part one, you know, then the other big part is like well, what's the actual function neural network? So how do we train the thing and what is the thing that we train?

So we initialize parameters with random values We need some function that's going to be the loss function that will return a number that's small if the performance of the model is good In some way to figure out whether the weight should be increased a bit or decrease a bit and Then we need to decide like when to stop which we'll just say let's just do a certain number of epochs So let's like Go even simpler, right?

We're not even going to do MNIST. We're going to start with this function x squared Okay, and in fast AI we've created a tiny little thing called plot function that plots the function Alright so there's our function f and What we're going to do is we're going to try to find this is our loss function So we're going to try and find the bottom point Right, so we're going to try and figure out what is the x value which is at the bottom?

So our seven-step procedure requires us to start out by initializing So we need to pick Some value right? So the value we pick was just say oh, let's just randomly pick minus one and a half Great. So now we need to know if I increase x a bit Does my remember this is my loss does my loss get a bit better?

Remember better is smaller or a bit worse So we can do that easily enough We can just try a slightly higher x and a slightly lower x and see what happens Right and you can see it's just the slope right the slope at this point Tells you that if I increase x by a bit Then my loss will decrease because that is the slope at this point So if we change our our weight our parameter Just a little bit in the direction of the slope Right.

So here is the direction of the slope and so here's the new value at that point And then do it again and then do it again Eventually, we'll get to the bottom of this curve So this idea goes all the way back to Isaac Newton at the very least and this basic idea is called Newton's method So a key thing we need to be able to do is to calculate this slope and the bad news is Do that we need calculus At least that's bad news for me because I've never been a fan of calculus we have to calculate the derivative Here's the good news though Maybe you spent ages in school learning how to calculate derivatives You don't have to anymore the computer does it for you and the computer does it fast.

It uses all of those Methods that you learned at school and it had a whole lot more Like clever tricks for speeding them up and it just does it all Automatically. So for example, it knows I don't know if you remember this from high school that the derivative of x squared is 2x It is just something it knows.

It's part of its kind of bag of tricks, right? So So PyTorch knows that PyTorch has an engine built in that can take derivatives and find the gradient of functions so to do that we start with a Tensor, let's say and in this case, we're going to modify this tensor with this special method called requires-grad and what this does is it tells PyTorch that any time I do a calculation with this Xt it should remember what calculation it does so that I can take the derivative later You see the underscore at the end An underscore at the end of a method in PyTorch means that this is called an in place operation It actually modifies this so requires-grad underscore Modifies this tensor to tell PyTorch that we want to be calculating gradients on it So that means it's just going to have to keep track of all of the computations we do so that it can calculate the derivative later Okay, so we've got the number 3 and Let's say we then call f on it.

Remember f is just squaring it though 3 squared is 9 But the value is not just 9 it's 9 accompanied with a grad function Which is that it's it knows that a power operation has been taken So we can now call a special method backward and backward Which refers to back propagation which we'll learn about which basically means take the derivative and so once it does that we can now look inside Xt because we said requires-grad and Find out its gradient and Remember the derivative of X squared is 2x In this case that was 3 2 times 3 is 6 right, so We didn't have to figure out the derivative We just call backward and then get the grad attribute to get the derivative So that's how easy it is to do calculus in PyTorch, so What you need to know about calculus is not how to take a derivative But what it means and what it means is It's the slope at some point Now here's something interesting let's not just take three, but let's take a rank one tensor also known as a vector three four ten and Let's add sum To our f function.

So it's going to go x squared dot sum And now we can take f of This vector get back 125 and Then we can say backward and grad and look 2x 2x 2x Right so we can calculate This is this is vector calculus right we're getting the gradient for every element of a vector With the same two lines of code So that's kind of all you need to know about calculus right and if this is If this idea that that a derivative for gradient is a slope is unfamiliar I'm check out Khan Academy They had some great introductory calculus and don't forget you can skip all the bits where they teach you how to calculate the gradients yourself So now that we know how to calculate the gradient that is the slope of the function That tells us if we change our input a little bit How will our output change?

Correspondingly, that's what a slope is Right and so that tells us that every one of our parameters if we know their gradients Then we know if we change that parameter up a bit or down a bit. How will it change our loss? So therefore we then know how to change our parameters So what we do is let's say all of our weights called W We just subtract off them the gradients multiplied by some Small number and that small number is often a number between about 0.001 and 0.1 It's called the learning rate right and this year is the essence of gradient descent So if you pick a learning rate, that's very small Then you take the slope and you take a really small step in that direction And another small step another small step another small steps note.

It's going to take forever to get to the end If you pick a learning rate, that's too big You jump way too far Each time and again, it's going to take forever And in fact in this case, sorry at this case We're assuming we're starting here and it's actually so big that it got worse and worse Or here's one where we start here and it's like it's not So big it gets worse and worse, but it just takes a long time to bounce in and out right, so Picking a good learning rate is really important both to making sure that it's even possible To solve the problem and that it's possible to solve it in a reasonable amount of time So we'll be learning about picking how to pick learning rates in this course So Let's try this.

Let's try using gradient descent. I said SGD. That's not quite accurate. It's just going to be gradient descent to solve an actual problem So the problem we're going to solve is let's imagine you were watching a roller coaster Go over the top of a hump, right? So as it comes out of the previous hill, it's going super fast and it's going up the hill And it's going slower and slower and slower until it gets to the top of the hump And then it goes down the other side it goes faster and faster and faster So if you like had a stopwatch or whatever or a sudden watch some kind of speedometer and you are measuring it just by hand At kind of equal time points.

You might end up with something that looks a bit like this Right. And so the way I did this was I just grabbed a range just grabs The numbers from naught up to but not including 20. Alright, so these are the time periods at which I'm taking my speed measurement and then I've just got some Quadratic function here and multiply it by 3 and then square it and then add 1 whatever right and then I Also, actually sorry I take my time minus 9.5 Square it times 0.75 add 1 and then I add a random number to that or add a random number to every observation So I end up with a quadratic function, which is a bit bumpy So this is kind of like what it might look like in real life because my speedometer Kind of testing is not perfect All right, so We want to create a function that estimates at any time.

What is the speed of the roller coaster? so we start by Guessing what function it might be so we guess that it's a function A times time squared plus B times time plus C. You might remember from school is called a quadratic so let's create a function right and so Let's create it using kind of the Arthur Samuel's technique the machine learning technique.

This function is going to take two things It's going to take an input which in this case is a time and it's going to take some parameters and The parameters are a B and C. So in in Python you can split out a List or a collection into its components like so and then here's that function.

Okay So we're not just trying to find any function in the world. We're just trying to find some function Which is a quadratic by finding an A and a B and a C So the the Arthur Samuel technique for doing this is to next up come up with a loss function Come up with a measurement of how good we are.

So if we've got some predictions That come out of our function and the targets which are these you know actual values Then we could just do the Mean squared error. Okay. So here's that mean squared error we saw before the difference squared and take the mean So now we need to go through our seven-step process We want to come up with a set of three parameters A B and C Which are as good as possible.

The step one is to initialize A B and C to random values So this is how you get random values three of them in PyTorch and remember we're going to be adjusting them So we have to tell PyTorch that we want the gradients I'm just going to save those away so I can check them later and then I calculate the predictions using that function f Which was this?

And then let's create a little function which just plots how good at this point are our predictions So here is a function that prints in red our predictions and in blue our targets, so that looks pretty terrible But let's calculate the loss Using that MSE function we wrote Okay, so now we want to improve this so calculate the gradients using the two steps we saw Call backward and then get grad and this says that each of our Parameters has a gradient that's negative Let's pick a learning rate of 10 to the minus 5 so we multiply that by 10 to the minus 5 And step the weights and remember step the weights means minus equals learning rate times The gradient there's a wonderful trick here which I've called dot data the reason I've called dot data is dot data is a special attribute in PyTorch which if you use it then the gradient is not calculated and we certainly wouldn't want the gradient to be calculated of The actual step we're doing we only want the gradient to be calculated of our function f Right.

So when we step the weights we have to use this special dot data attribute After we do that Delete the gradients that we already had And let's see if loss improved. So the loss before was 25,800 Now it's 5,400 and the plot has gone from something that goes down to minus 300 Oh to something that looks much better So let's do that a few times So I've just grabbed those previous lines of code and pasted them all into a single cell Okay, so preds loss backward data grad is none and then from time to time print the loss out and Repeat that ten times and look getting better and better And so we can actually look at it getting better and better So this is pretty cool, right?

We have a technique. This is the Arthur Samuel technique for Finding a set of parameters that Continuously improves by getting feedback from the result of measuring some loss function So that was kind of the key step, right? This this is the gradient descent method So you should make sure that you kind of go back and feel super comfortable with what's happened and You know if you're not feeling comfortable that that's fine Right if it's been a while or if you've never done this kind of gradient descent before This might feel super unfamiliar So kind of trying to find the first cell in this notebook where you don't fully understand what it's doing And then like stop and figure it out like look at everything that's going on do some experiments do some reading until you understand That cell where you were stuck before you move forwards So let's now apply this to MNIST So for MNIST We want to use this exact technique and there's basically nothing extra we have to do except one thing we need a loss function and The metric that we've been using is the error rate or the accuracy It's like how often are we correct?

Right and and that's the thing that we're actually trying to make Good now metric, but we've got a very serious problem Which is remember we need to calculate the gradient To figure out how we should change our parameters and the gradient is the slope or this deepness Which you might remember from school is defined as rise over run It's y new minus y old divided by x new minus x old so The gradients actually defined when x new is is very very close to x old Meaning that difference is very small But think about it Accuracy if I change a parameter by a tiny tiny tiny amount The accuracy might not change at all Because there might not be any Three that we now predict as a seven or any seven that we now predict as a three Because we change the parameter by such a small amount So it's it's it's possible.

In fact, it's certain that the gradient is zero at Many places and that means that our parameters Aren't going to change at all because learning rate times gradient is still zero when the gradients zero for any learning rate So this is why the loss Function and the metric are not always the same thing we can't use a metric as Our loss if that metric has a gradient of zero So we need something different so we want to find something that kind of Is pretty similar to the accuracy in that like as the accuracy gets better this ideal function we want gets better as well But it should not have a gradient of zero So let's think about that function Suppose we had three images Actually, you know what This is actually probably a good time to stop because actually, you know We've kind of we've got to the point here where we understand gradient descent We kind of know how to do it with a simple loss function and I actually think before we start looking at the MNIST loss function We shouldn't move on because we've got so much so much assignments to do for this week already So we've got build your web application and we've got go step through step through this notebook to make sure you fully understand it So I actually think we should probably Stop right here before we make things too crazy.

So before I do Rachel are there any questions? Okay, great. All right. Well, thanks everybody. I'm sorry for that last-minute change of tack there, but I think this is going to make sense So I hope you have a lot of fun with your web applications try and think of something.

That's really fun really interesting It doesn't have to be like important. It could just be some, you know cute thing We've had students before a student that I think he said he had 16 different cousins and he created something that would classify A photo based on which of his cousins is for like his fiance meeting his family you know You can come up with anything you like but you know, yeah show off your application and Maybe have a look around at what IPY widgets can do and try and come up with something that you think is pretty cool All right.

Thanks, everybody. I will see you next week.