Lesson 2 - Deep Learning for Coders (2020)

00:00:00.000 | So hello everybody and welcome back to

00:00:06.620 | Practical Deep Learning for Coders. This is lesson 2 and and in the last lesson we started

00:00:14.060 | training our first models. We didn't really have any idea how that training

00:00:18.760 | was really working, but we were looking at a high level at what was going on and

00:00:24.040 | we learned about what is machine learning and how does that work and we

00:00:35.320 | realized that based on how machine learning worked that there are some

00:00:40.400 | fundamental limitations on on what it can do and we talked about some of those

00:00:45.080 | limitations and we also talked about how after you've trained a machine learning

00:00:48.680 | model you end up with a program which behaves much like a normal program or

00:00:54.240 | something with inputs and a thing in the middle and outputs. So today we're

00:00:59.520 | gonna finish up talking about talking about that and we're going to then look

00:01:05.040 | at how we get those models into production and what some of the issues

00:01:08.400 | with doing that might be. I wanted to remind you that there are two sets of

00:01:16.360 | books, sorry two sets of notebooks available to you. One is the the the

00:01:22.320 | fastbook repo, the full actual notebooks containing all the text of the O'Reilly

00:01:29.000 | book and so this lets you see everything that I'm telling you in much more detail

00:01:35.920 | and then as well as that there's the the course v4 repo which contains exactly

00:01:42.520 | the same notebooks but with all the pros stripped away to help you study. So

00:01:47.640 | that's where you really want to be doing your experiment and your practice and so

00:01:51.640 | maybe as you listen to the video you can kind of switch back and forth between

00:01:56.600 | the video and reading or do one and then the other and then put it away and have

00:02:01.280 | a look at the course v4 notebooks and try to remember like okay what was this

00:02:04.800 | section about and run the code and see what happens and change it and so forth.

00:02:11.200 | So we were looking at this line of code where we looked at how we created our

00:02:21.360 | data by passing in information perhaps most importantly some way to label the

00:02:28.920 | data and we talked about the importance of labeling and in this case the this

00:02:33.000 | particular data set whether it's a cat or a dog you can tell by whether it's an

00:02:37.000 | uppercase or a lowercase letter in the first position that's just how this data

00:02:41.520 | set if they tell you when the readme works and we also looked particularly at

00:02:46.360 | this idea of valid percent equals 0.2 and like what does that mean it creates a

00:02:51.320 | validation set and that was something I wanted to talk more about. The first thing

00:02:57.560 | I do want to do though is point out that this particular labeling function

00:03:05.320 | returns something that's either true or false and actually this data set as we'll

00:03:11.480 | see later also tells also contains the actual breed of 37 different cat and dog

00:03:17.160 | breeds so you can you can also grab that from the file name. In each of those two

00:03:23.120 | cases we're trying to predict a category is it a cat or is it a dog or is it a

00:03:29.680 | German Shepherd or a beagle or rag doll cat or whatever when you're trying to

00:03:36.360 | predict a category so when the label is a category we call that a classification

00:03:42.160 | model. On the other hand you might try to predict how old is the animal or how

00:03:50.120 | tall is it or something like that which is like a continuous number that could

00:03:55.680 | be like 13.2 or 26.5 or whatever anytime you're trying to predict a number your

00:04:02.000 | label is a number you call that regression okay so those are the two

00:04:06.720 | main types of model classification and regressions this is very important jargon

00:04:11.000 | to know about so the regression model attempts to predict one or more numeric

00:04:16.840 | quantities such as temperature or location or whatever this is a bit

00:04:21.240 | confusing because sometimes people use the word regression as a shortcut to a

00:04:25.920 | particular like a abbreviation for a particular kind of model called linear

00:04:30.720 | regression that's super confusing because that's not what regression means

00:04:36.240 | linear regression is just a particular kind of regression but I just wanted to

00:04:40.320 | warn you of that when you start talking about regression a lot of people will

00:04:45.280 | assume you're talking about linear regression even though that's not what

00:04:48.080 | the word means. Alright so I wanted to talk about this valid percent 0.2 thing

00:04:54.840 | so as we described valid percent grabs in this case 20% of the data but 0.2 and

00:05:02.040 | puts it aside like in a separate bucket and then when you train your model your

00:05:08.120 | model doesn't get to look at that data at all that data is only used to decide

00:05:13.960 | to show you how accurate your model is so if you train for too long and or with

00:05:24.200 | not enough data and or a model with too many parameters after a while the

00:05:28.920 | accuracy of your model will actually get worse and this is called overfitting

00:05:34.040 | right so we use the validation set to ensure that we're not overfitting the

00:05:42.200 | next line of code that we looked at is this one where we created something

00:05:46.880 | called a learner we'll be learning a lot more about that but a learner is

00:05:50.320 | basically or is something which contains your data and your architecture that is

00:05:57.320 | the mathematical function that you're optimizing and so a learner is the thing

00:06:03.000 | that tries to figure out what are the parameters which best cause this

00:06:06.920 | function to match the labels in this data so we're talking a lot more about

00:06:13.640 | that but basically this particular function resnet 34 is the name of a

00:06:18.960 | particular architecture which is just very good for computer vision problems

00:06:23.480 | in fact the name really is resnet and then 34 tells you how many layers there

00:06:29.040 | are so you can use ones with bigger numbers here to get more parameters that

00:06:33.040 | will take longer to train take more memory more likely to overfit but could

00:06:38.160 | also create more complex models right now though I wanted to focus on this

00:06:44.320 | part here which is metrics equals error rate this is where you list the

00:06:49.920 | functions that you want to be that you want to be called with your data with

00:06:54.080 | your validation data and print it out after each epoch and epoch is is what we

00:07:01.280 | call it when you look at every single image in the data set once and so after

00:07:06.960 | you've looked at every image in the data set once we print out some information

00:07:11.160 | about how you're doing and the most important thing we print out is the

00:07:15.040 | result of calling these metrics so error rate is the name of a metric and it's a

00:07:20.200 | function that just prints out what percent of the validation set are being

00:07:24.760 | incorrectly classified by your model so a metrics a function that measures the

00:07:32.280 | quality of the predictions using the validation set so error rates one

00:07:36.720 | another common metric is accuracy which is just one minus error rate so very

00:07:42.160 | important to remember from last week we talked about loss Arthur Samuel had this

00:07:48.840 | important idea in machine learning that we need some way to figure out how good

00:07:54.800 | our how well our model is doing so that when we change the parameters we can

00:08:00.040 | figure out which set of parameters make that performance measurement get better

00:08:04.280 | or worse that performance measurement is called the loss the loss is not

00:08:10.440 | necessarily the same as your metric the reason why is a bit subtle and we'll be

00:08:17.140 | seeing it in a lot of detail once we delve into the math in the coming

00:08:20.140 | lessons but basically you need a function you need a loss function where

00:08:27.880 | if you change the parameters by just a little bit up or just a little bit down

00:08:31.960 | you can see if the loss gets a little bit better or a little bit worse and it

00:08:35.920 | turns out that error rate and accuracy doesn't tell you that at all because you

00:08:40.600 | might change the parameters by smudge such a small amount that none of your

00:08:45.040 | dogs predictions start becoming cats and none of your cat predictions start

00:08:48.920 | becoming dogs so like your predictions don't change and your error rate doesn't

00:08:52.640 | change so loss and metric are closely related but the metric is the thing that

00:08:58.080 | you care about the loss is the thing which your computer is using as the

00:09:03.640 | measurement of performance to decide how to update your parameters so we measure

00:09:11.200 | overfitting by looking at the metrics on the validation set so fast AI always

00:09:18.360 | uses the validation set to print out your metrics and overfitting is like the

00:09:24.320 | key thing that machine learning is about it's all about how do we find a model

00:09:30.160 | which fits the data not just for the data that we're training with but for

00:09:35.120 | data that the training algorithm hasn't seen before so overfitting results when

00:09:44.880 | our model is basically cheating our model can cheat by saying oh I've seen

00:09:52.560 | this exact picture before and I remember that that's a picture of a cat so it

00:09:58.040 | might not have learned what cats look like in general it just remembers you

00:10:01.880 | know that images one four and eight are cats and two and three and five are dogs

00:10:06.640 | and learns nothing actually about what they really look like so that's the kind

00:10:11.600 | of cheating that we're trying to avoid we don't want it to memorize our

00:10:15.360 | particular data set so we split off our validation data and most of this these

00:10:22.120 | words you're seeing on the screen are from the book okay so I just copied and

00:10:25.080 | pasted them so if we split off our validation data and make sure that our

00:10:31.240 | model sees it during training it's completely untainted by it so we can't

00:10:35.080 | possibly cheat not quite true we can cheat the way we could cheat is we could

00:10:41.960 | run we could fit a model look at the result in the validation set change

00:10:46.600 | something a little bit fit another model look at the validation set change

00:10:50.280 | something a little bit we could do that like a hundred times until we find

00:10:53.920 | something where the validation set looks the best but now we might have fit to

00:10:57.900 | the validation set right so if you want to be really rigorous about this you

00:11:03.120 | should actually set aside a third bit of data called the test set that is not

00:11:08.640 | used for training and it's not used for your metrics it's actually you don't

00:11:13.520 | look at it until the whole project's finished and this is what's used on

00:11:17.280 | competition platforms like Kaggle on Kaggle after the competition finishes

00:11:23.680 | your performance will be measured against a data set that you have never seen and

00:11:30.920 | so that's a really helpful approach and it's actually a great idea to do that

00:11:38.120 | like even if you're not doing the modeling yourself so if you're if you're

00:11:43.600 | looking at vendors and you're just trying to decide should I go with IBM or

00:11:48.040 | Google or Microsoft and they're all showing you how great their models are

00:11:52.360 | what you should do is you should say okay you go and build your models and I

00:11:57.680 | am going to hang on to ten percent of my data and I'm not going to let you see it

00:12:01.460 | at all and when you're all finished come back and then I'll run your model on the

00:12:06.240 | ten percent of data you've never seen now pulling out your validation and test

00:12:15.400 | sets is a bit subtle though here's an example of a simple little data set and

00:12:21.260 | this comes from a fantastic blog post that Rachel wrote that we will link to

00:12:25.960 | about creating effective validation sets and you can see basically you have some

00:12:31.120 | kind of seasonal data set here now if you just say okay fast AI I want to

00:12:37.360 | model that I want to create my data loader using a valid percent of 0.2 it

00:12:44.920 | would do this it would delete randomly some of the dots right now this isn't

00:12:52.180 | very helpful because it's we can still cheat because these dots are right in

00:12:57.560 | the middle of other dots and this isn't what would happen in practice what would

00:13:01.120 | happen in practice is we would want to predict this is sales by date right we

00:13:05.480 | want to predict the sales for next week not the sales for 14 days ago 18 days

00:13:10.720 | ago in 29 days ago right so what you actually need to do to create an

00:13:15.120 | effective validation set here is not do it randomly but instead chop off the end

00:13:21.640 | right and so this is what happens in all Kaggle competitions pretty much that

00:13:26.240 | involve time for instance is the thing that you have to predict is the next

00:13:30.640 | like two weeks or so after the last data point that they give you and this is

00:13:36.760 | what you should do also for your test set so again if you've got vendors that

00:13:40.760 | you're looking at you should say to them okay after you're all done modeling we're

00:13:45.000 | going to check your model against a data that is one week later than you've ever

00:13:49.920 | seen before and you won't be able to retrain or anything because that's what

00:13:53.880 | happens in practice right okay there's a question I've heard people describe

00:14:00.480 | overfitting as training error being below validation error does this rule of

00:14:05.120 | thumb end up being roughly the same as yours okay so that's a great question so

00:14:09.560 | I think what they mean there is training loss versus validation loss because we

00:14:15.840 | don't print training error so we do print at the end of each epoch the value

00:14:21.680 | of your loss function for the training set and the value of the loss function

00:14:25.080 | for the validation set and if you train for long enough so if it's training

00:14:32.120 | nicely your training loss will go down and your validation loss will go down

00:14:37.200 | because by definition loss function is defined such as a lower loss function is

00:14:44.920 | a better model if you start overfitting your training loss will keep going down

00:14:51.440 | right because like why wouldn't that you know you're getting better and better

00:14:55.480 | parameters but your validation loss will start to go up because actually you

00:15:02.160 | started fitting to the specific data points in the training set and so it's

00:15:05.920 | not going to actually get better it's going to get it's not going to get better

00:15:08.640 | for the validation set it'll start to get worse however that does not

00:15:14.880 | necessarily mean that you're overfitting or at least not overfitting in a bad way

00:15:19.080 | as we'll see it's actually possible to be at a point where the validation loss

00:15:24.760 | is getting worse but the validation accuracy or error or metric is still

00:15:29.640 | improving so I'm not going to describe how that would happen mathematically yet

00:15:35.240 | because we need to learn more about loss functions but we will but for now just

00:15:39.640 | realize that the important thing to look at is your metric getting worse not your

00:15:46.120 | loss function getting worse thank you for that fantastic question the next

00:15:56.520 | important thing we need to learn about is called transfer learning so the next

00:16:00.600 | line of code said learn fine tune why does it say learn fine tune fine tune is

00:16:07.360 | what we do when we are transfer learning so transfer learning is using a pre

00:16:12.580 | trained model for a task that is different to what it was originally

00:16:16.060 | trained for so more jargon to understand our jargon let's look at that what's a

00:16:20.980 | pre trained model so what happens is remember I told you the architecture

00:16:25.240 | we're using is called resnet 34 so when we take that resnet 34 that's just a

00:16:30.640 | it's just a mathematical function okay with lots of parameters that we're going

00:16:35.320 | to fit using machine learning there's a big data set called image net that

00:16:42.560 | contains 1.3 million pictures of a thousand different types of thing

00:16:46.640 | whether it be mushrooms or animals or airplanes or hammers or whatever there's

00:16:55.240 | a competition there used to be a competition that runs every year to see

00:16:58.240 | who could get the best accuracy on the image net competition and the models

00:17:02.600 | that did really well people would take those specific values of those

00:17:07.800 | parameters and they would make them available on the internet for anybody to

00:17:11.920 | download so if you download that you don't just have an architecture now you

00:17:16.280 | have a trained model you have a model that can recognize a thousand categories

00:17:22.320 | of thing in images which probably isn't very useful unless you happen to want

00:17:28.400 | something that recognizes those exact thousand categories of thing but it turns

00:17:33.120 | out you can rather you can start with those weights in your model and then

00:17:40.560 | train some more epochs on your data and you'll end up with a far far more

00:17:47.120 | accurate model than you would if you didn't start with that pre-trained model

00:17:51.540 | and we'll see why in just a moment right but this idea of transfer learning it's

00:17:57.040 | kind of it makes intuitive sense right image net already has some cats and some

00:18:03.520 | dogs in it it's you know it can say this is a cat and this is a dog but you want

00:18:07.320 | to maybe do something that recognizes lots of breeds that aren't an image net

00:18:11.000 | well for it to be able to recognize cats versus dogs versus airplanes versus

00:18:16.160 | hammers it has to understand things like what does metal look like what does fur

00:18:22.120 | look like what it is look like you know so it can say like oh this breed of

00:18:26.480 | animal this breed of dog has pointy ears and oh this thing is metal so it can't

00:18:31.320 | be a dog so all these kinds of concepts get implicitly learnt by a pre-trained

00:18:37.360 | model so if you start with a pre-trained model then you don't it you don't have

00:18:41.880 | to learn all these features from scratch and so transfer learning is the single

00:18:48.840 | most important thing for being able to use less data and less compute and get

00:18:54.960 | better accuracy so that's a key focus for the fast AI library and a key focus

00:19:00.920 | for this course there's a question I'm a bit confused on the differences between

00:19:08.840 | loss error and metric

00:19:12.600 | last error and metric sure so error is just one kind of metric so there's lots

00:19:23.600 | of different possible labels you could have let's say you're trying to create a

00:19:27.360 | model which could predict how old a cat or dog is so the metric you might use is

00:19:36.920 | on average how many years were you off by so that would be a metric on the other

00:19:44.440 | hand if you're trying to predict whether this is a cat or a dog your metric could

00:19:49.720 | be what percentage of the time am I wrong so that latter metric is called the

00:19:55.240 | error rate okay so error is one particular metric it's a thing that

00:20:00.320 | measures how well you're doing and it's like it should be the thing that you

00:20:04.880 | most care about so you write a function or use one of fast AI's pre-defined ones

00:20:10.520 | which measures how well you're doing loss is the thing that we talked about in

00:20:19.820 | lesson one so I'll give a quick summary but go back to lesson one if you don't

00:20:23.760 | remember Arthur Samuel talked about how a machine learning model needs some

00:20:29.000 | measure of performance which we can look at when we adjust our parameters up or

00:20:35.360 | down does that measure of performance get better or worse and as I mentioned

00:20:40.500 | earlier some metrics possibly won't change at all if you move the parameters

00:20:47.960 | up and down just a little bit so they can't be used for this purpose of

00:20:52.920 | adjusting the parameters to find a better measure of performance so quite

00:20:56.600 | often we need to use a different function we call this the loss function

00:21:00.840 | the loss function is the measure of performance that the algorithm uses to

00:21:05.480 | try to make the parameters better and it's something which should kind of

00:21:09.600 | track pretty closely to the the metric you care about but it's something which

00:21:15.120 | as you change the parameters a bit the loss should always change a bit and so

00:21:20.880 | there's a lot of hand waving there because we need to look at some of the

00:21:24.440 | math of how that works and we'll be doing that in the next couple of lessons

00:21:30.920 | thanks for the great questions okay so fine-tuning is a particular transfer

00:21:39.840 | learning technique where the oh and you're still showing your picture and

00:21:44.960 | out the slides so fine-tuning is a transfer learning technique where the

00:21:54.720 | weights this is not quite the right word we should say the parameters where the

00:21:58.400 | parameters of a pre-trained model are updated by training for additional epochs

00:22:02.800 | using a different task to that used for pre-training so pre-training the task

00:22:07.400 | might have been image net classification and then our different task might be

00:22:12.040 | recognizing cats versus dogs so the way by default fast AI does fine-tuning is

00:22:22.240 | that we use one epoch which remember is one looking at every image in the data

00:22:27.420 | set once one epoch to fit just those parts of the model necessary to get the

00:22:34.240 | particular part of the model that's especially for your data set working and

00:22:40.480 | then we use as many epochs as you ask for to fit the whole model and so this

00:22:45.640 | is more if you for those people who might be a bit more advanced we'll see

00:22:48.680 | exactly how this works later on in the lessons so why does transfer learning

00:22:55.560 | work and why does it work so well the best way in my opinion to look at this

00:22:59.920 | is to see this paper by Zyla and Fergus who were actually 2012 image net winners

00:23:07.200 | and interestingly their key insights came from their ability to visualize

00:23:13.160 | what's going on inside a model so visualization very often turns out to be

00:23:17.960 | super important to getting great results what they were able to do was they looked

00:23:22.600 | remember I told you like a resnet 34 has 34 layers they looked at something

00:23:28.880 | called Alex net which was the previous winner of the competition which only had

00:23:32.520 | seven layers at the time that was considered huge and so they took a seven

00:23:37.040 | layer model and they said what is the first layer of parameters look like and

00:23:42.720 | they figured it out how to draw a picture of them right and so the first

00:23:47.720 | layer had lots and lots of features but here are nine of them one two three four

00:23:55.900 | five six seven eight nine and here's what nine of those features look like one of

00:24:00.960 | them was something that could recognize diagonal lines from top left to bottom

00:24:04.440 | right one of them could find diagonal lines from bottom left to top right one

00:24:08.840 | of them could find gradients that went from the top of orange to the bottom of

00:24:12.400 | blue some of them were able you know one of them was specifically for finding

00:24:17.740 | things that were green and so forth right so for each of these nine they're

00:24:25.040 | called filters they're all features so then something really interesting they

00:24:30.400 | did was they looked at for each one of these each one of these filters each one

00:24:34.840 | of these features and we'll learn kind of mathematically about what these

00:24:38.360 | actually mean in the coming lessons but for now let's just recognize them as

00:24:43.120 | saying oh there's something that looks at diagonal lines and something that

00:24:45.520 | looks at gradients and they found in the actual images in ImageNet specific

00:24:52.900 | examples of parts of photos that match that filter so for this top left filter

00:24:58.400 | here are nine actual patches of real photos that match that filter and as you

00:25:04.360 | can see they're all diagonal lines and so here's the for the green one here's

00:25:08.560 | parts of actual photos that match the green one so layer one is super super

00:25:14.560 | simple and one of the interesting things to note here is that something that can

00:25:18.000 | recognize gradients and patches of color and lines is likely to be useful for

00:25:22.360 | lots of other tasks as well not just ImageNet so you can kind of see how

00:25:26.760 | something that can do this might also be good at many many other computer vision

00:25:33.280 | tasks as well this is layer two layer two takes the features of layer one and

00:25:40.680 | combines them so it can not just find edges but can find corners or repeating

00:25:49.800 | curving patterns or semicircles or full circles and so you can see for example

00:25:56.760 | here's a it's kind of hard to exactly visualize these layers after layer one

00:26:06.120 | you kind of have to show examples of what the filters look like but here you

00:26:11.080 | can see examples of parts of photos that these this layer to circular filter has

00:26:17.080 | activated on and as you can see it's found things with circles so

00:26:23.880 | interestingly this one which is this kind of blotchy gradient seems to be

00:26:28.040 | very good at finding sunsets and this repeating vertical pattern is very good

00:26:33.320 | at finding like curtains and wheat fields and stuff so the further we get

00:26:39.560 | layer three then gets to combine all the kinds of features in layer two and

00:26:45.320 | remember we're only seeing so we're only seeing here 12 of the features but

00:26:49.480 | actually there's probably hundreds of them I don't remember exactly in Alex

00:26:52.520 | Net but there's lots but by the time we get to layer three by combining features

00:26:57.440 | from layer two it already has something which is finding text so this is a

00:27:03.480 | feature which can find bits of image that contain text it's already got

00:27:08.120 | something which can find repeating geometric patterns and you see this is

00:27:12.980 | not just like a matching specific pixel patterns this is like a semantic concept

00:27:20.240 | it can find repeating circles or repeating squares or repeating hexagons

00:27:24.400 | right so it's it's really like computing it's not just matching a template and

00:27:31.220 | remember we know that neural networks can solve any possible computable

00:27:35.180 | function so it can certainly do that so layer 4 gets to combine all the filters

00:27:43.400 | from layer 3 anyway at once and so by layer 4 we have something that can find

00:27:47.240 | dog faces for instance so you can kind of see how each layer we get like

00:27:56.800 | multiplicatively more sophisticated features and so that's why these deep

00:28:01.880 | neural networks can be so incredibly powerful it's also why transfer learning

00:28:07.740 | can work so well because like if we wanted something that can find books and

00:28:13.160 | I don't think there's a book category in ImageNet well it's actually already got

00:28:17.020 | something that can find text as an earlier filter which I guess it must be

00:28:20.840 | using to find maybe there's a category for library or something or a bookshelf

00:28:25.880 | so when you use transfer learning you can take advantage of all of these

00:28:30.840 | pre-learn features to find things that are just combinations of these existing

00:28:37.680 | features that's why transfer learning can be done so much more quickly and so

00:28:42.880 | much less data than traditional approaches one important thing to

00:28:48.360 | realize then is that these techniques for computer vision are not just good at

00:28:53.160 | recognizing photos there's all kinds of things you can turn into pictures for

00:28:58.680 | example these are example these are sounds that have been turned into

00:29:04.440 | pictures by representing their frequencies over time and it turns out

00:29:10.080 | that if you convert a sound into these kinds of pictures you can get basically

00:29:15.960 | state-of-the-art results at sound detection just by using the exact same

00:29:21.660 | ResNet learner that we've already seen I wanted to highlight that it's 945 so if

00:29:28.520 | you want to take a break soon a really cool example from I think this is our

00:29:34.160 | very first year of running fast AI one of our students created pictures they

00:29:40.080 | worked at Splunk in anti-fraud and they created pictures of users moving their

00:29:45.240 | mouse and if I remember correctly as they moved their mouse he basically drew

00:29:49.800 | a picture of where the mouse moved and the color depended on how fast they

00:29:54.180 | moved and these circular blobs is where they clicked the left or the right mouse

00:29:59.080 | button and at Splunk they then well he what he did actually for the for the

00:30:04.740 | course as a project for the course is he tried to see whether he could use this

00:30:09.480 | these pictures with exactly the same approach we saw in lesson one to create

00:30:15.000 | an anti-fraud model and it worked so well that Splunk ended up patenting a new

00:30:21.240 | product based on this technique and you can actually check it out there's a blog

00:30:25.040 | post about it on the internet where they describe this breakthrough anti-fraud

00:30:29.120 | approach which literally came from one of our really amazing and brilliant and

00:30:34.440 | creative students after lesson one of the course another cool example of this

00:30:40.640 | is looking at different viruses and again turning them into pictures and you

00:30:48.800 | can kind of see how they've got here this is from a paper check out the book

00:30:52.500 | for the citation they've got three examples of a particular virus called

00:30:57.200 | vb.at and another example of a particular virus called fakrian and you

00:31:02.240 | can see each case the pictures all look kind of similar and that's why again

00:31:06.960 | they can get state-of-the-art results in in virus detection by turning the kind

00:31:12.760 | of program signatures into pictures and putting it through image recognition so

00:31:20.520 | in the book you'll find a list of all of the terms all of the most important

00:31:25.560 | terms we've seen by so far and what they mean I'm not going to read through them

00:31:29.280 | but I want you to please because these are the these are the terms that we're

00:31:33.480 | going to be using from now on and you've got to know what they mean because if

00:31:38.720 | you don't you're going to be really confused because I'll be talking about

00:31:41.320 | labels and architectures and models and parameters and they have very specific

00:31:46.080 | exact meanings and they'll be using those exact meanings so please review this so

00:31:52.520 | to remind you this is where we got to we we ended up with Arthur Samuel's overall

00:31:59.520 | approach and we replaced his terms with our terms so we have an architecture

00:32:05.520 | which contains parameters as inputs and we will parameters and the data as

00:32:11.960 | inputs so that the architecture press the parameters of the model with the

00:32:18.560 | inputs they use to calculate predictions they are compared to the labels with a

00:32:23.480 | loss function and that loss function is used to update the parameters many many

00:32:28.320 | times to make them better and better until the loss gets nice and super low

00:32:34.360 | so this is the end of chapter one of the book it's really important to look at

00:32:39.800 | the questionnaire because the questionnaire is the thing where you can

00:32:43.080 | check whether you have taken away from this book of this chapter the stuff that

00:32:49.260 | we hope you have so go through it and anything that you're not sure about the

00:32:55.520 | tech that the answer is in the text so just go back to earlier in the book and

00:33:00.080 | you will in the chapter and you will find the answers there's also a further

00:33:05.800 | research section after each questionnaire for the first couple of

00:33:09.480 | chapters they're actually pretty simple hopefully they're pretty fun and

00:33:12.240 | interesting they're things where to answer the question it's not enough to

00:33:15.480 | just look in the chapter you actually have to go and do your own thinking and

00:33:19.640 | experimenting and googling and so forth in later chapters some of these further

00:33:25.880 | research things are pretty significant projects that might take a few days or

00:33:30.320 | even weeks and so yeah you know check them out because hopefully they'll be a

00:33:35.480 | great way to expand your understanding of the material so something that Sylvain

00:33:42.560 | points out in the book is that if you really want to make the most of this

00:33:46.000 | then after each chapter please take the time to experiment with your own project

00:33:50.640 | and with the notebooks you provide what we provide and then see if you can redo

00:33:55.560 | the the notebooks on a new data set and perhaps for chapter one that might be a

00:34:00.200 | bit hard because we haven't really shown how to change things but for chapter for

00:34:03.640 | chapter two which we're going to start next you'll absolutely be able to do

00:34:07.240 | that okay so let's take a five minute break and we'll come back at 955 San

00:34:16.880 | Francisco time okay so welcome back everybody and I think we've got a couple

00:34:22.360 | of questions to start with so Rachel please take it away sure our filters

00:34:27.560 | independent by that I mean if filters are pre trained might they become less

00:34:31.840 | good in detecting features of previous images when fine-tuned oh that is a

00:34:37.000 | great question so assuming I understand the question correctly if if you start

00:34:43.120 | with say an image net model and then you you fine-tune it on dogs versus cats for

00:34:49.720 | a few epochs and you get something that's very good at recognizing dogs

00:34:53.560 | versus cats it's going to be much less good as an image net model after that

00:34:58.840 | so it's not going to be very good at recognizing airplanes or or hammers or

00:35:03.520 | whatever this is called catastrophic forgetting in the literature the idea

00:35:10.040 | that as you like see more images about different things to what you saw earlier

00:35:14.360 | that you start to forget about the things you saw earlier so if you want to

00:35:20.180 | fine-tune something which is good at a new task but also continues to be good

00:35:26.080 | at the previous task you need to keep putting in examples of the previous task

00:35:30.000 | as well and what are the example what are the differences between parameters

00:35:37.800 | and hyper parameters if I am feeding an image of a dog as an input and then

00:35:43.120 | changing the hyper parameters of batch size in the model what would be an

00:35:47.160 | example of a parameter so the parameters are the things that described in lesson

00:35:55.160 | one that Arthur Samuel described as being the things which change what the

00:36:02.720 | model does what the architecture does so we start with this infinitely flexible

00:36:08.540 | function the thing called a neural network that can do anything at all and

00:36:14.080 | the the way you get it to do one thing versus another thing is by changing its

00:36:19.880 | parameters there they are the numbers that you pass into that function so

00:36:24.640 | there's two types of numbers you pass into the function there's the numbers

00:36:27.640 | that represent your input like the pixels of your dog and there's the

00:36:32.240 | numbers that represent the learnt parameters so in the example of

00:36:39.520 | something that's not a neural net but like a checkers playing program like

00:36:43.560 | Arthur Samuel might have used back in the early 60s and late 50s those

00:36:47.640 | parameters may have been things like if there is a opportunity to take a piece

00:36:54.640 | versus an opportunity to get to the end of a board how much more value should I

00:36:59.920 | consider one versus the other you know it's twice as important or it's three

00:37:03.720 | times as important that two versus three that would be an example of a parameter

00:37:08.480 | in a neural network parameters are a much more abstract concept and so a

00:37:14.960 | detailed understanding of what they are will come in the next lesson or two but

00:37:20.080 | it's the same basic idea they're the numbers which change what the model does

00:37:26.480 | to be something that recognizes malignant tumors versus cats versus dogs

00:37:32.600 | versus colorizes black and white pictures whereas the hyper parameter is

00:37:38.920 | the choices about what what numbers do you pass to the function when you act

00:37:46.040 | the actual fitting function to decide how that fitting process happens there's

00:37:52.520 | a question I'm curious about the pacing of this course I'm concerned that all

00:37:55.960 | the material may not be covered depends what you mean by all the material we

00:38:00.920 | certainly won't cover everything in the world so yeah we'll cover what we can

00:38:08.400 | then we'll cover what we can in seven lessons we're certainly not covering the

00:38:12.960 | whole book if that's what you're wondering the whole book will be covered

00:38:16.200 | in either two or three courses in the past it's generally been two courses to

00:38:22.200 | cover about the amount of stuff in the book but we'll see how it goes because

00:38:26.480 | the books pretty big 500 pages when you say two courses you mean 14 lessons 14

00:38:32.520 | yeah so it'd be like 14 or 21 lessons to get through the whole book although

00:38:38.080 | having said that by the end of the first lesson hopefully there'll be kind of

00:38:40.880 | like enough momentum and understanding that the reading the book independently

00:38:44.800 | will be more useful and you'll have also kind of gained a community of folks on

00:38:50.800 | the forums that you can hang out with and ask questions of and so forth so in

00:38:57.380 | in the second part of the course we're going to be talking about putting stuff

00:39:02.000 | in production and we're so to do that we need to understand like what are the

00:39:08.040 | capabilities and limitations of of deep learning what are the kinds of projects

00:39:13.880 | that even make sense to try to put in production and you know one of the key

00:39:18.600 | things I should mention in in the Balkan in this course is that the first two or

00:39:22.160 | three lessons and chapters there's a lot of stuff which is designed not just for

00:39:27.640 | for coders but for for everybody there's lots of information about like what are

00:39:34.880 | the practical things you need to know to make deep learning work and so one of

00:39:38.320 | them things you need to know is like well what's deep learning actually good

00:39:41.360 | at at the moment so I'll summarize what the book says about this but there are

00:39:48.240 | the kind of four key areas that we have as applications in fast AI computer

00:39:53.760 | vision text tabula and but I've called here Rexis this stands for recommendation

00:39:58.200 | systems and specifically a technique called collaborative filtering which we

00:40:01.760 | briefly saw last week sorry another question is are there any pre-trained

00:40:06.960 | weights available other than the ones from image net that we can use if yes

00:40:11.240 | when should we use others in one image net oh that's a really great question so

00:40:16.280 | yes there are a lot of pre-trained models and one way to find them but also

00:40:23.320 | you're currently just showing switching okay great one great way to find them is

00:40:29.120 | you can look up model zoo which is a common name for like places that have

00:40:36.480 | lots of different models and so here's lots of model zoos or you can look for

00:40:44.400 | pre-trained models and so yeah there's quite a few unfortunately not as wide a

00:40:57.320 | variety as I would like that most are still on image net or similar kinds of

00:41:02.640 | general photos for example medical imaging there's hardly any there's a lot

00:41:09.800 | of opportunities for people to create domain specific pre-trained models it's

00:41:13.320 | it's still an area that's really underdone because not enough people are

00:41:16.320 | working on transfer learning okay so as I was mentioning we've kind of got these

00:41:23.760 | four applications that we've talked about a bit and deep learning is pretty you

00:41:32.560 | know pretty good at all of those tabular data like spreadsheets and database

00:41:39.160 | tables is an area where deep learning is not always the best choice but it's

00:41:44.160 | particularly good for things involving high cardinality variables that means

00:41:48.280 | variables that have like lots and lots of discrete levels like zip code or

00:41:52.520 | product ID or something like that deep learning is really pretty great for

00:41:58.600 | those in particular for text it's pretty great at things like classification and

00:42:06.760 | translation it's actually terrible for conversation and so that's that's been

00:42:11.720 | something that's been a huge disappointment for a lot of companies

00:42:14.120 | they tried to create these like conversation bots but actually deep

00:42:18.480 | learning isn't good at providing accurate information it's good at

00:42:23.240 | providing things that sound accurate and sound compelling but it we don't really

00:42:27.280 | have great ways yet of actually making sure it's correct one big issue for

00:42:34.840 | recommendation systems collaborative filtering is that deep learning is

00:42:39.880 | focused on making predictions which don't necessarily actually mean creating

00:42:44.760 | useful recommendations we'll see what that means in a moment deep learning is

00:42:50.680 | also good at multimodal that means things where you've got multiple

00:42:56.440 | different types of data so you might have some tabular data including a text

00:43:00.360 | column and an image and some collaborative filtering data and

00:43:06.880 | combining that all together is something that deep learning is really good at so

00:43:11.040 | for example putting captions on photos is something which deep learning is

00:43:17.920 | pretty good at although again it's not very good at being accurate so what you

00:43:22.400 | know it might say this is a picture of two birds it's actually a picture of

00:43:25.800 | three birds and then this other category there's lots and lots of things that you

00:43:33.800 | can do with deep learning by being creative about the use of these kinds of

00:43:38.240 | other application-based approaches for example an approach that we developed

00:43:43.600 | for natural language processing called ULM fit or you're learning in the course

00:43:48.120 | it turns out that it's also fantastic at doing protein analysis if you think of

00:43:53.040 | the different proteins as being different words and they're in a

00:43:57.360 | sequence which has some kind of state and meaning it turns out that ULM fit

00:44:02.240 | works really well for protein analysis so often it's about kind of being being

00:44:06.880 | creative so to decide like for the product that you're trying to build is

00:44:12.480 | deep learning going to work well for it in the end you kind of just have to try

00:44:17.600 | it and see but if you if you do a search you know hopefully you can find

00:44:24.480 | examples about the people that have tried something similar even if you

00:44:27.760 | can't that doesn't mean it's not going to work so for example I mentioned the

00:44:33.280 | collaborative filtering issue where a recommendation and a prediction are not

00:44:37.840 | necessarily the same thing you can see this on Amazon for example quite often

00:44:43.040 | so I bought a Terry Pratchett book and then Amazon tried for months to get me to

00:44:48.880 | buy more Terry Pratchett books now that must be because their predictive model

00:44:53.240 | said that people who bought one particular Terry Pratchett book are

00:44:57.440 | likely to also buy other Terry Pratchett books but from the point of view of like

00:45:01.880 | well is this going to change my buying behavior probably not right like if I

00:45:07.040 | liked that book I already know I like that author and I already know that like

00:45:10.440 | they probably wrote other things so I'll go and buy it anyway so this would be an

00:45:14.360 | example of like Amazon probably not being very smart up here they're

00:45:18.720 | actually showing me collaborative filtering predictions rather than

00:45:23.280 | actually figuring out how to optimize a recommendation so an optimized

00:45:27.520 | recommendation would be something more like your local human bookseller might

00:45:32.360 | do where they might say oh you like Terry Pratchett well let me tell you

00:45:36.840 | about other kind of comedy fantasy sci-fi writers on the similar vein who

00:45:41.440 | you might not have heard about before so the difference between recommendations

00:45:46.240 | and predictions is super important so I wanted to talk about a really important

00:45:53.360 | issue around interpreting models and for a case study for this I thought we let's

00:45:59.000 | pick something that's actually super important right now which is a model in

00:46:03.240 | this paper one of the things we're going to try and do in this course is learn

00:46:06.160 | how to read papers so here is a paper which you I would love for everybody to

00:46:11.480 | read called high temperature and high humidity reduce the transmission of

00:46:15.540 | COVID-19 now this is a very important issue because if the claim of this paper

00:46:20.840 | is true and that would mean that this is going to be a seasonal disease and if

00:46:25.360 | this is a seasonal disease and it's going to have massive policy implications

00:46:30.360 | so let's try and find out how this was modeled and understand how to interpret

00:46:35.240 | this model so this is a key picture from the paper and what they've done here is

00:46:45.560 | they've taken a hundred cities in China and they've plotted the temperature on

00:46:50.300 | one axis in Celsius and are on the other axis where R is a measure of

00:46:56.160 | transmissibility it says for each person that has this disease how many people on

00:47:02.200 | average will they infect so if R is under one then the disease will not

00:47:07.720 | spread is if R is higher than like two it's going to spread incredibly quickly

00:47:14.840 | and basically R is going to you know any high R is going to create an

00:47:18.560 | exponential transmission impact and you can see in this case they have plotted a

00:47:25.000 | best fit line through here and then they've made a claim that there's some

00:47:30.440 | particular relationship in terms of a formula that R is 1.99 minus 0.023 times

00:47:38.480 | temperature so a very obvious concern I would have looking at this picture is

00:47:44.840 | that this might just be random maybe there's no relationship at all but just

00:47:52.160 | if you picked a hundred cities at random perhaps they would sometimes show this

00:47:57.680 | level of relationship so one simple way to kind of see that would be to actually

00:48:04.840 | do it in a spreadsheet so here's here is a spreadsheet where what I did was I kind

00:48:12.960 | of eyeballed this data and I guessed about what is the mean degrees centigrade

00:48:17.920 | I think it's about five and what's about the standard deviation of centigrade I

00:48:22.440 | think it's probably about five as well and then I did the same thing for R I

00:48:27.240 | think the mean R looks like it's about 1.9 to me and it looks like the standard

00:48:32.040 | deviation of R is probably about 0.5 so what I then did was I just jumped over

00:48:38.560 | here and I created a random normal value so a random value from a normal

00:48:46.000 | distribution from a normal distribution so a bell curve with that particular

00:48:50.200 | mean and standard deviation of temperature and that particular mean and

00:48:55.120 | standard deviation of R and so this would be an example of a city that might

00:49:02.480 | be in this data set of a hundred cities something with nine degrees Celsius and

00:49:06.800 | an R of 1.1 so that would be nine degrees Celsius and an R of 1.1 so

00:49:12.680 | something about here and so then I just copied that formula down 100 times so

00:49:22.920 | here are a hundred cities that could be in China right where this is assuming

00:49:30.160 | that there is no relationship between temperature and R right they're just

00:49:34.320 | random numbers and so each time I recalculate that so if I hit ctrl equals

00:49:42.000 | it will just recalculate it right I get different numbers okay because they're

00:49:47.680 | random and so you can see at the top here I've then got the average of all of

00:49:55.240 | the temperatures and the average of all of the R's and the average of all the

00:49:58.880 | temperatures varies and the average of all of R's varies as well so then I what

00:50:09.560 | I did was I copied those random numbers over here let's actually do it so I'll

00:50:18.600 | go copy these 100 random numbers and paste them here here here here and so

00:50:32.760 | now I've got one two three four five six I've got six kind of groups of 100

00:50:40.720 | cities right and so let's stop those from randomly changing anymore by just

00:50:49.520 | fixing them in stone there okay so now that I've paste them in I've got six

00:51:01.520 | examples of what a hundred cities might look like if there was no relationship

00:51:06.440 | at all between temperature and R and I've got their main temperature and R

00:51:11.560 | in each of those six examples and what I've done is you can see here at least

00:51:16.980 | for the first one is I've plotted it right and you can see in this case there's

00:51:22.040 | actually a slight positive slope and I've actually calculated the slope for

00:51:33.500 | each just by using the slope function in Microsoft Excel and you can see that

00:51:37.840 | actually in this particular case is just random five times it's been negative and

00:51:46.200 | it's even more negative than their point 0 to 3 and so you can like it's kind of

00:51:53.560 | matching our intuition here which is that this the slope of the line that we

00:51:57.800 | have here is something that absolutely can often happen totally by chance it

00:52:03.680 | doesn't seem to be indicating any kind of real relationship at all if we wanted

00:52:09.240 | that slope to be like more confident we would need to look at more cities so like

00:52:17.800 | here I've got 3,000 randomly generated numbers and you can see here the slope

00:52:26.960 | is 0.00002 right it's almost exactly zero which is what we'd expect right when

00:52:33.080 | there's actually no relationship between C and R and in this case there isn't

00:52:37.440 | they're all random then if we look at lots and lots of randomly generated

00:52:41.360 | cities then we can say oh yeah this there's no slope but when you only look

00:52:45.840 | at a hundred as we did here you're going to see relationships totally

00:52:51.520 | coincidentally very very often right so that's something that we need to be able

00:52:57.360 | to measure and so one way to measure that is we use something called a p-value so

00:53:03.080 | a p-value here's how a p-value works we start out with something called a null

00:53:07.720 | hypothesis and the null hypothesis is basically what's what's our starting

00:53:13.760 | point assumption so our starting point assumption might be oh there's no

00:53:17.280 | relationship between temperature and R and then we gather some data and have

00:53:22.280 | you explained what R is I have yes R is the transmissibility of the virus so

00:53:28.680 | then we gather data of independent and dependent variables so in this case the

00:53:32.860 | independent variable is the thing that we think might cause a dependent variable

00:53:38.000 | so here the independent variable would be temperature the dependent variable

00:53:41.000 | would be R so here we've gathered data there's the data that was gathered in

00:53:45.720 | this example and then we say what percentage of the time would we see this

00:53:50.880 | amount of relationship which is a slope of 0.023 by chance and as we've seen one

00:53:57.720 | way to do that is by what we would call a simulation which is by generating

00:54:02.080 | random numbers a hundred set pairs of random numbers a bunch of times and

00:54:06.440 | seeing how often you see this this relationship we don't actually have to

00:54:12.240 | do it that though there's actually a simple equation we can use to jump

00:54:17.840 | straight to this number which is what percent of the time would we see that

00:54:21.280 | relationship by chance and this is basically what that looks like we have

00:54:31.040 | the most likely observation which in this case would be if there is no

00:54:35.980 | relationship between temperature and R then the most likely slope would be 0

00:54:40.040 | and sometimes you get positive slopes by chance and sometimes you get pretty small

00:54:48.940 | slopes and sometimes you get large negative slopes by chance and so the you

00:54:55.360 | know the larger the number the less likely it is to happen whether it be on

00:54:58.360 | the positive side or the negative side and so in our case our question was how

00:55:04.880 | often are we going to get less than negative 0.023 so it would actually be

00:55:10.000 | somewhere down here and I actually copy this from Wikipedia where they were

00:55:13.560 | looking for positive numbers and so they've colored in this area above a

00:55:17.760 | number so this is the p-value and so you can we don't care about the math but

00:55:22.480 | there's a simple little equation you can use to directly figure out this number

00:55:29.720 | the p-value from the data so this is kind of how nearly all kind of medical

00:55:39.840 | research results tend to be shown and folks really focus on this idea of p

00:55:45.480 | values and indeed in this particular study as we'll see in a moment they

00:55:49.640 | reported p-values so probably a lot of you have seen p-values in your previous

00:55:55.840 | lives they come up in a lot of different domains here's the thing they are

00:56:01.840 | terrible you almost always shouldn't be using them don't just trust me trust the

00:56:07.800 | American Statistical Association they point out six things about p-values and

00:56:14.240 | those include p-values do not measure the probability that the hypothesis is

00:56:19.320 | true all the probability that the data were produced by random choice alone now

00:56:24.480 | we know this because we just saw that if we use more data right so if we sample

00:56:32.040 | 3000 random cities rather than a hundred we get a much smaller value right so p

00:56:40.200 | values don't just tell you about how big a relationship is but they actually tell

00:56:44.320 | you about a combination of that and how much data did you collect right so so

00:56:49.560 | they don't measure the probability that the hypothesis is true so therefore

00:56:53.960 | conclusions and policy decisions should not be based on whether a p-value passes

00:56:58.920 | some threshold p-value does not measure the importance of a result right because

00:57:08.000 | again it could just tell you that you collected lots of data which doesn't

00:57:11.880 | tell you that the results actually of any practical input and so by itself it

00:57:16.120 | does not provide a good measure of evidence so Frank Harrell who is somebody

00:57:23.600 | who I read his book and it's a really important part of my learning he's a

00:57:28.360 | professor of biostatistics has a number of great articles about this he says

00:57:34.280 | null hypothesis testing and p-values have done significant harm to science

00:57:39.160 | and he wrote another piece called null hypothesis significance testing never

00:57:44.160 | worked so I've shown you what p-values are so that you know why they don't work

00:57:52.320 | not so that you can use them right but they're a super important part of

00:57:56.440 | machine learning because they come up all the time in making this you know when

00:58:01.320 | people saying this is how we decide whether your drug worked or whether

00:58:06.000 | there is a epidemiological relationship or whatever and indeed p-values appear

00:58:13.160 | in this paper so in the paper they show the results of a multiple linear

00:58:19.800 | regression and they put three stars next to any relationship which has a p-value

00:58:27.400 | of 0.01 or less so there is something useful to say about a small p-value like

00:58:38.240 | 0.01 or less which is that the thing that we're looking at did not probably did not

00:58:43.400 | happen by chance right the biggest statistical error people make all the

00:58:48.200 | time is that they see that a p-value is not less than 0.05 and then they make

00:58:54.400 | the erroneous conclusion that no relationship exists right which doesn't

00:59:01.880 | make any sense because like it let's say you only had like three data points then

00:59:06.480 | you almost certainly won't have enough data to have a p-value of less than 0.05

00:59:11.400 | for any hypothesis so like the way to check is to go back and say what if I

00:59:17.520 | picked the exact opposite null hypothesis what if my null hypothesis was

00:59:21.880 | there is a relationship between temperature and R then do I have enough

00:59:26.040 | data to reject that null hypothesis right and if the answer is no then you

00:59:34.820 | just don't have enough data to make any conclusions at all right so in this case

00:59:39.800 | they do have enough data to be confident that there is a relationship between

00:59:46.160 | temperature and R now that's weird because we just looked at the graph and

00:59:52.120 | we did a little back of bit of a back of the envelope in Excel and we thought

00:59:55.000 | this is could could well be random so here's where the issue is the graph

01:00:03.500 | shows what we call a univariate relationship a univariate relationship

01:00:07.220 | shows the relationship between one independent variable and one dependent

01:00:11.300 | variable and that's what you can normally show on a graph but in this

01:00:14.880 | case they did a multivariate model in which they looked at temperature and

01:00:19.680 | humidity and GDP per capita and population density and when you put all

01:00:26.680 | of those things into the model then you end up with statistically significant

01:00:30.560 | results for temperature and humidity why does that happen well the reason that

01:00:36.040 | happens is because all these variation in the blue dots is not random there's a

01:00:44.040 | reason they're different right and the reasons include denser cities are going

01:00:49.160 | to have higher transmission for instance and probably more humid will have less

01:00:55.000 | transmission so when you do a multivariate model it actually allows you

01:01:02.360 | to be more confident of your results right but the p-value as noted by the

01:01:11.760 | American Statistical Association does not tell us whether this is a practical

01:01:15.640 | importance the thing that tells us this is a practical is importance is the

01:01:20.400 | actual slope that's found and so in this case the equation they come up with is

01:01:28.120 | that R equals 3.968 minus 3.0.038 by temperature minus 0.024 by relative

01:01:37.600 | humidity this is this equation is this practically important well we can again

01:01:43.320 | do a little back of the envelope here by just putting that into Excel let's say

01:01:52.160 | there was one place that had a temperature of 10 centigrade and a

01:01:55.480 | humidity of 40 then if this equation is correct I would be about 2.7 somewhere

01:02:02.320 | with a temperature of 35 centigrade any humidity of 80 I would be about 0.8 so

01:02:08.880 | is this practically important oh my god yes right two different cities with

01:02:15.400 | different climates can be if they're the same in every other way and this model

01:02:19.920 | is correct then one city would have no spread of disease because I was less than

01:02:25.280 | one one would have massive exponential explosion so we can see from this model

01:02:33.120 | that if the modeling is correct then this is a highly practically significant

01:02:38.100 | result so this is how you determine practical significance of your models

01:02:41.960 | it's not with p-values but with looking at kind of actual outcomes so how do you

01:02:49.880 | think about the practical importance of a model and how do you turn a predictive

01:02:57.960 | model into something useful in production so I spent many many years

01:03:03.080 | thinking about this and I actually created a with some other great folks

01:03:09.640 | actually created a paper about it designing great data products and this

01:03:19.680 | is largely based on 10 years of work I did at a company I founded called optimal

01:03:26.060 | decisions group and optimal decisions group was focused on the question of

01:03:30.940 | helping insurance companies figure out what prices to set and insurance

01:03:36.440 | companies up until that point had focused on predictive modeling

01:03:40.240 | actuaries in particular spent their time trying to figure out how likely is it

01:03:47.320 | that you're going to crash your car and if you do how much damage might you have

01:03:50.920 | and then based on that try to figure out what price they should set for your

01:03:55.320 | policy so for this company what we did was we decided to use a different

01:04:01.160 | approach which I ended up calling the drivetrain approach just described here

01:04:06.280 | to to set insurance prices and indeed to do all kinds of other things and so for

01:04:12.780 | the insurance example the objective would be if an insurance company would

01:04:17.520 | be how do I maximize my let's say five year profit and then what inputs can we

01:04:25.800 | control can we control which I call levers so in this case it would be what

01:04:30.460 | price can I set and then data is data which can tell you as you change your

01:04:37.560 | levers how does that change your objective so if I start increasing my

01:04:41.600 | price to people who are likely to crash their car then we'll get less of them

01:04:46.300 | which means we'll have less costs but at the same time we'll also have less

01:04:50.240 | revenue coming in for example so to link up the kind of the levers to the

01:04:55.720 | objective via the data we collect we build models that described how the

01:04:59.960 | levers influenced the objective and this is all a it seems pretty obvious when

01:05:05.640 | you say it like this but when we started work with optimal decisions in 1999

01:05:11.040 | nobody was doing this in insurance everybody in insurance was simply

01:05:15.640 | doing a predictive model to guess how likely people were to crash their car and

01:05:20.920 | then pricing was set by like adding 20% or whatever it was just done in a very

01:05:27.000 | kind of naive way so what I did is I you know over many years took this basic

01:05:35.040 | process and tried to help lots of companies figure out how to use it to

01:05:39.080 | turn predictive models into actions so the starting point in like actually

01:05:46.800 | getting value in a predictive model is thinking about what is it you're trying

01:05:50.400 | to do and you know what are the sources of value in that thing you're trying to

01:05:53.400 | do the levers what are the things you can change like what's the point of a

01:05:58.280 | predictive model if you can't do anything about it right figuring out

01:06:02.800 | ways to find what data you you don't have which one's suitable what's

01:06:06.040 | available then think about what approaches to analytics you can then

01:06:09.160 | take and then super important like well can you actually implement you know

01:06:15.960 | those changes and super super important how do you actually change things as the

01:06:21.360 | environment changes and you know interestingly a lot of these things are

01:06:24.920 | areas where there's not very much academic research there's a little bit

01:06:28.680 | and some of the papers that have been particularly around maintenance of like

01:06:34.440 | how do you decide when your machine learning model is kind of still okay how

01:06:39.560 | do you update it over time I've had like many many many many citations but they

01:06:45.240 | don't pop up very often because a lot of folks are so focused on the math you

01:06:49.760 | know and then there's the whole question of like what constraints are in place

01:06:54.000 | across this whole thing so what you'll find in the book is there is a whole

01:06:58.120 | appendix which actually goes through every one of these six things and has a

01:07:03.800 | whole list of examples so this is an example of how to like think about value

01:07:11.520 | and lots of questions that companies and organizations can use to try and think

01:07:17.800 | about you know all of these different pieces of the actual puzzle of getting

01:07:25.200 | stuff into production and actually into an effective product we have a question

01:07:29.120 | sure just a moment so I say so do check out this appendix because it actually

01:07:33.680 | originally appeared as a blog post and I think except for my COVID-19 posts that

01:07:39.560 | I did with Rachel it's actually the most popular blog post I've ever written it's

01:07:43.880 | at hundreds of thousands of views and it kind of represents like 20 years of hard

01:07:48.560 | one insights about like how you actually get value from machine learning in

01:07:55.120 | practice and what you actually have to ask so please check it out because

01:07:58.320 | hopefully you'll find it helpful so when we think about like think about this for

01:08:03.760 | the question of how should people think about the relationship between seasonality

01:08:08.160 | and transmissibility of COVID-19 you kind of need to dig really deeply into the

01:08:15.720 | questions about like oh not just what what's that what are those numbers in

01:08:20.680 | the data but what does it really look like right so one of the things in the

01:08:24.160 | paper that they show is actual maps right of temperature and humidity and ah

01:08:31.360 | right and you can see like not surprisingly that humidity and

01:08:37.680 | temperature in China are what we would call autocorrelated which is to say that

01:08:44.160 | places that are close to each other in this case geographically have similar

01:08:48.080 | temperatures and similar humidities and so like this actually puts into the

01:08:54.960 | question the a lot the p-values that they have right because you you can't

01:09:01.040 | really think of these as a hundred totally separate cities because the ones

01:09:04.760 | that are close to each other probably have very close behavior so maybe you

01:09:08.080 | should think of them as like a small number of sets of cities you know of

01:09:12.920 | kind of larger geographies so these are the kinds of things that when you look

01:09:18.280 | actually into a model you need to like think about what are the what are the

01:09:23.000 | limitations but then to decide like well what does that mean what do I what do I

01:09:26.880 | do about that you you need to think of it from this kind of utility point of

01:09:34.360 | view this kind of end-to-end what are the actions I can take what are the

01:09:39.040 | results point of view not just null hypothesis testing so in this case for

01:09:44.440 | example there are basically four possible key ways this could end up it

01:09:52.040 | could end up that there really is a relationship between temperature and R

01:09:57.480 | or so that's what the right-hand side is or there is no real relationship between

01:10:03.800 | temperature and R and we might act on the assumption that there is a

01:10:09.160 | relationship or we might act on the assumption that there isn't a

01:10:12.720 | relationship and so you kind of want to look at each of these four possibilities

01:10:16.760 | and say like well what would be the economic and societal consequences and

01:10:22.560 | you know there's going to be a huge difference in lives lost and you know

01:10:28.000 | economies crashing and whatever else to you know for each of these four the the

01:10:36.180 | paper actually you know has shown if their model is correct what's the likely

01:10:42.000 | R value in March for like every city in the world and the likely R value in July

01:10:48.440 | for every city in the world and so for example if you look at kind of New

01:10:52.880 | England and New York the prediction here is and also West the other the very

01:10:57.680 | coast of the West Coast is that in July the disease will stop spreading now you

01:11:04.640 | know in a if that happens if they're right then that's going to be a

01:11:08.880 | disaster because I think it's very likely in America and also the UK that

01:11:14.320 | people will say oh turns out this disease is not a problem you know it

01:11:19.300 | didn't really take off at all the scientists were wrong people will go

01:11:23.000 | back to their previous day-to-day life and we could see what happened in 1918

01:11:28.160 | flu virus of like the second go around when winter hits could be much worse

01:11:34.760 | than than the start right so like there's these kind of like huge potential

01:11:41.800 | policy impacts depending on whether this is true or false and so to think about

01:11:47.880 | it - yes I also just wanted to say that it would be it would be very

01:11:53.160 | irresponsible to think oh summer's gonna solve it we don't need to act now just

01:11:59.240 | in that this is something growing exponentially and could do a huge huge

01:12:02.840 | amount of damage yeah yeah so it could already has done either way if you

01:12:08.040 | assume that there will be seasonality and that summer will fix things then it

01:12:13.760 | could lead you to be apathetic now if you assume there's no seasonality and

01:12:18.160 | then there is then you could end up kind of creating a larger level of

01:12:24.720 | expectation of distraction than actually happens and end up with your population

01:12:28.720 | being even more apathetic you know so that they're you know being wrong in any

01:12:33.000 | direction of your problem so one of the ways we tend to deal with this with with

01:12:37.800 | this kind of modeling is we try to think about priors so priors are basically

01:12:42.820 | things where we you know rather than just having a null hypothesis we try and

01:12:47.020 | start with a guess as to like well what's what's more likely right so in

01:12:52.080 | this case if memory says correctly I think we know that like flu viruses

01:12:57.560 | become inactive at 27 centigrade we know that like cold the cold coronaviruses

01:13:04.640 | are seasonal 1918 the 1918 flu epidemic was seasonal in every country and city

01:13:14.880 | that's been studied so far there's been quite a few studies like this they've

01:13:18.120 | always found climate relationships so far so maybe we'd say well our prior belief

01:13:23.640 | is that this thing is probably seasonal and so then we'd say well this

01:13:27.960 | particular paper adds some evidence to that so like it shows like how

01:13:34.800 | incredibly complex it is to use a model in practice for in this case policy

01:13:42.800 | discussions but also for like organizational decisions because you

01:13:47.880 | know there's always complexities there's always uncertainties and so you actually

01:13:52.520 | have to think about the the utilities you know and your best guesses and try to

01:13:57.920 | combine everything together as best as you can okay so with all that said it's

01:14:08.080 | still nice to be able to get our our models up and running even if you know

01:14:14.560 | even just a predictive model is sometimes useful of its own sometimes

01:14:19.300 | it's useful to prototype something and sometimes it's just it's going to be

01:14:23.960 | part of some bigger picture so rather than try to create some huge end-to-end

01:14:28.180 | model here we thought we would just show you how to get your your pytorch fast AI

01:14:36.400 | model up and running in as raw a form as possible so that from there you can kind

01:14:43.180 | of build on top of it as you like so to do that we are going to download and

01:14:51.100 | curate our own data set and you're going to do the same thing you've got to train

01:14:55.200 | your own model on that data set and then you're going to get an application and

01:15:00.360 | then you're going to host it okay now there's lots of ways to create an image

01:15:07.720 | data set you might have some photos on your own computer there might be stuff

01:15:12.080 | at work you can use one of the easiest though is just to download stuff off the

01:15:17.720 | internet there's lots of services for downloading stuff off the internet we're

01:15:22.120 | going to be using Bing image search here because they're super easy to use a lot

01:15:28.080 | of the other kind of easy to use things require breaking the terms of service of

01:15:32.640 | websites so like we're not going to show you how to do that but there's lots of

01:15:38.280 | examples that do show you how to do that so you can check them out as well if you

01:15:42.560 | if you want to Bing image search is actually pretty great at least at the

01:15:46.260 | moment these things change a lot so keep an eye on our website to see if we've

01:15:52.160 | changed our recommendation the biggest problem with Bing image search is that

01:15:57.480 | the sign-up process is a nightmare at least at the moment like one of the

01:16:03.360 | hardest parts of this book is just signing up to their damn API which

01:16:07.720 | requires going through Azure it's called cognitive services Azure cognitive

01:16:11.400 | services so we'll make sure that all that information is on the website for

01:16:15.820 | you to follow through just how to sign up so we're going to start from the

01:16:19.160 | assumption that you've already signed up but you can find it just go Bing Bing

01:16:29.040 | image search API and at the moment they give you seven days with a pretty high

01:16:36.760 | quota for free and then after that you can keep using it as long as you like

01:16:46.240 | but they kind of limit it to like three transactions per second or something

01:16:50.580 | which is still plenty you can still do thousands for free so it's it's at the

01:16:54.920 | moment it's pretty great even for free so what will happen is when you sign up

01:17:02.240 | for Bing image search or any of these kind of services they'll give you an API

01:17:05.840 | key so just replace the xxx here with the API key that they give you okay so

01:17:12.740 | that's now going to be called key in fact let's do it over here okay so you'll put

01:17:21.080 | in your key and then there's a function we've created called search images Bing

01:17:27.800 | which is just a super tiny little function as you can see it's just two

01:17:32.900 | lines of code I was just trying to save a little bit of time which will take some

01:17:38.960 | take your API key and some search term and return a list of URLs that match

01:17:44.200 | that search term as you can see for using this particular service you have

01:17:52.600 | to install a particular package so we show you how to do that on the site as

01:17:59.320 | well so once you've done so you'll be able to run this and that will return by

01:18:05.500 | default I think 150 URLs okay so fast AI comes with a download URL function so

01:18:13.200 | let's just download one of those images just to check and open it up and so what

01:18:18.760 | I did was I searched for grizzly bear and here I have a grizzly bear so then

01:18:24.800 | what I did was I said okay let's try and create a model that can recognize

01:18:29.480 | grizzly bears versus black bears versus teddy bears so that way I can find out I

01:18:35.280 | could set up some video recognition system near our campsite when we're out

01:18:40.800 | camping that gives me bear warnings but if it's a teddy bear coming then it

01:18:45.600 | doesn't warn me and wake me up because that would not be scary at all so then I

01:18:50.200 | just go through each of those three bear types create a directory with the name

01:18:55.760 | of grizzly or black or teddy bear search being for that particular search term

01:19:02.640 | along with bear and download and so download images is a fast AI function as

01:19:09.160 | well so after that I can call get image files which is a fast AI function that

01:19:16.040 | will just return recursively all of the image files inside this path and you can

01:19:21.080 | see it's given me bears / black / and then lots of numbers so one of the

01:19:29.480 | things you have to be careful of is that a lot of the stuff you download will

01:19:32.360 | turn out to be like not images at all and will break so you can call verify

01:19:36.800 | images to check that all of these file names are actual images and in this case

01:19:44.180 | I didn't have any failed so this it's empty but if you did have some then you

01:19:50.160 | would call path dot unlink unlink path dot unlink is part of the Python

01:19:56.000 | standard library and it deletes a file and map is something that will call this

01:20:02.120 | function for every element of this collection this is part of a special

01:20:10.080 | fast AI class called L it's basically it's kind of a mix between the Python

01:20:16.160 | standard library list class and a NumPy array class and we'll be learning more

01:20:21.840 | about it later in this course but it basically tries to make it super easy to

01:20:26.040 | do kind of more functional style programming and Python so in this case

01:20:31.720 | it's going to unlink everything that's in the failed list which is probably what

01:20:37.040 | we want because they're all the images that failed to verify alright so we've

01:20:42.280 | now got a path that contains a whole bunch of images and they're classified

01:20:48.760 | according to black grizzly or teddy based on what folder they're in and so to

01:20:55.320 | create so we're going to create a model and so to create a model the first thing

01:20:59.920 | we need to do is to tell fast AI what kind of data we have and how it's

01:21:07.120 | structured now in part in lesson one of the course we did that by using what we

01:21:13.960 | call a factory method which is we just said image data loaders dot from name

01:21:20.040 | and it did it all for us those factory methods are fine for beginners but now

01:21:28.040 | we're into lesson two we're not quite beginners anymore so we're going to show

01:21:31.120 | you the super super flexible way to use data in whatever format you like and

01:21:36.040 | it's called the data block API and so the data block API looks like this

01:21:46.080 | here's the data block API you tell fast AI what your independent variable is and

01:21:54.040 | what your dependent variable is so what your labels are and what your input data

01:21:57.800 | is so in this case our input data are images and our labels are categories so

01:22:05.560 | the category is going to be either grizzly or black or teddy so that's the

01:22:12.040 | first thing you tell it that that's the block parameter and then you tell it how

01:22:16.160 | do you get a list of all of the in this case file names right and we just saw

01:22:20.760 | how to do that because we just called the function ourselves the function is

01:22:23.820 | called get image files so we tell it what function to use to get that list of

01:22:27.560 | items and then you tell it how do you split the data into a validation set and

01:22:34.280 | a training set and so we're going to use something called a random splitter which

01:22:37.960 | just splits it randomly and we're going to put 30% of it into the validation set

01:22:42.000 | we're also going to set the random seed which ensures that every time we run

01:22:46.280 | this the validation set will be the same and then you say okay how do you label

01:22:51.960 | the data and this is the name of a function called parent label and so

01:22:56.520 | that's going to look for each item at the name of the parent so this this

01:23:03.120 | particular one would become a black bear and this is like the most common way for

01:23:08.960 | image data sets to be represented is that they get put the different images

01:23:13.240 | get the files get put into folder according to their label and then

01:23:19.200 | finally here we've got something called item transforms we'll be learning a lot

01:23:22.960 | more about transforms in a moment that these are basically functions that get

01:23:26.760 | applied to each image and so each image is going to be resized to 128 by 128

01:23:34.160 | square so we're going to be learning more about data block API soon but

01:23:39.680 | basically the process is going to be it's going to call whatever is get

01:23:42.240 | items which is a list of image files it's then I'm going to call get X get Y

01:23:47.680 | so in this case there's no get X but there is a get Y so it's just parent

01:23:51.240 | label and then it's going to call the create method for each of these two

01:23:55.360 | things it's going to create an image and it's going to create a category it's

01:23:59.080 | then going to call the item transforms which is resize and then the next thing

01:24:04.040 | it does is it puts it into something called a data loader a data loader is

01:24:07.760 | something that grabs a few images at a time I think by default at 64 and puts

01:24:13.840 | them all into a single it's got a batch it just grabs 64 images and sticks them

01:24:18.760 | all together and the reason it does that is it then puts them all onto the GPU at

01:24:23.320 | once so it can pass them all to the model through the GPU in one go and

01:24:30.360 | that's going to let the GPU go much faster as we'll be learning about and

01:24:35.200 | then finally we don't use any here we can have something called batch

01:24:38.680 | transforms which we will talk about later and then somewhere in the middle

01:24:43.280 | about here conceptually is the splitter which is the thing that splits into the

01:24:48.680 | training set and the validation set so this is a super flexible way to tell

01:24:54.560 | fast AI how to work with your data and so at the end of that it returns an

01:25:03.120 | object of type data loaders that's why we always call these things DL's right so

01:25:08.880 | data loaders has a validation and a training data loader and a data loader as

01:25:15.480 | I just mentioned is something that grabs a batch of a few items at a time and

01:25:19.880 | puts it on the GPU for you so this is basically the entire code of data loaders

01:25:26.920 | so the details don't matter I just wanted to point out that like a lot of

01:25:31.120 | these concepts in fast AI when you actually look at what they are there

01:25:34.800 | they're incredibly simple little things it's literally something that you just

01:25:38.680 | pass in a few data loaders to and it's still some in an attribute and pass and

01:25:43.160 | gives you the first one back as dot train and the second one back as dot

01:25:47.000 | valid so we can create our data loaders by first of all creating the data block

01:25:57.680 | and then we call the data loaders passing in our path to create DL's and

01:26:02.400 | then you can call show batch on that you can call show batch on pretty much

01:26:06.360 | anything in fast AI to see your data and look we've got some grizzlies we've got

01:26:10.700 | a teddy we've got a grizzly so you get the idea right I'm going to look at these

01:26:19.880 | different I'm going to look at data augmentation next week so I'm going to

01:26:23.360 | skip over data augmentation and let's just jump straight into training your

01:26:27.200 | model so once we've got DL's we can just like in lesson one call CNN learner to

01:26:38.600 | create a resnet we're going to create a smaller resident this time a resnet 18

01:26:43.080 | again asking for error rate we can then call dot fine-tune again so you see it's

01:26:48.080 | all the same lines of code we've already seen and you can see our error rate goes

01:26:52.800 | down from 9 to 1 so you've got 1% error and after training for about 25 seconds

01:26:58.960 | so you can see you know we've only got 450 images we've trained for well less

01:27:05.320 | than a minute and we only have let's look at the confusion matrix so we can

01:27:09.640 | say I want to create a classification interpretation class I want to look at

01:27:14.840 | the confusion matrix and the confusion matrix as you can see it's something

01:27:19.540 | that says for things that are actually black bears how many are predicted to be

01:27:24.280 | black bears versus grizzly bears versus teddy bears so the diagonal are the ones

01:27:31.280 | that are all correct and so it looks like we've got two errors we've got one

01:27:34.580 | grizzly that was predicted to be black one black that was predicted to be

01:27:37.760 | grizzly super super useful method is plot top losses that'll actually show me

01:27:48.280 | what my errors actually look like so this one here was predicted to be a

01:27:53.420 | grizzly bear but the label was black bear this one was the one that's

01:27:58.000 | predicted to be a black bear and the label was grizzly bear these ones here

01:28:03.440 | are not actually wrong there this is predicted to be black and it's actually

01:28:06.360 | black but the reason they appear in this is because these are the ones that the

01:28:12.160 | model was the least confident about okay so we're going to look at image

01:28:18.520 | classifier cleaner next week let's focus on how we then get this into production

01:28:24.160 | so to get it into production we need to export the model so what exporting the

01:28:32.680 | model does is it creates a new file which by default is called export dot

01:28:38.200 | pickle which contains the architecture and all of the parameters of the model

01:28:44.160 | so that is now something that you can copy over to a server somewhere and

01:28:50.160 | treat it as a predefined program right so then so the the process of using your

01:28:58.840 | trained model on new data kind of in production is called inference so here

01:29:06.200 | I've created an inference learner by loading that learner back again right and

01:29:11.280 | so obviously it doesn't make sense to do it right next to after I've saved it in

01:29:16.760 | in a notebook but I'm just showing you how it would work right so this is

01:29:20.360 | something that you would do on your server inference and remember that once

01:29:26.320 | you have trained a model you can just treat it as a program you can pass

01:29:30.660 | inputs to it so this is now our our program this is our bear predictor so I

01:29:35.800 | can now call predict on it and I can pass it an image and it will tell me

01:29:42.680 | here is it is ninety nine point nine nine nine percent sure that this is a

01:29:47.760 | grizzly so I think what we're going to do here is we're going to wrap it up

01:29:53.200 | here and next week we'll finish off by creating an actual GUI for our bear

01:30:03.160 | classifier we will show how to run it for free on a service called binder and

01:30:16.000 | yeah and then I think we'll be ready to dive into some of the some of the

01:30:21.560 | details of what's going on behind the scenes any questions or anything else

01:30:26.200 | before we wrap up Rachel now okay great all right thanks everybody so we

01:30:36.320 | hopefully yeah I think from here on we've covered you know most of the key

01:30:44.040 | kind of underlying foundational stuff from a machine learning point of view

01:30:48.240 | that we're going to need to cover so we'll be able to ready to dive into

01:30:54.160 | lower level details of how deep learning works behind the scenes and I think

01:31:01.440 | that'll be starting from next week so see you then

Lesson 2 - Deep Learning for Coders (2020)

Chapters