back to index

Lesson 2 - Deep Learning for Coders (2020)


Chapters

0:0 Lesson 1 recap
2:10 Classification vs Regression
4:50 Validation data set
6:42 Epoch, metrics, error rate and accuracy
9:7 Overfitting, training, validation and testing data set
12:10 How to choose your training set
15:55 Transfer learning
21:50 Fine tuning
22:23 Why transfer learning works so well
28:26 Vision techniques used for sound
29:30 Using pictures to create fraud detection at Splunk
30:38 Detecting viruses using CNN
31:20 List of most important terms used in this course
31:50 Arthur Samuel’s overall approach to neural networks
32:35 End of Chapter 1 of the Book
40:4 Where to find pretrained models
41:20 The state of deep learning
44:30 Recommendation vs Prediction
45:50 Interpreting Models - P value
57:20 Null Hypothesis Significance Testing
62:48 Turn predictive model into something useful in production
74:6 Practical exercise with Bing Image Search
76:25 Bing Image Sign up
81:38 Data Block API
88:48 Lesson Summary

Whisper Transcript | Transcript Only Page

00:00:00.000 | So hello everybody and welcome back to
00:00:06.620 | Practical Deep Learning for Coders. This is lesson 2 and and in the last lesson we started
00:00:14.060 | training our first models. We didn't really have any idea how that training
00:00:18.760 | was really working, but we were looking at a high level at what was going on and
00:00:24.040 | we learned about what is machine learning and how does that work and we
00:00:35.320 | realized that based on how machine learning worked that there are some
00:00:40.400 | fundamental limitations on on what it can do and we talked about some of those
00:00:45.080 | limitations and we also talked about how after you've trained a machine learning
00:00:48.680 | model you end up with a program which behaves much like a normal program or
00:00:54.240 | something with inputs and a thing in the middle and outputs. So today we're
00:00:59.520 | gonna finish up talking about talking about that and we're going to then look
00:01:05.040 | at how we get those models into production and what some of the issues
00:01:08.400 | with doing that might be. I wanted to remind you that there are two sets of
00:01:16.360 | books, sorry two sets of notebooks available to you. One is the the the
00:01:22.320 | fastbook repo, the full actual notebooks containing all the text of the O'Reilly
00:01:29.000 | book and so this lets you see everything that I'm telling you in much more detail
00:01:35.920 | and then as well as that there's the the course v4 repo which contains exactly
00:01:42.520 | the same notebooks but with all the pros stripped away to help you study. So
00:01:47.640 | that's where you really want to be doing your experiment and your practice and so
00:01:51.640 | maybe as you listen to the video you can kind of switch back and forth between
00:01:56.600 | the video and reading or do one and then the other and then put it away and have
00:02:01.280 | a look at the course v4 notebooks and try to remember like okay what was this
00:02:04.800 | section about and run the code and see what happens and change it and so forth.
00:02:11.200 | So we were looking at this line of code where we looked at how we created our
00:02:21.360 | data by passing in information perhaps most importantly some way to label the
00:02:28.920 | data and we talked about the importance of labeling and in this case the this
00:02:33.000 | particular data set whether it's a cat or a dog you can tell by whether it's an
00:02:37.000 | uppercase or a lowercase letter in the first position that's just how this data
00:02:41.520 | set if they tell you when the readme works and we also looked particularly at
00:02:46.360 | this idea of valid percent equals 0.2 and like what does that mean it creates a
00:02:51.320 | validation set and that was something I wanted to talk more about. The first thing
00:02:57.560 | I do want to do though is point out that this particular labeling function
00:03:05.320 | returns something that's either true or false and actually this data set as we'll
00:03:11.480 | see later also tells also contains the actual breed of 37 different cat and dog
00:03:17.160 | breeds so you can you can also grab that from the file name. In each of those two
00:03:23.120 | cases we're trying to predict a category is it a cat or is it a dog or is it a
00:03:29.680 | German Shepherd or a beagle or rag doll cat or whatever when you're trying to
00:03:36.360 | predict a category so when the label is a category we call that a classification
00:03:42.160 | model. On the other hand you might try to predict how old is the animal or how
00:03:50.120 | tall is it or something like that which is like a continuous number that could
00:03:55.680 | be like 13.2 or 26.5 or whatever anytime you're trying to predict a number your
00:04:02.000 | label is a number you call that regression okay so those are the two
00:04:06.720 | main types of model classification and regressions this is very important jargon
00:04:11.000 | to know about so the regression model attempts to predict one or more numeric
00:04:16.840 | quantities such as temperature or location or whatever this is a bit
00:04:21.240 | confusing because sometimes people use the word regression as a shortcut to a
00:04:25.920 | particular like a abbreviation for a particular kind of model called linear
00:04:30.720 | regression that's super confusing because that's not what regression means
00:04:36.240 | linear regression is just a particular kind of regression but I just wanted to
00:04:40.320 | warn you of that when you start talking about regression a lot of people will
00:04:45.280 | assume you're talking about linear regression even though that's not what
00:04:48.080 | the word means. Alright so I wanted to talk about this valid percent 0.2 thing
00:04:54.840 | so as we described valid percent grabs in this case 20% of the data but 0.2 and
00:05:02.040 | puts it aside like in a separate bucket and then when you train your model your
00:05:08.120 | model doesn't get to look at that data at all that data is only used to decide
00:05:13.960 | to show you how accurate your model is so if you train for too long and or with
00:05:24.200 | not enough data and or a model with too many parameters after a while the
00:05:28.920 | accuracy of your model will actually get worse and this is called overfitting
00:05:34.040 | right so we use the validation set to ensure that we're not overfitting the
00:05:42.200 | next line of code that we looked at is this one where we created something
00:05:46.880 | called a learner we'll be learning a lot more about that but a learner is
00:05:50.320 | basically or is something which contains your data and your architecture that is
00:05:57.320 | the mathematical function that you're optimizing and so a learner is the thing
00:06:03.000 | that tries to figure out what are the parameters which best cause this
00:06:06.920 | function to match the labels in this data so we're talking a lot more about
00:06:13.640 | that but basically this particular function resnet 34 is the name of a
00:06:18.960 | particular architecture which is just very good for computer vision problems
00:06:23.480 | in fact the name really is resnet and then 34 tells you how many layers there
00:06:29.040 | are so you can use ones with bigger numbers here to get more parameters that
00:06:33.040 | will take longer to train take more memory more likely to overfit but could
00:06:38.160 | also create more complex models right now though I wanted to focus on this
00:06:44.320 | part here which is metrics equals error rate this is where you list the
00:06:49.920 | functions that you want to be that you want to be called with your data with
00:06:54.080 | your validation data and print it out after each epoch and epoch is is what we
00:07:01.280 | call it when you look at every single image in the data set once and so after
00:07:06.960 | you've looked at every image in the data set once we print out some information
00:07:11.160 | about how you're doing and the most important thing we print out is the
00:07:15.040 | result of calling these metrics so error rate is the name of a metric and it's a
00:07:20.200 | function that just prints out what percent of the validation set are being
00:07:24.760 | incorrectly classified by your model so a metrics a function that measures the
00:07:32.280 | quality of the predictions using the validation set so error rates one
00:07:36.720 | another common metric is accuracy which is just one minus error rate so very
00:07:42.160 | important to remember from last week we talked about loss Arthur Samuel had this
00:07:48.840 | important idea in machine learning that we need some way to figure out how good
00:07:54.800 | our how well our model is doing so that when we change the parameters we can
00:08:00.040 | figure out which set of parameters make that performance measurement get better
00:08:04.280 | or worse that performance measurement is called the loss the loss is not
00:08:10.440 | necessarily the same as your metric the reason why is a bit subtle and we'll be
00:08:17.140 | seeing it in a lot of detail once we delve into the math in the coming
00:08:20.140 | lessons but basically you need a function you need a loss function where
00:08:27.880 | if you change the parameters by just a little bit up or just a little bit down
00:08:31.960 | you can see if the loss gets a little bit better or a little bit worse and it
00:08:35.920 | turns out that error rate and accuracy doesn't tell you that at all because you
00:08:40.600 | might change the parameters by smudge such a small amount that none of your
00:08:45.040 | dogs predictions start becoming cats and none of your cat predictions start
00:08:48.920 | becoming dogs so like your predictions don't change and your error rate doesn't
00:08:52.640 | change so loss and metric are closely related but the metric is the thing that
00:08:58.080 | you care about the loss is the thing which your computer is using as the
00:09:03.640 | measurement of performance to decide how to update your parameters so we measure
00:09:11.200 | overfitting by looking at the metrics on the validation set so fast AI always
00:09:18.360 | uses the validation set to print out your metrics and overfitting is like the
00:09:24.320 | key thing that machine learning is about it's all about how do we find a model
00:09:30.160 | which fits the data not just for the data that we're training with but for
00:09:35.120 | data that the training algorithm hasn't seen before so overfitting results when
00:09:44.880 | our model is basically cheating our model can cheat by saying oh I've seen
00:09:52.560 | this exact picture before and I remember that that's a picture of a cat so it
00:09:58.040 | might not have learned what cats look like in general it just remembers you
00:10:01.880 | know that images one four and eight are cats and two and three and five are dogs
00:10:06.640 | and learns nothing actually about what they really look like so that's the kind
00:10:11.600 | of cheating that we're trying to avoid we don't want it to memorize our
00:10:15.360 | particular data set so we split off our validation data and most of this these
00:10:22.120 | words you're seeing on the screen are from the book okay so I just copied and
00:10:25.080 | pasted them so if we split off our validation data and make sure that our
00:10:31.240 | model sees it during training it's completely untainted by it so we can't
00:10:35.080 | possibly cheat not quite true we can cheat the way we could cheat is we could
00:10:41.960 | run we could fit a model look at the result in the validation set change
00:10:46.600 | something a little bit fit another model look at the validation set change
00:10:50.280 | something a little bit we could do that like a hundred times until we find
00:10:53.920 | something where the validation set looks the best but now we might have fit to
00:10:57.900 | the validation set right so if you want to be really rigorous about this you
00:11:03.120 | should actually set aside a third bit of data called the test set that is not
00:11:08.640 | used for training and it's not used for your metrics it's actually you don't
00:11:13.520 | look at it until the whole project's finished and this is what's used on
00:11:17.280 | competition platforms like Kaggle on Kaggle after the competition finishes
00:11:23.680 | your performance will be measured against a data set that you have never seen and
00:11:30.920 | so that's a really helpful approach and it's actually a great idea to do that
00:11:38.120 | like even if you're not doing the modeling yourself so if you're if you're
00:11:43.600 | looking at vendors and you're just trying to decide should I go with IBM or
00:11:48.040 | Google or Microsoft and they're all showing you how great their models are
00:11:52.360 | what you should do is you should say okay you go and build your models and I
00:11:57.680 | am going to hang on to ten percent of my data and I'm not going to let you see it
00:12:01.460 | at all and when you're all finished come back and then I'll run your model on the
00:12:06.240 | ten percent of data you've never seen now pulling out your validation and test
00:12:15.400 | sets is a bit subtle though here's an example of a simple little data set and
00:12:21.260 | this comes from a fantastic blog post that Rachel wrote that we will link to
00:12:25.960 | about creating effective validation sets and you can see basically you have some
00:12:31.120 | kind of seasonal data set here now if you just say okay fast AI I want to
00:12:37.360 | model that I want to create my data loader using a valid percent of 0.2 it
00:12:44.920 | would do this it would delete randomly some of the dots right now this isn't
00:12:52.180 | very helpful because it's we can still cheat because these dots are right in
00:12:57.560 | the middle of other dots and this isn't what would happen in practice what would
00:13:01.120 | happen in practice is we would want to predict this is sales by date right we
00:13:05.480 | want to predict the sales for next week not the sales for 14 days ago 18 days
00:13:10.720 | ago in 29 days ago right so what you actually need to do to create an
00:13:15.120 | effective validation set here is not do it randomly but instead chop off the end
00:13:21.640 | right and so this is what happens in all Kaggle competitions pretty much that
00:13:26.240 | involve time for instance is the thing that you have to predict is the next
00:13:30.640 | like two weeks or so after the last data point that they give you and this is
00:13:36.760 | what you should do also for your test set so again if you've got vendors that
00:13:40.760 | you're looking at you should say to them okay after you're all done modeling we're
00:13:45.000 | going to check your model against a data that is one week later than you've ever
00:13:49.920 | seen before and you won't be able to retrain or anything because that's what
00:13:53.880 | happens in practice right okay there's a question I've heard people describe
00:14:00.480 | overfitting as training error being below validation error does this rule of
00:14:05.120 | thumb end up being roughly the same as yours okay so that's a great question so
00:14:09.560 | I think what they mean there is training loss versus validation loss because we
00:14:15.840 | don't print training error so we do print at the end of each epoch the value
00:14:21.680 | of your loss function for the training set and the value of the loss function
00:14:25.080 | for the validation set and if you train for long enough so if it's training
00:14:32.120 | nicely your training loss will go down and your validation loss will go down
00:14:37.200 | because by definition loss function is defined such as a lower loss function is
00:14:44.920 | a better model if you start overfitting your training loss will keep going down
00:14:51.440 | right because like why wouldn't that you know you're getting better and better
00:14:55.480 | parameters but your validation loss will start to go up because actually you
00:15:02.160 | started fitting to the specific data points in the training set and so it's
00:15:05.920 | not going to actually get better it's going to get it's not going to get better
00:15:08.640 | for the validation set it'll start to get worse however that does not
00:15:14.880 | necessarily mean that you're overfitting or at least not overfitting in a bad way
00:15:19.080 | as we'll see it's actually possible to be at a point where the validation loss
00:15:24.760 | is getting worse but the validation accuracy or error or metric is still
00:15:29.640 | improving so I'm not going to describe how that would happen mathematically yet
00:15:35.240 | because we need to learn more about loss functions but we will but for now just
00:15:39.640 | realize that the important thing to look at is your metric getting worse not your
00:15:46.120 | loss function getting worse thank you for that fantastic question the next
00:15:56.520 | important thing we need to learn about is called transfer learning so the next
00:16:00.600 | line of code said learn fine tune why does it say learn fine tune fine tune is
00:16:07.360 | what we do when we are transfer learning so transfer learning is using a pre
00:16:12.580 | trained model for a task that is different to what it was originally
00:16:16.060 | trained for so more jargon to understand our jargon let's look at that what's a
00:16:20.980 | pre trained model so what happens is remember I told you the architecture
00:16:25.240 | we're using is called resnet 34 so when we take that resnet 34 that's just a
00:16:30.640 | it's just a mathematical function okay with lots of parameters that we're going
00:16:35.320 | to fit using machine learning there's a big data set called image net that
00:16:42.560 | contains 1.3 million pictures of a thousand different types of thing
00:16:46.640 | whether it be mushrooms or animals or airplanes or hammers or whatever there's
00:16:55.240 | a competition there used to be a competition that runs every year to see
00:16:58.240 | who could get the best accuracy on the image net competition and the models
00:17:02.600 | that did really well people would take those specific values of those
00:17:07.800 | parameters and they would make them available on the internet for anybody to
00:17:11.920 | download so if you download that you don't just have an architecture now you
00:17:16.280 | have a trained model you have a model that can recognize a thousand categories
00:17:22.320 | of thing in images which probably isn't very useful unless you happen to want
00:17:28.400 | something that recognizes those exact thousand categories of thing but it turns
00:17:33.120 | out you can rather you can start with those weights in your model and then
00:17:40.560 | train some more epochs on your data and you'll end up with a far far more
00:17:47.120 | accurate model than you would if you didn't start with that pre-trained model
00:17:51.540 | and we'll see why in just a moment right but this idea of transfer learning it's
00:17:57.040 | kind of it makes intuitive sense right image net already has some cats and some
00:18:03.520 | dogs in it it's you know it can say this is a cat and this is a dog but you want
00:18:07.320 | to maybe do something that recognizes lots of breeds that aren't an image net
00:18:11.000 | well for it to be able to recognize cats versus dogs versus airplanes versus
00:18:16.160 | hammers it has to understand things like what does metal look like what does fur
00:18:22.120 | look like what it is look like you know so it can say like oh this breed of
00:18:26.480 | animal this breed of dog has pointy ears and oh this thing is metal so it can't
00:18:31.320 | be a dog so all these kinds of concepts get implicitly learnt by a pre-trained
00:18:37.360 | model so if you start with a pre-trained model then you don't it you don't have
00:18:41.880 | to learn all these features from scratch and so transfer learning is the single
00:18:48.840 | most important thing for being able to use less data and less compute and get
00:18:54.960 | better accuracy so that's a key focus for the fast AI library and a key focus
00:19:00.920 | for this course there's a question I'm a bit confused on the differences between
00:19:08.840 | loss error and metric
00:19:12.600 | last error and metric sure so error is just one kind of metric so there's lots
00:19:23.600 | of different possible labels you could have let's say you're trying to create a
00:19:27.360 | model which could predict how old a cat or dog is so the metric you might use is
00:19:36.920 | on average how many years were you off by so that would be a metric on the other
00:19:44.440 | hand if you're trying to predict whether this is a cat or a dog your metric could
00:19:49.720 | be what percentage of the time am I wrong so that latter metric is called the
00:19:55.240 | error rate okay so error is one particular metric it's a thing that
00:20:00.320 | measures how well you're doing and it's like it should be the thing that you
00:20:04.880 | most care about so you write a function or use one of fast AI's pre-defined ones
00:20:10.520 | which measures how well you're doing loss is the thing that we talked about in
00:20:19.820 | lesson one so I'll give a quick summary but go back to lesson one if you don't
00:20:23.760 | remember Arthur Samuel talked about how a machine learning model needs some
00:20:29.000 | measure of performance which we can look at when we adjust our parameters up or
00:20:35.360 | down does that measure of performance get better or worse and as I mentioned
00:20:40.500 | earlier some metrics possibly won't change at all if you move the parameters
00:20:47.960 | up and down just a little bit so they can't be used for this purpose of
00:20:52.920 | adjusting the parameters to find a better measure of performance so quite
00:20:56.600 | often we need to use a different function we call this the loss function
00:21:00.840 | the loss function is the measure of performance that the algorithm uses to
00:21:05.480 | try to make the parameters better and it's something which should kind of
00:21:09.600 | track pretty closely to the the metric you care about but it's something which
00:21:15.120 | as you change the parameters a bit the loss should always change a bit and so
00:21:20.880 | there's a lot of hand waving there because we need to look at some of the
00:21:24.440 | math of how that works and we'll be doing that in the next couple of lessons
00:21:30.920 | thanks for the great questions okay so fine-tuning is a particular transfer
00:21:39.840 | learning technique where the oh and you're still showing your picture and
00:21:44.960 | out the slides so fine-tuning is a transfer learning technique where the
00:21:54.720 | weights this is not quite the right word we should say the parameters where the
00:21:58.400 | parameters of a pre-trained model are updated by training for additional epochs
00:22:02.800 | using a different task to that used for pre-training so pre-training the task
00:22:07.400 | might have been image net classification and then our different task might be
00:22:12.040 | recognizing cats versus dogs so the way by default fast AI does fine-tuning is
00:22:22.240 | that we use one epoch which remember is one looking at every image in the data
00:22:27.420 | set once one epoch to fit just those parts of the model necessary to get the
00:22:34.240 | particular part of the model that's especially for your data set working and
00:22:40.480 | then we use as many epochs as you ask for to fit the whole model and so this
00:22:45.640 | is more if you for those people who might be a bit more advanced we'll see
00:22:48.680 | exactly how this works later on in the lessons so why does transfer learning
00:22:55.560 | work and why does it work so well the best way in my opinion to look at this
00:22:59.920 | is to see this paper by Zyla and Fergus who were actually 2012 image net winners
00:23:07.200 | and interestingly their key insights came from their ability to visualize
00:23:13.160 | what's going on inside a model so visualization very often turns out to be
00:23:17.960 | super important to getting great results what they were able to do was they looked
00:23:22.600 | remember I told you like a resnet 34 has 34 layers they looked at something
00:23:28.880 | called Alex net which was the previous winner of the competition which only had
00:23:32.520 | seven layers at the time that was considered huge and so they took a seven
00:23:37.040 | layer model and they said what is the first layer of parameters look like and
00:23:42.720 | they figured it out how to draw a picture of them right and so the first
00:23:47.720 | layer had lots and lots of features but here are nine of them one two three four
00:23:55.900 | five six seven eight nine and here's what nine of those features look like one of
00:24:00.960 | them was something that could recognize diagonal lines from top left to bottom
00:24:04.440 | right one of them could find diagonal lines from bottom left to top right one
00:24:08.840 | of them could find gradients that went from the top of orange to the bottom of
00:24:12.400 | blue some of them were able you know one of them was specifically for finding
00:24:17.740 | things that were green and so forth right so for each of these nine they're
00:24:25.040 | called filters they're all features so then something really interesting they
00:24:30.400 | did was they looked at for each one of these each one of these filters each one
00:24:34.840 | of these features and we'll learn kind of mathematically about what these
00:24:38.360 | actually mean in the coming lessons but for now let's just recognize them as
00:24:43.120 | saying oh there's something that looks at diagonal lines and something that
00:24:45.520 | looks at gradients and they found in the actual images in ImageNet specific
00:24:52.900 | examples of parts of photos that match that filter so for this top left filter
00:24:58.400 | here are nine actual patches of real photos that match that filter and as you
00:25:04.360 | can see they're all diagonal lines and so here's the for the green one here's
00:25:08.560 | parts of actual photos that match the green one so layer one is super super
00:25:14.560 | simple and one of the interesting things to note here is that something that can
00:25:18.000 | recognize gradients and patches of color and lines is likely to be useful for
00:25:22.360 | lots of other tasks as well not just ImageNet so you can kind of see how
00:25:26.760 | something that can do this might also be good at many many other computer vision
00:25:33.280 | tasks as well this is layer two layer two takes the features of layer one and
00:25:40.680 | combines them so it can not just find edges but can find corners or repeating
00:25:49.800 | curving patterns or semicircles or full circles and so you can see for example
00:25:56.760 | here's a it's kind of hard to exactly visualize these layers after layer one
00:26:06.120 | you kind of have to show examples of what the filters look like but here you
00:26:11.080 | can see examples of parts of photos that these this layer to circular filter has
00:26:17.080 | activated on and as you can see it's found things with circles so
00:26:23.880 | interestingly this one which is this kind of blotchy gradient seems to be
00:26:28.040 | very good at finding sunsets and this repeating vertical pattern is very good
00:26:33.320 | at finding like curtains and wheat fields and stuff so the further we get
00:26:39.560 | layer three then gets to combine all the kinds of features in layer two and
00:26:45.320 | remember we're only seeing so we're only seeing here 12 of the features but
00:26:49.480 | actually there's probably hundreds of them I don't remember exactly in Alex
00:26:52.520 | Net but there's lots but by the time we get to layer three by combining features
00:26:57.440 | from layer two it already has something which is finding text so this is a
00:27:03.480 | feature which can find bits of image that contain text it's already got
00:27:08.120 | something which can find repeating geometric patterns and you see this is
00:27:12.980 | not just like a matching specific pixel patterns this is like a semantic concept
00:27:20.240 | it can find repeating circles or repeating squares or repeating hexagons
00:27:24.400 | right so it's it's really like computing it's not just matching a template and
00:27:31.220 | remember we know that neural networks can solve any possible computable
00:27:35.180 | function so it can certainly do that so layer 4 gets to combine all the filters
00:27:43.400 | from layer 3 anyway at once and so by layer 4 we have something that can find
00:27:47.240 | dog faces for instance so you can kind of see how each layer we get like
00:27:56.800 | multiplicatively more sophisticated features and so that's why these deep
00:28:01.880 | neural networks can be so incredibly powerful it's also why transfer learning
00:28:07.740 | can work so well because like if we wanted something that can find books and
00:28:13.160 | I don't think there's a book category in ImageNet well it's actually already got
00:28:17.020 | something that can find text as an earlier filter which I guess it must be
00:28:20.840 | using to find maybe there's a category for library or something or a bookshelf
00:28:25.880 | so when you use transfer learning you can take advantage of all of these
00:28:30.840 | pre-learn features to find things that are just combinations of these existing
00:28:37.680 | features that's why transfer learning can be done so much more quickly and so
00:28:42.880 | much less data than traditional approaches one important thing to
00:28:48.360 | realize then is that these techniques for computer vision are not just good at
00:28:53.160 | recognizing photos there's all kinds of things you can turn into pictures for
00:28:58.680 | example these are example these are sounds that have been turned into
00:29:04.440 | pictures by representing their frequencies over time and it turns out
00:29:10.080 | that if you convert a sound into these kinds of pictures you can get basically
00:29:15.960 | state-of-the-art results at sound detection just by using the exact same
00:29:21.660 | ResNet learner that we've already seen I wanted to highlight that it's 945 so if
00:29:28.520 | you want to take a break soon a really cool example from I think this is our
00:29:34.160 | very first year of running fast AI one of our students created pictures they
00:29:40.080 | worked at Splunk in anti-fraud and they created pictures of users moving their
00:29:45.240 | mouse and if I remember correctly as they moved their mouse he basically drew
00:29:49.800 | a picture of where the mouse moved and the color depended on how fast they
00:29:54.180 | moved and these circular blobs is where they clicked the left or the right mouse
00:29:59.080 | button and at Splunk they then well he what he did actually for the for the
00:30:04.740 | course as a project for the course is he tried to see whether he could use this
00:30:09.480 | these pictures with exactly the same approach we saw in lesson one to create
00:30:15.000 | an anti-fraud model and it worked so well that Splunk ended up patenting a new
00:30:21.240 | product based on this technique and you can actually check it out there's a blog
00:30:25.040 | post about it on the internet where they describe this breakthrough anti-fraud
00:30:29.120 | approach which literally came from one of our really amazing and brilliant and
00:30:34.440 | creative students after lesson one of the course another cool example of this
00:30:40.640 | is looking at different viruses and again turning them into pictures and you
00:30:48.800 | can kind of see how they've got here this is from a paper check out the book
00:30:52.500 | for the citation they've got three examples of a particular virus called
00:30:57.200 | vb.at and another example of a particular virus called fakrian and you
00:31:02.240 | can see each case the pictures all look kind of similar and that's why again
00:31:06.960 | they can get state-of-the-art results in in virus detection by turning the kind
00:31:12.760 | of program signatures into pictures and putting it through image recognition so
00:31:20.520 | in the book you'll find a list of all of the terms all of the most important
00:31:25.560 | terms we've seen by so far and what they mean I'm not going to read through them
00:31:29.280 | but I want you to please because these are the these are the terms that we're
00:31:33.480 | going to be using from now on and you've got to know what they mean because if
00:31:38.720 | you don't you're going to be really confused because I'll be talking about
00:31:41.320 | labels and architectures and models and parameters and they have very specific
00:31:46.080 | exact meanings and they'll be using those exact meanings so please review this so
00:31:52.520 | to remind you this is where we got to we we ended up with Arthur Samuel's overall
00:31:59.520 | approach and we replaced his terms with our terms so we have an architecture
00:32:05.520 | which contains parameters as inputs and we will parameters and the data as
00:32:11.960 | inputs so that the architecture press the parameters of the model with the
00:32:18.560 | inputs they use to calculate predictions they are compared to the labels with a
00:32:23.480 | loss function and that loss function is used to update the parameters many many
00:32:28.320 | times to make them better and better until the loss gets nice and super low
00:32:34.360 | so this is the end of chapter one of the book it's really important to look at
00:32:39.800 | the questionnaire because the questionnaire is the thing where you can
00:32:43.080 | check whether you have taken away from this book of this chapter the stuff that
00:32:49.260 | we hope you have so go through it and anything that you're not sure about the
00:32:55.520 | tech that the answer is in the text so just go back to earlier in the book and
00:33:00.080 | you will in the chapter and you will find the answers there's also a further
00:33:05.800 | research section after each questionnaire for the first couple of
00:33:09.480 | chapters they're actually pretty simple hopefully they're pretty fun and
00:33:12.240 | interesting they're things where to answer the question it's not enough to
00:33:15.480 | just look in the chapter you actually have to go and do your own thinking and
00:33:19.640 | experimenting and googling and so forth in later chapters some of these further
00:33:25.880 | research things are pretty significant projects that might take a few days or
00:33:30.320 | even weeks and so yeah you know check them out because hopefully they'll be a
00:33:35.480 | great way to expand your understanding of the material so something that Sylvain
00:33:42.560 | points out in the book is that if you really want to make the most of this
00:33:46.000 | then after each chapter please take the time to experiment with your own project
00:33:50.640 | and with the notebooks you provide what we provide and then see if you can redo
00:33:55.560 | the the notebooks on a new data set and perhaps for chapter one that might be a
00:34:00.200 | bit hard because we haven't really shown how to change things but for chapter for
00:34:03.640 | chapter two which we're going to start next you'll absolutely be able to do
00:34:07.240 | that okay so let's take a five minute break and we'll come back at 955 San
00:34:16.880 | Francisco time okay so welcome back everybody and I think we've got a couple
00:34:22.360 | of questions to start with so Rachel please take it away sure our filters
00:34:27.560 | independent by that I mean if filters are pre trained might they become less
00:34:31.840 | good in detecting features of previous images when fine-tuned oh that is a
00:34:37.000 | great question so assuming I understand the question correctly if if you start
00:34:43.120 | with say an image net model and then you you fine-tune it on dogs versus cats for
00:34:49.720 | a few epochs and you get something that's very good at recognizing dogs
00:34:53.560 | versus cats it's going to be much less good as an image net model after that
00:34:58.840 | so it's not going to be very good at recognizing airplanes or or hammers or
00:35:03.520 | whatever this is called catastrophic forgetting in the literature the idea
00:35:10.040 | that as you like see more images about different things to what you saw earlier
00:35:14.360 | that you start to forget about the things you saw earlier so if you want to
00:35:20.180 | fine-tune something which is good at a new task but also continues to be good
00:35:26.080 | at the previous task you need to keep putting in examples of the previous task
00:35:30.000 | as well and what are the example what are the differences between parameters
00:35:37.800 | and hyper parameters if I am feeding an image of a dog as an input and then
00:35:43.120 | changing the hyper parameters of batch size in the model what would be an
00:35:47.160 | example of a parameter so the parameters are the things that described in lesson
00:35:55.160 | one that Arthur Samuel described as being the things which change what the
00:36:02.720 | model does what the architecture does so we start with this infinitely flexible
00:36:08.540 | function the thing called a neural network that can do anything at all and
00:36:14.080 | the the way you get it to do one thing versus another thing is by changing its
00:36:19.880 | parameters there they are the numbers that you pass into that function so
00:36:24.640 | there's two types of numbers you pass into the function there's the numbers
00:36:27.640 | that represent your input like the pixels of your dog and there's the
00:36:32.240 | numbers that represent the learnt parameters so in the example of
00:36:39.520 | something that's not a neural net but like a checkers playing program like
00:36:43.560 | Arthur Samuel might have used back in the early 60s and late 50s those
00:36:47.640 | parameters may have been things like if there is a opportunity to take a piece
00:36:54.640 | versus an opportunity to get to the end of a board how much more value should I
00:36:59.920 | consider one versus the other you know it's twice as important or it's three
00:37:03.720 | times as important that two versus three that would be an example of a parameter
00:37:08.480 | in a neural network parameters are a much more abstract concept and so a
00:37:14.960 | detailed understanding of what they are will come in the next lesson or two but
00:37:20.080 | it's the same basic idea they're the numbers which change what the model does
00:37:26.480 | to be something that recognizes malignant tumors versus cats versus dogs
00:37:32.600 | versus colorizes black and white pictures whereas the hyper parameter is
00:37:38.920 | the choices about what what numbers do you pass to the function when you act
00:37:46.040 | the actual fitting function to decide how that fitting process happens there's
00:37:52.520 | a question I'm curious about the pacing of this course I'm concerned that all
00:37:55.960 | the material may not be covered depends what you mean by all the material we
00:38:00.920 | certainly won't cover everything in the world so yeah we'll cover what we can
00:38:08.400 | then we'll cover what we can in seven lessons we're certainly not covering the
00:38:12.960 | whole book if that's what you're wondering the whole book will be covered
00:38:16.200 | in either two or three courses in the past it's generally been two courses to
00:38:22.200 | cover about the amount of stuff in the book but we'll see how it goes because
00:38:26.480 | the books pretty big 500 pages when you say two courses you mean 14 lessons 14
00:38:32.520 | yeah so it'd be like 14 or 21 lessons to get through the whole book although
00:38:38.080 | having said that by the end of the first lesson hopefully there'll be kind of
00:38:40.880 | like enough momentum and understanding that the reading the book independently
00:38:44.800 | will be more useful and you'll have also kind of gained a community of folks on
00:38:50.800 | the forums that you can hang out with and ask questions of and so forth so in
00:38:57.380 | in the second part of the course we're going to be talking about putting stuff
00:39:02.000 | in production and we're so to do that we need to understand like what are the
00:39:08.040 | capabilities and limitations of of deep learning what are the kinds of projects
00:39:13.880 | that even make sense to try to put in production and you know one of the key
00:39:18.600 | things I should mention in in the Balkan in this course is that the first two or
00:39:22.160 | three lessons and chapters there's a lot of stuff which is designed not just for
00:39:27.640 | for coders but for for everybody there's lots of information about like what are
00:39:34.880 | the practical things you need to know to make deep learning work and so one of
00:39:38.320 | them things you need to know is like well what's deep learning actually good
00:39:41.360 | at at the moment so I'll summarize what the book says about this but there are
00:39:48.240 | the kind of four key areas that we have as applications in fast AI computer
00:39:53.760 | vision text tabula and but I've called here Rexis this stands for recommendation
00:39:58.200 | systems and specifically a technique called collaborative filtering which we
00:40:01.760 | briefly saw last week sorry another question is are there any pre-trained
00:40:06.960 | weights available other than the ones from image net that we can use if yes
00:40:11.240 | when should we use others in one image net oh that's a really great question so
00:40:16.280 | yes there are a lot of pre-trained models and one way to find them but also
00:40:23.320 | you're currently just showing switching okay great one great way to find them is
00:40:29.120 | you can look up model zoo which is a common name for like places that have
00:40:36.480 | lots of different models and so here's lots of model zoos or you can look for
00:40:44.400 | pre-trained models and so yeah there's quite a few unfortunately not as wide a
00:40:57.320 | variety as I would like that most are still on image net or similar kinds of
00:41:02.640 | general photos for example medical imaging there's hardly any there's a lot
00:41:09.800 | of opportunities for people to create domain specific pre-trained models it's
00:41:13.320 | it's still an area that's really underdone because not enough people are
00:41:16.320 | working on transfer learning okay so as I was mentioning we've kind of got these
00:41:23.760 | four applications that we've talked about a bit and deep learning is pretty you
00:41:32.560 | know pretty good at all of those tabular data like spreadsheets and database
00:41:39.160 | tables is an area where deep learning is not always the best choice but it's
00:41:44.160 | particularly good for things involving high cardinality variables that means
00:41:48.280 | variables that have like lots and lots of discrete levels like zip code or
00:41:52.520 | product ID or something like that deep learning is really pretty great for
00:41:58.600 | those in particular for text it's pretty great at things like classification and
00:42:06.760 | translation it's actually terrible for conversation and so that's that's been
00:42:11.720 | something that's been a huge disappointment for a lot of companies
00:42:14.120 | they tried to create these like conversation bots but actually deep
00:42:18.480 | learning isn't good at providing accurate information it's good at
00:42:23.240 | providing things that sound accurate and sound compelling but it we don't really
00:42:27.280 | have great ways yet of actually making sure it's correct one big issue for
00:42:34.840 | recommendation systems collaborative filtering is that deep learning is
00:42:39.880 | focused on making predictions which don't necessarily actually mean creating
00:42:44.760 | useful recommendations we'll see what that means in a moment deep learning is
00:42:50.680 | also good at multimodal that means things where you've got multiple
00:42:56.440 | different types of data so you might have some tabular data including a text
00:43:00.360 | column and an image and some collaborative filtering data and
00:43:06.880 | combining that all together is something that deep learning is really good at so
00:43:11.040 | for example putting captions on photos is something which deep learning is
00:43:17.920 | pretty good at although again it's not very good at being accurate so what you
00:43:22.400 | know it might say this is a picture of two birds it's actually a picture of
00:43:25.800 | three birds and then this other category there's lots and lots of things that you
00:43:33.800 | can do with deep learning by being creative about the use of these kinds of
00:43:38.240 | other application-based approaches for example an approach that we developed
00:43:43.600 | for natural language processing called ULM fit or you're learning in the course
00:43:48.120 | it turns out that it's also fantastic at doing protein analysis if you think of
00:43:53.040 | the different proteins as being different words and they're in a
00:43:57.360 | sequence which has some kind of state and meaning it turns out that ULM fit
00:44:02.240 | works really well for protein analysis so often it's about kind of being being
00:44:06.880 | creative so to decide like for the product that you're trying to build is
00:44:12.480 | deep learning going to work well for it in the end you kind of just have to try
00:44:17.600 | it and see but if you if you do a search you know hopefully you can find
00:44:24.480 | examples about the people that have tried something similar even if you
00:44:27.760 | can't that doesn't mean it's not going to work so for example I mentioned the
00:44:33.280 | collaborative filtering issue where a recommendation and a prediction are not
00:44:37.840 | necessarily the same thing you can see this on Amazon for example quite often
00:44:43.040 | so I bought a Terry Pratchett book and then Amazon tried for months to get me to
00:44:48.880 | buy more Terry Pratchett books now that must be because their predictive model
00:44:53.240 | said that people who bought one particular Terry Pratchett book are
00:44:57.440 | likely to also buy other Terry Pratchett books but from the point of view of like
00:45:01.880 | well is this going to change my buying behavior probably not right like if I
00:45:07.040 | liked that book I already know I like that author and I already know that like
00:45:10.440 | they probably wrote other things so I'll go and buy it anyway so this would be an
00:45:14.360 | example of like Amazon probably not being very smart up here they're
00:45:18.720 | actually showing me collaborative filtering predictions rather than
00:45:23.280 | actually figuring out how to optimize a recommendation so an optimized
00:45:27.520 | recommendation would be something more like your local human bookseller might
00:45:32.360 | do where they might say oh you like Terry Pratchett well let me tell you
00:45:36.840 | about other kind of comedy fantasy sci-fi writers on the similar vein who
00:45:41.440 | you might not have heard about before so the difference between recommendations
00:45:46.240 | and predictions is super important so I wanted to talk about a really important
00:45:53.360 | issue around interpreting models and for a case study for this I thought we let's
00:45:59.000 | pick something that's actually super important right now which is a model in
00:46:03.240 | this paper one of the things we're going to try and do in this course is learn
00:46:06.160 | how to read papers so here is a paper which you I would love for everybody to
00:46:11.480 | read called high temperature and high humidity reduce the transmission of
00:46:15.540 | COVID-19 now this is a very important issue because if the claim of this paper
00:46:20.840 | is true and that would mean that this is going to be a seasonal disease and if
00:46:25.360 | this is a seasonal disease and it's going to have massive policy implications
00:46:30.360 | so let's try and find out how this was modeled and understand how to interpret
00:46:35.240 | this model so this is a key picture from the paper and what they've done here is
00:46:45.560 | they've taken a hundred cities in China and they've plotted the temperature on
00:46:50.300 | one axis in Celsius and are on the other axis where R is a measure of
00:46:56.160 | transmissibility it says for each person that has this disease how many people on
00:47:02.200 | average will they infect so if R is under one then the disease will not
00:47:07.720 | spread is if R is higher than like two it's going to spread incredibly quickly
00:47:14.840 | and basically R is going to you know any high R is going to create an
00:47:18.560 | exponential transmission impact and you can see in this case they have plotted a
00:47:25.000 | best fit line through here and then they've made a claim that there's some
00:47:30.440 | particular relationship in terms of a formula that R is 1.99 minus 0.023 times
00:47:38.480 | temperature so a very obvious concern I would have looking at this picture is
00:47:44.840 | that this might just be random maybe there's no relationship at all but just
00:47:52.160 | if you picked a hundred cities at random perhaps they would sometimes show this
00:47:57.680 | level of relationship so one simple way to kind of see that would be to actually
00:48:04.840 | do it in a spreadsheet so here's here is a spreadsheet where what I did was I kind
00:48:12.960 | of eyeballed this data and I guessed about what is the mean degrees centigrade
00:48:17.920 | I think it's about five and what's about the standard deviation of centigrade I
00:48:22.440 | think it's probably about five as well and then I did the same thing for R I
00:48:27.240 | think the mean R looks like it's about 1.9 to me and it looks like the standard
00:48:32.040 | deviation of R is probably about 0.5 so what I then did was I just jumped over
00:48:38.560 | here and I created a random normal value so a random value from a normal
00:48:46.000 | distribution from a normal distribution so a bell curve with that particular
00:48:50.200 | mean and standard deviation of temperature and that particular mean and
00:48:55.120 | standard deviation of R and so this would be an example of a city that might
00:49:02.480 | be in this data set of a hundred cities something with nine degrees Celsius and
00:49:06.800 | an R of 1.1 so that would be nine degrees Celsius and an R of 1.1 so
00:49:12.680 | something about here and so then I just copied that formula down 100 times so
00:49:22.920 | here are a hundred cities that could be in China right where this is assuming
00:49:30.160 | that there is no relationship between temperature and R right they're just
00:49:34.320 | random numbers and so each time I recalculate that so if I hit ctrl equals
00:49:42.000 | it will just recalculate it right I get different numbers okay because they're
00:49:47.680 | random and so you can see at the top here I've then got the average of all of
00:49:55.240 | the temperatures and the average of all of the R's and the average of all the
00:49:58.880 | temperatures varies and the average of all of R's varies as well so then I what
00:50:09.560 | I did was I copied those random numbers over here let's actually do it so I'll
00:50:18.600 | go copy these 100 random numbers and paste them here here here here and so
00:50:32.760 | now I've got one two three four five six I've got six kind of groups of 100
00:50:40.720 | cities right and so let's stop those from randomly changing anymore by just
00:50:49.520 | fixing them in stone there okay so now that I've paste them in I've got six
00:51:01.520 | examples of what a hundred cities might look like if there was no relationship
00:51:06.440 | at all between temperature and R and I've got their main temperature and R
00:51:11.560 | in each of those six examples and what I've done is you can see here at least
00:51:16.980 | for the first one is I've plotted it right and you can see in this case there's
00:51:22.040 | actually a slight positive slope and I've actually calculated the slope for
00:51:33.500 | each just by using the slope function in Microsoft Excel and you can see that
00:51:37.840 | actually in this particular case is just random five times it's been negative and
00:51:46.200 | it's even more negative than their point 0 to 3 and so you can like it's kind of
00:51:53.560 | matching our intuition here which is that this the slope of the line that we
00:51:57.800 | have here is something that absolutely can often happen totally by chance it
00:52:03.680 | doesn't seem to be indicating any kind of real relationship at all if we wanted
00:52:09.240 | that slope to be like more confident we would need to look at more cities so like
00:52:17.800 | here I've got 3,000 randomly generated numbers and you can see here the slope
00:52:26.960 | is 0.00002 right it's almost exactly zero which is what we'd expect right when
00:52:33.080 | there's actually no relationship between C and R and in this case there isn't
00:52:37.440 | they're all random then if we look at lots and lots of randomly generated
00:52:41.360 | cities then we can say oh yeah this there's no slope but when you only look
00:52:45.840 | at a hundred as we did here you're going to see relationships totally
00:52:51.520 | coincidentally very very often right so that's something that we need to be able
00:52:57.360 | to measure and so one way to measure that is we use something called a p-value so
00:53:03.080 | a p-value here's how a p-value works we start out with something called a null
00:53:07.720 | hypothesis and the null hypothesis is basically what's what's our starting
00:53:13.760 | point assumption so our starting point assumption might be oh there's no
00:53:17.280 | relationship between temperature and R and then we gather some data and have
00:53:22.280 | you explained what R is I have yes R is the transmissibility of the virus so
00:53:28.680 | then we gather data of independent and dependent variables so in this case the
00:53:32.860 | independent variable is the thing that we think might cause a dependent variable
00:53:38.000 | so here the independent variable would be temperature the dependent variable
00:53:41.000 | would be R so here we've gathered data there's the data that was gathered in
00:53:45.720 | this example and then we say what percentage of the time would we see this
00:53:50.880 | amount of relationship which is a slope of 0.023 by chance and as we've seen one
00:53:57.720 | way to do that is by what we would call a simulation which is by generating
00:54:02.080 | random numbers a hundred set pairs of random numbers a bunch of times and
00:54:06.440 | seeing how often you see this this relationship we don't actually have to
00:54:12.240 | do it that though there's actually a simple equation we can use to jump
00:54:17.840 | straight to this number which is what percent of the time would we see that
00:54:21.280 | relationship by chance and this is basically what that looks like we have
00:54:31.040 | the most likely observation which in this case would be if there is no
00:54:35.980 | relationship between temperature and R then the most likely slope would be 0
00:54:40.040 | and sometimes you get positive slopes by chance and sometimes you get pretty small
00:54:48.940 | slopes and sometimes you get large negative slopes by chance and so the you
00:54:55.360 | know the larger the number the less likely it is to happen whether it be on
00:54:58.360 | the positive side or the negative side and so in our case our question was how
00:55:04.880 | often are we going to get less than negative 0.023 so it would actually be
00:55:10.000 | somewhere down here and I actually copy this from Wikipedia where they were
00:55:13.560 | looking for positive numbers and so they've colored in this area above a
00:55:17.760 | number so this is the p-value and so you can we don't care about the math but
00:55:22.480 | there's a simple little equation you can use to directly figure out this number
00:55:29.720 | the p-value from the data so this is kind of how nearly all kind of medical
00:55:39.840 | research results tend to be shown and folks really focus on this idea of p
00:55:45.480 | values and indeed in this particular study as we'll see in a moment they
00:55:49.640 | reported p-values so probably a lot of you have seen p-values in your previous
00:55:55.840 | lives they come up in a lot of different domains here's the thing they are
00:56:01.840 | terrible you almost always shouldn't be using them don't just trust me trust the
00:56:07.800 | American Statistical Association they point out six things about p-values and
00:56:14.240 | those include p-values do not measure the probability that the hypothesis is
00:56:19.320 | true all the probability that the data were produced by random choice alone now
00:56:24.480 | we know this because we just saw that if we use more data right so if we sample
00:56:32.040 | 3000 random cities rather than a hundred we get a much smaller value right so p
00:56:40.200 | values don't just tell you about how big a relationship is but they actually tell
00:56:44.320 | you about a combination of that and how much data did you collect right so so
00:56:49.560 | they don't measure the probability that the hypothesis is true so therefore
00:56:53.960 | conclusions and policy decisions should not be based on whether a p-value passes
00:56:58.920 | some threshold p-value does not measure the importance of a result right because
00:57:08.000 | again it could just tell you that you collected lots of data which doesn't
00:57:11.880 | tell you that the results actually of any practical input and so by itself it
00:57:16.120 | does not provide a good measure of evidence so Frank Harrell who is somebody
00:57:23.600 | who I read his book and it's a really important part of my learning he's a
00:57:28.360 | professor of biostatistics has a number of great articles about this he says
00:57:34.280 | null hypothesis testing and p-values have done significant harm to science
00:57:39.160 | and he wrote another piece called null hypothesis significance testing never
00:57:44.160 | worked so I've shown you what p-values are so that you know why they don't work
00:57:52.320 | not so that you can use them right but they're a super important part of
00:57:56.440 | machine learning because they come up all the time in making this you know when
00:58:01.320 | people saying this is how we decide whether your drug worked or whether
00:58:06.000 | there is a epidemiological relationship or whatever and indeed p-values appear
00:58:13.160 | in this paper so in the paper they show the results of a multiple linear
00:58:19.800 | regression and they put three stars next to any relationship which has a p-value
00:58:27.400 | of 0.01 or less so there is something useful to say about a small p-value like
00:58:38.240 | 0.01 or less which is that the thing that we're looking at did not probably did not
00:58:43.400 | happen by chance right the biggest statistical error people make all the
00:58:48.200 | time is that they see that a p-value is not less than 0.05 and then they make
00:58:54.400 | the erroneous conclusion that no relationship exists right which doesn't
00:59:01.880 | make any sense because like it let's say you only had like three data points then
00:59:06.480 | you almost certainly won't have enough data to have a p-value of less than 0.05
00:59:11.400 | for any hypothesis so like the way to check is to go back and say what if I
00:59:17.520 | picked the exact opposite null hypothesis what if my null hypothesis was
00:59:21.880 | there is a relationship between temperature and R then do I have enough
00:59:26.040 | data to reject that null hypothesis right and if the answer is no then you
00:59:34.820 | just don't have enough data to make any conclusions at all right so in this case
00:59:39.800 | they do have enough data to be confident that there is a relationship between
00:59:46.160 | temperature and R now that's weird because we just looked at the graph and
00:59:52.120 | we did a little back of bit of a back of the envelope in Excel and we thought
00:59:55.000 | this is could could well be random so here's where the issue is the graph
01:00:03.500 | shows what we call a univariate relationship a univariate relationship
01:00:07.220 | shows the relationship between one independent variable and one dependent
01:00:11.300 | variable and that's what you can normally show on a graph but in this
01:00:14.880 | case they did a multivariate model in which they looked at temperature and
01:00:19.680 | humidity and GDP per capita and population density and when you put all
01:00:26.680 | of those things into the model then you end up with statistically significant
01:00:30.560 | results for temperature and humidity why does that happen well the reason that
01:00:36.040 | happens is because all these variation in the blue dots is not random there's a
01:00:44.040 | reason they're different right and the reasons include denser cities are going
01:00:49.160 | to have higher transmission for instance and probably more humid will have less
01:00:55.000 | transmission so when you do a multivariate model it actually allows you
01:01:02.360 | to be more confident of your results right but the p-value as noted by the
01:01:11.760 | American Statistical Association does not tell us whether this is a practical
01:01:15.640 | importance the thing that tells us this is a practical is importance is the
01:01:20.400 | actual slope that's found and so in this case the equation they come up with is
01:01:28.120 | that R equals 3.968 minus 3.0.038 by temperature minus 0.024 by relative
01:01:37.600 | humidity this is this equation is this practically important well we can again
01:01:43.320 | do a little back of the envelope here by just putting that into Excel let's say
01:01:52.160 | there was one place that had a temperature of 10 centigrade and a
01:01:55.480 | humidity of 40 then if this equation is correct I would be about 2.7 somewhere
01:02:02.320 | with a temperature of 35 centigrade any humidity of 80 I would be about 0.8 so
01:02:08.880 | is this practically important oh my god yes right two different cities with
01:02:15.400 | different climates can be if they're the same in every other way and this model
01:02:19.920 | is correct then one city would have no spread of disease because I was less than
01:02:25.280 | one one would have massive exponential explosion so we can see from this model
01:02:33.120 | that if the modeling is correct then this is a highly practically significant
01:02:38.100 | result so this is how you determine practical significance of your models
01:02:41.960 | it's not with p-values but with looking at kind of actual outcomes so how do you
01:02:49.880 | think about the practical importance of a model and how do you turn a predictive
01:02:57.960 | model into something useful in production so I spent many many years
01:03:03.080 | thinking about this and I actually created a with some other great folks
01:03:09.640 | actually created a paper about it designing great data products and this
01:03:19.680 | is largely based on 10 years of work I did at a company I founded called optimal
01:03:26.060 | decisions group and optimal decisions group was focused on the question of
01:03:30.940 | helping insurance companies figure out what prices to set and insurance
01:03:36.440 | companies up until that point had focused on predictive modeling
01:03:40.240 | actuaries in particular spent their time trying to figure out how likely is it
01:03:47.320 | that you're going to crash your car and if you do how much damage might you have
01:03:50.920 | and then based on that try to figure out what price they should set for your
01:03:55.320 | policy so for this company what we did was we decided to use a different
01:04:01.160 | approach which I ended up calling the drivetrain approach just described here
01:04:06.280 | to to set insurance prices and indeed to do all kinds of other things and so for
01:04:12.780 | the insurance example the objective would be if an insurance company would
01:04:17.520 | be how do I maximize my let's say five year profit and then what inputs can we
01:04:25.800 | control can we control which I call levers so in this case it would be what
01:04:30.460 | price can I set and then data is data which can tell you as you change your
01:04:37.560 | levers how does that change your objective so if I start increasing my
01:04:41.600 | price to people who are likely to crash their car then we'll get less of them
01:04:46.300 | which means we'll have less costs but at the same time we'll also have less
01:04:50.240 | revenue coming in for example so to link up the kind of the levers to the
01:04:55.720 | objective via the data we collect we build models that described how the
01:04:59.960 | levers influenced the objective and this is all a it seems pretty obvious when
01:05:05.640 | you say it like this but when we started work with optimal decisions in 1999
01:05:11.040 | nobody was doing this in insurance everybody in insurance was simply
01:05:15.640 | doing a predictive model to guess how likely people were to crash their car and
01:05:20.920 | then pricing was set by like adding 20% or whatever it was just done in a very
01:05:27.000 | kind of naive way so what I did is I you know over many years took this basic
01:05:35.040 | process and tried to help lots of companies figure out how to use it to
01:05:39.080 | turn predictive models into actions so the starting point in like actually
01:05:46.800 | getting value in a predictive model is thinking about what is it you're trying
01:05:50.400 | to do and you know what are the sources of value in that thing you're trying to
01:05:53.400 | do the levers what are the things you can change like what's the point of a
01:05:58.280 | predictive model if you can't do anything about it right figuring out
01:06:02.800 | ways to find what data you you don't have which one's suitable what's
01:06:06.040 | available then think about what approaches to analytics you can then
01:06:09.160 | take and then super important like well can you actually implement you know
01:06:15.960 | those changes and super super important how do you actually change things as the
01:06:21.360 | environment changes and you know interestingly a lot of these things are
01:06:24.920 | areas where there's not very much academic research there's a little bit
01:06:28.680 | and some of the papers that have been particularly around maintenance of like
01:06:34.440 | how do you decide when your machine learning model is kind of still okay how
01:06:39.560 | do you update it over time I've had like many many many many citations but they
01:06:45.240 | don't pop up very often because a lot of folks are so focused on the math you
01:06:49.760 | know and then there's the whole question of like what constraints are in place
01:06:54.000 | across this whole thing so what you'll find in the book is there is a whole
01:06:58.120 | appendix which actually goes through every one of these six things and has a
01:07:03.800 | whole list of examples so this is an example of how to like think about value
01:07:11.520 | and lots of questions that companies and organizations can use to try and think
01:07:17.800 | about you know all of these different pieces of the actual puzzle of getting
01:07:25.200 | stuff into production and actually into an effective product we have a question
01:07:29.120 | sure just a moment so I say so do check out this appendix because it actually
01:07:33.680 | originally appeared as a blog post and I think except for my COVID-19 posts that
01:07:39.560 | I did with Rachel it's actually the most popular blog post I've ever written it's
01:07:43.880 | at hundreds of thousands of views and it kind of represents like 20 years of hard
01:07:48.560 | one insights about like how you actually get value from machine learning in
01:07:55.120 | practice and what you actually have to ask so please check it out because
01:07:58.320 | hopefully you'll find it helpful so when we think about like think about this for
01:08:03.760 | the question of how should people think about the relationship between seasonality
01:08:08.160 | and transmissibility of COVID-19 you kind of need to dig really deeply into the
01:08:15.720 | questions about like oh not just what what's that what are those numbers in
01:08:20.680 | the data but what does it really look like right so one of the things in the
01:08:24.160 | paper that they show is actual maps right of temperature and humidity and ah
01:08:31.360 | right and you can see like not surprisingly that humidity and
01:08:37.680 | temperature in China are what we would call autocorrelated which is to say that
01:08:44.160 | places that are close to each other in this case geographically have similar
01:08:48.080 | temperatures and similar humidities and so like this actually puts into the
01:08:54.960 | question the a lot the p-values that they have right because you you can't
01:09:01.040 | really think of these as a hundred totally separate cities because the ones
01:09:04.760 | that are close to each other probably have very close behavior so maybe you
01:09:08.080 | should think of them as like a small number of sets of cities you know of
01:09:12.920 | kind of larger geographies so these are the kinds of things that when you look
01:09:18.280 | actually into a model you need to like think about what are the what are the
01:09:23.000 | limitations but then to decide like well what does that mean what do I what do I
01:09:26.880 | do about that you you need to think of it from this kind of utility point of
01:09:34.360 | view this kind of end-to-end what are the actions I can take what are the
01:09:39.040 | results point of view not just null hypothesis testing so in this case for
01:09:44.440 | example there are basically four possible key ways this could end up it
01:09:52.040 | could end up that there really is a relationship between temperature and R
01:09:57.480 | or so that's what the right-hand side is or there is no real relationship between
01:10:03.800 | temperature and R and we might act on the assumption that there is a
01:10:09.160 | relationship or we might act on the assumption that there isn't a
01:10:12.720 | relationship and so you kind of want to look at each of these four possibilities
01:10:16.760 | and say like well what would be the economic and societal consequences and
01:10:22.560 | you know there's going to be a huge difference in lives lost and you know
01:10:28.000 | economies crashing and whatever else to you know for each of these four the the
01:10:36.180 | paper actually you know has shown if their model is correct what's the likely
01:10:42.000 | R value in March for like every city in the world and the likely R value in July
01:10:48.440 | for every city in the world and so for example if you look at kind of New
01:10:52.880 | England and New York the prediction here is and also West the other the very
01:10:57.680 | coast of the West Coast is that in July the disease will stop spreading now you
01:11:04.640 | know in a if that happens if they're right then that's going to be a
01:11:08.880 | disaster because I think it's very likely in America and also the UK that
01:11:14.320 | people will say oh turns out this disease is not a problem you know it
01:11:19.300 | didn't really take off at all the scientists were wrong people will go
01:11:23.000 | back to their previous day-to-day life and we could see what happened in 1918
01:11:28.160 | flu virus of like the second go around when winter hits could be much worse
01:11:34.760 | than than the start right so like there's these kind of like huge potential
01:11:41.800 | policy impacts depending on whether this is true or false and so to think about
01:11:47.880 | it - yes I also just wanted to say that it would be it would be very
01:11:53.160 | irresponsible to think oh summer's gonna solve it we don't need to act now just
01:11:59.240 | in that this is something growing exponentially and could do a huge huge
01:12:02.840 | amount of damage yeah yeah so it could already has done either way if you
01:12:08.040 | assume that there will be seasonality and that summer will fix things then it
01:12:13.760 | could lead you to be apathetic now if you assume there's no seasonality and
01:12:18.160 | then there is then you could end up kind of creating a larger level of
01:12:24.720 | expectation of distraction than actually happens and end up with your population
01:12:28.720 | being even more apathetic you know so that they're you know being wrong in any
01:12:33.000 | direction of your problem so one of the ways we tend to deal with this with with
01:12:37.800 | this kind of modeling is we try to think about priors so priors are basically
01:12:42.820 | things where we you know rather than just having a null hypothesis we try and
01:12:47.020 | start with a guess as to like well what's what's more likely right so in
01:12:52.080 | this case if memory says correctly I think we know that like flu viruses
01:12:57.560 | become inactive at 27 centigrade we know that like cold the cold coronaviruses
01:13:04.640 | are seasonal 1918 the 1918 flu epidemic was seasonal in every country and city
01:13:14.880 | that's been studied so far there's been quite a few studies like this they've
01:13:18.120 | always found climate relationships so far so maybe we'd say well our prior belief
01:13:23.640 | is that this thing is probably seasonal and so then we'd say well this
01:13:27.960 | particular paper adds some evidence to that so like it shows like how
01:13:34.800 | incredibly complex it is to use a model in practice for in this case policy
01:13:42.800 | discussions but also for like organizational decisions because you
01:13:47.880 | know there's always complexities there's always uncertainties and so you actually
01:13:52.520 | have to think about the the utilities you know and your best guesses and try to
01:13:57.920 | combine everything together as best as you can okay so with all that said it's
01:14:08.080 | still nice to be able to get our our models up and running even if you know
01:14:14.560 | even just a predictive model is sometimes useful of its own sometimes
01:14:19.300 | it's useful to prototype something and sometimes it's just it's going to be
01:14:23.960 | part of some bigger picture so rather than try to create some huge end-to-end
01:14:28.180 | model here we thought we would just show you how to get your your pytorch fast AI
01:14:36.400 | model up and running in as raw a form as possible so that from there you can kind
01:14:43.180 | of build on top of it as you like so to do that we are going to download and
01:14:51.100 | curate our own data set and you're going to do the same thing you've got to train
01:14:55.200 | your own model on that data set and then you're going to get an application and
01:15:00.360 | then you're going to host it okay now there's lots of ways to create an image
01:15:07.720 | data set you might have some photos on your own computer there might be stuff
01:15:12.080 | at work you can use one of the easiest though is just to download stuff off the
01:15:17.720 | internet there's lots of services for downloading stuff off the internet we're
01:15:22.120 | going to be using Bing image search here because they're super easy to use a lot
01:15:28.080 | of the other kind of easy to use things require breaking the terms of service of
01:15:32.640 | websites so like we're not going to show you how to do that but there's lots of
01:15:38.280 | examples that do show you how to do that so you can check them out as well if you
01:15:42.560 | if you want to Bing image search is actually pretty great at least at the
01:15:46.260 | moment these things change a lot so keep an eye on our website to see if we've
01:15:52.160 | changed our recommendation the biggest problem with Bing image search is that
01:15:57.480 | the sign-up process is a nightmare at least at the moment like one of the
01:16:03.360 | hardest parts of this book is just signing up to their damn API which
01:16:07.720 | requires going through Azure it's called cognitive services Azure cognitive
01:16:11.400 | services so we'll make sure that all that information is on the website for
01:16:15.820 | you to follow through just how to sign up so we're going to start from the
01:16:19.160 | assumption that you've already signed up but you can find it just go Bing Bing
01:16:29.040 | image search API and at the moment they give you seven days with a pretty high
01:16:36.760 | quota for free and then after that you can keep using it as long as you like
01:16:46.240 | but they kind of limit it to like three transactions per second or something
01:16:50.580 | which is still plenty you can still do thousands for free so it's it's at the
01:16:54.920 | moment it's pretty great even for free so what will happen is when you sign up
01:17:02.240 | for Bing image search or any of these kind of services they'll give you an API
01:17:05.840 | key so just replace the xxx here with the API key that they give you okay so
01:17:12.740 | that's now going to be called key in fact let's do it over here okay so you'll put
01:17:21.080 | in your key and then there's a function we've created called search images Bing
01:17:27.800 | which is just a super tiny little function as you can see it's just two
01:17:32.900 | lines of code I was just trying to save a little bit of time which will take some
01:17:38.960 | take your API key and some search term and return a list of URLs that match
01:17:44.200 | that search term as you can see for using this particular service you have
01:17:52.600 | to install a particular package so we show you how to do that on the site as
01:17:59.320 | well so once you've done so you'll be able to run this and that will return by
01:18:05.500 | default I think 150 URLs okay so fast AI comes with a download URL function so
01:18:13.200 | let's just download one of those images just to check and open it up and so what
01:18:18.760 | I did was I searched for grizzly bear and here I have a grizzly bear so then
01:18:24.800 | what I did was I said okay let's try and create a model that can recognize
01:18:29.480 | grizzly bears versus black bears versus teddy bears so that way I can find out I
01:18:35.280 | could set up some video recognition system near our campsite when we're out
01:18:40.800 | camping that gives me bear warnings but if it's a teddy bear coming then it
01:18:45.600 | doesn't warn me and wake me up because that would not be scary at all so then I
01:18:50.200 | just go through each of those three bear types create a directory with the name
01:18:55.760 | of grizzly or black or teddy bear search being for that particular search term
01:19:02.640 | along with bear and download and so download images is a fast AI function as
01:19:09.160 | well so after that I can call get image files which is a fast AI function that
01:19:16.040 | will just return recursively all of the image files inside this path and you can
01:19:21.080 | see it's given me bears / black / and then lots of numbers so one of the
01:19:29.480 | things you have to be careful of is that a lot of the stuff you download will
01:19:32.360 | turn out to be like not images at all and will break so you can call verify
01:19:36.800 | images to check that all of these file names are actual images and in this case
01:19:44.180 | I didn't have any failed so this it's empty but if you did have some then you
01:19:50.160 | would call path dot unlink unlink path dot unlink is part of the Python
01:19:56.000 | standard library and it deletes a file and map is something that will call this
01:20:02.120 | function for every element of this collection this is part of a special
01:20:10.080 | fast AI class called L it's basically it's kind of a mix between the Python
01:20:16.160 | standard library list class and a NumPy array class and we'll be learning more
01:20:21.840 | about it later in this course but it basically tries to make it super easy to
01:20:26.040 | do kind of more functional style programming and Python so in this case
01:20:31.720 | it's going to unlink everything that's in the failed list which is probably what
01:20:37.040 | we want because they're all the images that failed to verify alright so we've
01:20:42.280 | now got a path that contains a whole bunch of images and they're classified
01:20:48.760 | according to black grizzly or teddy based on what folder they're in and so to
01:20:55.320 | create so we're going to create a model and so to create a model the first thing
01:20:59.920 | we need to do is to tell fast AI what kind of data we have and how it's
01:21:07.120 | structured now in part in lesson one of the course we did that by using what we
01:21:13.960 | call a factory method which is we just said image data loaders dot from name
01:21:20.040 | and it did it all for us those factory methods are fine for beginners but now
01:21:28.040 | we're into lesson two we're not quite beginners anymore so we're going to show
01:21:31.120 | you the super super flexible way to use data in whatever format you like and
01:21:36.040 | it's called the data block API and so the data block API looks like this
01:21:46.080 | here's the data block API you tell fast AI what your independent variable is and
01:21:54.040 | what your dependent variable is so what your labels are and what your input data
01:21:57.800 | is so in this case our input data are images and our labels are categories so
01:22:05.560 | the category is going to be either grizzly or black or teddy so that's the
01:22:12.040 | first thing you tell it that that's the block parameter and then you tell it how
01:22:16.160 | do you get a list of all of the in this case file names right and we just saw
01:22:20.760 | how to do that because we just called the function ourselves the function is
01:22:23.820 | called get image files so we tell it what function to use to get that list of
01:22:27.560 | items and then you tell it how do you split the data into a validation set and
01:22:34.280 | a training set and so we're going to use something called a random splitter which
01:22:37.960 | just splits it randomly and we're going to put 30% of it into the validation set
01:22:42.000 | we're also going to set the random seed which ensures that every time we run
01:22:46.280 | this the validation set will be the same and then you say okay how do you label
01:22:51.960 | the data and this is the name of a function called parent label and so
01:22:56.520 | that's going to look for each item at the name of the parent so this this
01:23:03.120 | particular one would become a black bear and this is like the most common way for
01:23:08.960 | image data sets to be represented is that they get put the different images
01:23:13.240 | get the files get put into folder according to their label and then
01:23:19.200 | finally here we've got something called item transforms we'll be learning a lot
01:23:22.960 | more about transforms in a moment that these are basically functions that get
01:23:26.760 | applied to each image and so each image is going to be resized to 128 by 128
01:23:34.160 | square so we're going to be learning more about data block API soon but
01:23:39.680 | basically the process is going to be it's going to call whatever is get
01:23:42.240 | items which is a list of image files it's then I'm going to call get X get Y
01:23:47.680 | so in this case there's no get X but there is a get Y so it's just parent
01:23:51.240 | label and then it's going to call the create method for each of these two
01:23:55.360 | things it's going to create an image and it's going to create a category it's
01:23:59.080 | then going to call the item transforms which is resize and then the next thing
01:24:04.040 | it does is it puts it into something called a data loader a data loader is
01:24:07.760 | something that grabs a few images at a time I think by default at 64 and puts
01:24:13.840 | them all into a single it's got a batch it just grabs 64 images and sticks them
01:24:18.760 | all together and the reason it does that is it then puts them all onto the GPU at
01:24:23.320 | once so it can pass them all to the model through the GPU in one go and
01:24:30.360 | that's going to let the GPU go much faster as we'll be learning about and
01:24:35.200 | then finally we don't use any here we can have something called batch
01:24:38.680 | transforms which we will talk about later and then somewhere in the middle
01:24:43.280 | about here conceptually is the splitter which is the thing that splits into the
01:24:48.680 | training set and the validation set so this is a super flexible way to tell
01:24:54.560 | fast AI how to work with your data and so at the end of that it returns an
01:25:03.120 | object of type data loaders that's why we always call these things DL's right so
01:25:08.880 | data loaders has a validation and a training data loader and a data loader as
01:25:15.480 | I just mentioned is something that grabs a batch of a few items at a time and
01:25:19.880 | puts it on the GPU for you so this is basically the entire code of data loaders
01:25:26.920 | so the details don't matter I just wanted to point out that like a lot of
01:25:31.120 | these concepts in fast AI when you actually look at what they are there
01:25:34.800 | they're incredibly simple little things it's literally something that you just
01:25:38.680 | pass in a few data loaders to and it's still some in an attribute and pass and
01:25:43.160 | gives you the first one back as dot train and the second one back as dot
01:25:47.000 | valid so we can create our data loaders by first of all creating the data block
01:25:57.680 | and then we call the data loaders passing in our path to create DL's and
01:26:02.400 | then you can call show batch on that you can call show batch on pretty much
01:26:06.360 | anything in fast AI to see your data and look we've got some grizzlies we've got
01:26:10.700 | a teddy we've got a grizzly so you get the idea right I'm going to look at these
01:26:19.880 | different I'm going to look at data augmentation next week so I'm going to
01:26:23.360 | skip over data augmentation and let's just jump straight into training your
01:26:27.200 | model so once we've got DL's we can just like in lesson one call CNN learner to
01:26:38.600 | create a resnet we're going to create a smaller resident this time a resnet 18
01:26:43.080 | again asking for error rate we can then call dot fine-tune again so you see it's
01:26:48.080 | all the same lines of code we've already seen and you can see our error rate goes
01:26:52.800 | down from 9 to 1 so you've got 1% error and after training for about 25 seconds
01:26:58.960 | so you can see you know we've only got 450 images we've trained for well less
01:27:05.320 | than a minute and we only have let's look at the confusion matrix so we can
01:27:09.640 | say I want to create a classification interpretation class I want to look at
01:27:14.840 | the confusion matrix and the confusion matrix as you can see it's something
01:27:19.540 | that says for things that are actually black bears how many are predicted to be
01:27:24.280 | black bears versus grizzly bears versus teddy bears so the diagonal are the ones
01:27:31.280 | that are all correct and so it looks like we've got two errors we've got one
01:27:34.580 | grizzly that was predicted to be black one black that was predicted to be
01:27:37.760 | grizzly super super useful method is plot top losses that'll actually show me
01:27:48.280 | what my errors actually look like so this one here was predicted to be a
01:27:53.420 | grizzly bear but the label was black bear this one was the one that's
01:27:58.000 | predicted to be a black bear and the label was grizzly bear these ones here
01:28:03.440 | are not actually wrong there this is predicted to be black and it's actually
01:28:06.360 | black but the reason they appear in this is because these are the ones that the
01:28:12.160 | model was the least confident about okay so we're going to look at image
01:28:18.520 | classifier cleaner next week let's focus on how we then get this into production
01:28:24.160 | so to get it into production we need to export the model so what exporting the
01:28:32.680 | model does is it creates a new file which by default is called export dot
01:28:38.200 | pickle which contains the architecture and all of the parameters of the model
01:28:44.160 | so that is now something that you can copy over to a server somewhere and
01:28:50.160 | treat it as a predefined program right so then so the the process of using your
01:28:58.840 | trained model on new data kind of in production is called inference so here
01:29:06.200 | I've created an inference learner by loading that learner back again right and
01:29:11.280 | so obviously it doesn't make sense to do it right next to after I've saved it in
01:29:16.760 | in a notebook but I'm just showing you how it would work right so this is
01:29:20.360 | something that you would do on your server inference and remember that once
01:29:26.320 | you have trained a model you can just treat it as a program you can pass
01:29:30.660 | inputs to it so this is now our our program this is our bear predictor so I
01:29:35.800 | can now call predict on it and I can pass it an image and it will tell me
01:29:42.680 | here is it is ninety nine point nine nine nine percent sure that this is a
01:29:47.760 | grizzly so I think what we're going to do here is we're going to wrap it up
01:29:53.200 | here and next week we'll finish off by creating an actual GUI for our bear
01:30:03.160 | classifier we will show how to run it for free on a service called binder and
01:30:16.000 | yeah and then I think we'll be ready to dive into some of the some of the
01:30:21.560 | details of what's going on behind the scenes any questions or anything else
01:30:26.200 | before we wrap up Rachel now okay great all right thanks everybody so we
01:30:36.320 | hopefully yeah I think from here on we've covered you know most of the key
01:30:44.040 | kind of underlying foundational stuff from a machine learning point of view
01:30:48.240 | that we're going to need to cover so we'll be able to ready to dive into
01:30:54.160 | lower level details of how deep learning works behind the scenes and I think
01:31:01.440 | that'll be starting from next week so see you then