Back to Index

Lesson 2 - Deep Learning for Coders (2020)


Chapters

0:0 Lesson 1 recap
2:10 Classification vs Regression
4:50 Validation data set
6:42 Epoch, metrics, error rate and accuracy
9:7 Overfitting, training, validation and testing data set
12:10 How to choose your training set
15:55 Transfer learning
21:50 Fine tuning
22:23 Why transfer learning works so well
28:26 Vision techniques used for sound
29:30 Using pictures to create fraud detection at Splunk
30:38 Detecting viruses using CNN
31:20 List of most important terms used in this course
31:50 Arthur Samuel’s overall approach to neural networks
32:35 End of Chapter 1 of the Book
40:4 Where to find pretrained models
41:20 The state of deep learning
44:30 Recommendation vs Prediction
45:50 Interpreting Models - P value
57:20 Null Hypothesis Significance Testing
62:48 Turn predictive model into something useful in production
74:6 Practical exercise with Bing Image Search
76:25 Bing Image Sign up
81:38 Data Block API
88:48 Lesson Summary

Transcript

So hello everybody and welcome back to Practical Deep Learning for Coders. This is lesson 2 and and in the last lesson we started training our first models. We didn't really have any idea how that training was really working, but we were looking at a high level at what was going on and we learned about what is machine learning and how does that work and we realized that based on how machine learning worked that there are some fundamental limitations on on what it can do and we talked about some of those limitations and we also talked about how after you've trained a machine learning model you end up with a program which behaves much like a normal program or something with inputs and a thing in the middle and outputs.

So today we're gonna finish up talking about talking about that and we're going to then look at how we get those models into production and what some of the issues with doing that might be. I wanted to remind you that there are two sets of books, sorry two sets of notebooks available to you.

One is the the the fastbook repo, the full actual notebooks containing all the text of the O'Reilly book and so this lets you see everything that I'm telling you in much more detail and then as well as that there's the the course v4 repo which contains exactly the same notebooks but with all the pros stripped away to help you study.

So that's where you really want to be doing your experiment and your practice and so maybe as you listen to the video you can kind of switch back and forth between the video and reading or do one and then the other and then put it away and have a look at the course v4 notebooks and try to remember like okay what was this section about and run the code and see what happens and change it and so forth.

So we were looking at this line of code where we looked at how we created our data by passing in information perhaps most importantly some way to label the data and we talked about the importance of labeling and in this case the this particular data set whether it's a cat or a dog you can tell by whether it's an uppercase or a lowercase letter in the first position that's just how this data set if they tell you when the readme works and we also looked particularly at this idea of valid percent equals 0.2 and like what does that mean it creates a validation set and that was something I wanted to talk more about.

The first thing I do want to do though is point out that this particular labeling function returns something that's either true or false and actually this data set as we'll see later also tells also contains the actual breed of 37 different cat and dog breeds so you can you can also grab that from the file name.

In each of those two cases we're trying to predict a category is it a cat or is it a dog or is it a German Shepherd or a beagle or rag doll cat or whatever when you're trying to predict a category so when the label is a category we call that a classification model.

On the other hand you might try to predict how old is the animal or how tall is it or something like that which is like a continuous number that could be like 13.2 or 26.5 or whatever anytime you're trying to predict a number your label is a number you call that regression okay so those are the two main types of model classification and regressions this is very important jargon to know about so the regression model attempts to predict one or more numeric quantities such as temperature or location or whatever this is a bit confusing because sometimes people use the word regression as a shortcut to a particular like a abbreviation for a particular kind of model called linear regression that's super confusing because that's not what regression means linear regression is just a particular kind of regression but I just wanted to warn you of that when you start talking about regression a lot of people will assume you're talking about linear regression even though that's not what the word means.

Alright so I wanted to talk about this valid percent 0.2 thing so as we described valid percent grabs in this case 20% of the data but 0.2 and puts it aside like in a separate bucket and then when you train your model your model doesn't get to look at that data at all that data is only used to decide to show you how accurate your model is so if you train for too long and or with not enough data and or a model with too many parameters after a while the accuracy of your model will actually get worse and this is called overfitting right so we use the validation set to ensure that we're not overfitting the next line of code that we looked at is this one where we created something called a learner we'll be learning a lot more about that but a learner is basically or is something which contains your data and your architecture that is the mathematical function that you're optimizing and so a learner is the thing that tries to figure out what are the parameters which best cause this function to match the labels in this data so we're talking a lot more about that but basically this particular function resnet 34 is the name of a particular architecture which is just very good for computer vision problems in fact the name really is resnet and then 34 tells you how many layers there are so you can use ones with bigger numbers here to get more parameters that will take longer to train take more memory more likely to overfit but could also create more complex models right now though I wanted to focus on this part here which is metrics equals error rate this is where you list the functions that you want to be that you want to be called with your data with your validation data and print it out after each epoch and epoch is is what we call it when you look at every single image in the data set once and so after you've looked at every image in the data set once we print out some information about how you're doing and the most important thing we print out is the result of calling these metrics so error rate is the name of a metric and it's a function that just prints out what percent of the validation set are being incorrectly classified by your model so a metrics a function that measures the quality of the predictions using the validation set so error rates one another common metric is accuracy which is just one minus error rate so very important to remember from last week we talked about loss Arthur Samuel had this important idea in machine learning that we need some way to figure out how good our how well our model is doing so that when we change the parameters we can figure out which set of parameters make that performance measurement get better or worse that performance measurement is called the loss the loss is not necessarily the same as your metric the reason why is a bit subtle and we'll be seeing it in a lot of detail once we delve into the math in the coming lessons but basically you need a function you need a loss function where if you change the parameters by just a little bit up or just a little bit down you can see if the loss gets a little bit better or a little bit worse and it turns out that error rate and accuracy doesn't tell you that at all because you might change the parameters by smudge such a small amount that none of your dogs predictions start becoming cats and none of your cat predictions start becoming dogs so like your predictions don't change and your error rate doesn't change so loss and metric are closely related but the metric is the thing that you care about the loss is the thing which your computer is using as the measurement of performance to decide how to update your parameters so we measure overfitting by looking at the metrics on the validation set so fast AI always uses the validation set to print out your metrics and overfitting is like the key thing that machine learning is about it's all about how do we find a model which fits the data not just for the data that we're training with but for data that the training algorithm hasn't seen before so overfitting results when our model is basically cheating our model can cheat by saying oh I've seen this exact picture before and I remember that that's a picture of a cat so it might not have learned what cats look like in general it just remembers you know that images one four and eight are cats and two and three and five are dogs and learns nothing actually about what they really look like so that's the kind of cheating that we're trying to avoid we don't want it to memorize our particular data set so we split off our validation data and most of this these words you're seeing on the screen are from the book okay so I just copied and pasted them so if we split off our validation data and make sure that our model sees it during training it's completely untainted by it so we can't possibly cheat not quite true we can cheat the way we could cheat is we could run we could fit a model look at the result in the validation set change something a little bit fit another model look at the validation set change something a little bit we could do that like a hundred times until we find something where the validation set looks the best but now we might have fit to the validation set right so if you want to be really rigorous about this you should actually set aside a third bit of data called the test set that is not used for training and it's not used for your metrics it's actually you don't look at it until the whole project's finished and this is what's used on competition platforms like Kaggle on Kaggle after the competition finishes your performance will be measured against a data set that you have never seen and so that's a really helpful approach and it's actually a great idea to do that like even if you're not doing the modeling yourself so if you're if you're looking at vendors and you're just trying to decide should I go with IBM or Google or Microsoft and they're all showing you how great their models are what you should do is you should say okay you go and build your models and I am going to hang on to ten percent of my data and I'm not going to let you see it at all and when you're all finished come back and then I'll run your model on the ten percent of data you've never seen now pulling out your validation and test sets is a bit subtle though here's an example of a simple little data set and this comes from a fantastic blog post that Rachel wrote that we will link to about creating effective validation sets and you can see basically you have some kind of seasonal data set here now if you just say okay fast AI I want to model that I want to create my data loader using a valid percent of 0.2 it would do this it would delete randomly some of the dots right now this isn't very helpful because it's we can still cheat because these dots are right in the middle of other dots and this isn't what would happen in practice what would happen in practice is we would want to predict this is sales by date right we want to predict the sales for next week not the sales for 14 days ago 18 days ago in 29 days ago right so what you actually need to do to create an effective validation set here is not do it randomly but instead chop off the end right and so this is what happens in all Kaggle competitions pretty much that involve time for instance is the thing that you have to predict is the next like two weeks or so after the last data point that they give you and this is what you should do also for your test set so again if you've got vendors that you're looking at you should say to them okay after you're all done modeling we're going to check your model against a data that is one week later than you've ever seen before and you won't be able to retrain or anything because that's what happens in practice right okay there's a question I've heard people describe overfitting as training error being below validation error does this rule of thumb end up being roughly the same as yours okay so that's a great question so I think what they mean there is training loss versus validation loss because we don't print training error so we do print at the end of each epoch the value of your loss function for the training set and the value of the loss function for the validation set and if you train for long enough so if it's training nicely your training loss will go down and your validation loss will go down because by definition loss function is defined such as a lower loss function is a better model if you start overfitting your training loss will keep going down right because like why wouldn't that you know you're getting better and better parameters but your validation loss will start to go up because actually you started fitting to the specific data points in the training set and so it's not going to actually get better it's going to get it's not going to get better for the validation set it'll start to get worse however that does not necessarily mean that you're overfitting or at least not overfitting in a bad way as we'll see it's actually possible to be at a point where the validation loss is getting worse but the validation accuracy or error or metric is still improving so I'm not going to describe how that would happen mathematically yet because we need to learn more about loss functions but we will but for now just realize that the important thing to look at is your metric getting worse not your loss function getting worse thank you for that fantastic question the next important thing we need to learn about is called transfer learning so the next line of code said learn fine tune why does it say learn fine tune fine tune is what we do when we are transfer learning so transfer learning is using a pre trained model for a task that is different to what it was originally trained for so more jargon to understand our jargon let's look at that what's a pre trained model so what happens is remember I told you the architecture we're using is called resnet 34 so when we take that resnet 34 that's just a it's just a mathematical function okay with lots of parameters that we're going to fit using machine learning there's a big data set called image net that contains 1.3 million pictures of a thousand different types of thing whether it be mushrooms or animals or airplanes or hammers or whatever there's a competition there used to be a competition that runs every year to see who could get the best accuracy on the image net competition and the models that did really well people would take those specific values of those parameters and they would make them available on the internet for anybody to download so if you download that you don't just have an architecture now you have a trained model you have a model that can recognize a thousand categories of thing in images which probably isn't very useful unless you happen to want something that recognizes those exact thousand categories of thing but it turns out you can rather you can start with those weights in your model and then train some more epochs on your data and you'll end up with a far far more accurate model than you would if you didn't start with that pre-trained model and we'll see why in just a moment right but this idea of transfer learning it's kind of it makes intuitive sense right image net already has some cats and some dogs in it it's you know it can say this is a cat and this is a dog but you want to maybe do something that recognizes lots of breeds that aren't an image net well for it to be able to recognize cats versus dogs versus airplanes versus hammers it has to understand things like what does metal look like what does fur look like what it is look like you know so it can say like oh this breed of animal this breed of dog has pointy ears and oh this thing is metal so it can't be a dog so all these kinds of concepts get implicitly learnt by a pre-trained model so if you start with a pre-trained model then you don't it you don't have to learn all these features from scratch and so transfer learning is the single most important thing for being able to use less data and less compute and get better accuracy so that's a key focus for the fast AI library and a key focus for this course there's a question I'm a bit confused on the differences between loss error and metric last error and metric sure so error is just one kind of metric so there's lots of different possible labels you could have let's say you're trying to create a model which could predict how old a cat or dog is so the metric you might use is on average how many years were you off by so that would be a metric on the other hand if you're trying to predict whether this is a cat or a dog your metric could be what percentage of the time am I wrong so that latter metric is called the error rate okay so error is one particular metric it's a thing that measures how well you're doing and it's like it should be the thing that you most care about so you write a function or use one of fast AI's pre-defined ones which measures how well you're doing loss is the thing that we talked about in lesson one so I'll give a quick summary but go back to lesson one if you don't remember Arthur Samuel talked about how a machine learning model needs some measure of performance which we can look at when we adjust our parameters up or down does that measure of performance get better or worse and as I mentioned earlier some metrics possibly won't change at all if you move the parameters up and down just a little bit so they can't be used for this purpose of adjusting the parameters to find a better measure of performance so quite often we need to use a different function we call this the loss function the loss function is the measure of performance that the algorithm uses to try to make the parameters better and it's something which should kind of track pretty closely to the the metric you care about but it's something which as you change the parameters a bit the loss should always change a bit and so there's a lot of hand waving there because we need to look at some of the math of how that works and we'll be doing that in the next couple of lessons thanks for the great questions okay so fine-tuning is a particular transfer learning technique where the oh and you're still showing your picture and out the slides so fine-tuning is a transfer learning technique where the weights this is not quite the right word we should say the parameters where the parameters of a pre-trained model are updated by training for additional epochs using a different task to that used for pre-training so pre-training the task might have been image net classification and then our different task might be recognizing cats versus dogs so the way by default fast AI does fine-tuning is that we use one epoch which remember is one looking at every image in the data set once one epoch to fit just those parts of the model necessary to get the particular part of the model that's especially for your data set working and then we use as many epochs as you ask for to fit the whole model and so this is more if you for those people who might be a bit more advanced we'll see exactly how this works later on in the lessons so why does transfer learning work and why does it work so well the best way in my opinion to look at this is to see this paper by Zyla and Fergus who were actually 2012 image net winners and interestingly their key insights came from their ability to visualize what's going on inside a model so visualization very often turns out to be super important to getting great results what they were able to do was they looked remember I told you like a resnet 34 has 34 layers they looked at something called Alex net which was the previous winner of the competition which only had seven layers at the time that was considered huge and so they took a seven layer model and they said what is the first layer of parameters look like and they figured it out how to draw a picture of them right and so the first layer had lots and lots of features but here are nine of them one two three four five six seven eight nine and here's what nine of those features look like one of them was something that could recognize diagonal lines from top left to bottom right one of them could find diagonal lines from bottom left to top right one of them could find gradients that went from the top of orange to the bottom of blue some of them were able you know one of them was specifically for finding things that were green and so forth right so for each of these nine they're called filters they're all features so then something really interesting they did was they looked at for each one of these each one of these filters each one of these features and we'll learn kind of mathematically about what these actually mean in the coming lessons but for now let's just recognize them as saying oh there's something that looks at diagonal lines and something that looks at gradients and they found in the actual images in ImageNet specific examples of parts of photos that match that filter so for this top left filter here are nine actual patches of real photos that match that filter and as you can see they're all diagonal lines and so here's the for the green one here's parts of actual photos that match the green one so layer one is super super simple and one of the interesting things to note here is that something that can recognize gradients and patches of color and lines is likely to be useful for lots of other tasks as well not just ImageNet so you can kind of see how something that can do this might also be good at many many other computer vision tasks as well this is layer two layer two takes the features of layer one and combines them so it can not just find edges but can find corners or repeating curving patterns or semicircles or full circles and so you can see for example here's a it's kind of hard to exactly visualize these layers after layer one you kind of have to show examples of what the filters look like but here you can see examples of parts of photos that these this layer to circular filter has activated on and as you can see it's found things with circles so interestingly this one which is this kind of blotchy gradient seems to be very good at finding sunsets and this repeating vertical pattern is very good at finding like curtains and wheat fields and stuff so the further we get layer three then gets to combine all the kinds of features in layer two and remember we're only seeing so we're only seeing here 12 of the features but actually there's probably hundreds of them I don't remember exactly in Alex Net but there's lots but by the time we get to layer three by combining features from layer two it already has something which is finding text so this is a feature which can find bits of image that contain text it's already got something which can find repeating geometric patterns and you see this is not just like a matching specific pixel patterns this is like a semantic concept it can find repeating circles or repeating squares or repeating hexagons right so it's it's really like computing it's not just matching a template and remember we know that neural networks can solve any possible computable function so it can certainly do that so layer 4 gets to combine all the filters from layer 3 anyway at once and so by layer 4 we have something that can find dog faces for instance so you can kind of see how each layer we get like multiplicatively more sophisticated features and so that's why these deep neural networks can be so incredibly powerful it's also why transfer learning can work so well because like if we wanted something that can find books and I don't think there's a book category in ImageNet well it's actually already got something that can find text as an earlier filter which I guess it must be using to find maybe there's a category for library or something or a bookshelf so when you use transfer learning you can take advantage of all of these pre-learn features to find things that are just combinations of these existing features that's why transfer learning can be done so much more quickly and so much less data than traditional approaches one important thing to realize then is that these techniques for computer vision are not just good at recognizing photos there's all kinds of things you can turn into pictures for example these are example these are sounds that have been turned into pictures by representing their frequencies over time and it turns out that if you convert a sound into these kinds of pictures you can get basically state-of-the-art results at sound detection just by using the exact same ResNet learner that we've already seen I wanted to highlight that it's 945 so if you want to take a break soon a really cool example from I think this is our very first year of running fast AI one of our students created pictures they worked at Splunk in anti-fraud and they created pictures of users moving their mouse and if I remember correctly as they moved their mouse he basically drew a picture of where the mouse moved and the color depended on how fast they moved and these circular blobs is where they clicked the left or the right mouse button and at Splunk they then well he what he did actually for the for the course as a project for the course is he tried to see whether he could use this these pictures with exactly the same approach we saw in lesson one to create an anti-fraud model and it worked so well that Splunk ended up patenting a new product based on this technique and you can actually check it out there's a blog post about it on the internet where they describe this breakthrough anti-fraud approach which literally came from one of our really amazing and brilliant and creative students after lesson one of the course another cool example of this is looking at different viruses and again turning them into pictures and you can kind of see how they've got here this is from a paper check out the book for the citation they've got three examples of a particular virus called vb.at and another example of a particular virus called fakrian and you can see each case the pictures all look kind of similar and that's why again they can get state-of-the-art results in in virus detection by turning the kind of program signatures into pictures and putting it through image recognition so in the book you'll find a list of all of the terms all of the most important terms we've seen by so far and what they mean I'm not going to read through them but I want you to please because these are the these are the terms that we're going to be using from now on and you've got to know what they mean because if you don't you're going to be really confused because I'll be talking about labels and architectures and models and parameters and they have very specific exact meanings and they'll be using those exact meanings so please review this so to remind you this is where we got to we we ended up with Arthur Samuel's overall approach and we replaced his terms with our terms so we have an architecture which contains parameters as inputs and we will parameters and the data as inputs so that the architecture press the parameters of the model with the inputs they use to calculate predictions they are compared to the labels with a loss function and that loss function is used to update the parameters many many times to make them better and better until the loss gets nice and super low so this is the end of chapter one of the book it's really important to look at the questionnaire because the questionnaire is the thing where you can check whether you have taken away from this book of this chapter the stuff that we hope you have so go through it and anything that you're not sure about the tech that the answer is in the text so just go back to earlier in the book and you will in the chapter and you will find the answers there's also a further research section after each questionnaire for the first couple of chapters they're actually pretty simple hopefully they're pretty fun and interesting they're things where to answer the question it's not enough to just look in the chapter you actually have to go and do your own thinking and experimenting and googling and so forth in later chapters some of these further research things are pretty significant projects that might take a few days or even weeks and so yeah you know check them out because hopefully they'll be a great way to expand your understanding of the material so something that Sylvain points out in the book is that if you really want to make the most of this then after each chapter please take the time to experiment with your own project and with the notebooks you provide what we provide and then see if you can redo the the notebooks on a new data set and perhaps for chapter one that might be a bit hard because we haven't really shown how to change things but for chapter for chapter two which we're going to start next you'll absolutely be able to do that okay so let's take a five minute break and we'll come back at 955 San Francisco time okay so welcome back everybody and I think we've got a couple of questions to start with so Rachel please take it away sure our filters independent by that I mean if filters are pre trained might they become less good in detecting features of previous images when fine-tuned oh that is a great question so assuming I understand the question correctly if if you start with say an image net model and then you you fine-tune it on dogs versus cats for a few epochs and you get something that's very good at recognizing dogs versus cats it's going to be much less good as an image net model after that so it's not going to be very good at recognizing airplanes or or hammers or whatever this is called catastrophic forgetting in the literature the idea that as you like see more images about different things to what you saw earlier that you start to forget about the things you saw earlier so if you want to fine-tune something which is good at a new task but also continues to be good at the previous task you need to keep putting in examples of the previous task as well and what are the example what are the differences between parameters and hyper parameters if I am feeding an image of a dog as an input and then changing the hyper parameters of batch size in the model what would be an example of a parameter so the parameters are the things that described in lesson one that Arthur Samuel described as being the things which change what the model does what the architecture does so we start with this infinitely flexible function the thing called a neural network that can do anything at all and the the way you get it to do one thing versus another thing is by changing its parameters there they are the numbers that you pass into that function so there's two types of numbers you pass into the function there's the numbers that represent your input like the pixels of your dog and there's the numbers that represent the learnt parameters so in the example of something that's not a neural net but like a checkers playing program like Arthur Samuel might have used back in the early 60s and late 50s those parameters may have been things like if there is a opportunity to take a piece versus an opportunity to get to the end of a board how much more value should I consider one versus the other you know it's twice as important or it's three times as important that two versus three that would be an example of a parameter in a neural network parameters are a much more abstract concept and so a detailed understanding of what they are will come in the next lesson or two but it's the same basic idea they're the numbers which change what the model does to be something that recognizes malignant tumors versus cats versus dogs versus colorizes black and white pictures whereas the hyper parameter is the choices about what what numbers do you pass to the function when you act the actual fitting function to decide how that fitting process happens there's a question I'm curious about the pacing of this course I'm concerned that all the material may not be covered depends what you mean by all the material we certainly won't cover everything in the world so yeah we'll cover what we can then we'll cover what we can in seven lessons we're certainly not covering the whole book if that's what you're wondering the whole book will be covered in either two or three courses in the past it's generally been two courses to cover about the amount of stuff in the book but we'll see how it goes because the books pretty big 500 pages when you say two courses you mean 14 lessons 14 yeah so it'd be like 14 or 21 lessons to get through the whole book although having said that by the end of the first lesson hopefully there'll be kind of like enough momentum and understanding that the reading the book independently will be more useful and you'll have also kind of gained a community of folks on the forums that you can hang out with and ask questions of and so forth so in in the second part of the course we're going to be talking about putting stuff in production and we're so to do that we need to understand like what are the capabilities and limitations of of deep learning what are the kinds of projects that even make sense to try to put in production and you know one of the key things I should mention in in the Balkan in this course is that the first two or three lessons and chapters there's a lot of stuff which is designed not just for for coders but for for everybody there's lots of information about like what are the practical things you need to know to make deep learning work and so one of them things you need to know is like well what's deep learning actually good at at the moment so I'll summarize what the book says about this but there are the kind of four key areas that we have as applications in fast AI computer vision text tabula and but I've called here Rexis this stands for recommendation systems and specifically a technique called collaborative filtering which we briefly saw last week sorry another question is are there any pre-trained weights available other than the ones from image net that we can use if yes when should we use others in one image net oh that's a really great question so yes there are a lot of pre-trained models and one way to find them but also you're currently just showing switching okay great one great way to find them is you can look up model zoo which is a common name for like places that have lots of different models and so here's lots of model zoos or you can look for pre-trained models and so yeah there's quite a few unfortunately not as wide a variety as I would like that most are still on image net or similar kinds of general photos for example medical imaging there's hardly any there's a lot of opportunities for people to create domain specific pre-trained models it's it's still an area that's really underdone because not enough people are working on transfer learning okay so as I was mentioning we've kind of got these four applications that we've talked about a bit and deep learning is pretty you know pretty good at all of those tabular data like spreadsheets and database tables is an area where deep learning is not always the best choice but it's particularly good for things involving high cardinality variables that means variables that have like lots and lots of discrete levels like zip code or product ID or something like that deep learning is really pretty great for those in particular for text it's pretty great at things like classification and translation it's actually terrible for conversation and so that's that's been something that's been a huge disappointment for a lot of companies they tried to create these like conversation bots but actually deep learning isn't good at providing accurate information it's good at providing things that sound accurate and sound compelling but it we don't really have great ways yet of actually making sure it's correct one big issue for recommendation systems collaborative filtering is that deep learning is focused on making predictions which don't necessarily actually mean creating useful recommendations we'll see what that means in a moment deep learning is also good at multimodal that means things where you've got multiple different types of data so you might have some tabular data including a text column and an image and some collaborative filtering data and combining that all together is something that deep learning is really good at so for example putting captions on photos is something which deep learning is pretty good at although again it's not very good at being accurate so what you know it might say this is a picture of two birds it's actually a picture of three birds and then this other category there's lots and lots of things that you can do with deep learning by being creative about the use of these kinds of other application-based approaches for example an approach that we developed for natural language processing called ULM fit or you're learning in the course it turns out that it's also fantastic at doing protein analysis if you think of the different proteins as being different words and they're in a sequence which has some kind of state and meaning it turns out that ULM fit works really well for protein analysis so often it's about kind of being being creative so to decide like for the product that you're trying to build is deep learning going to work well for it in the end you kind of just have to try it and see but if you if you do a search you know hopefully you can find examples about the people that have tried something similar even if you can't that doesn't mean it's not going to work so for example I mentioned the collaborative filtering issue where a recommendation and a prediction are not necessarily the same thing you can see this on Amazon for example quite often so I bought a Terry Pratchett book and then Amazon tried for months to get me to buy more Terry Pratchett books now that must be because their predictive model said that people who bought one particular Terry Pratchett book are likely to also buy other Terry Pratchett books but from the point of view of like well is this going to change my buying behavior probably not right like if I liked that book I already know I like that author and I already know that like they probably wrote other things so I'll go and buy it anyway so this would be an example of like Amazon probably not being very smart up here they're actually showing me collaborative filtering predictions rather than actually figuring out how to optimize a recommendation so an optimized recommendation would be something more like your local human bookseller might do where they might say oh you like Terry Pratchett well let me tell you about other kind of comedy fantasy sci-fi writers on the similar vein who you might not have heard about before so the difference between recommendations and predictions is super important so I wanted to talk about a really important issue around interpreting models and for a case study for this I thought we let's pick something that's actually super important right now which is a model in this paper one of the things we're going to try and do in this course is learn how to read papers so here is a paper which you I would love for everybody to read called high temperature and high humidity reduce the transmission of COVID-19 now this is a very important issue because if the claim of this paper is true and that would mean that this is going to be a seasonal disease and if this is a seasonal disease and it's going to have massive policy implications so let's try and find out how this was modeled and understand how to interpret this model so this is a key picture from the paper and what they've done here is they've taken a hundred cities in China and they've plotted the temperature on one axis in Celsius and are on the other axis where R is a measure of transmissibility it says for each person that has this disease how many people on average will they infect so if R is under one then the disease will not spread is if R is higher than like two it's going to spread incredibly quickly and basically R is going to you know any high R is going to create an exponential transmission impact and you can see in this case they have plotted a best fit line through here and then they've made a claim that there's some particular relationship in terms of a formula that R is 1.99 minus 0.023 times temperature so a very obvious concern I would have looking at this picture is that this might just be random maybe there's no relationship at all but just if you picked a hundred cities at random perhaps they would sometimes show this level of relationship so one simple way to kind of see that would be to actually do it in a spreadsheet so here's here is a spreadsheet where what I did was I kind of eyeballed this data and I guessed about what is the mean degrees centigrade I think it's about five and what's about the standard deviation of centigrade I think it's probably about five as well and then I did the same thing for R I think the mean R looks like it's about 1.9 to me and it looks like the standard deviation of R is probably about 0.5 so what I then did was I just jumped over here and I created a random normal value so a random value from a normal distribution from a normal distribution so a bell curve with that particular mean and standard deviation of temperature and that particular mean and standard deviation of R and so this would be an example of a city that might be in this data set of a hundred cities something with nine degrees Celsius and an R of 1.1 so that would be nine degrees Celsius and an R of 1.1 so something about here and so then I just copied that formula down 100 times so here are a hundred cities that could be in China right where this is assuming that there is no relationship between temperature and R right they're just random numbers and so each time I recalculate that so if I hit ctrl equals it will just recalculate it right I get different numbers okay because they're random and so you can see at the top here I've then got the average of all of the temperatures and the average of all of the R's and the average of all the temperatures varies and the average of all of R's varies as well so then I what I did was I copied those random numbers over here let's actually do it so I'll go copy these 100 random numbers and paste them here here here here and so now I've got one two three four five six I've got six kind of groups of 100 cities right and so let's stop those from randomly changing anymore by just fixing them in stone there okay so now that I've paste them in I've got six examples of what a hundred cities might look like if there was no relationship at all between temperature and R and I've got their main temperature and R in each of those six examples and what I've done is you can see here at least for the first one is I've plotted it right and you can see in this case there's actually a slight positive slope and I've actually calculated the slope for each just by using the slope function in Microsoft Excel and you can see that actually in this particular case is just random five times it's been negative and it's even more negative than their point 0 to 3 and so you can like it's kind of matching our intuition here which is that this the slope of the line that we have here is something that absolutely can often happen totally by chance it doesn't seem to be indicating any kind of real relationship at all if we wanted that slope to be like more confident we would need to look at more cities so like here I've got 3,000 randomly generated numbers and you can see here the slope is 0.00002 right it's almost exactly zero which is what we'd expect right when there's actually no relationship between C and R and in this case there isn't they're all random then if we look at lots and lots of randomly generated cities then we can say oh yeah this there's no slope but when you only look at a hundred as we did here you're going to see relationships totally coincidentally very very often right so that's something that we need to be able to measure and so one way to measure that is we use something called a p-value so a p-value here's how a p-value works we start out with something called a null hypothesis and the null hypothesis is basically what's what's our starting point assumption so our starting point assumption might be oh there's no relationship between temperature and R and then we gather some data and have you explained what R is I have yes R is the transmissibility of the virus so then we gather data of independent and dependent variables so in this case the independent variable is the thing that we think might cause a dependent variable so here the independent variable would be temperature the dependent variable would be R so here we've gathered data there's the data that was gathered in this example and then we say what percentage of the time would we see this amount of relationship which is a slope of 0.023 by chance and as we've seen one way to do that is by what we would call a simulation which is by generating random numbers a hundred set pairs of random numbers a bunch of times and seeing how often you see this this relationship we don't actually have to do it that though there's actually a simple equation we can use to jump straight to this number which is what percent of the time would we see that relationship by chance and this is basically what that looks like we have the most likely observation which in this case would be if there is no relationship between temperature and R then the most likely slope would be 0 and sometimes you get positive slopes by chance and sometimes you get pretty small slopes and sometimes you get large negative slopes by chance and so the you know the larger the number the less likely it is to happen whether it be on the positive side or the negative side and so in our case our question was how often are we going to get less than negative 0.023 so it would actually be somewhere down here and I actually copy this from Wikipedia where they were looking for positive numbers and so they've colored in this area above a number so this is the p-value and so you can we don't care about the math but there's a simple little equation you can use to directly figure out this number the p-value from the data so this is kind of how nearly all kind of medical research results tend to be shown and folks really focus on this idea of p values and indeed in this particular study as we'll see in a moment they reported p-values so probably a lot of you have seen p-values in your previous lives they come up in a lot of different domains here's the thing they are terrible you almost always shouldn't be using them don't just trust me trust the American Statistical Association they point out six things about p-values and those include p-values do not measure the probability that the hypothesis is true all the probability that the data were produced by random choice alone now we know this because we just saw that if we use more data right so if we sample 3000 random cities rather than a hundred we get a much smaller value right so p values don't just tell you about how big a relationship is but they actually tell you about a combination of that and how much data did you collect right so so they don't measure the probability that the hypothesis is true so therefore conclusions and policy decisions should not be based on whether a p-value passes some threshold p-value does not measure the importance of a result right because again it could just tell you that you collected lots of data which doesn't tell you that the results actually of any practical input and so by itself it does not provide a good measure of evidence so Frank Harrell who is somebody who I read his book and it's a really important part of my learning he's a professor of biostatistics has a number of great articles about this he says null hypothesis testing and p-values have done significant harm to science and he wrote another piece called null hypothesis significance testing never worked so I've shown you what p-values are so that you know why they don't work not so that you can use them right but they're a super important part of machine learning because they come up all the time in making this you know when people saying this is how we decide whether your drug worked or whether there is a epidemiological relationship or whatever and indeed p-values appear in this paper so in the paper they show the results of a multiple linear regression and they put three stars next to any relationship which has a p-value of 0.01 or less so there is something useful to say about a small p-value like 0.01 or less which is that the thing that we're looking at did not probably did not happen by chance right the biggest statistical error people make all the time is that they see that a p-value is not less than 0.05 and then they make the erroneous conclusion that no relationship exists right which doesn't make any sense because like it let's say you only had like three data points then you almost certainly won't have enough data to have a p-value of less than 0.05 for any hypothesis so like the way to check is to go back and say what if I picked the exact opposite null hypothesis what if my null hypothesis was there is a relationship between temperature and R then do I have enough data to reject that null hypothesis right and if the answer is no then you just don't have enough data to make any conclusions at all right so in this case they do have enough data to be confident that there is a relationship between temperature and R now that's weird because we just looked at the graph and we did a little back of bit of a back of the envelope in Excel and we thought this is could could well be random so here's where the issue is the graph shows what we call a univariate relationship a univariate relationship shows the relationship between one independent variable and one dependent variable and that's what you can normally show on a graph but in this case they did a multivariate model in which they looked at temperature and humidity and GDP per capita and population density and when you put all of those things into the model then you end up with statistically significant results for temperature and humidity why does that happen well the reason that happens is because all these variation in the blue dots is not random there's a reason they're different right and the reasons include denser cities are going to have higher transmission for instance and probably more humid will have less transmission so when you do a multivariate model it actually allows you to be more confident of your results right but the p-value as noted by the American Statistical Association does not tell us whether this is a practical importance the thing that tells us this is a practical is importance is the actual slope that's found and so in this case the equation they come up with is that R equals 3.968 minus 3.0.038 by temperature minus 0.024 by relative humidity this is this equation is this practically important well we can again do a little back of the envelope here by just putting that into Excel let's say there was one place that had a temperature of 10 centigrade and a humidity of 40 then if this equation is correct I would be about 2.7 somewhere with a temperature of 35 centigrade any humidity of 80 I would be about 0.8 so is this practically important oh my god yes right two different cities with different climates can be if they're the same in every other way and this model is correct then one city would have no spread of disease because I was less than one one would have massive exponential explosion so we can see from this model that if the modeling is correct then this is a highly practically significant result so this is how you determine practical significance of your models it's not with p-values but with looking at kind of actual outcomes so how do you think about the practical importance of a model and how do you turn a predictive model into something useful in production so I spent many many years thinking about this and I actually created a with some other great folks actually created a paper about it designing great data products and this is largely based on 10 years of work I did at a company I founded called optimal decisions group and optimal decisions group was focused on the question of helping insurance companies figure out what prices to set and insurance companies up until that point had focused on predictive modeling actuaries in particular spent their time trying to figure out how likely is it that you're going to crash your car and if you do how much damage might you have and then based on that try to figure out what price they should set for your policy so for this company what we did was we decided to use a different approach which I ended up calling the drivetrain approach just described here to to set insurance prices and indeed to do all kinds of other things and so for the insurance example the objective would be if an insurance company would be how do I maximize my let's say five year profit and then what inputs can we control can we control which I call levers so in this case it would be what price can I set and then data is data which can tell you as you change your levers how does that change your objective so if I start increasing my price to people who are likely to crash their car then we'll get less of them which means we'll have less costs but at the same time we'll also have less revenue coming in for example so to link up the kind of the levers to the objective via the data we collect we build models that described how the levers influenced the objective and this is all a it seems pretty obvious when you say it like this but when we started work with optimal decisions in 1999 nobody was doing this in insurance everybody in insurance was simply doing a predictive model to guess how likely people were to crash their car and then pricing was set by like adding 20% or whatever it was just done in a very kind of naive way so what I did is I you know over many years took this basic process and tried to help lots of companies figure out how to use it to turn predictive models into actions so the starting point in like actually getting value in a predictive model is thinking about what is it you're trying to do and you know what are the sources of value in that thing you're trying to do the levers what are the things you can change like what's the point of a predictive model if you can't do anything about it right figuring out ways to find what data you you don't have which one's suitable what's available then think about what approaches to analytics you can then take and then super important like well can you actually implement you know those changes and super super important how do you actually change things as the environment changes and you know interestingly a lot of these things are areas where there's not very much academic research there's a little bit and some of the papers that have been particularly around maintenance of like how do you decide when your machine learning model is kind of still okay how do you update it over time I've had like many many many many citations but they don't pop up very often because a lot of folks are so focused on the math you know and then there's the whole question of like what constraints are in place across this whole thing so what you'll find in the book is there is a whole appendix which actually goes through every one of these six things and has a whole list of examples so this is an example of how to like think about value and lots of questions that companies and organizations can use to try and think about you know all of these different pieces of the actual puzzle of getting stuff into production and actually into an effective product we have a question sure just a moment so I say so do check out this appendix because it actually originally appeared as a blog post and I think except for my COVID-19 posts that I did with Rachel it's actually the most popular blog post I've ever written it's at hundreds of thousands of views and it kind of represents like 20 years of hard one insights about like how you actually get value from machine learning in practice and what you actually have to ask so please check it out because hopefully you'll find it helpful so when we think about like think about this for the question of how should people think about the relationship between seasonality and transmissibility of COVID-19 you kind of need to dig really deeply into the questions about like oh not just what what's that what are those numbers in the data but what does it really look like right so one of the things in the paper that they show is actual maps right of temperature and humidity and ah right and you can see like not surprisingly that humidity and temperature in China are what we would call autocorrelated which is to say that places that are close to each other in this case geographically have similar temperatures and similar humidities and so like this actually puts into the question the a lot the p-values that they have right because you you can't really think of these as a hundred totally separate cities because the ones that are close to each other probably have very close behavior so maybe you should think of them as like a small number of sets of cities you know of kind of larger geographies so these are the kinds of things that when you look actually into a model you need to like think about what are the what are the limitations but then to decide like well what does that mean what do I what do I do about that you you need to think of it from this kind of utility point of view this kind of end-to-end what are the actions I can take what are the results point of view not just null hypothesis testing so in this case for example there are basically four possible key ways this could end up it could end up that there really is a relationship between temperature and R or so that's what the right-hand side is or there is no real relationship between temperature and R and we might act on the assumption that there is a relationship or we might act on the assumption that there isn't a relationship and so you kind of want to look at each of these four possibilities and say like well what would be the economic and societal consequences and you know there's going to be a huge difference in lives lost and you know economies crashing and whatever else to you know for each of these four the the paper actually you know has shown if their model is correct what's the likely R value in March for like every city in the world and the likely R value in July for every city in the world and so for example if you look at kind of New England and New York the prediction here is and also West the other the very coast of the West Coast is that in July the disease will stop spreading now you know in a if that happens if they're right then that's going to be a disaster because I think it's very likely in America and also the UK that people will say oh turns out this disease is not a problem you know it didn't really take off at all the scientists were wrong people will go back to their previous day-to-day life and we could see what happened in 1918 flu virus of like the second go around when winter hits could be much worse than than the start right so like there's these kind of like huge potential policy impacts depending on whether this is true or false and so to think about it - yes I also just wanted to say that it would be it would be very irresponsible to think oh summer's gonna solve it we don't need to act now just in that this is something growing exponentially and could do a huge huge amount of damage yeah yeah so it could already has done either way if you assume that there will be seasonality and that summer will fix things then it could lead you to be apathetic now if you assume there's no seasonality and then there is then you could end up kind of creating a larger level of expectation of distraction than actually happens and end up with your population being even more apathetic you know so that they're you know being wrong in any direction of your problem so one of the ways we tend to deal with this with with this kind of modeling is we try to think about priors so priors are basically things where we you know rather than just having a null hypothesis we try and start with a guess as to like well what's what's more likely right so in this case if memory says correctly I think we know that like flu viruses become inactive at 27 centigrade we know that like cold the cold coronaviruses are seasonal 1918 the 1918 flu epidemic was seasonal in every country and city that's been studied so far there's been quite a few studies like this they've always found climate relationships so far so maybe we'd say well our prior belief is that this thing is probably seasonal and so then we'd say well this particular paper adds some evidence to that so like it shows like how incredibly complex it is to use a model in practice for in this case policy discussions but also for like organizational decisions because you know there's always complexities there's always uncertainties and so you actually have to think about the the utilities you know and your best guesses and try to combine everything together as best as you can okay so with all that said it's still nice to be able to get our our models up and running even if you know even just a predictive model is sometimes useful of its own sometimes it's useful to prototype something and sometimes it's just it's going to be part of some bigger picture so rather than try to create some huge end-to-end model here we thought we would just show you how to get your your pytorch fast AI model up and running in as raw a form as possible so that from there you can kind of build on top of it as you like so to do that we are going to download and curate our own data set and you're going to do the same thing you've got to train your own model on that data set and then you're going to get an application and then you're going to host it okay now there's lots of ways to create an image data set you might have some photos on your own computer there might be stuff at work you can use one of the easiest though is just to download stuff off the internet there's lots of services for downloading stuff off the internet we're going to be using Bing image search here because they're super easy to use a lot of the other kind of easy to use things require breaking the terms of service of websites so like we're not going to show you how to do that but there's lots of examples that do show you how to do that so you can check them out as well if you if you want to Bing image search is actually pretty great at least at the moment these things change a lot so keep an eye on our website to see if we've changed our recommendation the biggest problem with Bing image search is that the sign-up process is a nightmare at least at the moment like one of the hardest parts of this book is just signing up to their damn API which requires going through Azure it's called cognitive services Azure cognitive services so we'll make sure that all that information is on the website for you to follow through just how to sign up so we're going to start from the assumption that you've already signed up but you can find it just go Bing Bing image search API and at the moment they give you seven days with a pretty high quota for free and then after that you can keep using it as long as you like but they kind of limit it to like three transactions per second or something which is still plenty you can still do thousands for free so it's it's at the moment it's pretty great even for free so what will happen is when you sign up for Bing image search or any of these kind of services they'll give you an API key so just replace the xxx here with the API key that they give you okay so that's now going to be called key in fact let's do it over here okay so you'll put in your key and then there's a function we've created called search images Bing which is just a super tiny little function as you can see it's just two lines of code I was just trying to save a little bit of time which will take some take your API key and some search term and return a list of URLs that match that search term as you can see for using this particular service you have to install a particular package so we show you how to do that on the site as well so once you've done so you'll be able to run this and that will return by default I think 150 URLs okay so fast AI comes with a download URL function so let's just download one of those images just to check and open it up and so what I did was I searched for grizzly bear and here I have a grizzly bear so then what I did was I said okay let's try and create a model that can recognize grizzly bears versus black bears versus teddy bears so that way I can find out I could set up some video recognition system near our campsite when we're out camping that gives me bear warnings but if it's a teddy bear coming then it doesn't warn me and wake me up because that would not be scary at all so then I just go through each of those three bear types create a directory with the name of grizzly or black or teddy bear search being for that particular search term along with bear and download and so download images is a fast AI function as well so after that I can call get image files which is a fast AI function that will just return recursively all of the image files inside this path and you can see it's given me bears / black / and then lots of numbers so one of the things you have to be careful of is that a lot of the stuff you download will turn out to be like not images at all and will break so you can call verify images to check that all of these file names are actual images and in this case I didn't have any failed so this it's empty but if you did have some then you would call path dot unlink unlink path dot unlink is part of the Python standard library and it deletes a file and map is something that will call this function for every element of this collection this is part of a special fast AI class called L it's basically it's kind of a mix between the Python standard library list class and a NumPy array class and we'll be learning more about it later in this course but it basically tries to make it super easy to do kind of more functional style programming and Python so in this case it's going to unlink everything that's in the failed list which is probably what we want because they're all the images that failed to verify alright so we've now got a path that contains a whole bunch of images and they're classified according to black grizzly or teddy based on what folder they're in and so to create so we're going to create a model and so to create a model the first thing we need to do is to tell fast AI what kind of data we have and how it's structured now in part in lesson one of the course we did that by using what we call a factory method which is we just said image data loaders dot from name and it did it all for us those factory methods are fine for beginners but now we're into lesson two we're not quite beginners anymore so we're going to show you the super super flexible way to use data in whatever format you like and it's called the data block API and so the data block API looks like this here's the data block API you tell fast AI what your independent variable is and what your dependent variable is so what your labels are and what your input data is so in this case our input data are images and our labels are categories so the category is going to be either grizzly or black or teddy so that's the first thing you tell it that that's the block parameter and then you tell it how do you get a list of all of the in this case file names right and we just saw how to do that because we just called the function ourselves the function is called get image files so we tell it what function to use to get that list of items and then you tell it how do you split the data into a validation set and a training set and so we're going to use something called a random splitter which just splits it randomly and we're going to put 30% of it into the validation set we're also going to set the random seed which ensures that every time we run this the validation set will be the same and then you say okay how do you label the data and this is the name of a function called parent label and so that's going to look for each item at the name of the parent so this this particular one would become a black bear and this is like the most common way for image data sets to be represented is that they get put the different images get the files get put into folder according to their label and then finally here we've got something called item transforms we'll be learning a lot more about transforms in a moment that these are basically functions that get applied to each image and so each image is going to be resized to 128 by 128 square so we're going to be learning more about data block API soon but basically the process is going to be it's going to call whatever is get items which is a list of image files it's then I'm going to call get X get Y so in this case there's no get X but there is a get Y so it's just parent label and then it's going to call the create method for each of these two things it's going to create an image and it's going to create a category it's then going to call the item transforms which is resize and then the next thing it does is it puts it into something called a data loader a data loader is something that grabs a few images at a time I think by default at 64 and puts them all into a single it's got a batch it just grabs 64 images and sticks them all together and the reason it does that is it then puts them all onto the GPU at once so it can pass them all to the model through the GPU in one go and that's going to let the GPU go much faster as we'll be learning about and then finally we don't use any here we can have something called batch transforms which we will talk about later and then somewhere in the middle about here conceptually is the splitter which is the thing that splits into the training set and the validation set so this is a super flexible way to tell fast AI how to work with your data and so at the end of that it returns an object of type data loaders that's why we always call these things DL's right so data loaders has a validation and a training data loader and a data loader as I just mentioned is something that grabs a batch of a few items at a time and puts it on the GPU for you so this is basically the entire code of data loaders so the details don't matter I just wanted to point out that like a lot of these concepts in fast AI when you actually look at what they are there they're incredibly simple little things it's literally something that you just pass in a few data loaders to and it's still some in an attribute and pass and gives you the first one back as dot train and the second one back as dot valid so we can create our data loaders by first of all creating the data block and then we call the data loaders passing in our path to create DL's and then you can call show batch on that you can call show batch on pretty much anything in fast AI to see your data and look we've got some grizzlies we've got a teddy we've got a grizzly so you get the idea right I'm going to look at these different I'm going to look at data augmentation next week so I'm going to skip over data augmentation and let's just jump straight into training your model so once we've got DL's we can just like in lesson one call CNN learner to create a resnet we're going to create a smaller resident this time a resnet 18 again asking for error rate we can then call dot fine-tune again so you see it's all the same lines of code we've already seen and you can see our error rate goes down from 9 to 1 so you've got 1% error and after training for about 25 seconds so you can see you know we've only got 450 images we've trained for well less than a minute and we only have let's look at the confusion matrix so we can say I want to create a classification interpretation class I want to look at the confusion matrix and the confusion matrix as you can see it's something that says for things that are actually black bears how many are predicted to be black bears versus grizzly bears versus teddy bears so the diagonal are the ones that are all correct and so it looks like we've got two errors we've got one grizzly that was predicted to be black one black that was predicted to be grizzly super super useful method is plot top losses that'll actually show me what my errors actually look like so this one here was predicted to be a grizzly bear but the label was black bear this one was the one that's predicted to be a black bear and the label was grizzly bear these ones here are not actually wrong there this is predicted to be black and it's actually black but the reason they appear in this is because these are the ones that the model was the least confident about okay so we're going to look at image classifier cleaner next week let's focus on how we then get this into production so to get it into production we need to export the model so what exporting the model does is it creates a new file which by default is called export dot pickle which contains the architecture and all of the parameters of the model so that is now something that you can copy over to a server somewhere and treat it as a predefined program right so then so the the process of using your trained model on new data kind of in production is called inference so here I've created an inference learner by loading that learner back again right and so obviously it doesn't make sense to do it right next to after I've saved it in in a notebook but I'm just showing you how it would work right so this is something that you would do on your server inference and remember that once you have trained a model you can just treat it as a program you can pass inputs to it so this is now our our program this is our bear predictor so I can now call predict on it and I can pass it an image and it will tell me here is it is ninety nine point nine nine nine percent sure that this is a grizzly so I think what we're going to do here is we're going to wrap it up here and next week we'll finish off by creating an actual GUI for our bear classifier we will show how to run it for free on a service called binder and yeah and then I think we'll be ready to dive into some of the some of the details of what's going on behind the scenes any questions or anything else before we wrap up Rachel now okay great all right thanks everybody so we hopefully yeah I think from here on we've covered you know most of the key kind of underlying foundational stuff from a machine learning point of view that we're going to need to cover so we'll be able to ready to dive into lower level details of how deep learning works behind the scenes and I think that'll be starting from next week so see you then