Lesson 1: Deep Learning 2018

Hi everybody, welcome to practical deep learning for coders. This is part one of our two-part course I'm Presenting this from the Data Institute in San Francisco We'll be doing seven lessons in this part of the course Most of them will be about a couple of hours long this first one may be a little bit shorter Practical deep learning for coders is all about getting you up and running with deep learning in practice Getting world-class results and it's a really coding focused approach as the name suggests but we're not going to dumb it down by the end of the course you all have learned all of the Theory and details that are necessary to rebuild all of the world-class results.

We're learning about from scratch Now I should mention that our videos are hosted on YouTube But we strongly recommend watching them via our website at course dot fast AI Although they're exactly the same videos the important thing about watching them through our website Is that you'll get all of the information you need about kind of updates to libraries by all locations?

Further information frequently asked questions and so forth So if you're currently on YouTube watching this why don't you switch over to course dot fast at AI now and start watching through there? And make sure you read all of the material on the page before you start just to make sure that you've got everything you need The other thing to mention is that there is a really great strong community at forums dot fast AI From time to time you'll find that you get stuck You may get stuck very early on you may not get stuck for quite a while, but at some point you might get stuck with understanding Why something works the way it does or there may be some computer problem that you have or so forth On forums dot fast at AI there are thousands of other learners talking about every lesson and lots of other topics besides It's the most active deep learning community on the internet by far.

So Definitely register there and start getting involved. You'll get a lot more out of this course if you do that So we're going to start by doing some coding. This is an approach We're going to be talking about in a moment called the top-down approach to study But let's learn it by doing it.

So let's go ahead and try and actually train a neural network Now in order to train a neural network, you almost certainly want a GPU GPU is a graphics processing a graphics processing unit It's the things that companies use to help you play games better They let your computer render the game much more quickly than your CPU can We'll be talking about them more shortly.

But for now, I'm going to show you how you can get access to a GPU Specifically you're going to need an Nvidia GPU because only Nvidia GPUs support something called CUDA CUDA is the language and framework that nearly all deep-learning libraries and practitioners use to do their work Obviously, it's not ideal that we're stuck with one particular vendors cards and over time We hope to see more competition in this space.

But for now, we do need an Nvidia GPU Your laptop almost certainly doesn't have one unless you specifically went out of your way to buy like a gaming laptop So almost certainly you will need to rent one The good news is that renting access? Paying by the second for a GPU based computer is pretty easy and pretty cheap I'm going to show you a couple of options The first option I'll show you which is Probably the easiest is called cressel if you go to cressel.com and Click on sign up or if you've been there before sign in You will find yourself at this screen which has a big button that says start Jupiter and another switch called enable GPU So if we make sure that is set to true enable GPU is on and we click start Jupiter and We click start Jupiter It's going to launch us into something called Jupiter notebook Jupiter notebook in a recent survey of tens of thousands of data scientists was rated as the third most important tool In the data scientist toolbox.

It's really important that you get to learn it well and all of our courses will be run through Jupiter Yes, Rachel. You have a question or comment? Oh, I just wanted to point out that you get I believe 10 free hours So if you wanted to try cressel out Yeah, I he might have changed that recently to less hours, but you can check the fact or the pricing But you certainly get some free hours The pricing varies because this is actually runs on top of Amazon web services.

So at the moment, it's 60 cents an hour The nice thing is though that you can always turn it turn it on You know start your Jupiter without the CP without the GPU running and pay you a tenth of that price, which is pretty cool So Jupiter notebook is something we'll be doing all of this course in and so to get started here we're going to find our particular course, so we'd go to courses and We'd go to fast AI 2 and There they are Things have been moving around a little bit.

So it may be in a different spot for you When you look at this and we'll make sure all the information current information is on the website Now having said that that's you know, the cressel approach is you know, as you can see, it's basically instant and and easy But if you've got you know an extra hour or so to get going an even better option is Something called paper space Paper space unlike cressel doesn't run on top of Amazon.

They have their own machines and If I click on so here's here's paper space and so if I click on new machine I Can pick which one of their three data centers to use so pick the plot one closest to you. So I'll say West Coast and Then I'll say Linux and I'll say you bun to 16 And then it says choose machine and you can see there's various different machines I can choose from And pay by the hour So this is pretty cool for 40 cents an hour.

So it's cheaper than cressel I get a machine that's actually going to be much faster than cressel 60 cent now machine or for 65 cents an hour Way way way faster, right? So I'm going to actually show you how to get started with with the with the paper space approach Because that actually is going to do everything from scratch You may find if you try to do the 65 cents an hour one that it may require you to contact paper space to say Like why do you want it?

That's just an anti fraud thing. So if you say faster AI there then They'll quickly get you up and running. So I'm going to use the cheapest one here 40 cents an hour You can pick how much storage you want and Note that you pay for a month of storage as soon as you start the machine up Right, so don't start and stop lots of machines because each time you pay for that month of storage I think the 250 gig seven dollar a month option is pretty good But you really need 50 gig.

So if you're trying to minimize the price you can go there The only other thing you need to do is turn on public IP so that we can actually log into this and We can turn off auto snapshot to save the money of not having backups All right, so if you then click on create your paper space about a minute later you will find That your machine will pop up.

Here is my Ubuntu 1604 machine If you check your email You will find that they have emailed you a password so you can copy that And You can go to your machine and enter your password now to paste the password You would press ctrl shift V or on Mac.

I guess Apple shift V So it's slightly different to normal pasting or of course you can just type it in And here we are now we can make a little bit more room here by clicking on these little arrows I Can zoom in a little bit? And so as you can see we've got like a terminal that's sitting inside Our browser which is kind of quite a handy way to do it So now we need to configure this for the course and so the way you configure it for the course is you type?

curl HTTP colon slash slash files dot fast dot AI slash setup slash paper space pipe Bash Okay, and so that's then going to run a script which is going to set up all of the CUDA drivers the special Python Reaper pipe Python distribution we use called anaconda all of the libraries all of the courses And the data we use for the first part of the course Okay, so that takes an hour or so and when it's finished running you'll need to reboot your computer So to reboot not your own computer But your paper space computer and so to do that you can just click on this little circular restart machine button Okay, and when it comes back up you'll be ready to go.

So what you'll find Is that you've now got an anaconda 3 directory. That's where your Python is You've got a data directory which contains the data for the first part of this course first lesson, which is that dogs and cats? And you've got a fast AI directory and That contains everything for this course so what you should do is CD fast AI and from time to time you should go git pull and that will just make sure that all of your Fast AI stuff is up to date and also from time to time You might want to just check that your Python libraries up to date and so you can type Conda and update to do that Alright, so make sure that you've CD'd into fast AI and then you can type Jupiter notebook All right, there it is So we now have a Jupiter notebook serving it running and we want to connect that right and so you can see here It says copy paste this URL Into your browser when you connect so if you double click on it Then that will actually That will actually copy it for you Then you can go and paste it, but you need to change this local host To be the paper space IP address, so if you click on a little arrows to go smaller You can see the IP address is here so I'll just copy that and paste it Where it used to say local host okay?

So it's now HTTP and then my IP and then everything else I copied before and so there it is So this is the fast AI Get repo and our courses are all in courses and in there the deep learning part one is DL one and In there you will find Lesson one dot IPI and be I Python notebook So here we are ready to go Depending whether you're using Gressel or paper space or something else if you check courses to fast at AI We'll keep putting additional videos and links to information about how to set up other You know good Jupyter notebook providers as well So to run a cell in Jupyter notebook You select the cell and you hold down shift and press enter or if you've got the toolbar showing You can just click on the little run button, so you'll notice that some cells contain Code and some contain text and some contain pictures and some contain videos so this environment basically has You know it's it's a way that we can give you access to a way to run Experiments and to kind of tell you what's going on show pictures This is why it's like a super popular tool in data science the data science is kind of all about running experiments really So let's go ahead and click run And you'll see that cell turn into a star the one turn into a star for a moment, and then it finished running Okay, so let's try the next one this time instead of using the toolbar.

I'm going to hold down shift and press enter And you can see again It turned into a star and then it said to so if I'd hold down shift and keep pressing enter it just keeps running each Cell right so I can put anything I like for example one plus one is two so What we're going to do is we're going to?

Yes, Rachel. Oh, this is just a side note, but I wanted to point out that we're using Python 3 here Yes, thank you, Python 3 and so you get some errors if you're still using Python 2. Mm-hmm. Yeah And it is important to switch to Python 3 you know now well for fast AI you require it But you know increasingly a lot of libraries are removing support for Python 2 Thanks Rachel Now it mentions here that you can download the data set for this lesson from this location if you're using Cressel or The paper space script that we just used to set up and this will already be made available for you Okay, if you're not you'll need to W get it as soon now Cressel is Quite a bit slower than paper space and also it There are some particular things it doesn't support that we really need and so there there are a couple of extra steps if you're using Cressel you have to run two more cells right so you can see these are commented out They've got hashes at the start So if you remove the hashes from these and run these two additional cells that just runs the stuff that the stuff that you only Need for Cressel I'm using paper space, so I'm not going to run it okay, so Inside our Data so we set up this path to data dogs cats That's pre set up for you and so inside there.

You can see here. I can use an exclamation mark to Basically say I don't want to run Python, but I want to run bash I don't want to run shell so this runs a bash command and the bit inside the curly brackets Actually refers however to a Python variable so it inserts that Python variable into the bash command So here is the contents of our folder There's a training set and a validation set if you're not familiar with the idea of training sets and validation sets It would be a very good idea to check out our practical machine learning course Which tells you a lot about this kind of stuff of like that the basics of how to set up and run machine learning projects more generally Would you recommend that people take that course before this one?

Actually a lot of students who would you know as they went through these who said they look they've liked doing them together So you can kind of check it out and and see the machine learning course Yeah, they cover some similar stuff but all in different directions so people have done both since you know say they find it They each support each other.

I wouldn't say it's prerequisite But you know if I do if I say something like hey This is a training set and this is a validation set and you're going I don't know what that means At least Google it do a quick read you know because we're assuming That you know the very basics of kind of what machine learning is and does to some extent And I have a whole blog post on this topic as well Okay, and we'll make sure that you link to that from course.fast.ai And I also just wanted to say in general with fast.ai our philosophy is to Kind of learn things on an as-needed basis.

Yeah exactly don't try and learn everything that you think you might need first Otherwise you'll never get around to learning the stuff you actually want to learn Exactly and that shows up in deep learning. I think particularly a lot yes Okay, so in our validation folder There's a cats folder and a dogs folder and then inside the validation cats folder is a whole bunch of JPEGs The reason that it's set up like this is that this is kind of the most common standard approach for how?

image classification data sets are shared and provided and the idea is that each folder Tells you the label so there's each of these Images is labeled cats and each of the images in the dogs folder is labeled dogs. Okay? This is how Keras works as well for example So this is a pretty standard way to share image classification files So we can have a look So if you go plot.im show We can see an example of the first of the cats If you haven't seen This before this is a Python 3.6 format string so you can Google for that if you haven't seen it It's a very convenient way to do string formatting, and we use it a lot So there's our cat, but we're going to mainly be interested in the underlying data that makes up that cat so specifically It's an image whose shape that is the dimensions of the array is 198 by 179 by 3 So it's a three-dimensional array also called a rank 3 tensor And here are the first four rows and four columns of that image so as you can see each of those cells has three Items in it, and this is the red green and blue pixel values between 0 and 255 So here's a little subset of what a picture actually looks like inside your computer so that's that that's will be our idea is to take these kinds of numbers and Use them to predict whether those kinds of numbers represent a cat Or a dog based on looking at lots of pictures of cats and dogs so that's a pretty hard thing to do and at the point in time when this This data set actually comes from a Kaggle competition the dogs versus cats Kaggle competition and when it was released in I think it was 2012 The state-of-the-art was 80% accuracy so computers weren't really able to at all accurately recognize dogs versus cats So let's go ahead and train a model So Here are the three lines of code necessary to train a model And so let's go ahead and run it so I'll click on this on the cell.

I'll press shift enter and Then we'll wait a couple of seconds for it to pop up and there it goes Okay, and it's training and So I've asked it to do three epochs so that means it's going to look at every image Three times in total or look at the entire set of images three times That's what we mean by an epoch and as we do it's going to print out The accuracy is this last of the three numbers that prints out on the validation set, okay?

The first two numbers will talk about later In short they're the value of the loss function which is in this case the cross entropy loss For the training set and the validation set and then right at the start here is the epoch number So you can see it's getting about 90 percent accuracy And it took 17 seconds so you can see we've come a long way since 2012 and in fact even in the competition This actually would have won the Kaggle competition of that time the best in the Kaggle competition was 98.9 And we're getting about 99% so this may surprise you that we're getting a You know Kaggle winning as of 20 end of 2012 early 2013 Kaggle winning image classifier in 17 seconds but and three lines of code And I think that's because like a lot of people assume that deep learning takes a huge amount of time And lots of resources and lots of data and as you'll learn in this course That in general isn't true One of the ways we've made it much simpler is that this code is written on top of a library we built imaginatively called fast AI the fast AI library is basically a library which takes all of the Best practices approaches that we can find and so each time a paper comes out.

You know we that looks interesting We test it out if it works well for a variety of data sets and we can figure out how to tune it we implement it in fast AI and so fast AI kind of curates all this stuff and packages up for you and Much of the time or most the time kind of automatically figures out the best way to handle things So the fast AI library is why we were able to do this in just three lines of code And the reason that we were able to make the fast AI library work So well is because it in turn sits on top of something called pytorch which is a Really flexible deep learning and machine learning and GPU computation library written by Facebook Most people are more familiar with TensorFlow than pytorch because Google markets that pretty heavily But most of the top researchers I know nowadays at least the ones that aren't at Google have switched across to pytorch Yes, Rachel, and we'll be covering some pytorch later in the course.

Yeah, it's I mean one of the things that Hopefully you're really like about fast AI is that it's really flexible that you can use all these kind of curated best practices as Much as little as you want and so it's really easy to hook in at any point and write your own Data augmentation write your own loss function write your own network architecture, whatever and so we'll do all of those things in this course So what does this model look like?

well, what we can do is we can Take a look at so what are the what is the the validation set? Dependent variable the Y look like and it's just a bunch of zeros and ones, right? So the zeros if we look at data dot classes the zeros represent cats the ones represent dogs You'll see here.

There's basically two objects. I'm working with one is an object called data Which contains the validation and training data and another one is the object called learn which contains the model, right? So anytime you want to find something out about the data we can look inside data So we want to get predictions for a validation set and so to do that we can call learn dot predict and So you can see here the first ten predictions and what it's giving you is prediction for dog and a prediction for cat now the way pytorch generally works and therefore fast AI also works is that most models return the log Of the predictions rather than the probabilities themselves.

We'll learn why that is later in the course So for now recognize that to get your probabilities you have to get e to the power of You'll see here. We're using numpy NP is numpy if you're not familiar with numpy That is one of the things that we assume that you have some familiarity with So be sure to check out the material on course dot fast at AI to learn the basics of numpy it's the way that Python handles all of the Fast numerical programming array computation that kind of thing Okay, so we can get the probabilities using that using NP dot X There's a few functions here that you can look at yourself if you're interested, but just some plotting functions that we'll use And so we can now plot some random correct Images and so here are some images that it was correct about okay, and so remember one is a dog So anything greater than 0.5 is dog and 0 is a cat so this is what 10 to the negative 5 obviously a cat Here are some which are incorrect Right so you can see that some of these which it thinks are incorrect obviously are just the you know images.

It shouldn't be there at all But clearly this one which it called a a dog is not at all a dog so there are some obvious mistakes We can also take a look at Which cats is it the most confident are cats which dogs are the most dog like the most confident dogs Perhaps more interestingly we can also see which cats is it the most confident are actually dogs so which ones it is at the most wrong about and Same thing for the ones the dogs that it really thinks are cats and again some of these are just Pretty weird.

I guess there is a dog in there. Yes, Rachel I just say do you want to say more about why you would want to look at your data? Yeah, sure So yeah, so finally I just mentioned the last one we've got here is to see which ones have the probability closest to 0.5 So these are the ones that the the model knows it doesn't really know what to do with and some of these it's not surprising So yeah, I mean this is kind of like Always the first thing I do after I build a model is to try to find a way to like visualize what it's built Because if I want to make the model better Then I need to take advantage of the things it's doing well and fix the things it's doing badly.

So in this case And often this is the case. I've learned something about the data set itself Which is that there are some things that are in here that probably shouldn't be But I've also like it's also clear that this Model has room to improve like to me. That's pretty obviously a Dog, but one thing I'm suspicious about here is this image is very kind of fat and short and As we all learn The way these algorithms work is it kind of grabs a square piece at a time?

So this rather makes me suspicious that we're going to need to use something called data augmentation That will learn about learn about later to handle this properly Okay, so That's it right we've now built We've now built an image classifier and something that you should try now is to grab some data yourself some pictures of Two or more different types of thing put them in different folders and run the same three lines of code On them, okay, and you'll find that it will work for that as well as long as that they are pictures of things like the kinds of things that people normally take photos of right, so if they're microscope microscope pictures or pathology pictures or CT scans or something this won't work very well as we'll learn about later There are some other things we didn't need to do to make that work, but for things that look like normal photos These you can run exactly the same three lines of code and just point your path variable somewhere else To get your own image classifier so for example one student Took those three lines of code downloaded for Google images Ten examples of pictures of people playing cricket ten examples of people playing baseball and build a classifier Of those images which was nearly perfectly correct the same student actually also tried downloading seven pictures of Canadian currency seven pictures of American currency and again in that case the model was a hundred percent Accurate so you can just go to Google images if you like and download a few things of a few different classes and see See what works and tell us on the forum both your successes and your failures So what we just did was to Train a neural network, but we didn't first of all tell you what a neural network is or what training means or anything Why is that?

Well, this is the start of our top-down approach to learning And basically the idea is that unlike the way math and technical subjects are usually taught where you learn every little element piece by piece and you don't actually get to put them all together and Build your own image classifier until third year of graduate school.

Our approach is to say from the start Hey, let's show you how to train an image classifier and now you can start doing stuff And then gradually we dig deeper and deeper and deeper and so the idea is that Throughout the course you're going to see like new problems that we want to solve So for example in the next lesson, we'll look at well What if we're not looking at normal kinds of photos, but we're looking at satellite images And we'll see why it is that this approach that we're learning today doesn't quite work as well And what things do we have to change and so we'll learn enough about the theory to understand why that happens And then we'll learn about the libraries and how we can change change things with the libraries to make that work better And So during the course we're gradually going to learn to solve more and more problems as we do So we'll need to learn more and more parts of the library more and more bits of the theory until by the end We're actually going to learn how to create a world-class neural net architecture from scratch and our own training loop from scratch and so we're actually build everything ourselves So that's the general Approach.

Yes, Rachel and we sometimes also call this the whole game Which is inspired by Harvard professor David Perkins Yeah And so the idea with the whole game is like this is more like how you would learn baseball or music With baseball you would get taken to a ball game.

You would learn what baseball is You would start playing it and it would only be years later that you might learn about the physics of how curveball works For example or with music we put an instrument in your hand and you start Banging the drum or hitting the xylophone and it's not until years later that you learn about the circle of fifths and understand How to construct a cadence for example So yeah, so that's this is kind of the approach we're using it's very inspired by David Perkins and other writers of education So what that does mean is to take advantage of this as we peel back the layers We want you to keep like looking under the hood yourself as well like experiment a lot because this is a very code driven Approach so here's basically what happens right?

We start out looking today at convolutional neural networks for images and then in a couple of lessons We'll start to look at how to use neural nets to look at structured data and then to look at language data and then to look at recommendation system data And then we kind of then take all of those steps and we go backwards through them in reverse order So now you know by the end of that fourth piece you will know By the end of lesson four how to create a world-class image classifier a world-class Structured data analysis program world-class language classifier world-class recommendation system And then we're going to go back over all of them again and learn in depth about like well What exactly did it do and how did it work?

And how do we change things around and use it in different situations for for the recommendation systems structured data? Images and then finally back to language. So that's how it's going to work So what that kind of means is that most students find that they tend to watch the videos two or three times?

but not like Watch lesson one two or three times and lesson two two or three times and listen three three times But like they do the whole thing into end lessons one through seven and then go back and start lesson one again That's an approach which a lot of people find when they want to kind of go back and understand all the details That up that can work pretty well, so I would say you know aim to get through to the end of lesson seven You know as as quickly as you can rather than aiming to fully understand every detail from the start So basically the plan is that in today's lesson you learn In as few lines as code as possible with as few details as possible How do you actually build an image classifier with deep learning to do this to in this case say?

Hey, here are some pictures of dogs as opposed to pictures of cats Then we're going to learn How to look at different kinds of images and particularly we're going to look at images of from satellites I'm going to say for a satellite image What kinds of things might you be seeing in that image and there could be multiple things that we're looking at so a multi-label?

location problem From there, we'll move to something which is perhaps the most widely applicable for the most people Which is looking at what we call structured data so data about data that kind of comes from Databases or spreadsheets, so we're going to specifically look at this data set of predicting sales The number of things that are sold at different stores on different dates Based on different holidays and and so on and so forth and so we're going to be doing this sales forecasting exercise After that we're going to look at language, and we're going to figure out What this person?

thinks about the movie zombie Geddon And we'll be able to figure out how to create just like we create image classifiers for any kind of image We'll learn to create in NLP classifiers to classify any kind of language in lots of different ways Then we'll look at something called collaborative filtering which is used mainly for recommendation systems We're going to be looking at this data set that showed for different people for different movies.

What rating did they give it? Here are some of the movies and so This is maybe an easier way to think about it Is there are lots of different users and lots of different movies and then for each one we can look up for each user How much they like that movie and the goal will be of course to predict for user movie combinations?

We haven't seen before are they likely to enjoy that movie or not and that's the really common approach used for like Deciding what stuff to put on your home page when somebody's visiting You know what book might they want to read or what film might they want to see or so forth?

From there we're going to then dig back into language a bit more and we're going to look at Actually, we're going to look at the writings of Nietzsche the philosopher and learn how to create our own Nietzsche philosophy from scratch character by character So this here perhaps that every life of values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse Love is not actually Nietzsche That's actually like some character by character generated text that we built with this recurrent neural network And then finally we're going to loop all the way back to computer vision again We're going to learn how not just to recognize cats from dogs But to actually find like where the cat is with this kind of heat map And we're also going to learn how to write our own architectures from scratch um, so this is an example of a resnet which is the kind of network that we Are using in today's lesson for computer vision?

And so we'll actually end up building the network and the training loop from scratch And so they're basically the the steps that we're going to be taking from here and at each step. We're going to be getting into Increasing amounts of detail about how to actually do these things yourself So we've actually heard back from our students of past courses about what they found and one of the things that we've heard a lot of students say is that they spend too much time on theory and research And not enough time running the code And even after we tell people about this warning where they still come to the end of the course and often say I wish I had taken more Seriously that advice which is to keep running code So these are actual quotes from our forum in retrospect I should have spent the majority of my time on the actual code and the notebooks See what goes in see what comes out now This idea that you can create World-class models in a code first approach learning what you need as you go It's very different to a lot of the advice you'll read out there such as this person on the forum Hacker News who claimed that the best way to become an ML engineer is to Learn all of math learn C and C++ learn parallel programming learn ML Algorithms implement them yourself using plain C and finally start doing ML So we would say if you want to become an effective practitioner do exactly the opposite of this Yes, Rachel.

Oh, yeah, I'm just highlighting that this is We think this is bad advice and this can be very discouraging for a lot of people to come across. Yeah it's it's it's you know, we now have thousands or tens of thousands of people that have done this course and have Lots and lots of examples of people who are now running research labs or Google brain residents or you know Have created patents based on deep learning and so forth who have done it by doing this course So the top-down approach works super well Now one thing to mention is like we've we've now already learned how you can actually train a world-class image classifier in 17 seconds, I should mention by the way the first time you run that code there are two things it has to do that take more than 17 seconds one is that it downloads a Pre-trained model from the internet.

So you'll see the first time you run it. It'll say downloading model So that takes a minute or two also The first time you run it it pre computes and caches Some of the intermediate information that it needs and that takes about a minute and a half as well So if the first time you run it it takes three or four minutes To download and pre-compute stuff.

That's normal if you run it again, you should find it takes 20 seconds or so so Image classifiers, you know, you may not feel like you need to recognize cats versus dogs very often on a computer You can probably do it yourself pretty well But what's interestingly interesting is that these image classification algorithms are really useful for lots and lots of things For example AlphaGo which became which beat the go world champion the way it worked was to use something At its heart that looked almost exactly like our dogs versus cats image classification algorithm It looked at thousands and thousands of go boards And for each one there was a label saying whether that go board ended up being the winning or the losing player and so it learnt Basically an image classification that was able to look at a go board and figure out whether it was a good go board or a bad Go board and that's really the key most important Step in playing go.

Well is to know which which move is better Another example is one of our earlier students who actually Got a couple of patterns for this work looked at anti-fraud He had lots of examples of his customers mouse movements because they they provided kind of these User tracking software to help avoid fraud and so he took the the mouse paths basically of the users on his customers websites Turned them into pictures of where the mouse moved and how quickly it moved And then built a image classifier that took those images As input and as output it was was that a fraudulent transaction or not?

And turned out to get you know really great results for his company so image classifiers Are like much more flexible than you might imagine? so So this is how you know some of the ways you can use deep learning specifically for image recognition and It's worth understanding that deep learning is not You know just a word that means the same thing as machine learning Like what is it that we're actually doing here when we're doing deep learning?

Instead deep learning is a kind of machine learning So machine learning was invented by this guy Arthur Samuels who was pretty amazing in the late 50s He got this IBM mainframe to play checkers better than he can and the way he did it was he invented machine learning he got the Mainframe to play against itself Lots of times and figure out which kinds of things led to victories and which kinds of things didn't And used that to kind of almost write its own program And after Samuels actually said in 1962 that he thought that one day the vast majority of computer software Would be written using this machine learning approach rather than written by hand by writing the loops and so forth by hand So I guess that hasn't happened yet, but it seems to be in the process of happening I think one of the reasons it didn't happen for a long time is because traditional machine learning actually was very difficult and very Knowledge and time intensive so for example here's something called the computational pathologist or CPath From guy called Andy Beck Andy Beck back when he was at Stanford He's now moved on to Somewhere on the East Coast Harvard, I think And what he did was he took these pathology slides of breast cancer biopsies, right and he worked with lots of pathologists to come up with ideas about what kinds of Patterns or features might be associated with sort of long-term survival versus Dining quickly basically and so he came up with these ideas like well They came up with these ideas like relationship between epithelial nuclear neighbors relationship between epithelial and stromal objects and so forth and so they came up with all of these ideas of features these are just a few of the hundreds that they thought of and then lots of smart computer programmers wrote specialist algorithms to to calculate all these different features and then those those Features were passed into a logistic regression To predict survival and it ended up working very well It had ended up that the survival predictions were more accurate than pathologists own survival predictions were and so machine learning can work really well, but the point here is that this was a An approach that took lots of domain experts and computer experts Many years of work to actually to build this thing, right?

so We really want something something better and so specifically I'm going to show you something which rather than being a very specific function with all this very domain specific feature engineering we're going to try and create an infinitely flexible function a function that could solve any problem Right it would solve any problem if only you set the parameters of that function correctly And so then we need some all-purpose way of setting the parameters of that function And we would need that to be fast and scalable Right now if we had something that had these three things Then you wouldn't need to do this Incredibly time and domain knowledge intensive approach anymore instead we can learn all of those things with this with this algorithm So as you might have guessed The algorithm in question which has these three properties is called deep learning Or if not an algorithm, then maybe we would call it a class of algorithms Let's look at each of these three things in turn So the underlying function that deep learning uses is something called the neural network Now the neural network we're going to learn all about it and implemented ourselves from scratch later on in the course But for now all you need to know about it is that it consists of a number of simple linear layers interspersed with a number of simple nonlinear layers And when you interspersed these layers in this way You get something called the universal approximation theorem and the universal approximation theorem says that this kind of function Can solve any given problem?

To arbitrarily close accuracy as long as you add enough parameters So it's actually provably shown to be an infinitely flexible function Right. So now we need some way to fit the parameters so that this infinitely flexible neural network solves some specific problem and so the way we do that is using a technique that probably most of you will have come across before at some stage called gradient descent and with gradient descent we basically say Okay, well for the different parameters we have How how good are they at solving my problem and let's figure out a slightly better set of parameters?

And a slightly better set of parameters and basically follow down The the surface of the loss function downwards. It's kind of like a marble going down to find the minimum and As you can see here depending on where you start you end up in different places These things are called local minima now interestingly it turns out that for neural networks particular in particular There aren't actually multiple different Local minima, there's basically just there's basically just one right or to think of it another way There are different parts of the space which are all equally good so Gradient descent therefore turns out to be actually an excellent way to Solve this problem of fitting parameters to neural networks The problem is though that we need to do it in a reasonable amount of time and It's really only thanks to GPUs that that's become possible So GPUs this shows over the last few years How many gigaflops per second can you get out of a?

GPU that's the red and green versus a CPU. That's the blue right and this is on a log scale So you can see that generally speaking the GPUs are about 10 times faster than the CPUs and What's really interesting is that nowadays not only is the Titan X about 10 times faster than the e5 2699 CPU but the Titan X Well actually better one to look at would be the GTX 1080i GPU costs about 700 bucks Whereas the CPU which is 10 times slower costs over $4,000 So GPUs turn out to be able to solve these Neural network parameter fitting problems incredibly quickly And also incredibly cheaply so they've been absolutely key in bringing these three pieces together Then there's one more piece Which is I mentioned that these neural networks you can intersperse multiple sets of linear and then nonlinear layers In the particular example that's drawn here there's actually only one what we call hidden layer one layer in the middle and Something that we learned in the last few years is that these kinds of neural networks although they do Support the universal approximation theorem they can solve any given problem arbitrarily closely They require an exponentially increasing number of parameters to do so So they don't actually solve the fast and scalable for even reasonable size problems But we've since discovered that if you create at multiple hidden layers Then you get super linear scaling so you can add a few more hidden layers to get multiplicatively more accuracy to multiplicatively more complex problems and That is where it becomes called deep learning.

So deep learning means a neural network with multiple hidden layers So when you put all this together, there's actually really amazing what happens Google started investing in deep learning in 2012 they Actually hired Jeffrey Hinton who's kind of the father of deep learning and his top student Alex Kudzewski And they started trying to build a team that team became known as Google brain and because Things with these three properties are so incredibly powerful and so incredibly flexible you can actually see over time How many projects at Google use deep learning?

My graph here only goes up through a bit over a year ago But it's I know it's been continuing to grow exponentially since then as well And so what you see now is around Google that deep learning is used in like every part of the business and so it's really interesting to see how This this kind of simple idea that we can solve machine learning problems using a an Algorithm that has these properties When a big company invests heavily in actually making that happen You see this incredible growth in how much it's used So for example if you use the inbox by Google software Then when you receive an email from somebody it will often Tell you here are some replies That I could send for you and so it's actually using deep learning here to read the original email and to generate some suggested replies and so like this is a really great example of the kind of stuff that Previously just wasn't possible Another great example would be Microsoft is also a little bit more recently invested heavily in deep learning and so now you can Use Skype you can speaking to it in English and ask it at the other end to Translate it in real time to Chinese or Spanish and then when they talk back to you in Chinese or Spanish Skype will in real time translate it the speech in in their language into English speech in real time And again, this is an example of stuff which we can only do thanks to deep learning I also think it's really interesting to think about how deep learning can be combined with human expertise So here's an example of like drawing something just sketching it out And then using a program called neural doodle This is from a couple of years ago to then say please take that sketch and render it in the style of an artist And so here's the picture that it then created Rendering it as you know impressionist painting, and I think this is a really great example of how You can use deep learning to help combine human expertise and what computers are good at So I a few years ago decided to try this myself like what would happen if I took Deep learning and tried to use it to solve a really important problem, and so the problem I picked was diagnosing lung cancer It turns out if you can find lung nodules earlier There's a 10 times higher probability of survival So it's a really important problem to solve so I got together with three other people none of us had any medical background And we grabbed a data set of CT scans We used a convolutional neural network Much like the dogs versus cats one we trained at the start of today's lesson to try and predict which CT scans had malignant tumors in them And we ended up after a couple of months with something with a much lower False negative rate and a much lower false positive rate than a panel of four radiologists And we went on to build this in a startup into into a company called analytic which has really become pretty successful and Since that time the idea of using deep learning for medical imaging has become Hugely popular and it's being used all around the world So what I've generally noticed is that you know the vast majority of Of kind of things that people do in the world currently aren't using deep learning And then each time somebody says oh, let's try using deep learning to improve performance at this thing They nearly always get fantastic results and then suddenly everybody in that industry starts using it as well So there's just lots and lots of opportunities here at this particular time to use deep learning to help with all kinds of different stuff So I've jotted down a few ideas here.

These are all things which I know you can use deep learning for right now to get good results from and You know are things which people spend a lot of money on or have a lot of you know important business opportunities There's lots more as well But these are some examples of things that maybe at your company you could think about applying deep learning for So let's talk about what's actually going on What actually happened when we trained that deep learning model earlier?

And so as I briefly mentioned the thing we created is something called a convolutional neural network or CNN and The key piece of a convolutional neural network is the convolution So here's a great example from a website I've got the URL up here explained visually It's called and the explained visually website has an example of a convolution kind of in practice over here in the bottom left is a very zoomed in picture of somebody's face and Over here on the right is an example of using a convolution on that image You can see here.

This particular thing is obviously finding Edges the edges of his head right top and bottom edges in particular Now how is it doing that well if we look at each of these little three by three areas that this is moving over It's taking each three by three area of pixels and here are the pixel values right each thing in that three by three area and It's multiplying each one of those three by three pixels by each one of these three by three Kernel values in a convolution this specific set of nine values is called a kernel It doesn't have to be nine it could be four by four or five by five or two by two or whatever, right?

In this case, it's a three by three kernel and in fact in deep learning nearly all of our kernels are three by three So in this case the kernel is one two one. Oh minus one minus two minus one. So we take each of the Black through white pixel values and we multiply as you can see each of them by the corresponding value in the kernel and Then we add them all together And so if you do that for every three by three area you end up with The values that you see over here on the right hand side Okay, so very low values become Black very high values become white and so you can see when we're at an edge where it's black at the bottom and white at the top We're obviously going to get higher numbers over here and vice versa.

Okay, so that's a convolution So as you can see it is a linear operation and so based on that definition of a neural net I described before this can be a layer in our neural network. It is a simple linear operation And we're going to look at lots more at convolutions later including building a little spreadsheet that implements them ourselves So the next thing we're going to do is we're going to add a nonlinear layer so a nonlinearity as it's called is something which takes an input value and Turns it into some different value in a nonlinear way and you can see this orange picture here is an example of a nonlinear function specifically this is something called a sigmoid and so a sigmoid is something that has this kind of s shape and This is what we used to use as our nonlinearities in neural networks a lot Actually nowadays we nearly entirely use something else called a relu or rectified linear unit a relu is simply take any negative numbers and replace them with zero and Leave any positive numbers as they are so in other words in code that would be Y equals max x comma 0 so max x comma 0 simply says replace the negatives with 0 Regardless of whether you use a sigmoid or a relu or something else The key point about taking this combination of a linear layer followed by a element wise nonlinear function is That it allows us to create arbitrarily complex shapes as you see in the bottom, right?

And the reason why is that this is all from Michael Nielsen's neural networks and deep learning comm really fantastic interactive book as You change the values of your linear functions It basically allows you to kind of like build these arbitrarily tall or thin blocks and then combine those blocks together And this is actually the essence of the universal approximation theorem this idea that when you have a linear layer Feeding into a nonlinearity you can actually create these arbitrarily complex shapes So this is the key idea behind why neural networks can solve any computable problem So then we need a way as we described to actually Set these parameters so it's all very well knowing that we can move the parameters around manually to try to Create different shapes, but we have some specific shape.

We want how do we get to that shape? And so as we discussed earlier the basic idea is to use something called gradient descent This is an extract from a notebook actually one of the fast AI lessons And it shows actually an example of using gradient descent to solve a simple linear regression problem But I can show you the basic idea.

Let's say you were just you had a simple Quadratic, right and So you were trying to find the minimum of this quadratic And so in order to find the minimum you start out by randomly picking some point, right? So we say okay, let's pick let's pick here And so you go up there and you calculate the value of your quadratic at that point So what you now want to do is try to find a slightly better point So what you could do is you can move a little bit to the left And a little bit to the right to find out which direction is down and what you'll find out Is that moving a little bit to the left decreases the value of the function so that looks good, right?

and so in other words, we're calculating the derivative of the function at that point All right, so that tells you which way is down It's the gradient. And so now that we know that going to the left is down we can take a small step in that direction To create a new point and then we can repeat the process and say okay Which way is down now and we can now take another step and another step and another step another step another step, okay?

And each time we're getting closer and closer So the basic approach here is to say okay. We start we're at some point. We've got some value X Which is our current guess right that at time step n So then our new guess at time step n plus 1 is just equal to our previous guess plus the derivative Right times some Small number because we want to take a small step We need to pick a small number because if we picked a big number right then we say okay We know we want to go to the left.

Let's jump a big long way to the left we could go all the way over here and We actually end up worse right and then we do it again now or even worse again, right, so if you have too high a Step size you can actually end up with divergence rather than convergence So this number here we're going to be talking about it a lot during this course And we're going to be writing all this stuff out and code from scratch ourselves But this number here is called the learning rate Okay, so You can see here This is an example of basically starting out with some random line and then using gradient descent to gradually make the line better and better and better So what happens when you combine these ideas right the convolution?

The non-linearity and gradient descent because they're all tiny small simple little things it doesn't sound that exciting But if you have enough of these kernels Right with enough layers something really interesting happens And we can actually draw them So here's the So this is a really interesting paper by Matt Ziler and Rob Fergus and what they did a few years ago Was they figured out how to basically draw a picture of what each layer in a deep learning net network learned?

And so they showed that layer one of the network here are nine examples of convolutional filters from layer one of a trained network and they found that some of the filters kind of learnt these diagonal lines or Simple little grid patterns some of them learnt these simple gradients right and so for each of these filters They show nine examples of little pieces of actual photos Which activate that filter quite highly right so you can see layer one These learn to remember these these are learnt using gradient descent these filters were not programmed They were learnt using gradient descent right so in other words we were learning These nine numbers so layer two then was going to take these as inputs and Combine them together and so layer two had you know This is like nine kind of attempts to draw one of the examples of the filters in layer two They're pretty hard to draw but what you can do is say for each filter What are examples of little bits of images that activated them and you can see by layer two we've got?

Basically something that's being activated nearly entirely by little bits of sunset something's that's being activated by circular objects something that's being activated by Repeating horizontal lines something that's being activated by corners right so you can see how we're basically combining layer one features together So if we combine those features together and again, these are all Institutional filters learnt through gradient descent by the third layer.

It's actually learned to recognize the presence of text Another filter has learned to recognize the presence of petals Another filter has learned to recognize the presence of human faces right so just three layers is enough to get some pretty Rich behavior so but by the time we get to layer five We've got something that can recognize the eyeballs of insects and birds And something that can recognize unicycle wheels Right so so this is kind of where we start with something Incredibly simple all right But if we use it as a bit a big enough scale Thanks to the universal approximation theorem and the use of multiple hidden layers and deep learning We actually get these very very rich capabilities So that is what we used when we actually trained Our little dog versus cat recognizer, okay So Let's talk more about this dog versus cat recognizer So we've learned the idea of like we can look at the pictures that come out of the other end to see what the models Classifying well or classifying badly or which ones it's unsure about But let's talk about like this key thing.

I mentioned which is the learning rate So I mentioned we have to set this thing I just called it L before the learning rate and you might have noticed there's a couple of numbers these kind of magic numbers Here the first one is the learning rate, right? So this number is how much do you want to multiply the gradient by when you're taking each step in your gradient descent?

We already talked about why you wouldn't want it to be too high Right, but probably also it's obvious to see why you wouldn't want it to be too low, right? If you had it too low You would take like a little step and you'd be a little bit closer and a little bit step a little step little step And it would take lots and lots and lots of steps and it would take too long so setting this number well is actually really important and For the longest time this was driving deep learning researchers crazy because they didn't really know a Good way to set this reliably So the good news is last year a researcher came up with an approach to quite reliably set the learning rate Unfortunately almost nobody noticed so almost no deep learning researchers.

I know about actually are aware of this approach But it's incredibly successful and it's incredibly simple and I'll show you the idea It's built into the fast AI library as something called LR find or the learning rate finder and it comes from this paper I was actually 2015 paper.

Sorry Cyclical learning rates for training neural networks by a terrific researcher called Leslie Smith And I'll show you Leslie's idea So Leslie's idea started out with the same Basic idea that we've seen before which is if we're going to optimize something pick some random point Take its gradient Right and then specifically he said take a tiny tiny step No tiny step so a learning rate of like 10 e next 7 Right and then do it again again, but each time increase the learning rate like double it So then we try like 2 e next 7 4 e next 7 8 e next 7 10 e next 6 right and so gradually your steps Are getting bigger and bigger?

Right and so you can see what's going to happen. It's going to like Start doing almost nothing right and it's going to then suddenly the loss function is going to improve very quickly Right, but then it's going to step even further again and Then even further again Right, let's draw the rest of that line to be clear Right and so suddenly it's then going to shoot off and get much worse right, so The idea then is to go back and say okay At what point did we see like the best improvement?

So here We've got our best improvement right and so we'd say okay. Let's use that Learning rate right so in other words if we were to plot the learning rate Over time It was increasing like so Right and so what we then want to do is we want to plot the learning rate Against the loss right so when I say the loss I basically mean like how accurate is the model how close in this case the loss Would be how far away is the predicted prediction?

from the from the goal Right and so if we plotted the learning rate against the loss we'd say like okay initially it didn't do very much Right for small learning rates, and then it suddenly improved a lot and then it suddenly got a lot worse So that's the basic idea and so we'd be looking for the point where this graph is Dropping quickly right we're not looking for its minimum point We're not saying like where was at the lowest because that could actually be the point where it's just jumped too far We want at what point was it dropping?

the fastest So if you go So if you create your learn objects in the same way that we did before we'll be learning more about this these details shortly If you then call LR find method on that you'll see that it'll start training a model Like it did before but it'll generally stop before it gets to a hundred percent because if it notices That the loss is getting a lot worse Then it'll stop automatically so that you can see here.

It stopped at 84% and so then you can call Learn dot shed that gets you the learning rate scheduler That's the object which actually does this learning rate finding and that object has a plot learning rate function And so you can see here by iteration you can see the learning rate All right, so you can see each step the learning rate is getting bigger and bigger You can do it this way you can see it's increasing exponentially Another way that Leslie Smith the researcher suggests is to do it linearly So I'm actually currently researching with both of these approaches to see which works best Recently I've been mainly using exponential, but I'm starting to look more at using linear at the moment And so if we then call shed dot plot that does the plot that I just described down here learning rate versus Loss all right, and so we're looking for the highest learning rate we can find Where the loss is still improving?

clearly well right and so in this case I would say 10 to the negative 2 max at 10 to the negative 1 is not improving All right 10 to the negative 3 it is also improving But I'm trying to find the highest learning rate I can where it's still clearly improving So I'd say 10 to the negative 2 right so you might have noticed that when we ran our model before we had 10 to the negative 2 0.01.

So that's why we picked that learning rate So there's really only one other number that we have to pick and That was this number 3 and so that number 3 controlled how many epochs that we run so an epoch means going through our entire data set of images and Using each each time we do a bunch of they called mini batches we grab like 64 images at a time and use them to try to improve the model a little bit using gradient descent Right and using all of the images once is called one epoch and so at the end of each epoch we print out the accuracy and Validation and training loss at the end of the epoch so question of how many epochs should we run is kind of the one other question that you need to answer to run these three lines of code and The answer really to me is like As many as you like What you might find happen is if you run it for too long the accuracy you'll start getting worse Right and we'll learn about that why later.

It's something called overfitting right so You can run it for a while run lots of epochs Once you see it getting worse You know how many epochs you can run and the other thing that might happen is if you've got like a really big model Or what lots and lots of data maybe it takes so long you don't have time and so you just run enough epochs that Fit into the time you have available so the number of epochs you run you know that's a pretty easy thing to set So they're the only two numbers you're going to have to set and so the goal This week will be to make sure that you can run Not only these three lines of code on the data that I provided But to run it on a set of images that you either have on your computer or that you Get from work or that you download from Google And I try to get a sense of like which kinds of images does it seem to work well for?

Which ones doesn't it work well for? What kind of learning rates do you need for different kinds of images how many epochs do you need? How does the number of the learning rate change the accuracy you get and so forth like really experiment and then? You know try to get a sense of like what's inside this data object?

You know what are the y values look like what are these classes mean? If you're not familiar with numpy you know really practice a lot with numpy so that by the time you come back for the next lesson You know we're going to be digging into a lot more detail, and so you'll really feel ready to do that now one thing that's really important to be able to do that is that you need to really know how to work with Numpy the faster I library and so forth and so I want to show you some tricks in Jupyter notebook to make that much easier So one trick to be aware of is if you can't quite remember how to spell something right so If you're not quite sure What the message you want is you can always hit tab?

And you'll get a list of Methods that start with that letter right and so that's a quick way to find things If you then can't remember what the arguments are to a method hit shift tab All right, so hitting shift tab tells you the arguments to the method so shift tab is like one of the most helpful things I know So let's take np.x Shift tab and so now you might be wondering like okay.

Well. What does this function do and how does it work? If you press shift tab twice Then it actually brings up the documentation Shows you what the parameters are and shows you what it returns and gives you examples Okay, if you press it three times Then it actually pops up a whole little separate window with that information Okay, so shift tab is super helpful One way to grab that window straight away is if you just put question mark at the start Then it just brings up that little documentation window Now the other thing to be aware of is increasingly during this course We're going to be looking at the actual source code of fast AI itself and learning how it's built and why it's built that way It's really helpful to look at source code in order to you know Understand what you can do and how you can do it So if you for example wanted to look at the source code for learn dot predict you can just put two question marks Okay, and you can see it's popped up the source code right and so it's just a single line of code You'll very often find that fast AI methods like they're they're designed to never be more than About half a screen full of code and they're often under six lines so you can see this case It's calling predict with tags so we could then get the source code for that in the same way Okay And then that's calling a function called predict with tags so we could get the documentation for that in the same way and Then so here we are and then finally that's what it does it iterates through a data loader gets the predictions and then passes them back and so forth, okay, so question mark question mark is how to get source code a single question mark is how to get documentation and Shift tab is how to bring up parameters or press it more times to get the docs So that's really helpful Another really helpful thing to know about is how to use Jupyter notebook well and the button that you want to know is H If you press H, it will bring up the keyboard shortcuts Palette and so now you can see exactly what Jupyter notebook can do and how to do it I personally find all of these functions useful So I generally tell students to try and learn four or five different keyboard shortcuts a day Try them out see what they do see how they work, and then you can try practicing in that session And one very important thing to remember when you're finished with your work for the day go back to paper space and click on that little button Which stops and starts the machine so after it stopped you'll see it says connection closed and you'll see it's off If you leave it running you'll be charged for it same thing with Cressel be sure to go to your Cressel Instance and stop it you can't just turn your computer off or close the browser You actually have to stop it in Cressel or in paper space and don't forget to do that Or you'll end up being charged until You finally do remember Okay, so I think that's all of the information that you need to get started please remember about the forums If you get stuck at any point check them out But before you do make sure you read the information on course.fast.ai for each lesson All right because that is going to tell you about like things that have changed okay, so if there's been some change to which Jupyter notebook provider we suggest using or how to set up paper space or anything like that That'll all be on course.fast.ai Okay, thanks very much for watching and look forward to seeing you in the next lesson

Lesson 1: Deep Learning 2018

Chapters

Transcript