back to indexLesson 1: Deep Learning 2018
Chapters
0:0 Introduction
1:33 Community
2:18 Coding
4:2 Jupiter Notebook
6:0 Paper Space
12:57 Running a cell
15:7 Running more cells
20:21 Training a model
29:4 Training an image classifier
30:50 Topdown approach
33:48 Lesson plan
39:22 Advice from past students
41:26 Image classifiers
44:23 Deep learning vs machine learning
47:33 An infinitely flexible function
48:43 The neural network
49:38 Gradient descent
51:3 GPU vs CPU
52:27 Hidden Layers
53:38 Google Brain
55:3 Google Inbox
55:33 Microsoft Skype
56:10 Neural Doodle
56:52 My Personal Experience
58:45 Deep Learning Ideas
00:00:00.000 |
Hi everybody, welcome to practical deep learning for coders. This is part one of our two-part course 00:00:10.720 |
Presenting this from the Data Institute in San Francisco 00:00:14.020 |
We'll be doing seven lessons in this part of the course 00:00:19.620 |
Most of them will be about a couple of hours long this first one may be a little bit shorter 00:00:26.760 |
Practical deep learning for coders is all about getting you up and running with deep learning in practice 00:00:32.080 |
Getting world-class results and it's a really coding focused approach as the name suggests 00:00:38.700 |
but we're not going to dumb it down by the end of the course you all have learned all of the 00:00:43.220 |
Theory and details that are necessary to rebuild all of the world-class results. We're learning about from scratch 00:00:49.640 |
Now I should mention that our videos are hosted on YouTube 00:00:55.600 |
But we strongly recommend watching them via our website at course dot fast AI 00:01:00.900 |
Although they're exactly the same videos the important thing about watching them through our website 00:01:07.520 |
Is that you'll get all of the information you need about kind of updates to libraries by all locations? 00:01:14.160 |
Further information frequently asked questions and so forth 00:01:17.980 |
So if you're currently on YouTube watching this why don't you switch over to course dot fast at AI now and start watching through there? 00:01:25.480 |
And make sure you read all of the material on the page before you start just to make sure that you've got everything you need 00:01:31.460 |
The other thing to mention is that there is a really great strong community at forums dot fast AI 00:01:39.020 |
From time to time you'll find that you get stuck 00:01:44.560 |
You may get stuck very early on you may not get stuck for quite a while, but at some point you might get stuck with understanding 00:01:52.680 |
Why something works the way it does or there may be some computer problem that you have or so forth 00:01:58.200 |
On forums dot fast at AI there are thousands of other learners talking about every lesson and lots of other topics besides 00:02:06.040 |
It's the most active deep learning community on the internet by far. So 00:02:10.220 |
Definitely register there and start getting involved. You'll get a lot more out of this course if you do that 00:02:19.680 |
So we're going to start by doing some coding. This is an approach 00:02:24.520 |
We're going to be talking about in a moment called the top-down approach to study 00:02:29.160 |
But let's learn it by doing it. So let's go ahead and try and actually train a neural network 00:02:36.320 |
Now in order to train a neural network, you almost certainly want a GPU 00:02:42.140 |
GPU is a graphics processing a graphics processing unit 00:02:47.640 |
It's the things that companies use to help you play games better 00:02:53.240 |
They let your computer render the game much more quickly than your CPU can 00:02:59.760 |
We'll be talking about them more shortly. But for now, I'm going to show you how you can get access to a GPU 00:03:07.160 |
Specifically you're going to need an Nvidia GPU because only Nvidia GPUs support something called CUDA 00:03:16.800 |
CUDA is the language and framework that nearly all deep-learning 00:03:20.780 |
libraries and practitioners use to do their work 00:03:25.160 |
Obviously, it's not ideal that we're stuck with one particular vendors cards and over time 00:03:31.480 |
We hope to see more competition in this space. But for now, we do need an Nvidia GPU 00:03:35.780 |
Your laptop almost certainly doesn't have one unless you specifically went out of your way to buy like a gaming laptop 00:03:44.840 |
So almost certainly you will need to rent one 00:03:52.060 |
Paying by the second for a GPU based computer is pretty easy and pretty cheap 00:04:14.600 |
Click on sign up or if you've been there before sign in 00:04:17.240 |
You will find yourself at this screen which has a big button that says start Jupiter and another switch called enable GPU 00:04:25.560 |
So if we make sure that is set to true enable GPU is on and we click start Jupiter 00:04:35.880 |
It's going to launch us into something called Jupiter notebook 00:04:41.040 |
Jupiter notebook in a recent survey of tens of thousands of data scientists was rated as the third most important tool 00:04:48.920 |
In the data scientist toolbox. It's really important that you get to learn it well and all of our courses will be run through Jupiter 00:04:55.760 |
Yes, Rachel. You have a question or comment? Oh, I just wanted to point out that you get I believe 10 free hours 00:05:07.320 |
Yeah, I he might have changed that recently to less hours, but you can check the fact or the pricing 00:05:14.680 |
The pricing varies because this is actually runs on top of Amazon web services. So at the moment, it's 60 cents an hour 00:05:21.680 |
The nice thing is though that you can always turn it turn it on 00:05:26.240 |
You know start your Jupiter without the CP without the GPU running and pay you a tenth of that price, which is pretty cool 00:05:34.160 |
So Jupiter notebook is something we'll be doing all of this course in and so to get started here 00:05:39.160 |
we're going to find our particular course, so we'd go to courses and 00:05:49.440 |
Things have been moving around a little bit. So it may be in a different spot for you 00:05:53.400 |
When you look at this and we'll make sure all the information current information is on the website 00:06:00.000 |
Now having said that that's you know, the cressel approach is you know, as you can see, it's basically instant and and easy 00:06:08.000 |
But if you've got you know an extra hour or so to get going an even better option is 00:06:19.440 |
Paper space unlike cressel doesn't run on top of Amazon. They have their own machines 00:06:29.800 |
If I click on so here's here's paper space and so if I click on new machine I 00:06:38.400 |
Can pick which one of their three data centers to use so pick the plot one closest to you. So I'll say West Coast and 00:06:45.160 |
Then I'll say Linux and I'll say you bun to 16 00:06:50.560 |
And then it says choose machine and you can see there's various different machines I can choose from 00:06:59.760 |
So this is pretty cool for 40 cents an hour. So it's cheaper than cressel 00:07:05.680 |
I get a machine that's actually going to be much faster than cressel 60 cent now machine or for 65 cents an hour 00:07:15.040 |
So I'm going to actually show you how to get started with with the with the paper space approach 00:07:20.000 |
Because that actually is going to do everything from scratch 00:07:25.400 |
You may find if you try to do the 65 cents an hour one that it may require you to contact paper space to say 00:07:32.520 |
Like why do you want it? That's just an anti fraud thing. So if you say faster AI there 00:07:40.880 |
They'll quickly get you up and running. So I'm going to use the cheapest one here 40 cents an hour 00:07:52.720 |
Note that you pay for a month of storage as soon as you start the machine up 00:07:56.880 |
Right, so don't start and stop lots of machines because each time you pay for that month of storage 00:08:01.160 |
I think the 250 gig seven dollar a month option is pretty good 00:08:05.680 |
But you really need 50 gig. So if you're trying to minimize the price you can go there 00:08:09.560 |
The only other thing you need to do is turn on public IP so that we can actually log into this and 00:08:17.520 |
We can turn off auto snapshot to save the money of not having backups 00:08:21.700 |
All right, so if you then click on create your paper space about a minute later you will find 00:08:33.900 |
That your machine will pop up. Here is my Ubuntu 1604 machine 00:08:43.780 |
You will find that they have emailed you a password so you can copy that 00:08:51.880 |
You can go to your machine and enter your password now to paste the password 00:08:56.760 |
You would press ctrl shift V or on Mac. I guess Apple shift V 00:09:01.920 |
So it's slightly different to normal pasting or of course you can just type it in 00:09:07.400 |
And here we are now we can make a little bit more room here by clicking on these little arrows I 00:09:17.720 |
And so as you can see we've got like a terminal that's sitting inside 00:09:22.240 |
Our browser which is kind of quite a handy way to do it 00:09:26.000 |
So now we need to configure this for the course and so the way you configure it for the course is you type? 00:09:36.760 |
HTTP colon slash slash files dot fast dot AI slash setup slash paper space 00:09:49.640 |
Okay, and so that's then going to run a script which is going to set up all of the CUDA drivers 00:09:59.280 |
Reaper pipe Python distribution we use called anaconda all of the libraries all of the courses 00:10:06.440 |
And the data we use for the first part of the course 00:10:10.280 |
Okay, so that takes an hour or so and when it's finished running you'll need to reboot your computer 00:10:20.240 |
But your paper space computer and so to do that you can just click on this little circular restart machine button 00:10:26.080 |
Okay, and when it comes back up you'll be ready to go. So what you'll find 00:10:31.040 |
Is that you've now got an anaconda 3 directory. That's where your Python is 00:10:37.400 |
You've got a data directory which contains the data for the first part of this course first lesson, which is that dogs and cats? 00:10:55.320 |
CD fast AI and from time to time you should go git pull and that will just make sure that all of your 00:11:04.040 |
Fast AI stuff is up to date and also from time to time 00:11:07.960 |
You might want to just check that your Python libraries up to date and so you can type Conda and update 00:11:15.960 |
Alright, so make sure that you've CD'd into fast AI and then you can type Jupiter notebook 00:11:28.000 |
So we now have a Jupiter notebook serving it running and we want to connect that right and so you can see here 00:11:36.720 |
Into your browser when you connect so if you double click on it 00:11:48.160 |
Then you can go and paste it, but you need to change this local host 00:11:53.680 |
To be the paper space IP address, so if you click on a little arrows to go smaller 00:12:08.800 |
So it's now HTTP and then my IP and then everything else I copied before and so there it is 00:12:19.360 |
Get repo and our courses are all in courses and in there the deep learning part one is DL one and 00:12:41.400 |
Depending whether you're using Gressel or paper space or something else if you check courses to fast at AI 00:12:46.560 |
We'll keep putting additional videos and links to information about how to set up other 00:13:01.500 |
You select the cell and you hold down shift and press enter or if you've got the toolbar showing 00:13:08.480 |
You can just click on the little run button, so you'll notice that some cells contain 00:13:15.040 |
Code and some contain text and some contain pictures and some contain videos so this environment basically has 00:13:22.840 |
You know it's it's a way that we can give you access to a way to run 00:13:29.260 |
Experiments and to kind of tell you what's going on show pictures 00:13:33.660 |
This is why it's like a super popular tool in data science the data science is kind of all about running experiments 00:13:46.400 |
And you'll see that cell turn into a star the one turn into a star for a moment, and then it finished running 00:13:52.400 |
Okay, so let's try the next one this time instead of using the toolbar. I'm going to hold down shift and press enter 00:14:00.080 |
It turned into a star and then it said to so if I'd hold down shift and keep pressing enter it just keeps running each 00:14:06.080 |
Cell right so I can put anything I like for example one plus one 00:14:17.840 |
Yes, Rachel. Oh, this is just a side note, but I wanted to point out that we're using Python 3 here 00:14:24.400 |
Yes, thank you, Python 3 and so you get some errors if you're still using Python 2. Mm-hmm. Yeah 00:14:29.560 |
And it is important to switch to Python 3 you know now well for fast AI you require it 00:14:37.480 |
But you know increasingly a lot of libraries are 00:14:47.400 |
Now it mentions here that you can download the data set for this lesson from this location 00:14:58.360 |
The paper space script that we just used to set up and this will already be made available for you 00:15:03.680 |
Okay, if you're not you'll need to W get it as soon 00:15:10.480 |
Quite a bit slower than paper space and also it 00:15:14.040 |
There are some particular things it doesn't support that we really need and so there there are a couple of extra steps if you're using 00:15:21.600 |
Cressel you have to run two more cells right so you can see these are commented out 00:15:27.320 |
So if you remove the hashes from these and run these two additional cells that just runs the stuff that the stuff that you only 00:15:34.320 |
Need for Cressel I'm using paper space, so I'm not going to run it 00:15:43.600 |
Data so we set up this path to data dogs cats 00:15:47.880 |
That's pre set up for you and so inside there. You can see here. I can use an exclamation mark 00:15:55.800 |
Basically say I don't want to run Python, but I want to run bash 00:15:59.480 |
I don't want to run shell so this runs a bash command and the bit inside the curly brackets 00:16:05.720 |
Actually refers however to a Python variable so it inserts that Python variable into the bash command 00:16:13.800 |
There's a training set and a validation set if you're not familiar with the idea of training sets and validation sets 00:16:21.040 |
It would be a very good idea to check out our 00:16:27.040 |
Which tells you a lot about this kind of stuff of like that the basics of how to set up and run machine learning 00:16:36.360 |
Would you recommend that people take that course before this one? 00:16:40.340 |
Actually a lot of students who would you know as they went through these who said they look they've liked doing them together 00:16:53.320 |
Yeah, they cover some similar stuff but all in different directions so people have done both since you know say they find it 00:17:01.760 |
They each support each other. I wouldn't say it's prerequisite 00:17:05.720 |
But you know if I do if I say something like hey 00:17:08.760 |
This is a training set and this is a validation set and you're going I don't know what that means 00:17:12.000 |
At least Google it do a quick read you know because we're assuming 00:17:15.480 |
That you know the very basics of kind of what machine learning is and does to some extent 00:17:23.260 |
And I have a whole blog post on this topic as well 00:17:26.320 |
Okay, and we'll make sure that you link to that from course.fast.ai 00:17:29.680 |
And I also just wanted to say in general with fast.ai our philosophy is to 00:17:34.080 |
Kind of learn things on an as-needed basis. Yeah exactly don't try and learn everything that you think you might need first 00:17:41.560 |
Otherwise you'll never get around to learning the stuff you actually want to learn 00:17:44.360 |
Exactly and that shows up in deep learning. I think 00:17:53.560 |
There's a cats folder and a dogs folder and then inside the validation cats folder is a whole bunch of JPEGs 00:18:00.400 |
The reason that it's set up like this is that this is kind of the most common standard approach for how? 00:18:06.940 |
image classification data sets are shared and provided and the idea is that each folder 00:18:17.640 |
Images is labeled cats and each of the images in the dogs folder is labeled dogs. Okay? 00:18:26.560 |
So this is a pretty standard way to share image classification 00:18:40.800 |
We can see an example of the first of the cats 00:18:49.920 |
3.6 format string so you can Google for that if you haven't seen it 00:18:54.200 |
It's a very convenient way to do string formatting, and we use it a lot 00:18:57.080 |
So there's our cat, but we're going to mainly be interested in the underlying data that makes up that cat 00:19:07.760 |
It's an image whose shape that is the dimensions of the array is 198 by 179 by 3 00:19:15.080 |
So it's a three-dimensional array also called a rank 3 tensor 00:19:18.520 |
And here are the first four rows and four columns of that image 00:19:30.640 |
Items in it, and this is the red green and blue pixel values between 0 and 255 00:19:37.160 |
So here's a little subset of what a picture actually looks like inside your computer 00:19:43.320 |
so that's that that's will be our idea is to take these kinds of numbers and 00:19:48.520 |
Use them to predict whether those kinds of numbers represent a cat 00:19:52.340 |
Or a dog based on looking at lots of pictures of cats and dogs 00:19:56.640 |
so that's a pretty hard thing to do and at the point in time when this 00:20:02.480 |
This data set actually comes from a Kaggle competition the dogs versus cats Kaggle competition and when it was released in I think it 00:20:11.720 |
The state-of-the-art was 80% accuracy so computers weren't really able to at all accurately recognize dogs versus cats 00:20:29.960 |
Here are the three lines of code necessary to train a model 00:20:34.680 |
And so let's go ahead and run it so I'll click on this on the cell. I'll press shift enter 00:20:42.160 |
Then we'll wait a couple of seconds for it to pop up and there it goes 00:20:51.660 |
So I've asked it to do three epochs so that means it's going to look at every image 00:20:55.440 |
Three times in total or look at the entire set of images three times 00:20:59.560 |
That's what we mean by an epoch and as we do it's going to print out 00:21:05.880 |
The accuracy is this last of the three numbers that prints out on the validation set, okay? 00:21:14.120 |
In short they're the value of the loss function which is in this case the cross entropy loss 00:21:18.520 |
For the training set and the validation set and then right at the start here is the epoch number 00:21:29.480 |
And it took 17 seconds so you can see we've come a long way since 00:21:38.460 |
This actually would have won the Kaggle competition of that time the best in the Kaggle competition was 98.9 00:21:48.160 |
so this may surprise you that we're getting a 00:21:52.680 |
You know Kaggle winning as of 20 end of 2012 early 2013 00:21:57.480 |
Kaggle winning image classifier in 17 seconds 00:22:07.880 |
And I think that's because like a lot of people assume that deep learning takes a huge amount of time 00:22:14.800 |
And lots of resources and lots of data and as you'll learn in this course 00:22:23.400 |
One of the ways we've made it much simpler is that this code is written on top of a library we built 00:22:34.560 |
the fast AI library is basically a library which takes all of the 00:22:39.960 |
Best practices approaches that we can find and so each time a paper comes out. You know we that looks interesting 00:22:46.960 |
We test it out if it works well for a variety of data sets and we can figure out how to tune it 00:22:52.020 |
we implement it in fast AI and so fast AI kind of curates all this stuff and packages up for you and 00:22:58.480 |
Much of the time or most the time kind of automatically figures out the best way to handle things 00:23:03.420 |
So the fast AI library is why we were able to do this in just three lines of code 00:23:07.560 |
And the reason that we were able to make the fast AI library work 00:23:11.760 |
So well is because it in turn sits on top of something called pytorch 00:23:18.680 |
Really flexible deep learning and machine learning and GPU computation library written by Facebook 00:23:27.600 |
Most people are more familiar with TensorFlow than pytorch because Google markets that pretty heavily 00:23:33.960 |
But most of the top researchers I know nowadays at least the ones that aren't at Google have switched across to pytorch 00:23:40.680 |
Yes, Rachel, and we'll be covering some pytorch later in the course. Yeah, it's I mean one of the things that 00:23:46.880 |
Hopefully you're really like about fast AI is that it's really flexible that you can use all these kind of curated best practices as 00:23:56.560 |
Much as little as you want and so it's really easy to hook in at any point and write your own 00:24:02.040 |
Data augmentation write your own loss function write your own network architecture, whatever and so we'll do all of those things 00:24:17.640 |
Take a look at so what are the what is the the validation set? 00:24:22.560 |
Dependent variable the Y look like and it's just a bunch of zeros and ones, right? 00:24:27.160 |
So the zeros if we look at data dot classes the zeros represent cats the ones represent dogs 00:24:32.760 |
You'll see here. There's basically two objects. I'm working with one is an object called data 00:24:36.980 |
Which contains the validation and training data and another one is the object called learn which contains the model, right? 00:24:44.120 |
So anytime you want to find something out about the data we can look inside data 00:24:49.320 |
So we want to get predictions for a validation set and so to do that we can call learn dot predict 00:24:57.760 |
So you can see here the first ten predictions and what it's giving you is prediction for dog and a prediction for cat 00:25:05.200 |
now the way pytorch generally works and therefore fast AI also works is that most models return the 00:25:14.280 |
Of the predictions rather than the probabilities themselves. We'll learn why that is later in the course 00:25:19.900 |
So for now recognize that to get your probabilities you have to get 00:25:26.600 |
You'll see here. We're using numpy NP is numpy if you're not familiar with numpy 00:25:32.720 |
That is one of the things that we assume that you have some familiarity with 00:25:36.400 |
So be sure to check out the material on course dot fast at AI to learn the basics of numpy 00:25:48.080 |
Fast numerical programming array computation that kind of thing 00:25:54.860 |
Okay, so we can get the probabilities using that 00:26:02.120 |
There's a few functions here that you can look at yourself if you're interested, but just some plotting functions that we'll use 00:26:13.720 |
Images and so here are some images that it was correct about okay, and so remember one is a dog 00:26:22.360 |
So anything greater than 0.5 is dog and 0 is a cat so this is what 10 to the negative 5 obviously a cat 00:26:32.320 |
Right so you can see that some of these which it thinks are incorrect obviously are just the you know images. It shouldn't be there at all 00:26:41.320 |
But clearly this one which it called a a dog is not at all a dog so there are some obvious mistakes 00:26:53.160 |
Which cats is it the most confident are cats which dogs are the most dog like the most confident dogs 00:27:02.320 |
Perhaps more interestingly we can also see which cats is it the most confident are actually dogs 00:27:09.000 |
so which ones it is at the most wrong about and 00:27:11.960 |
Same thing for the ones the dogs that it really thinks are cats and again some of these are just 00:27:18.640 |
Pretty weird. I guess there is a dog in there. Yes, Rachel 00:27:22.700 |
I just say do you want to say more about why you would want to look at your data? 00:27:29.920 |
So yeah, so finally I just mentioned the last one we've got here is to see which ones have the probability closest to 0.5 00:27:38.560 |
So these are the ones that the the model knows it doesn't really know what to do with and some of these it's not surprising 00:27:48.760 |
Always the first thing I do after I build a model is to try to find a way to like visualize what it's built 00:27:59.800 |
Then I need to take advantage of the things it's doing well and fix the things it's doing badly. So in this case 00:28:07.640 |
And often this is the case. I've learned something about the data set itself 00:28:11.240 |
Which is that there are some things that are in here that probably shouldn't be 00:28:20.800 |
Model has room to improve like to me. That's pretty obviously a 00:28:25.840 |
Dog, but one thing I'm suspicious about here is this image is very 00:28:39.160 |
The way these algorithms work is it kind of grabs a square piece at a time? 00:28:44.320 |
So this rather makes me suspicious that we're going to need to use something called data augmentation 00:28:49.080 |
That will learn about learn about later to handle this properly 00:29:03.000 |
We've now built an image classifier and something that you should try now is to grab some data 00:29:15.720 |
Two or more different types of thing put them in different folders and run the same three lines of code 00:29:26.960 |
that it will work for that as well as long as that they are pictures of things like 00:29:33.160 |
the kinds of things that people normally take photos of right, so if they're 00:29:37.800 |
microscope microscope pictures or pathology pictures or 00:29:41.840 |
CT scans or something this won't work very well as we'll learn about later 00:29:47.360 |
There are some other things we didn't need to do to make that work, but for things that look like normal photos 00:29:54.760 |
These you can run exactly the same three lines of code and just point your 00:30:09.120 |
Took those three lines of code downloaded for Google images 00:30:12.840 |
Ten examples of pictures of people playing cricket ten examples of people playing baseball and build a classifier 00:30:19.800 |
Of those images which was nearly perfectly correct 00:30:25.400 |
student actually also tried downloading seven pictures of 00:30:29.360 |
Canadian currency seven pictures of American currency and again in that case the model was a hundred percent 00:30:37.280 |
Accurate so you can just go to Google images if you like and download a few things of a few different classes and see 00:30:43.440 |
See what works and tell us on the forum both your successes and your failures 00:30:54.280 |
Train a neural network, but we didn't first of all tell you what a neural network is or what training means or 00:31:04.480 |
Why is that? Well, this is the start of our top-down approach to learning 00:31:11.140 |
And basically the idea is that unlike the way math and technical subjects are usually taught 00:31:17.760 |
where you learn every little element piece by piece and you don't actually get to put them all together and 00:31:23.680 |
Build your own image classifier until third year of graduate school. Our approach is to say from the start 00:31:31.600 |
Hey, let's show you how to train an image classifier and now you can start doing stuff 00:31:36.700 |
And then gradually we dig deeper and deeper and deeper 00:31:46.160 |
Throughout the course you're going to see like new problems that we want to solve 00:31:50.640 |
So for example in the next lesson, we'll look at well 00:31:54.620 |
What if we're not looking at normal kinds of photos, but we're looking at satellite images 00:32:01.180 |
And we'll see why it is that this approach that we're learning today doesn't quite work as well 00:32:06.000 |
And what things do we have to change and so we'll learn enough about the theory to understand why that happens 00:32:12.100 |
And then we'll learn about the libraries and how we can change change things with the libraries to make that work better 00:32:20.440 |
So during the course we're gradually going to learn to solve more and more problems as we do 00:32:25.480 |
So we'll need to learn more and more parts of the library more and more bits of the theory until by the end 00:32:31.960 |
We're actually going to learn how to create a 00:32:37.440 |
neural net architecture from scratch and our own training loop from scratch and so we're actually build everything 00:32:47.760 |
Approach. Yes, Rachel and we sometimes also call this the whole game 00:32:52.240 |
Which is inspired by Harvard professor David Perkins 00:32:57.440 |
And so the idea with the whole game is like this is more like how you would learn baseball or music 00:33:02.280 |
With baseball you would get taken to a ball game. You would learn what baseball is 00:33:07.240 |
You would start playing it and it would only be years later that you might learn about the physics of how curveball works 00:33:14.720 |
For example or with music we put an instrument in your hand and you start 00:33:20.040 |
Banging the drum or hitting the xylophone and it's not until years later that you learn about the circle of fifths and understand 00:33:29.160 |
So yeah, so that's this is kind of the approach we're using it's very inspired by 00:33:37.680 |
So what that does mean is to take advantage of this as we peel back the layers 00:33:43.440 |
We want you to keep like looking under the hood yourself as well like experiment a lot because this is a very code driven 00:33:51.880 |
Approach so here's basically what happens right? We start out looking today at 00:33:57.280 |
convolutional neural networks for images and then in a couple of lessons 00:34:02.000 |
We'll start to look at how to use neural nets to look at structured data and then to look at language data and then to look 00:34:10.960 |
And then we kind of then take all of those steps and we go backwards through them in reverse order 00:34:18.040 |
So now you know by the end of that fourth piece you will know 00:34:22.120 |
By the end of lesson four how to create a world-class image classifier a world-class 00:34:30.160 |
Structured data analysis program world-class language classifier world-class recommendation system 00:34:36.660 |
And then we're going to go back over all of them again and learn in depth about like well 00:34:43.360 |
And how do we change things around and use it in different situations for for the recommendation systems structured data? 00:34:51.000 |
Images and then finally back to language. So that's how it's going to work 00:34:56.680 |
So what that kind of means is that most students find that they tend to watch the videos two or three times? 00:35:06.720 |
Watch lesson one two or three times and lesson two two or three times and listen three three times 00:35:11.240 |
But like they do the whole thing into end lessons one through seven and then go back and start lesson one again 00:35:18.280 |
That's an approach which a lot of people find when they want to kind of go back and understand all the details 00:35:23.840 |
That up that can work pretty well, so I would say you know aim to get through to the end of lesson seven 00:35:30.220 |
You know as as quickly as you can rather than aiming to fully understand every detail from the start 00:35:39.040 |
So basically the plan is that in today's lesson you learn 00:35:46.760 |
In as few lines as code as possible with as few details as possible 00:35:52.200 |
How do you actually build an image classifier with deep learning to do this to in this case say? 00:35:57.800 |
Hey, here are some pictures of dogs as opposed to pictures of cats 00:36:05.200 |
How to look at different kinds of images and particularly we're going to look at images of from satellites 00:36:13.960 |
What kinds of things might you be seeing in that image and there could be multiple things that we're looking at so a multi-label? 00:36:23.400 |
From there, we'll move to something which is perhaps the most widely applicable for the most people 00:36:29.540 |
Which is looking at what we call structured data 00:36:37.360 |
Databases or spreadsheets, so we're going to specifically look at this data set of predicting sales 00:36:43.080 |
The number of things that are sold at different stores on different dates 00:36:48.840 |
Based on different holidays and and so on and so forth and so we're going to be doing this sales forecasting 00:36:56.520 |
After that we're going to look at language, and we're going to figure out 00:37:05.120 |
And we'll be able to figure out how to create just like we create image classifiers for any kind of image 00:37:10.800 |
We'll learn to create in NLP classifiers to classify any kind of language in lots of different ways 00:37:18.720 |
Then we'll look at something called collaborative filtering which is used mainly for recommendation systems 00:37:23.840 |
We're going to be looking at this data set that showed for different people for different movies. What rating did they give it? 00:37:32.760 |
This is maybe an easier way to think about it 00:37:35.560 |
Is there are lots of different users and lots of different movies and then for each one we can look up for each user 00:37:41.480 |
How much they like that movie and the goal will be of course to predict for user movie combinations? 00:37:47.840 |
We haven't seen before are they likely to enjoy that movie or not and that's the really common approach used for like 00:37:55.640 |
Deciding what stuff to put on your home page when somebody's visiting 00:37:59.400 |
You know what book might they want to read or what film might they want to see or so forth? 00:38:03.880 |
From there we're going to then dig back into language a bit more and we're going to look at 00:38:12.080 |
Actually, we're going to look at the writings of Nietzsche the philosopher and learn how to create our own Nietzsche philosophy from scratch 00:38:21.320 |
So this here perhaps that every life of values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse 00:38:31.240 |
That's actually like some character by character generated text that we built with this recurrent neural network 00:38:41.280 |
And then finally we're going to loop all the way back to computer vision again 00:38:44.680 |
We're going to learn how not just to recognize cats from dogs 00:38:48.440 |
But to actually find like where the cat is with this kind of heat map 00:38:52.160 |
And we're also going to learn how to write our own architectures from scratch 00:38:56.800 |
um, so this is an example of a resnet which is the kind of network that we 00:39:01.280 |
Are using in today's lesson for computer vision? 00:39:04.880 |
And so we'll actually end up building the network and the training loop from scratch 00:39:09.900 |
And so they're basically the the steps that we're going to be taking from here and at each step. We're going to be getting into 00:39:16.200 |
Increasing amounts of detail about how to actually do these things yourself 00:39:21.320 |
So we've actually heard back from our students of past courses about what they found and 00:39:30.020 |
one of the things that we've heard a lot of students say is that they spend too much time on theory and 00:39:43.080 |
And even after we tell people about this warning where they still come to the end of the course and often say I wish I had 00:39:52.160 |
Seriously that advice which is to keep running code 00:39:55.080 |
So these are actual quotes from our forum in retrospect 00:39:59.280 |
I should have spent the majority of my time on the actual code and the notebooks 00:40:14.120 |
World-class models in a code first approach learning what you need as you go 00:40:19.520 |
It's very different to a lot of the advice you'll read out there such as this 00:40:23.640 |
person on the forum Hacker News who claimed that the best way to become an ML engineer is to 00:40:32.080 |
Learn all of math learn C and C++ learn parallel programming learn ML 00:40:38.920 |
Algorithms implement them yourself using plain C and finally start doing ML 00:40:43.840 |
So we would say if you want to become an effective practitioner do exactly the opposite of this 00:40:50.240 |
Yes, Rachel. Oh, yeah, I'm just highlighting that this is 00:40:53.920 |
We think this is bad advice and this can be very discouraging for a lot of people to come across. Yeah 00:41:00.760 |
it's it's it's you know, we now have thousands or tens of thousands of people that have done this course and 00:41:09.160 |
Lots and lots of examples of people who are now 00:41:17.580 |
Have created patents based on deep learning and so forth who have done it by doing this course 00:41:27.560 |
Now one thing to mention is like we've we've now already learned how you can actually train a world-class image classifier in 00:41:35.840 |
17 seconds, I should mention by the way the first time you run that code 00:41:41.600 |
there are two things it has to do that take more than 17 seconds one is that it downloads a 00:41:47.440 |
Pre-trained model from the internet. So you'll see the first time you run it. It'll say downloading model 00:41:57.360 |
The first time you run it it pre computes and caches 00:42:00.200 |
Some of the intermediate information that it needs and that takes about a minute and a half as well 00:42:10.920 |
To download and pre-compute stuff. That's normal if you run it again, you should find it takes 00:42:20.320 |
Image classifiers, you know, you may not feel like you need to recognize cats versus dogs very often on a computer 00:42:30.720 |
But what's interestingly interesting is that these image classification algorithms are really useful for lots and lots of things 00:42:41.760 |
AlphaGo which became which beat the go world champion the way it worked was to use something 00:42:49.480 |
At its heart that looked almost exactly like our dogs versus cats image classification algorithm 00:42:56.360 |
It looked at thousands and thousands of go boards 00:43:00.800 |
And for each one there was a label saying whether that go board ended up being the winning or the losing 00:43:10.320 |
Basically an image classification that was able to look at a go board and figure out whether it was a good go board or a bad 00:43:17.000 |
Go board and that's really the key most important 00:43:20.800 |
Step in playing go. Well is to know which which move is better 00:43:25.720 |
Another example is one of our earlier students who actually 00:43:38.160 |
He had lots of examples of his customers mouse movements because they they provided kind of these 00:43:46.400 |
User tracking software to help avoid fraud and so he took the the mouse paths 00:43:52.540 |
basically of the users on his customers websites 00:43:56.680 |
Turned them into pictures of where the mouse moved and how quickly it moved 00:44:01.800 |
And then built a image classifier that took those images 00:44:06.680 |
As input and as output it was was that a fraudulent transaction or not? 00:44:12.480 |
And turned out to get you know really great results for his company so image classifiers 00:44:18.440 |
Are like much more flexible than you might imagine? 00:44:26.240 |
So this is how you know some of the ways you can use deep learning specifically for image recognition and 00:44:39.520 |
You know just a word that means the same thing as machine learning 00:44:42.680 |
Like what is it that we're actually doing here when we're doing deep learning? 00:44:46.400 |
Instead deep learning is a kind of machine learning 00:44:50.400 |
So machine learning was invented by this guy Arthur Samuels who was pretty amazing in the late 50s 00:44:57.060 |
He got this IBM mainframe to play checkers better than he can and the way he did it 00:45:09.520 |
Lots of times and figure out which kinds of things led to victories and which kinds of things didn't 00:45:15.680 |
And used that to kind of almost write its own program 00:45:19.320 |
And after Samuels actually said in 1962 that he thought that one day the vast majority of computer software 00:45:26.560 |
Would be written using this machine learning approach rather than written by hand by writing the loops and so forth by hand 00:45:35.400 |
So I guess that hasn't happened yet, but it seems to be in the process of happening 00:45:41.400 |
I think one of the reasons it didn't happen for a long time is because traditional machine learning actually was very difficult and very 00:45:49.820 |
Knowledge and time intensive so for example here's something called the computational pathologist or CPath 00:45:57.560 |
From guy called Andy Beck Andy Beck back when he was at Stanford 00:46:08.400 |
And what he did was he took these pathology slides of breast cancer 00:46:17.000 |
he worked with lots of pathologists to come up with ideas about what kinds of 00:46:23.280 |
Patterns or features might be associated with 00:46:30.720 |
Dining quickly basically and so he came up with these ideas like well 00:46:35.800 |
They came up with these ideas like relationship between epithelial nuclear neighbors 00:46:39.320 |
relationship between epithelial and stromal objects and so forth and so they came up with all of these ideas of features 00:46:45.880 |
these are just a few of the hundreds that they thought of and then lots of 00:46:52.840 |
specialist algorithms to to calculate all these different features and then those those 00:47:00.360 |
Features were passed into a logistic regression 00:47:02.580 |
To predict survival and it ended up working very well 00:47:06.920 |
It had ended up that the survival predictions were more accurate than pathologists own survival predictions were 00:47:15.080 |
and so machine learning can work really well, but the point here is that this was a 00:47:19.720 |
An approach that took lots of domain experts and computer experts 00:47:26.040 |
Many years of work to actually to build this thing, right? 00:47:40.000 |
so specifically I'm going to show you something which rather than being a very specific function with all this very 00:47:51.120 |
feature engineering we're going to try and create an infinitely flexible function a function that could solve any problem 00:47:58.000 |
Right it would solve any problem if only you set the parameters of that function correctly 00:48:03.440 |
And so then we need some all-purpose way of setting the parameters of that function 00:48:08.760 |
And we would need that to be fast and scalable 00:48:11.220 |
Right now if we had something that had these three things 00:48:17.080 |
Incredibly time and domain knowledge intensive approach anymore instead we can learn all of those things 00:48:29.320 |
The algorithm in question which has these three properties is called deep learning 00:48:34.440 |
Or if not an algorithm, then maybe we would call it a class of algorithms 00:48:39.240 |
Let's look at each of these three things in turn 00:48:43.560 |
So the underlying function that deep learning uses is something called the neural network 00:48:49.240 |
Now the neural network we're going to learn all about it and implemented ourselves from scratch later on in the course 00:48:56.360 |
But for now all you need to know about it is that it consists of a number of simple linear layers 00:49:03.200 |
interspersed with a number of simple nonlinear layers 00:49:07.040 |
And when you interspersed these layers in this way 00:49:12.880 |
You get something called the universal approximation theorem and the universal approximation theorem says that this kind of function 00:49:24.960 |
To arbitrarily close accuracy as long as you add enough parameters 00:49:31.880 |
So it's actually provably shown to be an infinitely flexible function 00:49:38.520 |
Right. So now we need some way to fit the parameters so that this infinitely flexible neural network solves some specific problem and 00:49:46.240 |
so the way we do that is using a technique that 00:49:50.300 |
probably most of you will have come across before at some stage called gradient descent and with gradient descent we basically say 00:49:57.680 |
Okay, well for the different parameters we have 00:50:00.200 |
How how good are they at solving my problem and let's figure out a slightly better set of parameters? 00:50:08.440 |
And a slightly better set of parameters and basically follow down 00:50:11.720 |
The the surface of the loss function downwards. It's kind of like a marble going down to find the minimum and 00:50:19.440 |
As you can see here depending on where you start you end up in different places 00:50:25.160 |
These things are called local minima now interestingly it turns out that for neural networks particular in particular 00:50:39.080 |
Local minima, there's basically just there's basically just one right or to think of it another way 00:50:46.960 |
There are different parts of the space which are all equally good 00:50:53.880 |
Gradient descent therefore turns out to be actually an excellent way to 00:50:58.400 |
Solve this problem of fitting parameters to neural networks 00:51:04.840 |
The problem is though that we need to do it in a reasonable amount of time and 00:51:09.480 |
It's really only thanks to GPUs that that's become possible 00:51:17.520 |
How many gigaflops per second can you get out of a? 00:51:23.920 |
GPU that's the red and green versus a CPU. That's the blue right and this is on a log scale 00:51:31.760 |
So you can see that generally speaking the GPUs are 00:51:40.720 |
What's really interesting is that nowadays not only is the Titan X about 10 times faster than the e5 00:51:53.600 |
Well actually better one to look at would be the GTX 1080i 00:52:01.240 |
Whereas the CPU which is 10 times slower costs over $4,000 00:52:18.520 |
And also incredibly cheaply so they've been absolutely key in bringing these three pieces together 00:52:29.640 |
Which is I mentioned that these neural networks you can intersperse multiple sets of linear and then nonlinear layers 00:52:36.960 |
In the particular example that's drawn here there's actually only one 00:52:43.560 |
what we call hidden layer one layer in the middle and 00:52:46.480 |
Something that we learned in the last few years is that these kinds of neural networks although they do 00:52:53.200 |
Support the universal approximation theorem they can solve any given problem arbitrarily closely 00:52:59.320 |
They require an exponentially increasing number of parameters to do so 00:53:05.000 |
So they don't actually solve the fast and scalable for even reasonable size problems 00:53:10.240 |
But we've since discovered that if you create at multiple hidden layers 00:53:16.840 |
Then you get super linear scaling so you can add a few more hidden layers 00:53:25.600 |
more accuracy to multiplicatively more complex problems and 00:53:29.240 |
That is where it becomes called deep learning. So deep learning means a neural network with multiple hidden layers 00:53:36.680 |
So when you put all this together, there's actually really amazing what happens 00:53:45.120 |
Google started investing in deep learning in 2012 00:53:53.200 |
Actually hired Jeffrey Hinton who's kind of the father of deep learning and his top student Alex Kudzewski 00:54:00.040 |
And they started trying to build a team that team became known as Google brain 00:54:09.680 |
Things with these three properties are so incredibly powerful and so incredibly flexible you can actually see over time 00:54:18.320 |
How many projects at Google use deep learning? 00:54:22.420 |
My graph here only goes up through a bit over a year ago 00:54:26.560 |
But it's I know it's been continuing to grow exponentially since then as well 00:54:30.920 |
And so what you see now is around Google that deep learning is used in like every part of the business 00:54:43.960 |
This this kind of simple idea that we can solve machine learning problems using a an 00:54:53.520 |
When a big company invests heavily in actually making that happen 00:54:57.720 |
You see this incredible growth in how much it's used 00:55:01.640 |
So for example if you use the inbox by Google software 00:55:07.920 |
Then when you receive an email from somebody it will often 00:55:15.920 |
That I could send for you and so it's actually using deep learning here to read the original email and to generate 00:55:24.240 |
some suggested replies and so like this is a really great example of the kind of stuff that 00:55:33.640 |
Another great example would be Microsoft is also a little bit more recently invested heavily in deep learning and so now you can 00:55:43.800 |
Use Skype you can speaking to it in English and ask it at the other end to 00:55:49.880 |
Translate it in real time to Chinese or Spanish and then when they talk back to you in Chinese or Spanish 00:55:55.720 |
Skype will in real time translate it the speech in in their language into English speech in real time 00:56:03.520 |
And again, this is an example of stuff which we can only do thanks to deep learning 00:56:11.880 |
I also think it's really interesting to think about how deep learning can be combined with human expertise 00:56:18.080 |
So here's an example of like drawing something just sketching it out 00:56:22.960 |
And then using a program called neural doodle 00:56:26.080 |
This is from a couple of years ago to then say please take that sketch and render it in the style of an artist 00:56:33.280 |
And so here's the picture that it then created 00:56:37.440 |
Rendering it as you know impressionist painting, and I think this is a really great example of how 00:56:46.480 |
human expertise and what computers are good at 00:56:50.480 |
So I a few years ago decided to try this myself like what would happen if I took 00:57:02.080 |
Deep learning and tried to use it to solve a really important problem, and so the problem I picked was 00:57:15.640 |
There's a 10 times higher probability of survival 00:57:20.040 |
So it's a really important problem to solve so I got together with three other people none of us had any medical background 00:57:33.960 |
Much like the dogs versus cats one we trained at the start of today's lesson 00:57:46.480 |
And we ended up after a couple of months with something with a much lower 00:57:50.720 |
False negative rate and a much lower false positive rate than a panel of four radiologists 00:57:55.800 |
And we went on to build this in a startup into into a company called analytic 00:58:01.600 |
which has really become pretty successful and 00:58:03.800 |
Since that time the idea of using deep learning for medical imaging has become 00:58:09.440 |
Hugely popular and it's being used all around the world 00:58:12.760 |
So what I've generally noticed is that you know the vast majority of 00:58:18.720 |
Of kind of things that people do in the world currently aren't using deep learning 00:58:25.040 |
And then each time somebody says oh, let's try using deep learning to improve performance at this thing 00:58:30.880 |
They nearly always get fantastic results and then suddenly everybody in that industry starts using it as well 00:58:37.260 |
So there's just lots and lots of opportunities here at this particular time to use deep learning to help with all kinds of different stuff 00:58:45.000 |
So I've jotted down a few ideas here. These are all things which I know you can use 00:58:51.360 |
deep learning for right now to get good results from 00:58:57.720 |
You know are things which people spend a lot of money on or have a lot of you know important business opportunities 00:59:06.160 |
But these are some examples of things that maybe at your company you could think about applying deep learning for 00:59:15.880 |
What actually happened when we trained that deep learning model earlier? 00:59:21.760 |
And so as I briefly mentioned the thing we created is something called a convolutional neural network or CNN and 00:59:29.520 |
The key piece of a convolutional neural network is the convolution 00:59:44.240 |
It's called and the explained visually website has an example of a convolution 00:59:50.760 |
kind of in practice over here in the bottom left is a very zoomed in picture of somebody's face and 00:59:56.600 |
Over here on the right is an example of using a convolution on that image 01:00:03.440 |
You can see here. This particular thing is obviously finding 01:00:08.120 |
Edges the edges of his head right top and bottom edges in particular 01:00:17.440 |
Now how is it doing that well if we look at each of these little three by three areas that this is moving over 01:00:23.520 |
It's taking each three by three area of pixels and here are the pixel values 01:00:28.380 |
right each thing in that three by three area and 01:00:31.440 |
It's multiplying each one of those three by three pixels by each one of these 01:00:40.400 |
Kernel values in a convolution this specific set of nine values is called a kernel 01:00:47.400 |
It doesn't have to be nine it could be four by four or five by five or two by two or whatever, right? 01:00:52.760 |
In this case, it's a three by three kernel and in fact in deep learning nearly all of our kernels are three by three 01:00:58.760 |
So in this case the kernel is one two one. Oh minus one minus two minus one. So we take each of the 01:01:07.240 |
Black through white pixel values and we multiply as you can see each of them by the corresponding value in the kernel and 01:01:20.400 |
And so if you do that for every three by three area you end up with 01:01:26.040 |
The values that you see over here on the right hand side 01:01:33.640 |
Black very high values become white and so you can see when we're at an edge 01:01:43.960 |
We're obviously going to get higher numbers over here and vice versa. Okay, so that's a convolution 01:01:50.780 |
So as you can see it is a linear operation and so based on that definition of a neural net 01:01:57.720 |
I described before this can be a layer in our neural network. It is a simple linear operation 01:02:04.220 |
And we're going to look at lots more at convolutions later including building a little spreadsheet 01:02:11.520 |
So the next thing we're going to do is we're going to add a nonlinear layer 01:02:16.280 |
so a nonlinearity as it's called is something which takes an input value and 01:02:25.480 |
Turns it into some different value in a nonlinear way and you can see this orange picture here is an example of a nonlinear 01:02:32.520 |
function specifically this is something called a sigmoid and 01:02:36.120 |
so a sigmoid is something that has this kind of s shape and 01:02:40.440 |
This is what we used to use as our nonlinearities in neural networks a lot 01:02:45.500 |
Actually nowadays we nearly entirely use something else called a relu or rectified linear unit 01:02:52.200 |
a relu is simply take any negative numbers and replace them with zero and 01:02:58.360 |
Leave any positive numbers as they are so in other words in code that would be 01:03:04.020 |
Y equals max x comma 0 so max x comma 0 simply says replace the negatives with 0 01:03:20.000 |
Regardless of whether you use a sigmoid or a relu or something else 01:03:24.000 |
The key point about taking this combination of a linear layer followed by a element wise nonlinear function is 01:03:32.860 |
That it allows us to create arbitrarily complex shapes as you see in the bottom, right? 01:03:38.080 |
And the reason why is that this is all from Michael Nielsen's neural networks and deep learning comm really fantastic 01:03:48.880 |
You change the values of your linear functions 01:03:53.280 |
It basically allows you to kind of like build these arbitrarily tall or thin blocks and then combine those blocks together 01:04:02.240 |
And this is actually the essence of the universal approximation theorem this idea that when you have a linear layer 01:04:10.400 |
Feeding into a nonlinearity you can actually create these arbitrarily complex shapes 01:04:16.160 |
So this is the key idea behind why neural networks can solve any computable problem 01:04:22.600 |
So then we need a way as we described to actually 01:04:28.600 |
Set these parameters so it's all very well knowing that we can move the parameters around manually to try to 01:04:36.520 |
Create different shapes, but we have some specific shape. We want how do we get to that shape? 01:04:42.680 |
And so as we discussed earlier the basic idea is to use something called gradient descent 01:04:48.280 |
This is an extract from a notebook actually one of the fast AI lessons 01:04:53.640 |
And it shows actually an example of using gradient descent to solve a simple linear regression problem 01:05:01.560 |
But I can show you the basic idea. Let's say you were just you had a simple 01:05:13.000 |
So you were trying to find the minimum of this quadratic 01:05:18.040 |
And so in order to find the minimum you start out by randomly picking some point, right? 01:05:27.120 |
And so you go up there and you calculate the value of your quadratic at that point 01:05:31.640 |
So what you now want to do is try to find a slightly better point 01:05:35.960 |
So what you could do is you can move a little bit to the left 01:05:40.680 |
And a little bit to the right to find out which direction is down and what you'll find out 01:05:46.840 |
Is that moving a little bit to the left decreases the value of the function so that looks good, right? 01:05:59.140 |
All right, so that tells you which way is down 01:06:04.760 |
It's the gradient. And so now that we know that going to the left is down we can take a small step in 01:06:13.800 |
To create a new point and then we can repeat the process and say okay 01:06:18.680 |
Which way is down now and we can now take another step and another step and another step another step another step, okay? 01:06:26.240 |
And each time we're getting closer and closer 01:06:29.520 |
So the basic approach here is to say okay. We start we're at some point. We've got some value X 01:06:36.440 |
Which is our current guess right that at time step n 01:06:41.080 |
So then our new guess at time step n plus 1 is just equal to our previous guess 01:07:00.200 |
Small number because we want to take a small step 01:07:02.880 |
We need to pick a small number because if we picked a big number right then we say okay 01:07:09.240 |
We know we want to go to the left. Let's jump a big long way to the left 01:07:14.880 |
We actually end up worse right and then we do it again 01:07:25.960 |
Step size you can actually end up with divergence rather than convergence 01:07:31.000 |
So this number here we're going to be talking about it a lot during this course 01:07:35.000 |
And we're going to be writing all this stuff out and code from scratch ourselves 01:07:37.760 |
But this number here is called the learning rate 01:07:50.560 |
This is an example of basically starting out with some random line and then using gradient descent to gradually make the line 01:07:59.760 |
So what happens when you combine these ideas right the convolution? 01:08:04.920 |
The non-linearity and gradient descent because they're all tiny small simple little things it doesn't sound that exciting 01:08:17.640 |
Right with enough layers something really interesting happens 01:08:26.920 |
So this is a really interesting paper by Matt Ziler and Rob Fergus and what they did a few years ago 01:08:36.300 |
Was they figured out how to basically draw a picture of what each layer in a deep learning net network learned? 01:08:43.840 |
And so they showed that layer one of the network here are nine examples of convolutional filters from layer one of a trained network 01:08:53.960 |
and they found that some of the filters kind of learnt these diagonal lines or 01:08:58.620 |
Simple little grid patterns some of them learnt these simple gradients right and so for each of these filters 01:09:05.800 |
They show nine examples of little pieces of actual photos 01:09:10.840 |
Which activate that filter quite highly right so you can see layer one 01:09:16.600 |
These learn to remember these these are learnt using gradient descent these filters were not programmed 01:09:22.720 |
They were learnt using gradient descent right so in other words we were learning 01:09:37.040 |
so layer two then was going to take these as inputs and 01:09:41.360 |
Combine them together and so layer two had you know 01:09:46.860 |
This is like nine kind of attempts to draw one of the examples of the filters in layer two 01:09:52.700 |
They're pretty hard to draw but what you can do is say for each filter 01:09:57.160 |
What are examples of little bits of images that activated them and you can see by layer two we've got? 01:10:03.640 |
Basically something that's being activated nearly entirely by little bits of sunset 01:10:07.920 |
something's that's being activated by circular objects 01:10:15.300 |
Repeating horizontal lines something that's being activated by corners right so you can see how we're basically combining layer one features together 01:10:24.600 |
So if we combine those features together and again, these are all 01:10:29.960 |
Institutional filters learnt through gradient descent by the third layer. It's actually learned to recognize the presence of text 01:10:38.360 |
Another filter has learned to recognize the presence of petals 01:10:42.160 |
Another filter has learned to recognize the presence of human faces right so just three layers is enough to get some pretty 01:10:50.440 |
Rich behavior so but by the time we get to layer five 01:10:54.760 |
We've got something that can recognize the eyeballs of insects and birds 01:11:03.960 |
Right so so this is kind of where we start with something 01:11:14.440 |
Thanks to the universal approximation theorem and the use of multiple hidden layers and deep learning 01:11:27.120 |
So that is what we used when we actually trained 01:11:41.240 |
Let's talk more about this dog versus cat recognizer 01:11:44.840 |
So we've learned the idea of like we can look at the pictures that come out of the other end to see what the models 01:11:50.880 |
Classifying well or classifying badly or which ones it's unsure about 01:11:56.240 |
But let's talk about like this key thing. I mentioned which is the learning rate 01:12:03.560 |
I just called it L before the learning rate and you might have noticed there's a couple of numbers these kind of magic numbers 01:12:09.960 |
Here the first one is the learning rate, right? 01:12:14.480 |
So this number is how much do you want to multiply the gradient by when you're taking each step in your gradient descent? 01:12:23.400 |
We already talked about why you wouldn't want it to be too high 01:12:26.440 |
Right, but probably also it's obvious to see why you wouldn't want it to be too low, right? If you had it too low 01:12:33.520 |
You would take like a little step and you'd be a little bit closer and a little bit step a little step little step 01:12:40.560 |
And it would take lots and lots and lots of steps and it would take too long 01:12:44.480 |
so setting this number well is actually really important and 01:12:53.120 |
deep learning researchers crazy because they didn't really know a 01:13:00.480 |
So the good news is last year a researcher came up 01:13:07.000 |
with an approach to quite reliably set the learning rate 01:13:10.880 |
Unfortunately almost nobody noticed so almost no deep learning researchers. I know about actually are aware of this approach 01:13:20.640 |
But it's incredibly successful and it's incredibly simple and I'll show you the idea 01:13:25.320 |
It's built into the fast AI library as something called LR find or the learning rate finder and it comes from this paper 01:13:35.400 |
Cyclical learning rates for training neural networks by a terrific researcher called Leslie Smith 01:13:50.960 |
Basic idea that we've seen before which is if we're going to optimize something pick some random point 01:14:00.480 |
Right and then specifically he said take a tiny tiny step 01:14:05.720 |
No tiny step so a learning rate of like 10 e next 7 01:14:12.280 |
Right and then do it again again, but each time increase the learning rate like double it 01:14:18.240 |
So then we try like 2 e next 7 4 e next 7 8 e next 7 01:14:31.800 |
Right and so you can see what's going to happen. It's going to like 01:14:37.440 |
Start doing almost nothing right and it's going to then suddenly the loss function is going to improve very quickly 01:14:43.680 |
Right, but then it's going to step even further again 01:14:51.320 |
Right, let's draw the rest of that line to be clear 01:14:56.280 |
Right and so suddenly it's then going to shoot off and get much worse 01:15:10.600 |
At what point did we see like the best improvement? 01:15:27.080 |
We've got our best improvement right and so we'd say okay. Let's use that 01:15:32.680 |
Learning rate right so in other words if we were to plot 01:15:47.040 |
Right and so what we then want to do is we want to plot 01:15:53.160 |
Against the loss right so when I say the loss I basically mean like how accurate is the model how close in this case the loss 01:16:01.080 |
Would be how far away is the predicted prediction? 01:16:07.560 |
Right and so if we plotted the learning rate against the loss we'd say like okay initially it didn't do very much 01:16:14.880 |
Right for small learning rates, and then it suddenly improved a lot and then it suddenly got a lot worse 01:16:22.360 |
So that's the basic idea and so we'd be looking for the point where this graph is 01:16:29.920 |
Dropping quickly right we're not looking for its minimum point 01:16:33.000 |
We're not saying like where was at the lowest because that could actually be the point where it's just jumped too far 01:16:46.280 |
So if you create your learn objects in the same way that we did before we'll be learning more about this these details shortly 01:16:54.320 |
If you then call LR find method on that you'll see that it'll start training a model 01:17:01.360 |
Like it did before but it'll generally stop before it gets to a hundred percent because if it notices 01:17:12.640 |
Then it'll stop automatically so that you can see here. It stopped at 84% and so then you can call 01:17:19.440 |
Learn dot shed that gets you the learning rate scheduler 01:17:22.680 |
That's the object which actually does this learning rate finding and that object has a plot learning rate function 01:17:28.240 |
And so you can see here by iteration you can see the learning rate 01:17:32.680 |
All right, so you can see each step the learning rate is getting bigger and bigger 01:17:36.640 |
You can do it this way you can see it's increasing exponentially 01:17:41.880 |
Another way that Leslie Smith the researcher suggests is to do it linearly 01:17:47.560 |
So I'm actually currently researching with both of these approaches to see which works best 01:17:51.720 |
Recently I've been mainly using exponential, but I'm starting to look more at using linear at the moment 01:17:57.200 |
And so if we then call shed dot plot that does the plot that I just described down here 01:18:06.760 |
Loss all right, and so we're looking for the highest learning rate we can find 01:18:16.240 |
clearly well right and so in this case I would say 01:18:20.400 |
10 to the negative 2 max at 10 to the negative 1 is not improving 01:18:25.200 |
All right 10 to the negative 3 it is also improving 01:18:28.920 |
But I'm trying to find the highest learning rate I can where it's still clearly improving 01:18:33.160 |
So I'd say 10 to the negative 2 right so you might have noticed that when we ran our model before we had 01:18:40.240 |
10 to the negative 2 0.01. So that's why we picked that learning rate 01:18:45.940 |
So there's really only one other number that we have to pick and 01:18:53.700 |
That was this number 3 and so that number 3 controlled how many 01:19:02.100 |
epochs that we run so an epoch means going through our entire data set of images and 01:19:11.820 |
Using each each time we do a bunch of they called mini batches we grab like 01:19:17.340 |
64 images at a time and use them to try to improve the model a little bit using gradient descent 01:19:23.260 |
Right and using all of the images once is called one epoch 01:19:27.420 |
and so at the end of each epoch we print out the accuracy and 01:19:32.900 |
Validation and training loss at the end of the epoch 01:19:41.780 |
how many epochs should we run is kind of the one other question that you need to answer to run these three lines of code and 01:19:55.340 |
What you might find happen is if you run it for too long the accuracy you'll start getting worse 01:20:01.640 |
Right and we'll learn about that why later. It's something called overfitting right so 01:20:06.900 |
You can run it for a while run lots of epochs 01:20:12.060 |
You know how many epochs you can run and the other thing that might happen is if you've got like a really big model 01:20:17.780 |
Or what lots and lots of data maybe it takes so long you don't have time and so you just run enough epochs that 01:20:23.500 |
Fit into the time you have available so the number of epochs you run you know that's a pretty easy thing to set 01:20:29.580 |
So they're the only two numbers you're going to have to set and so the goal 01:20:34.860 |
This week will be to make sure that you can run 01:20:39.580 |
Not only these three lines of code on the data that I provided 01:20:43.820 |
But to run it on a set of images that you either have on your computer or that you 01:20:50.360 |
Get from work or that you download from Google 01:20:53.420 |
And I try to get a sense of like which kinds of images does it seem to work well for? 01:21:02.860 |
What kind of learning rates do you need for different kinds of images how many epochs do you need? 01:21:09.540 |
How does the number of the learning rate change the accuracy you get and so forth like really experiment and then? 01:21:16.420 |
You know try to get a sense of like what's inside this data object? 01:21:21.140 |
You know what are the y values look like what are these classes mean? 01:21:24.980 |
If you're not familiar with numpy you know really practice a lot with numpy so that by the time you come back for the next 01:21:33.660 |
You know we're going to be digging into a lot more detail, and so you'll really feel ready to do that 01:21:39.060 |
now one thing that's really important to be able to do that is that you need to really know how to 01:21:47.780 |
Numpy the faster I library and so forth and so I want to show you some tricks in Jupyter notebook to make that much easier 01:21:55.580 |
So one trick to be aware of is if you can't quite remember how to spell something right so 01:22:04.860 |
What the message you want is you can always hit tab? 01:22:10.220 |
Methods that start with that letter right and so that's a quick way to find things 01:22:14.900 |
If you then can't remember what the arguments are to a method hit shift tab 01:22:20.360 |
All right, so hitting shift tab tells you the arguments to the method so shift tab is like one of the most helpful things 01:22:35.860 |
Shift tab and so now you might be wondering like okay. Well. What does this function do and how does it work? 01:22:48.220 |
Shows you what the parameters are and shows you what it returns and gives you examples 01:22:58.380 |
Then it actually pops up a whole little separate window with that information 01:23:05.860 |
One way to grab that window straight away is if you just put question mark at the start 01:23:12.540 |
Then it just brings up that little documentation window 01:23:16.660 |
Now the other thing to be aware of is increasingly during this course 01:23:22.500 |
We're going to be looking at the actual source code of fast AI itself and learning how it's built and why it's built that way 01:23:29.660 |
It's really helpful to look at source code in order to you know 01:23:33.740 |
Understand what you can do and how you can do it 01:23:36.540 |
So if you for example wanted to look at the source code for learn dot predict you can just put two question marks 01:23:42.400 |
Okay, and you can see it's popped up the source code right and so it's just a single line of code 01:23:50.300 |
You'll very often find that fast AI methods like they're they're designed to never be more than 01:23:57.420 |
About half a screen full of code and they're often under six lines so you can see this case 01:24:02.660 |
It's calling predict with tags so we could then get the source code for that in the same way 01:24:10.940 |
And then that's calling a function called predict with tags so we could get the documentation for that in the same way and 01:24:16.340 |
Then so here we are and then finally that's what it does it iterates through a data loader gets the predictions and then passes them back 01:24:26.980 |
question mark question mark is how to get source code a single question mark is how to get documentation and 01:24:33.660 |
Shift tab is how to bring up parameters or press it more times 01:24:43.020 |
Another really helpful thing to know about is how to use Jupyter notebook well and the button that you want to know is H 01:24:50.180 |
If you press H, it will bring up the keyboard shortcuts 01:24:54.940 |
Palette and so now you can see exactly what Jupyter notebook can do and how to do it 01:25:00.500 |
I personally find all of these functions useful 01:25:03.680 |
So I generally tell students to try and learn four or five different keyboard shortcuts a day 01:25:08.960 |
Try them out see what they do see how they work, and then you can try practicing in that session 01:25:14.940 |
And one very important thing to remember when you're finished with your work for the day go back to paper space and click on that 01:25:24.100 |
Which stops and starts the machine so after it stopped you'll see it says connection closed and you'll see it's off 01:25:30.460 |
If you leave it running you'll be charged for it same thing with Cressel be sure to go to your Cressel 01:25:37.060 |
Instance and stop it you can't just turn your computer off or close the browser 01:25:43.020 |
You actually have to stop it in Cressel or in paper space and don't forget to do that 01:25:53.220 |
Okay, so I think that's all of the information that you need to get started please remember about the forums 01:26:04.500 |
But before you do make sure you read the information on course.fast.ai for each lesson 01:26:11.020 |
All right because that is going to tell you about like things that have changed okay, so if there's been some change to 01:26:20.900 |
Jupyter notebook provider we suggest using or how to set up paper space or anything like that 01:26:28.780 |
Okay, thanks very much for watching and look forward to seeing you in the next lesson