Lesson 1: Deep Learning 2018

00:00:00.000 | Hi everybody, welcome to practical deep learning for coders. This is part one of our two-part course

00:00:10.720 | Presenting this from the Data Institute in San Francisco

00:00:14.020 | We'll be doing seven lessons in this part of the course

00:00:19.620 | Most of them will be about a couple of hours long this first one may be a little bit shorter

00:00:26.760 | Practical deep learning for coders is all about getting you up and running with deep learning in practice

00:00:32.080 | Getting world-class results and it's a really coding focused approach as the name suggests

00:00:38.700 | but we're not going to dumb it down by the end of the course you all have learned all of the

00:00:43.220 | Theory and details that are necessary to rebuild all of the world-class results. We're learning about from scratch

00:00:49.640 | Now I should mention that our videos are hosted on YouTube

00:00:55.600 | But we strongly recommend watching them via our website at course dot fast AI

00:01:00.900 | Although they're exactly the same videos the important thing about watching them through our website

00:01:07.520 | Is that you'll get all of the information you need about kind of updates to libraries by all locations?

00:01:14.160 | Further information frequently asked questions and so forth

00:01:17.980 | So if you're currently on YouTube watching this why don't you switch over to course dot fast at AI now and start watching through there?

00:01:25.480 | And make sure you read all of the material on the page before you start just to make sure that you've got everything you need

00:01:31.460 | The other thing to mention is that there is a really great strong community at forums dot fast AI

00:01:39.020 | From time to time you'll find that you get stuck

00:01:44.560 | You may get stuck very early on you may not get stuck for quite a while, but at some point you might get stuck with understanding

00:01:52.680 | Why something works the way it does or there may be some computer problem that you have or so forth

00:01:58.200 | On forums dot fast at AI there are thousands of other learners talking about every lesson and lots of other topics besides

00:02:06.040 | It's the most active deep learning community on the internet by far. So

00:02:10.220 | Definitely register there and start getting involved. You'll get a lot more out of this course if you do that

00:02:19.680 | So we're going to start by doing some coding. This is an approach

00:02:24.520 | We're going to be talking about in a moment called the top-down approach to study

00:02:29.160 | But let's learn it by doing it. So let's go ahead and try and actually train a neural network

00:02:36.320 | Now in order to train a neural network, you almost certainly want a GPU

00:02:42.140 | GPU is a graphics processing a graphics processing unit

00:02:47.640 | It's the things that companies use to help you play games better

00:02:53.240 | They let your computer render the game much more quickly than your CPU can

00:02:59.760 | We'll be talking about them more shortly. But for now, I'm going to show you how you can get access to a GPU

00:03:07.160 | Specifically you're going to need an Nvidia GPU because only Nvidia GPUs support something called CUDA

00:03:16.800 | CUDA is the language and framework that nearly all deep-learning

00:03:20.780 | libraries and practitioners use to do their work

00:03:25.160 | Obviously, it's not ideal that we're stuck with one particular vendors cards and over time

00:03:31.480 | We hope to see more competition in this space. But for now, we do need an Nvidia GPU

00:03:35.780 | Your laptop almost certainly doesn't have one unless you specifically went out of your way to buy like a gaming laptop

00:03:44.840 | So almost certainly you will need to rent one

00:03:49.240 | The good news is that renting access?

00:03:52.060 | Paying by the second for a GPU based computer is pretty easy and pretty cheap

00:03:57.800 | I'm going to show you a couple of options

00:04:00.920 | The first option I'll show you which is

00:04:05.620 | Probably the easiest is called cressel

00:04:09.200 | if you go to cressel.com and

00:04:14.600 | Click on sign up or if you've been there before sign in

00:04:17.240 | You will find yourself at this screen which has a big button that says start Jupiter and another switch called enable GPU

00:04:25.560 | So if we make sure that is set to true enable GPU is on and we click start Jupiter

00:04:31.160 | and

00:04:33.880 | We click start Jupiter

00:04:35.880 | It's going to launch us into something called Jupiter notebook

00:04:41.040 | Jupiter notebook in a recent survey of tens of thousands of data scientists was rated as the third most important tool

00:04:48.920 | In the data scientist toolbox. It's really important that you get to learn it well and all of our courses will be run through Jupiter

00:04:55.760 | Yes, Rachel. You have a question or comment? Oh, I just wanted to point out that you get I believe 10 free hours

00:05:02.000 | So if you wanted to try cressel out

00:05:07.320 | Yeah, I he might have changed that recently to less hours, but you can check the fact or the pricing

00:05:12.680 | But you certainly get some free hours

00:05:14.680 | The pricing varies because this is actually runs on top of Amazon web services. So at the moment, it's 60 cents an hour

00:05:21.680 | The nice thing is though that you can always turn it turn it on

00:05:26.240 | You know start your Jupiter without the CP without the GPU running and pay you a tenth of that price, which is pretty cool

00:05:34.160 | So Jupiter notebook is something we'll be doing all of this course in and so to get started here

00:05:39.160 | we're going to find our particular course, so we'd go to courses and

00:05:42.400 | We'd go to fast AI 2 and

00:05:46.040 | There they are

00:05:49.440 | Things have been moving around a little bit. So it may be in a different spot for you

00:05:53.400 | When you look at this and we'll make sure all the information current information is on the website

00:06:00.000 | Now having said that that's you know, the cressel approach is you know, as you can see, it's basically instant and and easy

00:06:08.000 | But if you've got you know an extra hour or so to get going an even better option is

00:06:16.880 | Something called paper space

00:06:19.440 | Paper space unlike cressel doesn't run on top of Amazon. They have their own machines

00:06:27.800 | and

00:06:29.800 | If I click on so here's here's paper space and so if I click on new machine I

00:06:38.400 | Can pick which one of their three data centers to use so pick the plot one closest to you. So I'll say West Coast and

00:06:45.160 | Then I'll say Linux and I'll say you bun to 16

00:06:50.560 | And then it says choose machine and you can see there's various different machines I can choose from

00:06:57.760 | And pay by the hour

00:06:59.760 | So this is pretty cool for 40 cents an hour. So it's cheaper than cressel

00:07:05.680 | I get a machine that's actually going to be much faster than cressel 60 cent now machine or for 65 cents an hour

00:07:12.480 | Way way way faster, right?

00:07:15.040 | So I'm going to actually show you how to get started with with the with the paper space approach

00:07:20.000 | Because that actually is going to do everything from scratch

00:07:25.400 | You may find if you try to do the 65 cents an hour one that it may require you to contact paper space to say

00:07:32.520 | Like why do you want it? That's just an anti fraud thing. So if you say faster AI there

00:07:38.360 | then

00:07:40.880 | They'll quickly get you up and running. So I'm going to use the cheapest one here 40 cents an hour

00:07:45.360 | You can pick how much storage you want and

00:07:52.720 | Note that you pay for a month of storage as soon as you start the machine up

00:07:56.880 | Right, so don't start and stop lots of machines because each time you pay for that month of storage

00:08:01.160 | I think the 250 gig seven dollar a month option is pretty good

00:08:05.680 | But you really need 50 gig. So if you're trying to minimize the price you can go there

00:08:09.560 | The only other thing you need to do is turn on public IP so that we can actually log into this and

00:08:17.520 | We can turn off auto snapshot to save the money of not having backups

00:08:21.700 | All right, so if you then click on create your paper space about a minute later you will find

00:08:33.900 | That your machine will pop up. Here is my Ubuntu 1604 machine

00:08:40.340 | If you check your email

00:08:43.780 | You will find that they have emailed you a password so you can copy that

00:08:50.720 | And

00:08:51.880 | You can go to your machine and enter your password now to paste the password

00:08:56.760 | You would press ctrl shift V or on Mac. I guess Apple shift V

00:09:01.920 | So it's slightly different to normal pasting or of course you can just type it in

00:09:07.400 | And here we are now we can make a little bit more room here by clicking on these little arrows I

00:09:14.960 | Can zoom in a little bit?

00:09:17.720 | And so as you can see we've got like a terminal that's sitting inside

00:09:22.240 | Our browser which is kind of quite a handy way to do it

00:09:26.000 | So now we need to configure this for the course and so the way you configure it for the course is you type?

00:09:32.960 | curl

00:09:36.760 | HTTP colon slash slash files dot fast dot AI slash setup slash paper space

00:09:47.160 | pipe

00:09:48.480 | Bash

00:09:49.640 | Okay, and so that's then going to run a script which is going to set up all of the CUDA drivers

00:09:56.520 | the special Python

00:09:59.280 | Reaper pipe Python distribution we use called anaconda all of the libraries all of the courses

00:10:06.440 | And the data we use for the first part of the course

00:10:10.280 | Okay, so that takes an hour or so and when it's finished running you'll need to reboot your computer

00:10:17.960 | So to reboot not your own computer

00:10:20.240 | But your paper space computer and so to do that you can just click on this little circular restart machine button

00:10:26.080 | Okay, and when it comes back up you'll be ready to go. So what you'll find

00:10:31.040 | Is that you've now got an anaconda 3 directory. That's where your Python is

00:10:37.400 | You've got a data directory which contains the data for the first part of this course first lesson, which is that dogs and cats?

00:10:44.600 | And you've got a fast AI directory

00:10:47.600 | and

00:10:49.880 | That contains everything for this course

00:10:52.280 | so what you should do is

00:10:55.320 | CD fast AI and from time to time you should go git pull and that will just make sure that all of your

00:11:04.040 | Fast AI stuff is up to date and also from time to time

00:11:07.960 | You might want to just check that your Python libraries up to date and so you can type Conda and update

00:11:13.320 | to do that

00:11:15.960 | Alright, so make sure that you've CD'd into fast AI and then you can type Jupiter notebook

00:11:23.160 | All right, there it is

00:11:28.000 | So we now have a Jupiter notebook serving it running and we want to connect that right and so you can see here

00:11:34.720 | It says copy paste this URL

00:11:36.720 | Into your browser when you connect so if you double click on it

00:11:41.000 | Then that will actually

00:11:43.720 | That will actually copy it for you

00:11:48.160 | Then you can go and paste it, but you need to change this local host

00:11:53.680 | To be the paper space IP address, so if you click on a little arrows to go smaller

00:11:59.160 | You can see the IP address is here

00:12:01.360 | so I'll just copy that and

00:12:03.960 | paste it

00:12:06.760 | Where it used to say local host okay?

00:12:08.800 | So it's now HTTP and then my IP and then everything else I copied before and so there it is

00:12:15.360 | So this is the fast AI

00:12:19.360 | Get repo and our courses are all in courses and in there the deep learning part one is DL one and

00:12:27.320 | In there you will find

00:12:30.120 | Lesson one dot IPI and be I Python notebook

00:12:34.080 | So here we are ready to go

00:12:41.400 | Depending whether you're using Gressel or paper space or something else if you check courses to fast at AI

00:12:46.560 | We'll keep putting additional videos and links to information about how to set up other

00:12:51.080 | You know good Jupyter notebook

00:12:53.600 | providers as well

00:12:55.920 | So to run a cell in Jupyter notebook

00:13:01.500 | You select the cell and you hold down shift and press enter or if you've got the toolbar showing

00:13:08.480 | You can just click on the little run button, so you'll notice that some cells contain

00:13:15.040 | Code and some contain text and some contain pictures and some contain videos so this environment basically has

00:13:22.840 | You know it's it's a way that we can give you access to a way to run

00:13:29.260 | Experiments and to kind of tell you what's going on show pictures

00:13:33.660 | This is why it's like a super popular tool in data science the data science is kind of all about running experiments

00:13:41.960 | really

00:13:44.120 | So let's go ahead and click run

00:13:46.400 | And you'll see that cell turn into a star the one turn into a star for a moment, and then it finished running

00:13:52.400 | Okay, so let's try the next one this time instead of using the toolbar. I'm going to hold down shift and press enter

00:13:58.100 | And you can see again

00:14:00.080 | It turned into a star and then it said to so if I'd hold down shift and keep pressing enter it just keeps running each

00:14:06.080 | Cell right so I can put anything I like for example one plus one

00:14:10.640 | is two

00:14:13.840 | so

00:14:15.600 | What we're going to do is we're going to?

00:14:17.840 | Yes, Rachel. Oh, this is just a side note, but I wanted to point out that we're using Python 3 here

00:14:24.400 | Yes, thank you, Python 3 and so you get some errors if you're still using Python 2. Mm-hmm. Yeah

00:14:29.560 | And it is important to switch to Python 3 you know now well for fast AI you require it

00:14:37.480 | But you know increasingly a lot of libraries are

00:14:42.040 | removing support for Python 2

00:14:44.040 | Thanks Rachel

00:14:47.400 | Now it mentions here that you can download the data set for this lesson from this location

00:14:54.160 | if you're using

00:14:57.040 | Cressel or

00:14:58.360 | The paper space script that we just used to set up and this will already be made available for you

00:15:03.680 | Okay, if you're not you'll need to W get it as soon

00:15:06.360 | now Cressel is

00:15:10.480 | Quite a bit slower than paper space and also it

00:15:14.040 | There are some particular things it doesn't support that we really need and so there there are a couple of extra steps if you're using

00:15:21.600 | Cressel you have to run two more cells right so you can see these are commented out

00:15:26.080 | They've got hashes at the start

00:15:27.320 | So if you remove the hashes from these and run these two additional cells that just runs the stuff that the stuff that you only

00:15:34.320 | Need for Cressel I'm using paper space, so I'm not going to run it

00:15:38.600 | okay, so

00:15:40.600 | Inside our

00:15:43.600 | Data so we set up this path to data dogs cats

00:15:47.880 | That's pre set up for you and so inside there. You can see here. I can use an exclamation mark

00:15:52.760 | to

00:15:55.800 | Basically say I don't want to run Python, but I want to run bash

00:15:59.480 | I don't want to run shell so this runs a bash command and the bit inside the curly brackets

00:16:05.720 | Actually refers however to a Python variable so it inserts that Python variable into the bash command

00:16:11.000 | So here is the contents of our folder

00:16:13.800 | There's a training set and a validation set if you're not familiar with the idea of training sets and validation sets

00:16:21.040 | It would be a very good idea to check out our

00:16:24.080 | practical machine learning course

00:16:27.040 | Which tells you a lot about this kind of stuff of like that the basics of how to set up and run machine learning

00:16:34.360 | projects more generally

00:16:36.360 | Would you recommend that people take that course before this one?

00:16:40.340 | Actually a lot of students who would you know as they went through these who said they look they've liked doing them together

00:16:46.920 | So you can kind of check it out and and see

00:16:50.200 | the machine learning course

00:16:53.320 | Yeah, they cover some similar stuff but all in different directions so people have done both since you know say they find it

00:17:01.760 | They each support each other. I wouldn't say it's prerequisite

00:17:05.720 | But you know if I do if I say something like hey

00:17:08.760 | This is a training set and this is a validation set and you're going I don't know what that means

00:17:12.000 | At least Google it do a quick read you know because we're assuming

00:17:15.480 | That you know the very basics of kind of what machine learning is and does to some extent

00:17:23.260 | And I have a whole blog post on this topic as well

00:17:26.320 | Okay, and we'll make sure that you link to that from course.fast.ai

00:17:29.680 | And I also just wanted to say in general with fast.ai our philosophy is to

00:17:34.080 | Kind of learn things on an as-needed basis. Yeah exactly don't try and learn everything that you think you might need first

00:17:41.560 | Otherwise you'll never get around to learning the stuff you actually want to learn

00:17:44.360 | Exactly and that shows up in deep learning. I think

00:17:47.200 | particularly a lot yes

00:17:50.040 | Okay, so in our validation folder

00:17:53.560 | There's a cats folder and a dogs folder and then inside the validation cats folder is a whole bunch of JPEGs

00:18:00.400 | The reason that it's set up like this is that this is kind of the most common standard approach for how?

00:18:06.940 | image classification data sets are shared and provided and the idea is that each folder

00:18:13.120 | Tells you the label so there's each of these

00:18:17.640 | Images is labeled cats and each of the images in the dogs folder is labeled dogs. Okay?

00:18:23.560 | This is how Keras works as well for example

00:18:26.560 | So this is a pretty standard way to share image classification

00:18:33.800 | files

00:18:37.000 | So we can have a look

00:18:38.800 | So if you go plot.im show

00:18:40.800 | We can see an example of the first of the cats

00:18:45.920 | If you haven't seen

00:18:47.920 | This before this is a Python

00:18:49.920 | 3.6 format string so you can Google for that if you haven't seen it

00:18:54.200 | It's a very convenient way to do string formatting, and we use it a lot

00:18:57.080 | So there's our cat, but we're going to mainly be interested in the underlying data that makes up that cat

00:19:05.160 | so specifically

00:19:07.760 | It's an image whose shape that is the dimensions of the array is 198 by 179 by 3

00:19:15.080 | So it's a three-dimensional array also called a rank 3 tensor

00:19:18.520 | And here are the first four rows and four columns of that image

00:19:23.560 | so as you can see

00:19:26.760 | each of those

00:19:28.640 | cells has three

00:19:30.640 | Items in it, and this is the red green and blue pixel values between 0 and 255

00:19:37.160 | So here's a little subset of what a picture actually looks like inside your computer

00:19:43.320 | so that's that that's will be our idea is to take these kinds of numbers and

00:19:48.520 | Use them to predict whether those kinds of numbers represent a cat

00:19:52.340 | Or a dog based on looking at lots of pictures of cats and dogs

00:19:56.640 | so that's a pretty hard thing to do and at the point in time when this

00:20:02.480 | This data set actually comes from a Kaggle competition the dogs versus cats Kaggle competition and when it was released in I think it

00:20:10.360 | was 2012

00:20:11.720 | The state-of-the-art was 80% accuracy so computers weren't really able to at all accurately recognize dogs versus cats

00:20:20.060 | So let's go ahead and train a model

00:20:24.080 | So

00:20:29.960 | Here are the three lines of code necessary to train a model

00:20:34.680 | And so let's go ahead and run it so I'll click on this on the cell. I'll press shift enter

00:20:40.160 | and

00:20:42.160 | Then we'll wait a couple of seconds for it to pop up and there it goes

00:20:47.200 | Okay, and it's training

00:20:50.280 | and

00:20:51.660 | So I've asked it to do three epochs so that means it's going to look at every image

00:20:55.440 | Three times in total or look at the entire set of images three times

00:20:59.560 | That's what we mean by an epoch and as we do it's going to print out

00:21:05.880 | The accuracy is this last of the three numbers that prints out on the validation set, okay?

00:21:11.280 | The first two numbers will talk about later

00:21:14.120 | In short they're the value of the loss function which is in this case the cross entropy loss

00:21:18.520 | For the training set and the validation set and then right at the start here is the epoch number

00:21:23.200 | So you can see it's getting about

00:21:26.360 | 90 percent accuracy

00:21:29.480 | And it took 17 seconds so you can see we've come a long way since

00:21:35.280 | 2012 and in fact even in the competition

00:21:38.460 | This actually would have won the Kaggle competition of that time the best in the Kaggle competition was 98.9

00:21:46.060 | And we're getting about 99%

00:21:48.160 | so this may surprise you that we're getting a

00:21:52.680 | You know Kaggle winning as of 20 end of 2012 early 2013

00:21:57.480 | Kaggle winning image classifier in 17 seconds

00:22:03.880 | but

00:22:05.880 | and three lines of code

00:22:07.880 | And I think that's because like a lot of people assume that deep learning takes a huge amount of time

00:22:14.800 | And lots of resources and lots of data and as you'll learn in this course

00:22:20.400 | That in general isn't true

00:22:23.400 | One of the ways we've made it much simpler is that this code is written on top of a library we built

00:22:32.560 | imaginatively called fast AI

00:22:34.560 | the fast AI library is basically a library which takes all of the

00:22:39.960 | Best practices approaches that we can find and so each time a paper comes out. You know we that looks interesting

00:22:46.960 | We test it out if it works well for a variety of data sets and we can figure out how to tune it

00:22:52.020 | we implement it in fast AI and so fast AI kind of curates all this stuff and packages up for you and

00:22:58.480 | Much of the time or most the time kind of automatically figures out the best way to handle things

00:23:03.420 | So the fast AI library is why we were able to do this in just three lines of code

00:23:07.560 | And the reason that we were able to make the fast AI library work

00:23:11.760 | So well is because it in turn sits on top of something called pytorch

00:23:16.120 | which is a

00:23:18.680 | Really flexible deep learning and machine learning and GPU computation library written by Facebook

00:23:27.600 | Most people are more familiar with TensorFlow than pytorch because Google markets that pretty heavily

00:23:33.960 | But most of the top researchers I know nowadays at least the ones that aren't at Google have switched across to pytorch

00:23:40.680 | Yes, Rachel, and we'll be covering some pytorch later in the course. Yeah, it's I mean one of the things that

00:23:46.880 | Hopefully you're really like about fast AI is that it's really flexible that you can use all these kind of curated best practices as

00:23:56.560 | Much as little as you want and so it's really easy to hook in at any point and write your own

00:24:02.040 | Data augmentation write your own loss function write your own network architecture, whatever and so we'll do all of those things

00:24:09.420 | in this course

00:24:12.000 | So what does this model look like?

00:24:14.360 | well, what we can do is we can

00:24:17.640 | Take a look at so what are the what is the the validation set?

00:24:22.560 | Dependent variable the Y look like and it's just a bunch of zeros and ones, right?

00:24:27.160 | So the zeros if we look at data dot classes the zeros represent cats the ones represent dogs

00:24:32.760 | You'll see here. There's basically two objects. I'm working with one is an object called data

00:24:36.980 | Which contains the validation and training data and another one is the object called learn which contains the model, right?

00:24:44.120 | So anytime you want to find something out about the data we can look inside data

00:24:49.320 | So we want to get predictions for a validation set and so to do that we can call learn dot predict

00:24:55.400 | and

00:24:57.760 | So you can see here the first ten predictions and what it's giving you is prediction for dog and a prediction for cat

00:25:05.200 | now the way pytorch generally works and therefore fast AI also works is that most models return the

00:25:12.000 | log

00:25:14.280 | Of the predictions rather than the probabilities themselves. We'll learn why that is later in the course

00:25:19.900 | So for now recognize that to get your probabilities you have to get

00:25:23.620 | e to the power of

00:25:26.600 | You'll see here. We're using numpy NP is numpy if you're not familiar with numpy

00:25:32.720 | That is one of the things that we assume that you have some familiarity with

00:25:36.400 | So be sure to check out the material on course dot fast at AI to learn the basics of numpy

00:25:43.440 | it's

00:25:44.840 | the way that Python handles all of the

00:25:48.080 | Fast numerical programming array computation that kind of thing

00:25:54.860 | Okay, so we can get the probabilities using that

00:25:59.300 | using NP dot X

00:26:02.120 | There's a few functions here that you can look at yourself if you're interested, but just some plotting functions that we'll use

00:26:07.600 | And so we can now plot

00:26:11.640 | some random correct

00:26:13.720 | Images and so here are some images that it was correct about okay, and so remember one is a dog

00:26:22.360 | So anything greater than 0.5 is dog and 0 is a cat so this is what 10 to the negative 5 obviously a cat

00:26:29.400 | Here are some which are incorrect

00:26:32.320 | Right so you can see that some of these which it thinks are incorrect obviously are just the you know images. It shouldn't be there at all

00:26:41.320 | But clearly this one which it called a a dog is not at all a dog so there are some obvious mistakes

00:26:48.320 | We can also take a look at

00:26:53.160 | Which cats is it the most confident are cats which dogs are the most dog like the most confident dogs

00:27:02.320 | Perhaps more interestingly we can also see which cats is it the most confident are actually dogs

00:27:09.000 | so which ones it is at the most wrong about and

00:27:11.960 | Same thing for the ones the dogs that it really thinks are cats and again some of these are just

00:27:18.640 | Pretty weird. I guess there is a dog in there. Yes, Rachel

00:27:22.700 | I just say do you want to say more about why you would want to look at your data?

00:27:26.680 | Yeah, sure

00:27:29.920 | So yeah, so finally I just mentioned the last one we've got here is to see which ones have the probability closest to 0.5

00:27:38.560 | So these are the ones that the the model knows it doesn't really know what to do with and some of these it's not surprising

00:27:44.520 | So yeah, I mean this is kind of like

00:27:48.760 | Always the first thing I do after I build a model is to try to find a way to like visualize what it's built

00:27:56.640 | Because if I want to make the model better

00:27:59.800 | Then I need to take advantage of the things it's doing well and fix the things it's doing badly. So in this case

00:28:07.640 | And often this is the case. I've learned something about the data set itself

00:28:11.240 | Which is that there are some things that are in here that probably shouldn't be

00:28:14.600 | But I've also like it's also clear that

00:28:18.520 | this

00:28:20.800 | Model has room to improve like to me. That's pretty obviously a

00:28:25.840 | Dog, but one thing I'm suspicious about here is this image is very

00:28:31.440 | kind of fat and

00:28:34.600 | short and

00:28:37.240 | As we all learn

00:28:39.160 | The way these algorithms work is it kind of grabs a square piece at a time?

00:28:44.320 | So this rather makes me suspicious that we're going to need to use something called data augmentation

00:28:49.080 | That will learn about learn about later to handle this properly

00:28:53.320 | Okay, so

00:28:58.160 | That's it right we've now built

00:29:03.000 | We've now built an image classifier and something that you should try now is to grab some data

00:29:11.240 | yourself

00:29:13.720 | some pictures of

00:29:15.720 | Two or more different types of thing put them in different folders and run the same three lines of code

00:29:22.720 | On them, okay, and you'll find

00:29:26.960 | that it will work for that as well as long as that they are pictures of things like

00:29:33.160 | the kinds of things that people normally take photos of right, so if they're

00:29:37.800 | microscope microscope pictures or pathology pictures or

00:29:41.840 | CT scans or something this won't work very well as we'll learn about later

00:29:47.360 | There are some other things we didn't need to do to make that work, but for things that look like normal photos

00:29:54.760 | These you can run exactly the same three lines of code and just point your

00:29:59.440 | path variable somewhere else

00:30:02.440 | To get your own image classifier

00:30:05.320 | so for example

00:30:07.160 | one student

00:30:09.120 | Took those three lines of code downloaded for Google images

00:30:12.840 | Ten examples of pictures of people playing cricket ten examples of people playing baseball and build a classifier

00:30:19.800 | Of those images which was nearly perfectly correct

00:30:23.920 | the same

00:30:25.400 | student actually also tried downloading seven pictures of

00:30:29.360 | Canadian currency seven pictures of American currency and again in that case the model was a hundred percent

00:30:37.280 | Accurate so you can just go to Google images if you like and download a few things of a few different classes and see

00:30:43.440 | See what works and tell us on the forum both your successes and your failures

00:30:52.280 | So what we just did was to

00:30:54.280 | Train a neural network, but we didn't first of all tell you what a neural network is or what training means or

00:31:02.160 | anything

00:31:04.480 | Why is that? Well, this is the start of our top-down approach to learning

00:31:11.140 | And basically the idea is that unlike the way math and technical subjects are usually taught

00:31:17.760 | where you learn every little element piece by piece and you don't actually get to put them all together and

00:31:23.680 | Build your own image classifier until third year of graduate school. Our approach is to say from the start

00:31:31.600 | Hey, let's show you how to train an image classifier and now you can start doing stuff

00:31:36.700 | And then gradually we dig deeper and deeper and deeper

00:31:40.340 | and

00:31:42.760 | so the idea is that

00:31:46.160 | Throughout the course you're going to see like new problems that we want to solve

00:31:50.640 | So for example in the next lesson, we'll look at well

00:31:54.620 | What if we're not looking at normal kinds of photos, but we're looking at satellite images

00:32:01.180 | And we'll see why it is that this approach that we're learning today doesn't quite work as well

00:32:06.000 | And what things do we have to change and so we'll learn enough about the theory to understand why that happens

00:32:12.100 | And then we'll learn about the libraries and how we can change change things with the libraries to make that work better

00:32:17.600 | And

00:32:20.440 | So during the course we're gradually going to learn to solve more and more problems as we do

00:32:25.480 | So we'll need to learn more and more parts of the library more and more bits of the theory until by the end

00:32:31.960 | We're actually going to learn how to create a

00:32:35.040 | world-class

00:32:37.440 | neural net architecture from scratch and our own training loop from scratch and so we're actually build everything

00:32:44.240 | ourselves

00:32:45.840 | So that's the general

00:32:47.760 | Approach. Yes, Rachel and we sometimes also call this the whole game

00:32:52.240 | Which is inspired by Harvard professor David Perkins

00:32:56.660 | Yeah

00:32:57.440 | And so the idea with the whole game is like this is more like how you would learn baseball or music

00:33:02.280 | With baseball you would get taken to a ball game. You would learn what baseball is

00:33:07.240 | You would start playing it and it would only be years later that you might learn about the physics of how curveball works

00:33:14.720 | For example or with music we put an instrument in your hand and you start

00:33:20.040 | Banging the drum or hitting the xylophone and it's not until years later that you learn about the circle of fifths and understand

00:33:26.960 | How to construct a cadence for example

00:33:29.160 | So yeah, so that's this is kind of the approach we're using it's very inspired by

00:33:34.840 | David Perkins and other writers of education

00:33:37.680 | So what that does mean is to take advantage of this as we peel back the layers

00:33:43.440 | We want you to keep like looking under the hood yourself as well like experiment a lot because this is a very code driven

00:33:51.880 | Approach so here's basically what happens right? We start out looking today at

00:33:57.280 | convolutional neural networks for images and then in a couple of lessons

00:34:02.000 | We'll start to look at how to use neural nets to look at structured data and then to look at language data and then to look

00:34:08.960 | at recommendation system data

00:34:10.960 | And then we kind of then take all of those steps and we go backwards through them in reverse order

00:34:18.040 | So now you know by the end of that fourth piece you will know

00:34:22.120 | By the end of lesson four how to create a world-class image classifier a world-class

00:34:30.160 | Structured data analysis program world-class language classifier world-class recommendation system

00:34:36.660 | And then we're going to go back over all of them again and learn in depth about like well

00:34:41.240 | What exactly did it do and how did it work?

00:34:43.360 | And how do we change things around and use it in different situations for for the recommendation systems structured data?

00:34:51.000 | Images and then finally back to language. So that's how it's going to work

00:34:56.680 | So what that kind of means is that most students find that they tend to watch the videos two or three times?

00:35:04.280 | but not like

00:35:06.720 | Watch lesson one two or three times and lesson two two or three times and listen three three times

00:35:11.240 | But like they do the whole thing into end lessons one through seven and then go back and start lesson one again

00:35:18.280 | That's an approach which a lot of people find when they want to kind of go back and understand all the details

00:35:23.840 | That up that can work pretty well, so I would say you know aim to get through to the end of lesson seven

00:35:30.220 | You know as as quickly as you can rather than aiming to fully understand every detail from the start

00:35:39.040 | So basically the plan is that in today's lesson you learn

00:35:46.760 | In as few lines as code as possible with as few details as possible

00:35:52.200 | How do you actually build an image classifier with deep learning to do this to in this case say?

00:35:57.800 | Hey, here are some pictures of dogs as opposed to pictures of cats

00:36:01.760 | Then we're going to learn

00:36:05.200 | How to look at different kinds of images and particularly we're going to look at images of from satellites

00:36:11.840 | I'm going to say for a satellite image

00:36:13.960 | What kinds of things might you be seeing in that image and there could be multiple things that we're looking at so a multi-label?

00:36:21.600 | location problem

00:36:23.400 | From there, we'll move to something which is perhaps the most widely applicable for the most people

00:36:29.540 | Which is looking at what we call structured data

00:36:32.300 | so data about

00:36:35.080 | data that kind of comes from

00:36:37.360 | Databases or spreadsheets, so we're going to specifically look at this data set of predicting sales

00:36:43.080 | The number of things that are sold at different stores on different dates

00:36:48.840 | Based on different holidays and and so on and so forth and so we're going to be doing this sales forecasting

00:36:54.660 | exercise

00:36:56.520 | After that we're going to look at language, and we're going to figure out

00:37:00.620 | What this person?

00:37:03.120 | thinks about the movie zombie Geddon

00:37:05.120 | And we'll be able to figure out how to create just like we create image classifiers for any kind of image

00:37:10.800 | We'll learn to create in NLP classifiers to classify any kind of language in lots of different ways

00:37:18.720 | Then we'll look at something called collaborative filtering which is used mainly for recommendation systems

00:37:23.840 | We're going to be looking at this data set that showed for different people for different movies. What rating did they give it?

00:37:30.200 | Here are some of the movies and so

00:37:32.760 | This is maybe an easier way to think about it

00:37:35.560 | Is there are lots of different users and lots of different movies and then for each one we can look up for each user

00:37:41.480 | How much they like that movie and the goal will be of course to predict for user movie combinations?

00:37:47.840 | We haven't seen before are they likely to enjoy that movie or not and that's the really common approach used for like

00:37:55.640 | Deciding what stuff to put on your home page when somebody's visiting

00:37:59.400 | You know what book might they want to read or what film might they want to see or so forth?

00:38:03.880 | From there we're going to then dig back into language a bit more and we're going to look at

00:38:12.080 | Actually, we're going to look at the writings of Nietzsche the philosopher and learn how to create our own Nietzsche philosophy from scratch

00:38:19.780 | character by character

00:38:21.320 | So this here perhaps that every life of values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse

00:38:28.860 | Love is not actually Nietzsche

00:38:31.240 | That's actually like some character by character generated text that we built with this recurrent neural network

00:38:41.280 | And then finally we're going to loop all the way back to computer vision again

00:38:44.680 | We're going to learn how not just to recognize cats from dogs

00:38:48.440 | But to actually find like where the cat is with this kind of heat map

00:38:52.160 | And we're also going to learn how to write our own architectures from scratch

00:38:56.800 | um, so this is an example of a resnet which is the kind of network that we

00:39:01.280 | Are using in today's lesson for computer vision?

00:39:04.880 | And so we'll actually end up building the network and the training loop from scratch

00:39:09.900 | And so they're basically the the steps that we're going to be taking from here and at each step. We're going to be getting into

00:39:16.200 | Increasing amounts of detail about how to actually do these things yourself

00:39:21.320 | So we've actually heard back from our students of past courses about what they found and

00:39:30.020 | one of the things that we've heard a lot of students say is that they spend too much time on theory and

00:39:39.880 | research

00:39:41.080 | And not enough time running the code

00:39:43.080 | And even after we tell people about this warning where they still come to the end of the course and often say I wish I had

00:39:50.500 | taken more

00:39:52.160 | Seriously that advice which is to keep running code

00:39:55.080 | So these are actual quotes from our forum in retrospect

00:39:59.280 | I should have spent the majority of my time on the actual code and the notebooks

00:40:03.780 | See what goes in see what comes out

00:40:08.400 | now

00:40:10.400 | This idea that you can create

00:40:14.120 | World-class models in a code first approach learning what you need as you go

00:40:19.520 | It's very different to a lot of the advice you'll read out there such as this

00:40:23.640 | person on the forum Hacker News who claimed that the best way to become an ML engineer is to

00:40:32.080 | Learn all of math learn C and C++ learn parallel programming learn ML

00:40:38.920 | Algorithms implement them yourself using plain C and finally start doing ML

00:40:43.840 | So we would say if you want to become an effective practitioner do exactly the opposite of this

00:40:50.240 | Yes, Rachel. Oh, yeah, I'm just highlighting that this is

00:40:53.920 | We think this is bad advice and this can be very discouraging for a lot of people to come across. Yeah

00:41:00.760 | it's it's it's you know, we now have thousands or tens of thousands of people that have done this course and

00:41:06.040 | have

00:41:09.160 | Lots and lots of examples of people who are now

00:41:11.820 | running research labs or

00:41:14.680 | Google brain residents or you know

00:41:17.580 | Have created patents based on deep learning and so forth who have done it by doing this course

00:41:22.880 | So the top-down approach works super well

00:41:27.560 | Now one thing to mention is like we've we've now already learned how you can actually train a world-class image classifier in

00:41:35.840 | 17 seconds, I should mention by the way the first time you run that code

00:41:41.600 | there are two things it has to do that take more than 17 seconds one is that it downloads a

00:41:47.440 | Pre-trained model from the internet. So you'll see the first time you run it. It'll say downloading model

00:41:53.160 | So that takes a minute or two

00:41:56.080 | also

00:41:57.360 | The first time you run it it pre computes and caches

00:42:00.200 | Some of the intermediate information that it needs and that takes about a minute and a half as well

00:42:06.120 | So if the first time you run it it takes

00:42:08.600 | three or four minutes

00:42:10.920 | To download and pre-compute stuff. That's normal if you run it again, you should find it takes

00:42:16.080 | 20 seconds or so

00:42:18.560 | so

00:42:20.320 | Image classifiers, you know, you may not feel like you need to recognize cats versus dogs very often on a computer

00:42:28.600 | You can probably do it yourself pretty well

00:42:30.720 | But what's interestingly interesting is that these image classification algorithms are really useful for lots and lots of things

00:42:38.120 | For example

00:42:41.760 | AlphaGo which became which beat the go world champion the way it worked was to use something

00:42:49.480 | At its heart that looked almost exactly like our dogs versus cats image classification algorithm

00:42:56.360 | It looked at thousands and thousands of go boards

00:43:00.800 | And for each one there was a label saying whether that go board ended up being the winning or the losing

00:43:07.400 | player and so it learnt

00:43:10.320 | Basically an image classification that was able to look at a go board and figure out whether it was a good go board or a bad

00:43:17.000 | Go board and that's really the key most important

00:43:20.800 | Step in playing go. Well is to know which which move is better

00:43:25.720 | Another example is one of our earlier students who actually

00:43:32.280 | Got a couple of patterns for this work

00:43:35.360 | looked at anti-fraud

00:43:38.160 | He had lots of examples of his customers mouse movements because they they provided kind of these

00:43:46.400 | User tracking software to help avoid fraud and so he took the the mouse paths

00:43:52.540 | basically of the users on his customers websites

00:43:56.680 | Turned them into pictures of where the mouse moved and how quickly it moved

00:44:01.800 | And then built a image classifier that took those images

00:44:06.680 | As input and as output it was was that a fraudulent transaction or not?

00:44:12.480 | And turned out to get you know really great results for his company so image classifiers

00:44:18.440 | Are like much more flexible than you might imagine?

00:44:22.800 | so

00:44:26.240 | So this is how you know some of the ways you can use deep learning specifically for image recognition and

00:44:32.480 | It's worth understanding that

00:44:35.840 | deep learning is not

00:44:39.520 | You know just a word that means the same thing as machine learning

00:44:42.680 | Like what is it that we're actually doing here when we're doing deep learning?

00:44:46.400 | Instead deep learning is a kind of machine learning

00:44:50.400 | So machine learning was invented by this guy Arthur Samuels who was pretty amazing in the late 50s

00:44:57.060 | He got this IBM mainframe to play checkers better than he can and the way he did it

00:45:04.080 | was he invented machine learning he got the

00:45:07.520 | Mainframe to play against itself

00:45:09.520 | Lots of times and figure out which kinds of things led to victories and which kinds of things didn't

00:45:15.680 | And used that to kind of almost write its own program

00:45:19.320 | And after Samuels actually said in 1962 that he thought that one day the vast majority of computer software

00:45:26.560 | Would be written using this machine learning approach rather than written by hand by writing the loops and so forth by hand

00:45:35.400 | So I guess that hasn't happened yet, but it seems to be in the process of happening

00:45:41.400 | I think one of the reasons it didn't happen for a long time is because traditional machine learning actually was very difficult and very

00:45:49.820 | Knowledge and time intensive so for example here's something called the computational pathologist or CPath

00:45:57.560 | From guy called Andy Beck Andy Beck back when he was at Stanford

00:46:03.160 | He's now moved on to

00:46:05.320 | Somewhere on the East Coast Harvard, I think

00:46:08.400 | And what he did was he took these pathology slides of breast cancer

00:46:13.960 | biopsies, right and

00:46:17.000 | he worked with lots of pathologists to come up with ideas about what kinds of

00:46:23.280 | Patterns or features might be associated with

00:46:26.720 | sort of long-term survival versus

00:46:30.720 | Dining quickly basically and so he came up with these ideas like well

00:46:35.800 | They came up with these ideas like relationship between epithelial nuclear neighbors

00:46:39.320 | relationship between epithelial and stromal objects and so forth and so they came up with all of these ideas of features

00:46:45.880 | these are just a few of the hundreds that they thought of and then lots of

00:46:50.000 | smart computer programmers wrote

00:46:52.840 | specialist algorithms to to calculate all these different features and then those those

00:47:00.360 | Features were passed into a logistic regression

00:47:02.580 | To predict survival and it ended up working very well

00:47:06.920 | It had ended up that the survival predictions were more accurate than pathologists own survival predictions were

00:47:15.080 | and so machine learning can work really well, but the point here is that this was a

00:47:19.720 | An approach that took lots of domain experts and computer experts

00:47:26.040 | Many years of work to actually to build this thing, right?

00:47:31.100 | so

00:47:33.880 | We really want something

00:47:37.440 | something better and

00:47:40.000 | so specifically I'm going to show you something which rather than being a very specific function with all this very

00:47:48.080 | domain specific

00:47:51.120 | feature engineering we're going to try and create an infinitely flexible function a function that could solve any problem

00:47:58.000 | Right it would solve any problem if only you set the parameters of that function correctly

00:48:03.440 | And so then we need some all-purpose way of setting the parameters of that function

00:48:08.760 | And we would need that to be fast and scalable

00:48:11.220 | Right now if we had something that had these three things

00:48:14.000 | Then you wouldn't need to do this

00:48:17.080 | Incredibly time and domain knowledge intensive approach anymore instead we can learn all of those things

00:48:23.080 | with this

00:48:25.240 | with this algorithm

00:48:27.240 | So as you might have guessed

00:48:29.320 | The algorithm in question which has these three properties is called deep learning

00:48:34.440 | Or if not an algorithm, then maybe we would call it a class of algorithms

00:48:39.240 | Let's look at each of these three things in turn

00:48:43.560 | So the underlying function that deep learning uses is something called the neural network

00:48:49.240 | Now the neural network we're going to learn all about it and implemented ourselves from scratch later on in the course

00:48:56.360 | But for now all you need to know about it is that it consists of a number of simple linear layers

00:49:03.200 | interspersed with a number of simple nonlinear layers

00:49:07.040 | And when you interspersed these layers in this way

00:49:12.880 | You get something called the universal approximation theorem and the universal approximation theorem says that this kind of function

00:49:21.800 | Can solve any given problem?

00:49:24.960 | To arbitrarily close accuracy as long as you add enough parameters

00:49:31.880 | So it's actually provably shown to be an infinitely flexible function

00:49:38.520 | Right. So now we need some way to fit the parameters so that this infinitely flexible neural network solves some specific problem and

00:49:46.240 | so the way we do that is using a technique that

00:49:50.300 | probably most of you will have come across before at some stage called gradient descent and with gradient descent we basically say

00:49:57.680 | Okay, well for the different parameters we have

00:50:00.200 | How how good are they at solving my problem and let's figure out a slightly better set of parameters?

00:50:08.440 | And a slightly better set of parameters and basically follow down

00:50:11.720 | The the surface of the loss function downwards. It's kind of like a marble going down to find the minimum and

00:50:19.440 | As you can see here depending on where you start you end up in different places

00:50:25.160 | These things are called local minima now interestingly it turns out that for neural networks particular in particular

00:50:35.840 | There aren't actually multiple different

00:50:39.080 | Local minima, there's basically just there's basically just one right or to think of it another way

00:50:46.960 | There are different parts of the space which are all equally good

00:50:50.640 | so

00:50:53.880 | Gradient descent therefore turns out to be actually an excellent way to

00:50:58.400 | Solve this problem of fitting parameters to neural networks

00:51:04.840 | The problem is though that we need to do it in a reasonable amount of time and

00:51:09.480 | It's really only thanks to GPUs that that's become possible

00:51:14.220 | So GPUs this shows over the last few years

00:51:17.520 | How many gigaflops per second can you get out of a?

00:51:23.920 | GPU that's the red and green versus a CPU. That's the blue right and this is on a log scale

00:51:31.760 | So you can see that generally speaking the GPUs are

00:51:35.680 | about 10 times faster than the CPUs and

00:51:40.720 | What's really interesting is that nowadays not only is the Titan X about 10 times faster than the e5

00:51:50.180 | 2699 CPU but the Titan X

00:51:53.600 | Well actually better one to look at would be the GTX 1080i

00:51:59.240 | GPU costs about 700 bucks

00:52:01.240 | Whereas the CPU which is 10 times slower costs over $4,000

00:52:06.920 | So GPUs turn out to be able to solve these

00:52:11.800 | Neural network parameter fitting problems

00:52:15.960 | incredibly quickly

00:52:18.520 | And also incredibly cheaply so they've been absolutely key in bringing these three pieces together

00:52:27.800 | Then there's one more piece

00:52:29.640 | Which is I mentioned that these neural networks you can intersperse multiple sets of linear and then nonlinear layers

00:52:36.960 | In the particular example that's drawn here there's actually only one

00:52:43.560 | what we call hidden layer one layer in the middle and

00:52:46.480 | Something that we learned in the last few years is that these kinds of neural networks although they do

00:52:53.200 | Support the universal approximation theorem they can solve any given problem arbitrarily closely

00:52:59.320 | They require an exponentially increasing number of parameters to do so

00:53:05.000 | So they don't actually solve the fast and scalable for even reasonable size problems

00:53:10.240 | But we've since discovered that if you create at multiple hidden layers

00:53:16.840 | Then you get super linear scaling so you can add a few more hidden layers

00:53:22.920 | to get

00:53:24.320 | multiplicatively

00:53:25.600 | more accuracy to multiplicatively more complex problems and

00:53:29.240 | That is where it becomes called deep learning. So deep learning means a neural network with multiple hidden layers

00:53:36.680 | So when you put all this together, there's actually really amazing what happens

00:53:45.120 | Google started investing in deep learning in 2012

00:53:51.720 | they

00:53:53.200 | Actually hired Jeffrey Hinton who's kind of the father of deep learning and his top student Alex Kudzewski

00:54:00.040 | And they started trying to build a team that team became known as Google brain

00:54:06.200 | and

00:54:08.680 | because

00:54:09.680 | Things with these three properties are so incredibly powerful and so incredibly flexible you can actually see over time

00:54:18.320 | How many projects at Google use deep learning?

00:54:22.420 | My graph here only goes up through a bit over a year ago

00:54:26.560 | But it's I know it's been continuing to grow exponentially since then as well

00:54:30.920 | And so what you see now is around Google that deep learning is used in like every part of the business

00:54:37.440 | and so it's really interesting to see how

00:54:43.960 | This this kind of simple idea that we can solve machine learning problems using a an

00:54:51.040 | Algorithm that has these properties

00:54:53.520 | When a big company invests heavily in actually making that happen

00:54:57.720 | You see this incredible growth in how much it's used

00:55:01.640 | So for example if you use the inbox by Google software

00:55:07.920 | Then when you receive an email from somebody it will often

00:55:13.920 | Tell you here are some replies

00:55:15.920 | That I could send for you and so it's actually using deep learning here to read the original email and to generate

00:55:24.240 | some suggested replies and so like this is a really great example of the kind of stuff that

00:55:30.760 | Previously just wasn't possible

00:55:33.640 | Another great example would be Microsoft is also a little bit more recently invested heavily in deep learning and so now you can

00:55:43.800 | Use Skype you can speaking to it in English and ask it at the other end to

00:55:49.880 | Translate it in real time to Chinese or Spanish and then when they talk back to you in Chinese or Spanish

00:55:55.720 | Skype will in real time translate it the speech in in their language into English speech in real time

00:56:03.520 | And again, this is an example of stuff which we can only do thanks to deep learning

00:56:11.880 | I also think it's really interesting to think about how deep learning can be combined with human expertise

00:56:18.080 | So here's an example of like drawing something just sketching it out

00:56:22.960 | And then using a program called neural doodle

00:56:26.080 | This is from a couple of years ago to then say please take that sketch and render it in the style of an artist

00:56:33.280 | And so here's the picture that it then created

00:56:37.440 | Rendering it as you know impressionist painting, and I think this is a really great example of how

00:56:42.880 | You can use deep learning to help combine

00:56:46.480 | human expertise and what computers are good at

00:56:50.480 | So I a few years ago decided to try this myself like what would happen if I took

00:57:02.080 | Deep learning and tried to use it to solve a really important problem, and so the problem I picked was

00:57:08.120 | diagnosing lung cancer

00:57:10.240 | It turns out if you can find

00:57:12.640 | lung nodules earlier

00:57:15.640 | There's a 10 times higher probability of survival

00:57:20.040 | So it's a really important problem to solve so I got together with three other people none of us had any medical background

00:57:27.600 | And we grabbed a data set of CT scans

00:57:31.880 | We used a convolutional neural network

00:57:33.960 | Much like the dogs versus cats one we trained at the start of today's lesson

00:57:38.840 | to try and predict which

00:57:41.520 | CT scans had

00:57:44.480 | malignant tumors in them

00:57:46.480 | And we ended up after a couple of months with something with a much lower

00:57:50.720 | False negative rate and a much lower false positive rate than a panel of four radiologists

00:57:55.800 | And we went on to build this in a startup into into a company called analytic

00:58:01.600 | which has really become pretty successful and

00:58:03.800 | Since that time the idea of using deep learning for medical imaging has become

00:58:09.440 | Hugely popular and it's being used all around the world

00:58:12.760 | So what I've generally noticed is that you know the vast majority of

00:58:18.720 | Of kind of things that people do in the world currently aren't using deep learning

00:58:25.040 | And then each time somebody says oh, let's try using deep learning to improve performance at this thing

00:58:30.880 | They nearly always get fantastic results and then suddenly everybody in that industry starts using it as well

00:58:37.260 | So there's just lots and lots of opportunities here at this particular time to use deep learning to help with all kinds of different stuff

00:58:45.000 | So I've jotted down a few ideas here. These are all things which I know you can use

00:58:51.360 | deep learning for right now to get good results from

00:58:55.440 | and

00:58:57.720 | You know are things which people spend a lot of money on or have a lot of you know important business opportunities

00:59:03.800 | There's lots more as well

00:59:06.160 | But these are some examples of things that maybe at your company you could think about applying deep learning for

00:59:11.480 | So let's talk about what's actually going on

00:59:15.880 | What actually happened when we trained that deep learning model earlier?

00:59:21.760 | And so as I briefly mentioned the thing we created is something called a convolutional neural network or CNN and

00:59:29.520 | The key piece of a convolutional neural network is the convolution

00:59:34.920 | So here's a great example from a website

00:59:38.880 | I've got the URL up here

00:59:42.040 | explained visually

00:59:44.240 | It's called and the explained visually website has an example of a convolution

00:59:50.760 | kind of in practice over here in the bottom left is a very zoomed in picture of somebody's face and

00:59:56.600 | Over here on the right is an example of using a convolution on that image

01:00:03.440 | You can see here. This particular thing is obviously finding

01:00:08.120 | Edges the edges of his head right top and bottom edges in particular

01:00:17.440 | Now how is it doing that well if we look at each of these little three by three areas that this is moving over

01:00:23.520 | It's taking each three by three area of pixels and here are the pixel values

01:00:28.380 | right each thing in that three by three area and

01:00:31.440 | It's multiplying each one of those three by three pixels by each one of these

01:00:37.320 | three by three

01:00:40.400 | Kernel values in a convolution this specific set of nine values is called a kernel

01:00:47.400 | It doesn't have to be nine it could be four by four or five by five or two by two or whatever, right?

01:00:52.760 | In this case, it's a three by three kernel and in fact in deep learning nearly all of our kernels are three by three

01:00:58.760 | So in this case the kernel is one two one. Oh minus one minus two minus one. So we take each of the

01:01:07.240 | Black through white pixel values and we multiply as you can see each of them by the corresponding value in the kernel and

01:01:17.880 | Then we add them all together

01:01:20.400 | And so if you do that for every three by three area you end up with

01:01:26.040 | The values that you see over here on the right hand side

01:01:29.000 | Okay, so very low values become

01:01:33.640 | Black very high values become white and so you can see when we're at an edge

01:01:39.480 | where it's black at the bottom and

01:01:41.920 | white at the top

01:01:43.960 | We're obviously going to get higher numbers over here and vice versa. Okay, so that's a convolution

01:01:50.780 | So as you can see it is a linear operation and so based on that definition of a neural net

01:01:57.720 | I described before this can be a layer in our neural network. It is a simple linear operation

01:02:04.220 | And we're going to look at lots more at convolutions later including building a little spreadsheet

01:02:08.840 | that implements them ourselves

01:02:11.520 | So the next thing we're going to do is we're going to add a nonlinear layer

01:02:16.280 | so a nonlinearity as it's called is something which takes an input value and

01:02:25.480 | Turns it into some different value in a nonlinear way and you can see this orange picture here is an example of a nonlinear

01:02:32.520 | function specifically this is something called a sigmoid and

01:02:36.120 | so a sigmoid is something that has this kind of s shape and

01:02:40.440 | This is what we used to use as our nonlinearities in neural networks a lot

01:02:45.500 | Actually nowadays we nearly entirely use something else called a relu or rectified linear unit

01:02:52.200 | a relu is simply take any negative numbers and replace them with zero and

01:02:58.360 | Leave any positive numbers as they are so in other words in code that would be

01:03:04.020 | Y equals max x comma 0 so max x comma 0 simply says replace the negatives with 0

01:03:20.000 | Regardless of whether you use a sigmoid or a relu or something else

01:03:24.000 | The key point about taking this combination of a linear layer followed by a element wise nonlinear function is

01:03:32.860 | That it allows us to create arbitrarily complex shapes as you see in the bottom, right?

01:03:38.080 | And the reason why is that this is all from Michael Nielsen's neural networks and deep learning comm really fantastic

01:03:46.720 | interactive book as

01:03:48.880 | You change the values of your linear functions

01:03:53.280 | It basically allows you to kind of like build these arbitrarily tall or thin blocks and then combine those blocks together

01:04:02.240 | And this is actually the essence of the universal approximation theorem this idea that when you have a linear layer

01:04:10.400 | Feeding into a nonlinearity you can actually create these arbitrarily complex shapes

01:04:16.160 | So this is the key idea behind why neural networks can solve any computable problem

01:04:22.600 | So then we need a way as we described to actually

01:04:28.600 | Set these parameters so it's all very well knowing that we can move the parameters around manually to try to

01:04:36.520 | Create different shapes, but we have some specific shape. We want how do we get to that shape?

01:04:42.680 | And so as we discussed earlier the basic idea is to use something called gradient descent

01:04:48.280 | This is an extract from a notebook actually one of the fast AI lessons

01:04:53.640 | And it shows actually an example of using gradient descent to solve a simple linear regression problem

01:05:01.560 | But I can show you the basic idea. Let's say you were just you had a simple

01:05:11.000 | Quadratic, right and

01:05:13.000 | So you were trying to find the minimum of this quadratic

01:05:18.040 | And so in order to find the minimum you start out by randomly picking some point, right?

01:05:24.640 | So we say okay, let's pick let's pick here

01:05:27.120 | And so you go up there and you calculate the value of your quadratic at that point

01:05:31.640 | So what you now want to do is try to find a slightly better point

01:05:35.960 | So what you could do is you can move a little bit to the left

01:05:40.680 | And a little bit to the right to find out which direction is down and what you'll find out

01:05:46.840 | Is that moving a little bit to the left decreases the value of the function so that looks good, right?

01:05:52.280 | and so in other words, we're calculating the

01:05:55.160 | derivative of the function at that point

01:05:59.140 | All right, so that tells you which way is down

01:06:04.760 | It's the gradient. And so now that we know that going to the left is down we can take a small step in

01:06:11.320 | that direction

01:06:13.800 | To create a new point and then we can repeat the process and say okay

01:06:18.680 | Which way is down now and we can now take another step and another step and another step another step another step, okay?

01:06:26.240 | And each time we're getting closer and closer

01:06:29.520 | So the basic approach here is to say okay. We start we're at some point. We've got some value X

01:06:36.440 | Which is our current guess right that at time step n

01:06:41.080 | So then our new guess at time step n plus 1 is just equal to our previous guess

01:06:47.520 | plus

01:06:49.720 | the derivative

01:06:51.720 | Right times

01:06:58.840 | some

01:07:00.200 | Small number because we want to take a small step

01:07:02.880 | We need to pick a small number because if we picked a big number right then we say okay

01:07:09.240 | We know we want to go to the left. Let's jump a big long way to the left

01:07:12.880 | we could go all the way over here and

01:07:14.880 | We actually end up worse right and then we do it again

01:07:18.540 | now or even worse again, right, so

01:07:21.880 | if you have too high a

01:07:25.960 | Step size you can actually end up with divergence rather than convergence

01:07:31.000 | So this number here we're going to be talking about it a lot during this course

01:07:35.000 | And we're going to be writing all this stuff out and code from scratch ourselves

01:07:37.760 | But this number here is called the learning rate

01:07:40.920 | Okay, so

01:07:48.400 | You can see here

01:07:50.560 | This is an example of basically starting out with some random line and then using gradient descent to gradually make the line

01:07:57.760 | better and better and better

01:07:59.760 | So what happens when you combine these ideas right the convolution?

01:08:04.920 | The non-linearity and gradient descent because they're all tiny small simple little things it doesn't sound that exciting

01:08:12.440 | But if you have enough of these kernels

01:08:17.640 | Right with enough layers something really interesting happens

01:08:21.080 | And we can actually draw them

01:08:23.920 | So here's the

01:08:26.920 | So this is a really interesting paper by Matt Ziler and Rob Fergus and what they did a few years ago

01:08:36.300 | Was they figured out how to basically draw a picture of what each layer in a deep learning net network learned?

01:08:43.840 | And so they showed that layer one of the network here are nine examples of convolutional filters from layer one of a trained network

01:08:53.960 | and they found that some of the filters kind of learnt these diagonal lines or

01:08:58.620 | Simple little grid patterns some of them learnt these simple gradients right and so for each of these filters

01:09:05.800 | They show nine examples of little pieces of actual photos

01:09:10.840 | Which activate that filter quite highly right so you can see layer one

01:09:16.600 | These learn to remember these these are learnt using gradient descent these filters were not programmed

01:09:22.720 | They were learnt using gradient descent right so in other words we were learning

01:09:27.400 | These nine numbers

01:09:37.040 | so layer two then was going to take these as inputs and

01:09:41.360 | Combine them together and so layer two had you know

01:09:46.860 | This is like nine kind of attempts to draw one of the examples of the filters in layer two

01:09:52.700 | They're pretty hard to draw but what you can do is say for each filter

01:09:57.160 | What are examples of little bits of images that activated them and you can see by layer two we've got?

01:10:03.640 | Basically something that's being activated nearly entirely by little bits of sunset

01:10:07.920 | something's that's being activated by circular objects

01:10:12.280 | something that's being activated by

01:10:15.300 | Repeating horizontal lines something that's being activated by corners right so you can see how we're basically combining layer one features together

01:10:24.600 | So if we combine those features together and again, these are all

01:10:29.960 | Institutional filters learnt through gradient descent by the third layer. It's actually learned to recognize the presence of text

01:10:38.360 | Another filter has learned to recognize the presence of petals

01:10:42.160 | Another filter has learned to recognize the presence of human faces right so just three layers is enough to get some pretty

01:10:50.440 | Rich behavior so but by the time we get to layer five

01:10:54.760 | We've got something that can recognize the eyeballs of insects and birds

01:10:59.680 | And something that can recognize

01:11:01.680 | unicycle wheels

01:11:03.960 | Right so so this is kind of where we start with something

01:11:08.340 | Incredibly simple all right

01:11:11.440 | But if we use it as a bit a big enough scale

01:11:14.440 | Thanks to the universal approximation theorem and the use of multiple hidden layers and deep learning

01:11:20.800 | We actually get these very very rich

01:11:24.920 | capabilities

01:11:27.120 | So that is what we used when we actually trained

01:11:30.280 | Our little dog versus cat recognizer, okay

01:11:36.400 | So

01:11:41.240 | Let's talk more about this dog versus cat recognizer

01:11:44.840 | So we've learned the idea of like we can look at the pictures that come out of the other end to see what the models

01:11:50.880 | Classifying well or classifying badly or which ones it's unsure about

01:11:56.240 | But let's talk about like this key thing. I mentioned which is the learning rate

01:12:01.240 | So I mentioned we have to set this thing

01:12:03.560 | I just called it L before the learning rate and you might have noticed there's a couple of numbers these kind of magic numbers

01:12:09.960 | Here the first one is the learning rate, right?

01:12:14.480 | So this number is how much do you want to multiply the gradient by when you're taking each step in your gradient descent?

01:12:23.400 | We already talked about why you wouldn't want it to be too high

01:12:26.440 | Right, but probably also it's obvious to see why you wouldn't want it to be too low, right? If you had it too low

01:12:33.520 | You would take like a little step and you'd be a little bit closer and a little bit step a little step little step

01:12:40.560 | And it would take lots and lots and lots of steps and it would take too long

01:12:44.480 | so setting this number well is actually really important and

01:12:49.880 | For the longest time this was driving

01:12:53.120 | deep learning researchers crazy because they didn't really know a

01:12:57.520 | Good way to set this reliably

01:13:00.480 | So the good news is last year a researcher came up

01:13:07.000 | with an approach to quite reliably set the learning rate

01:13:10.880 | Unfortunately almost nobody noticed so almost no deep learning researchers. I know about actually are aware of this approach

01:13:20.640 | But it's incredibly successful and it's incredibly simple and I'll show you the idea

01:13:25.320 | It's built into the fast AI library as something called LR find or the learning rate finder and it comes from this paper

01:13:33.000 | I was actually 2015 paper. Sorry

01:13:35.400 | Cyclical learning rates for training neural networks by a terrific researcher called Leslie Smith

01:13:40.960 | And I'll show you Leslie's idea

01:13:48.120 | So Leslie's idea started out with the same

01:13:50.960 | Basic idea that we've seen before which is if we're going to optimize something pick some random point

01:13:57.560 | Take its gradient

01:14:00.480 | Right and then specifically he said take a tiny tiny step

01:14:05.720 | No tiny step so a learning rate of like 10 e next 7

01:14:12.280 | Right and then do it again again, but each time increase the learning rate like double it

01:14:18.240 | So then we try like 2 e next 7 4 e next 7 8 e next 7

01:14:22.960 | 10 e next 6 right and so gradually

01:14:26.920 | your steps

01:14:29.320 | Are getting bigger and bigger?

01:14:31.800 | Right and so you can see what's going to happen. It's going to like

01:14:37.440 | Start doing almost nothing right and it's going to then suddenly the loss function is going to improve very quickly

01:14:43.680 | Right, but then it's going to step even further again

01:14:47.120 | and

01:14:49.320 | Then even further again

01:14:51.320 | Right, let's draw the rest of that line to be clear

01:14:56.280 | Right and so suddenly it's then going to shoot off and get much worse

01:15:03.240 | right, so

01:15:06.040 | The idea then is to go back and say okay

01:15:10.600 | At what point did we see like the best improvement?

01:15:20.560 | So here

01:15:27.080 | We've got our best improvement right and so we'd say okay. Let's use that

01:15:32.680 | Learning rate right so in other words if we were to plot

01:15:36.120 | the learning rate

01:15:39.520 | Over time

01:15:42.520 | It was increasing

01:15:45.080 | like so

01:15:47.040 | Right and so what we then want to do is we want to plot

01:15:50.520 | the learning rate

01:15:53.160 | Against the loss right so when I say the loss I basically mean like how accurate is the model how close in this case the loss

01:16:01.080 | Would be how far away is the predicted prediction?

01:16:04.960 | from the from the goal

01:16:07.560 | Right and so if we plotted the learning rate against the loss we'd say like okay initially it didn't do very much

01:16:14.880 | Right for small learning rates, and then it suddenly improved a lot and then it suddenly got a lot worse

01:16:22.360 | So that's the basic idea and so we'd be looking for the point where this graph is

01:16:29.920 | Dropping quickly right we're not looking for its minimum point

01:16:33.000 | We're not saying like where was at the lowest because that could actually be the point where it's just jumped too far

01:16:38.400 | We want at what point was it dropping?

01:16:40.560 | the fastest

01:16:43.000 | So if you go

01:16:46.280 | So if you create your learn objects in the same way that we did before we'll be learning more about this these details shortly

01:16:54.320 | If you then call LR find method on that you'll see that it'll start training a model

01:17:01.360 | Like it did before but it'll generally stop before it gets to a hundred percent because if it notices

01:17:08.200 | That the loss is getting a lot worse

01:17:12.640 | Then it'll stop automatically so that you can see here. It stopped at 84% and so then you can call

01:17:19.440 | Learn dot shed that gets you the learning rate scheduler

01:17:22.680 | That's the object which actually does this learning rate finding and that object has a plot learning rate function

01:17:28.240 | And so you can see here by iteration you can see the learning rate

01:17:32.680 | All right, so you can see each step the learning rate is getting bigger and bigger

01:17:36.640 | You can do it this way you can see it's increasing exponentially

01:17:41.880 | Another way that Leslie Smith the researcher suggests is to do it linearly

01:17:47.560 | So I'm actually currently researching with both of these approaches to see which works best

01:17:51.720 | Recently I've been mainly using exponential, but I'm starting to look more at using linear at the moment

01:17:57.200 | And so if we then call shed dot plot that does the plot that I just described down here

01:18:04.000 | learning rate versus

01:18:06.760 | Loss all right, and so we're looking for the highest learning rate we can find

01:18:12.480 | Where the loss is still improving?

01:18:16.240 | clearly well right and so in this case I would say

01:18:20.400 | 10 to the negative 2 max at 10 to the negative 1 is not improving

01:18:25.200 | All right 10 to the negative 3 it is also improving

01:18:28.920 | But I'm trying to find the highest learning rate I can where it's still clearly improving

01:18:33.160 | So I'd say 10 to the negative 2 right so you might have noticed that when we ran our model before we had

01:18:40.240 | 10 to the negative 2 0.01. So that's why we picked that learning rate

01:18:45.940 | So there's really only one other number that we have to pick and

01:18:53.700 | That was this number 3 and so that number 3 controlled how many

01:19:02.100 | epochs that we run so an epoch means going through our entire data set of images and

01:19:11.820 | Using each each time we do a bunch of they called mini batches we grab like

01:19:17.340 | 64 images at a time and use them to try to improve the model a little bit using gradient descent

01:19:23.260 | Right and using all of the images once is called one epoch

01:19:27.420 | and so at the end of each epoch we print out the accuracy and

01:19:32.900 | Validation and training loss at the end of the epoch

01:19:36.660 | so

01:19:40.260 | question of

01:19:41.780 | how many epochs should we run is kind of the one other question that you need to answer to run these three lines of code and

01:19:48.620 | The answer really to me is like

01:19:51.940 | As many as you like

01:19:55.340 | What you might find happen is if you run it for too long the accuracy you'll start getting worse

01:20:01.640 | Right and we'll learn about that why later. It's something called overfitting right so

01:20:06.900 | You can run it for a while run lots of epochs

01:20:10.740 | Once you see it getting worse

01:20:12.060 | You know how many epochs you can run and the other thing that might happen is if you've got like a really big model

01:20:17.780 | Or what lots and lots of data maybe it takes so long you don't have time and so you just run enough epochs that

01:20:23.500 | Fit into the time you have available so the number of epochs you run you know that's a pretty easy thing to set

01:20:29.580 | So they're the only two numbers you're going to have to set and so the goal

01:20:34.860 | This week will be to make sure that you can run

01:20:39.580 | Not only these three lines of code on the data that I provided

01:20:43.820 | But to run it on a set of images that you either have on your computer or that you

01:20:50.360 | Get from work or that you download from Google

01:20:53.420 | And I try to get a sense of like which kinds of images does it seem to work well for?

01:20:59.780 | Which ones doesn't it work well for?

01:21:02.860 | What kind of learning rates do you need for different kinds of images how many epochs do you need?

01:21:09.540 | How does the number of the learning rate change the accuracy you get and so forth like really experiment and then?

01:21:16.420 | You know try to get a sense of like what's inside this data object?

01:21:21.140 | You know what are the y values look like what are these classes mean?

01:21:24.980 | If you're not familiar with numpy you know really practice a lot with numpy so that by the time you come back for the next

01:21:31.980 | lesson

01:21:33.660 | You know we're going to be digging into a lot more detail, and so you'll really feel ready to do that

01:21:39.060 | now one thing that's really important to be able to do that is that you need to really know how to

01:21:44.720 | work with

01:21:47.780 | Numpy the faster I library and so forth and so I want to show you some tricks in Jupyter notebook to make that much easier

01:21:55.580 | So one trick to be aware of is if you can't quite remember how to spell something right so

01:22:02.260 | If you're not quite sure

01:22:04.860 | What the message you want is you can always hit tab?

01:22:08.220 | And you'll get a list of

01:22:10.220 | Methods that start with that letter right and so that's a quick way to find things

01:22:14.900 | If you then can't remember what the arguments are to a method hit shift tab

01:22:20.360 | All right, so hitting shift tab tells you the arguments to the method so shift tab is like one of the most helpful things

01:22:29.540 | I know

01:22:31.540 | So let's take

01:22:33.620 | np.x

01:22:35.860 | Shift tab and so now you might be wondering like okay. Well. What does this function do and how does it work?

01:22:42.140 | If you press shift tab twice

01:22:44.780 | Then it actually brings up the documentation

01:22:48.220 | Shows you what the parameters are and shows you what it returns and gives you examples

01:22:54.340 | Okay, if you press it three times

01:22:58.380 | Then it actually pops up a whole little separate window with that information

01:23:03.740 | Okay, so shift tab is super helpful

01:23:05.860 | One way to grab that window straight away is if you just put question mark at the start

01:23:12.540 | Then it just brings up that little documentation window

01:23:16.660 | Now the other thing to be aware of is increasingly during this course

01:23:22.500 | We're going to be looking at the actual source code of fast AI itself and learning how it's built and why it's built that way

01:23:29.660 | It's really helpful to look at source code in order to you know

01:23:33.740 | Understand what you can do and how you can do it

01:23:36.540 | So if you for example wanted to look at the source code for learn dot predict you can just put two question marks

01:23:42.400 | Okay, and you can see it's popped up the source code right and so it's just a single line of code

01:23:50.300 | You'll very often find that fast AI methods like they're they're designed to never be more than

01:23:57.420 | About half a screen full of code and they're often under six lines so you can see this case

01:24:02.660 | It's calling predict with tags so we could then get the source code for that in the same way

01:24:07.340 | Okay

01:24:10.940 | And then that's calling a function called predict with tags so we could get the documentation for that in the same way and

01:24:16.340 | Then so here we are and then finally that's what it does it iterates through a data loader gets the predictions and then passes them back

01:24:23.500 | and so forth, okay, so

01:24:26.980 | question mark question mark is how to get source code a single question mark is how to get documentation and

01:24:33.660 | Shift tab is how to bring up parameters or press it more times

01:24:38.320 | to get the docs

01:24:40.980 | So that's really helpful

01:24:43.020 | Another really helpful thing to know about is how to use Jupyter notebook well and the button that you want to know is H

01:24:50.180 | If you press H, it will bring up the keyboard shortcuts

01:24:54.940 | Palette and so now you can see exactly what Jupyter notebook can do and how to do it

01:25:00.500 | I personally find all of these functions useful

01:25:03.680 | So I generally tell students to try and learn four or five different keyboard shortcuts a day

01:25:08.960 | Try them out see what they do see how they work, and then you can try practicing in that session

01:25:14.940 | And one very important thing to remember when you're finished with your work for the day go back to paper space and click on that

01:25:23.060 | little button

01:25:24.100 | Which stops and starts the machine so after it stopped you'll see it says connection closed and you'll see it's off

01:25:30.460 | If you leave it running you'll be charged for it same thing with Cressel be sure to go to your Cressel

01:25:37.060 | Instance and stop it you can't just turn your computer off or close the browser

01:25:43.020 | You actually have to stop it in Cressel or in paper space and don't forget to do that

01:25:47.940 | Or you'll end up being charged until

01:25:49.940 | You finally do remember

01:25:53.220 | Okay, so I think that's all of the information that you need to get started please remember about the forums

01:26:00.140 | If you get stuck at any point check them out

01:26:04.500 | But before you do make sure you read the information on course.fast.ai for each lesson

01:26:11.020 | All right because that is going to tell you about like things that have changed okay, so if there's been some change to

01:26:17.920 | which

01:26:20.900 | Jupyter notebook provider we suggest using or how to set up paper space or anything like that

01:26:26.020 | That'll all be on course.fast.ai

01:26:28.780 | Okay, thanks very much for watching and look forward to seeing you in the next lesson

Lesson 1: Deep Learning 2018

Chapters