back to index

Lesson 1: Deep Learning 2018


Chapters

0:0 Introduction
1:33 Community
2:18 Coding
4:2 Jupiter Notebook
6:0 Paper Space
12:57 Running a cell
15:7 Running more cells
20:21 Training a model
29:4 Training an image classifier
30:50 Topdown approach
33:48 Lesson plan
39:22 Advice from past students
41:26 Image classifiers
44:23 Deep learning vs machine learning
47:33 An infinitely flexible function
48:43 The neural network
49:38 Gradient descent
51:3 GPU vs CPU
52:27 Hidden Layers
53:38 Google Brain
55:3 Google Inbox
55:33 Microsoft Skype
56:10 Neural Doodle
56:52 My Personal Experience
58:45 Deep Learning Ideas

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everybody, welcome to practical deep learning for coders. This is part one of our two-part course
00:00:10.720 | Presenting this from the Data Institute in San Francisco
00:00:14.020 | We'll be doing seven lessons in this part of the course
00:00:19.620 | Most of them will be about a couple of hours long this first one may be a little bit shorter
00:00:26.760 | Practical deep learning for coders is all about getting you up and running with deep learning in practice
00:00:32.080 | Getting world-class results and it's a really coding focused approach as the name suggests
00:00:38.700 | but we're not going to dumb it down by the end of the course you all have learned all of the
00:00:43.220 | Theory and details that are necessary to rebuild all of the world-class results. We're learning about from scratch
00:00:49.640 | Now I should mention that our videos are hosted on YouTube
00:00:55.600 | But we strongly recommend watching them via our website at course dot fast AI
00:01:00.900 | Although they're exactly the same videos the important thing about watching them through our website
00:01:07.520 | Is that you'll get all of the information you need about kind of updates to libraries by all locations?
00:01:14.160 | Further information frequently asked questions and so forth
00:01:17.980 | So if you're currently on YouTube watching this why don't you switch over to course dot fast at AI now and start watching through there?
00:01:25.480 | And make sure you read all of the material on the page before you start just to make sure that you've got everything you need
00:01:31.460 | The other thing to mention is that there is a really great strong community at forums dot fast AI
00:01:39.020 | From time to time you'll find that you get stuck
00:01:44.560 | You may get stuck very early on you may not get stuck for quite a while, but at some point you might get stuck with understanding
00:01:52.680 | Why something works the way it does or there may be some computer problem that you have or so forth
00:01:58.200 | On forums dot fast at AI there are thousands of other learners talking about every lesson and lots of other topics besides
00:02:06.040 | It's the most active deep learning community on the internet by far. So
00:02:10.220 | Definitely register there and start getting involved. You'll get a lot more out of this course if you do that
00:02:19.680 | So we're going to start by doing some coding. This is an approach
00:02:24.520 | We're going to be talking about in a moment called the top-down approach to study
00:02:29.160 | But let's learn it by doing it. So let's go ahead and try and actually train a neural network
00:02:36.320 | Now in order to train a neural network, you almost certainly want a GPU
00:02:42.140 | GPU is a graphics processing a graphics processing unit
00:02:47.640 | It's the things that companies use to help you play games better
00:02:53.240 | They let your computer render the game much more quickly than your CPU can
00:02:59.760 | We'll be talking about them more shortly. But for now, I'm going to show you how you can get access to a GPU
00:03:07.160 | Specifically you're going to need an Nvidia GPU because only Nvidia GPUs support something called CUDA
00:03:16.800 | CUDA is the language and framework that nearly all deep-learning
00:03:20.780 | libraries and practitioners use to do their work
00:03:25.160 | Obviously, it's not ideal that we're stuck with one particular vendors cards and over time
00:03:31.480 | We hope to see more competition in this space. But for now, we do need an Nvidia GPU
00:03:35.780 | Your laptop almost certainly doesn't have one unless you specifically went out of your way to buy like a gaming laptop
00:03:44.840 | So almost certainly you will need to rent one
00:03:49.240 | The good news is that renting access?
00:03:52.060 | Paying by the second for a GPU based computer is pretty easy and pretty cheap
00:03:57.800 | I'm going to show you a couple of options
00:04:00.920 | The first option I'll show you which is
00:04:05.620 | Probably the easiest is called cressel
00:04:09.200 | if you go to cressel.com and
00:04:14.600 | Click on sign up or if you've been there before sign in
00:04:17.240 | You will find yourself at this screen which has a big button that says start Jupiter and another switch called enable GPU
00:04:25.560 | So if we make sure that is set to true enable GPU is on and we click start Jupiter
00:04:33.880 | We click start Jupiter
00:04:35.880 | It's going to launch us into something called Jupiter notebook
00:04:41.040 | Jupiter notebook in a recent survey of tens of thousands of data scientists was rated as the third most important tool
00:04:48.920 | In the data scientist toolbox. It's really important that you get to learn it well and all of our courses will be run through Jupiter
00:04:55.760 | Yes, Rachel. You have a question or comment? Oh, I just wanted to point out that you get I believe 10 free hours
00:05:02.000 | So if you wanted to try cressel out
00:05:07.320 | Yeah, I he might have changed that recently to less hours, but you can check the fact or the pricing
00:05:12.680 | But you certainly get some free hours
00:05:14.680 | The pricing varies because this is actually runs on top of Amazon web services. So at the moment, it's 60 cents an hour
00:05:21.680 | The nice thing is though that you can always turn it turn it on
00:05:26.240 | You know start your Jupiter without the CP without the GPU running and pay you a tenth of that price, which is pretty cool
00:05:34.160 | So Jupiter notebook is something we'll be doing all of this course in and so to get started here
00:05:39.160 | we're going to find our particular course, so we'd go to courses and
00:05:42.400 | We'd go to fast AI 2 and
00:05:46.040 | There they are
00:05:49.440 | Things have been moving around a little bit. So it may be in a different spot for you
00:05:53.400 | When you look at this and we'll make sure all the information current information is on the website
00:06:00.000 | Now having said that that's you know, the cressel approach is you know, as you can see, it's basically instant and and easy
00:06:08.000 | But if you've got you know an extra hour or so to get going an even better option is
00:06:16.880 | Something called paper space
00:06:19.440 | Paper space unlike cressel doesn't run on top of Amazon. They have their own machines
00:06:29.800 | If I click on so here's here's paper space and so if I click on new machine I
00:06:38.400 | Can pick which one of their three data centers to use so pick the plot one closest to you. So I'll say West Coast and
00:06:45.160 | Then I'll say Linux and I'll say you bun to 16
00:06:50.560 | And then it says choose machine and you can see there's various different machines I can choose from
00:06:57.760 | And pay by the hour
00:06:59.760 | So this is pretty cool for 40 cents an hour. So it's cheaper than cressel
00:07:05.680 | I get a machine that's actually going to be much faster than cressel 60 cent now machine or for 65 cents an hour
00:07:12.480 | Way way way faster, right?
00:07:15.040 | So I'm going to actually show you how to get started with with the with the paper space approach
00:07:20.000 | Because that actually is going to do everything from scratch
00:07:25.400 | You may find if you try to do the 65 cents an hour one that it may require you to contact paper space to say
00:07:32.520 | Like why do you want it? That's just an anti fraud thing. So if you say faster AI there
00:07:40.880 | They'll quickly get you up and running. So I'm going to use the cheapest one here 40 cents an hour
00:07:45.360 | You can pick how much storage you want and
00:07:52.720 | Note that you pay for a month of storage as soon as you start the machine up
00:07:56.880 | Right, so don't start and stop lots of machines because each time you pay for that month of storage
00:08:01.160 | I think the 250 gig seven dollar a month option is pretty good
00:08:05.680 | But you really need 50 gig. So if you're trying to minimize the price you can go there
00:08:09.560 | The only other thing you need to do is turn on public IP so that we can actually log into this and
00:08:17.520 | We can turn off auto snapshot to save the money of not having backups
00:08:21.700 | All right, so if you then click on create your paper space about a minute later you will find
00:08:33.900 | That your machine will pop up. Here is my Ubuntu 1604 machine
00:08:40.340 | If you check your email
00:08:43.780 | You will find that they have emailed you a password so you can copy that
00:08:51.880 | You can go to your machine and enter your password now to paste the password
00:08:56.760 | You would press ctrl shift V or on Mac. I guess Apple shift V
00:09:01.920 | So it's slightly different to normal pasting or of course you can just type it in
00:09:07.400 | And here we are now we can make a little bit more room here by clicking on these little arrows I
00:09:14.960 | Can zoom in a little bit?
00:09:17.720 | And so as you can see we've got like a terminal that's sitting inside
00:09:22.240 | Our browser which is kind of quite a handy way to do it
00:09:26.000 | So now we need to configure this for the course and so the way you configure it for the course is you type?
00:09:36.760 | HTTP colon slash slash files dot fast dot AI slash setup slash paper space
00:09:49.640 | Okay, and so that's then going to run a script which is going to set up all of the CUDA drivers
00:09:56.520 | the special Python
00:09:59.280 | Reaper pipe Python distribution we use called anaconda all of the libraries all of the courses
00:10:06.440 | And the data we use for the first part of the course
00:10:10.280 | Okay, so that takes an hour or so and when it's finished running you'll need to reboot your computer
00:10:17.960 | So to reboot not your own computer
00:10:20.240 | But your paper space computer and so to do that you can just click on this little circular restart machine button
00:10:26.080 | Okay, and when it comes back up you'll be ready to go. So what you'll find
00:10:31.040 | Is that you've now got an anaconda 3 directory. That's where your Python is
00:10:37.400 | You've got a data directory which contains the data for the first part of this course first lesson, which is that dogs and cats?
00:10:44.600 | And you've got a fast AI directory
00:10:49.880 | That contains everything for this course
00:10:52.280 | so what you should do is
00:10:55.320 | CD fast AI and from time to time you should go git pull and that will just make sure that all of your
00:11:04.040 | Fast AI stuff is up to date and also from time to time
00:11:07.960 | You might want to just check that your Python libraries up to date and so you can type Conda and update
00:11:13.320 | to do that
00:11:15.960 | Alright, so make sure that you've CD'd into fast AI and then you can type Jupiter notebook
00:11:23.160 | All right, there it is
00:11:28.000 | So we now have a Jupiter notebook serving it running and we want to connect that right and so you can see here
00:11:34.720 | It says copy paste this URL
00:11:36.720 | Into your browser when you connect so if you double click on it
00:11:41.000 | Then that will actually
00:11:43.720 | That will actually copy it for you
00:11:48.160 | Then you can go and paste it, but you need to change this local host
00:11:53.680 | To be the paper space IP address, so if you click on a little arrows to go smaller
00:11:59.160 | You can see the IP address is here
00:12:01.360 | so I'll just copy that and
00:12:03.960 | paste it
00:12:06.760 | Where it used to say local host okay?
00:12:08.800 | So it's now HTTP and then my IP and then everything else I copied before and so there it is
00:12:15.360 | So this is the fast AI
00:12:19.360 | Get repo and our courses are all in courses and in there the deep learning part one is DL one and
00:12:27.320 | In there you will find
00:12:30.120 | Lesson one dot IPI and be I Python notebook
00:12:34.080 | So here we are ready to go
00:12:41.400 | Depending whether you're using Gressel or paper space or something else if you check courses to fast at AI
00:12:46.560 | We'll keep putting additional videos and links to information about how to set up other
00:12:51.080 | You know good Jupyter notebook
00:12:53.600 | providers as well
00:12:55.920 | So to run a cell in Jupyter notebook
00:13:01.500 | You select the cell and you hold down shift and press enter or if you've got the toolbar showing
00:13:08.480 | You can just click on the little run button, so you'll notice that some cells contain
00:13:15.040 | Code and some contain text and some contain pictures and some contain videos so this environment basically has
00:13:22.840 | You know it's it's a way that we can give you access to a way to run
00:13:29.260 | Experiments and to kind of tell you what's going on show pictures
00:13:33.660 | This is why it's like a super popular tool in data science the data science is kind of all about running experiments
00:13:41.960 | really
00:13:44.120 | So let's go ahead and click run
00:13:46.400 | And you'll see that cell turn into a star the one turn into a star for a moment, and then it finished running
00:13:52.400 | Okay, so let's try the next one this time instead of using the toolbar. I'm going to hold down shift and press enter
00:13:58.100 | And you can see again
00:14:00.080 | It turned into a star and then it said to so if I'd hold down shift and keep pressing enter it just keeps running each
00:14:06.080 | Cell right so I can put anything I like for example one plus one
00:14:10.640 | is two
00:14:15.600 | What we're going to do is we're going to?
00:14:17.840 | Yes, Rachel. Oh, this is just a side note, but I wanted to point out that we're using Python 3 here
00:14:24.400 | Yes, thank you, Python 3 and so you get some errors if you're still using Python 2. Mm-hmm. Yeah
00:14:29.560 | And it is important to switch to Python 3 you know now well for fast AI you require it
00:14:37.480 | But you know increasingly a lot of libraries are
00:14:42.040 | removing support for Python 2
00:14:44.040 | Thanks Rachel
00:14:47.400 | Now it mentions here that you can download the data set for this lesson from this location
00:14:54.160 | if you're using
00:14:57.040 | Cressel or
00:14:58.360 | The paper space script that we just used to set up and this will already be made available for you
00:15:03.680 | Okay, if you're not you'll need to W get it as soon
00:15:06.360 | now Cressel is
00:15:10.480 | Quite a bit slower than paper space and also it
00:15:14.040 | There are some particular things it doesn't support that we really need and so there there are a couple of extra steps if you're using
00:15:21.600 | Cressel you have to run two more cells right so you can see these are commented out
00:15:26.080 | They've got hashes at the start
00:15:27.320 | So if you remove the hashes from these and run these two additional cells that just runs the stuff that the stuff that you only
00:15:34.320 | Need for Cressel I'm using paper space, so I'm not going to run it
00:15:38.600 | okay, so
00:15:40.600 | Inside our
00:15:43.600 | Data so we set up this path to data dogs cats
00:15:47.880 | That's pre set up for you and so inside there. You can see here. I can use an exclamation mark
00:15:55.800 | Basically say I don't want to run Python, but I want to run bash
00:15:59.480 | I don't want to run shell so this runs a bash command and the bit inside the curly brackets
00:16:05.720 | Actually refers however to a Python variable so it inserts that Python variable into the bash command
00:16:11.000 | So here is the contents of our folder
00:16:13.800 | There's a training set and a validation set if you're not familiar with the idea of training sets and validation sets
00:16:21.040 | It would be a very good idea to check out our
00:16:24.080 | practical machine learning course
00:16:27.040 | Which tells you a lot about this kind of stuff of like that the basics of how to set up and run machine learning
00:16:34.360 | projects more generally
00:16:36.360 | Would you recommend that people take that course before this one?
00:16:40.340 | Actually a lot of students who would you know as they went through these who said they look they've liked doing them together
00:16:46.920 | So you can kind of check it out and and see
00:16:50.200 | the machine learning course
00:16:53.320 | Yeah, they cover some similar stuff but all in different directions so people have done both since you know say they find it
00:17:01.760 | They each support each other. I wouldn't say it's prerequisite
00:17:05.720 | But you know if I do if I say something like hey
00:17:08.760 | This is a training set and this is a validation set and you're going I don't know what that means
00:17:12.000 | At least Google it do a quick read you know because we're assuming
00:17:15.480 | That you know the very basics of kind of what machine learning is and does to some extent
00:17:23.260 | And I have a whole blog post on this topic as well
00:17:26.320 | Okay, and we'll make sure that you link to that from course.fast.ai
00:17:29.680 | And I also just wanted to say in general with fast.ai our philosophy is to
00:17:34.080 | Kind of learn things on an as-needed basis. Yeah exactly don't try and learn everything that you think you might need first
00:17:41.560 | Otherwise you'll never get around to learning the stuff you actually want to learn
00:17:44.360 | Exactly and that shows up in deep learning. I think
00:17:47.200 | particularly a lot yes
00:17:50.040 | Okay, so in our validation folder
00:17:53.560 | There's a cats folder and a dogs folder and then inside the validation cats folder is a whole bunch of JPEGs
00:18:00.400 | The reason that it's set up like this is that this is kind of the most common standard approach for how?
00:18:06.940 | image classification data sets are shared and provided and the idea is that each folder
00:18:13.120 | Tells you the label so there's each of these
00:18:17.640 | Images is labeled cats and each of the images in the dogs folder is labeled dogs. Okay?
00:18:23.560 | This is how Keras works as well for example
00:18:26.560 | So this is a pretty standard way to share image classification
00:18:33.800 | files
00:18:37.000 | So we can have a look
00:18:38.800 | So if you go plot.im show
00:18:40.800 | We can see an example of the first of the cats
00:18:45.920 | If you haven't seen
00:18:47.920 | This before this is a Python
00:18:49.920 | 3.6 format string so you can Google for that if you haven't seen it
00:18:54.200 | It's a very convenient way to do string formatting, and we use it a lot
00:18:57.080 | So there's our cat, but we're going to mainly be interested in the underlying data that makes up that cat
00:19:05.160 | so specifically
00:19:07.760 | It's an image whose shape that is the dimensions of the array is 198 by 179 by 3
00:19:15.080 | So it's a three-dimensional array also called a rank 3 tensor
00:19:18.520 | And here are the first four rows and four columns of that image
00:19:23.560 | so as you can see
00:19:26.760 | each of those
00:19:28.640 | cells has three
00:19:30.640 | Items in it, and this is the red green and blue pixel values between 0 and 255
00:19:37.160 | So here's a little subset of what a picture actually looks like inside your computer
00:19:43.320 | so that's that that's will be our idea is to take these kinds of numbers and
00:19:48.520 | Use them to predict whether those kinds of numbers represent a cat
00:19:52.340 | Or a dog based on looking at lots of pictures of cats and dogs
00:19:56.640 | so that's a pretty hard thing to do and at the point in time when this
00:20:02.480 | This data set actually comes from a Kaggle competition the dogs versus cats Kaggle competition and when it was released in I think it
00:20:10.360 | was 2012
00:20:11.720 | The state-of-the-art was 80% accuracy so computers weren't really able to at all accurately recognize dogs versus cats
00:20:20.060 | So let's go ahead and train a model
00:20:29.960 | Here are the three lines of code necessary to train a model
00:20:34.680 | And so let's go ahead and run it so I'll click on this on the cell. I'll press shift enter
00:20:42.160 | Then we'll wait a couple of seconds for it to pop up and there it goes
00:20:47.200 | Okay, and it's training
00:20:51.660 | So I've asked it to do three epochs so that means it's going to look at every image
00:20:55.440 | Three times in total or look at the entire set of images three times
00:20:59.560 | That's what we mean by an epoch and as we do it's going to print out
00:21:05.880 | The accuracy is this last of the three numbers that prints out on the validation set, okay?
00:21:11.280 | The first two numbers will talk about later
00:21:14.120 | In short they're the value of the loss function which is in this case the cross entropy loss
00:21:18.520 | For the training set and the validation set and then right at the start here is the epoch number
00:21:23.200 | So you can see it's getting about
00:21:26.360 | 90 percent accuracy
00:21:29.480 | And it took 17 seconds so you can see we've come a long way since
00:21:35.280 | 2012 and in fact even in the competition
00:21:38.460 | This actually would have won the Kaggle competition of that time the best in the Kaggle competition was 98.9
00:21:46.060 | And we're getting about 99%
00:21:48.160 | so this may surprise you that we're getting a
00:21:52.680 | You know Kaggle winning as of 20 end of 2012 early 2013
00:21:57.480 | Kaggle winning image classifier in 17 seconds
00:22:05.880 | and three lines of code
00:22:07.880 | And I think that's because like a lot of people assume that deep learning takes a huge amount of time
00:22:14.800 | And lots of resources and lots of data and as you'll learn in this course
00:22:20.400 | That in general isn't true
00:22:23.400 | One of the ways we've made it much simpler is that this code is written on top of a library we built
00:22:32.560 | imaginatively called fast AI
00:22:34.560 | the fast AI library is basically a library which takes all of the
00:22:39.960 | Best practices approaches that we can find and so each time a paper comes out. You know we that looks interesting
00:22:46.960 | We test it out if it works well for a variety of data sets and we can figure out how to tune it
00:22:52.020 | we implement it in fast AI and so fast AI kind of curates all this stuff and packages up for you and
00:22:58.480 | Much of the time or most the time kind of automatically figures out the best way to handle things
00:23:03.420 | So the fast AI library is why we were able to do this in just three lines of code
00:23:07.560 | And the reason that we were able to make the fast AI library work
00:23:11.760 | So well is because it in turn sits on top of something called pytorch
00:23:16.120 | which is a
00:23:18.680 | Really flexible deep learning and machine learning and GPU computation library written by Facebook
00:23:27.600 | Most people are more familiar with TensorFlow than pytorch because Google markets that pretty heavily
00:23:33.960 | But most of the top researchers I know nowadays at least the ones that aren't at Google have switched across to pytorch
00:23:40.680 | Yes, Rachel, and we'll be covering some pytorch later in the course. Yeah, it's I mean one of the things that
00:23:46.880 | Hopefully you're really like about fast AI is that it's really flexible that you can use all these kind of curated best practices as
00:23:56.560 | Much as little as you want and so it's really easy to hook in at any point and write your own
00:24:02.040 | Data augmentation write your own loss function write your own network architecture, whatever and so we'll do all of those things
00:24:09.420 | in this course
00:24:12.000 | So what does this model look like?
00:24:14.360 | well, what we can do is we can
00:24:17.640 | Take a look at so what are the what is the the validation set?
00:24:22.560 | Dependent variable the Y look like and it's just a bunch of zeros and ones, right?
00:24:27.160 | So the zeros if we look at data dot classes the zeros represent cats the ones represent dogs
00:24:32.760 | You'll see here. There's basically two objects. I'm working with one is an object called data
00:24:36.980 | Which contains the validation and training data and another one is the object called learn which contains the model, right?
00:24:44.120 | So anytime you want to find something out about the data we can look inside data
00:24:49.320 | So we want to get predictions for a validation set and so to do that we can call learn dot predict
00:24:57.760 | So you can see here the first ten predictions and what it's giving you is prediction for dog and a prediction for cat
00:25:05.200 | now the way pytorch generally works and therefore fast AI also works is that most models return the
00:25:14.280 | Of the predictions rather than the probabilities themselves. We'll learn why that is later in the course
00:25:19.900 | So for now recognize that to get your probabilities you have to get
00:25:23.620 | e to the power of
00:25:26.600 | You'll see here. We're using numpy NP is numpy if you're not familiar with numpy
00:25:32.720 | That is one of the things that we assume that you have some familiarity with
00:25:36.400 | So be sure to check out the material on course dot fast at AI to learn the basics of numpy
00:25:44.840 | the way that Python handles all of the
00:25:48.080 | Fast numerical programming array computation that kind of thing
00:25:54.860 | Okay, so we can get the probabilities using that
00:25:59.300 | using NP dot X
00:26:02.120 | There's a few functions here that you can look at yourself if you're interested, but just some plotting functions that we'll use
00:26:07.600 | And so we can now plot
00:26:11.640 | some random correct
00:26:13.720 | Images and so here are some images that it was correct about okay, and so remember one is a dog
00:26:22.360 | So anything greater than 0.5 is dog and 0 is a cat so this is what 10 to the negative 5 obviously a cat
00:26:29.400 | Here are some which are incorrect
00:26:32.320 | Right so you can see that some of these which it thinks are incorrect obviously are just the you know images. It shouldn't be there at all
00:26:41.320 | But clearly this one which it called a a dog is not at all a dog so there are some obvious mistakes
00:26:48.320 | We can also take a look at
00:26:53.160 | Which cats is it the most confident are cats which dogs are the most dog like the most confident dogs
00:27:02.320 | Perhaps more interestingly we can also see which cats is it the most confident are actually dogs
00:27:09.000 | so which ones it is at the most wrong about and
00:27:11.960 | Same thing for the ones the dogs that it really thinks are cats and again some of these are just
00:27:18.640 | Pretty weird. I guess there is a dog in there. Yes, Rachel
00:27:22.700 | I just say do you want to say more about why you would want to look at your data?
00:27:26.680 | Yeah, sure
00:27:29.920 | So yeah, so finally I just mentioned the last one we've got here is to see which ones have the probability closest to 0.5
00:27:38.560 | So these are the ones that the the model knows it doesn't really know what to do with and some of these it's not surprising
00:27:44.520 | So yeah, I mean this is kind of like
00:27:48.760 | Always the first thing I do after I build a model is to try to find a way to like visualize what it's built
00:27:56.640 | Because if I want to make the model better
00:27:59.800 | Then I need to take advantage of the things it's doing well and fix the things it's doing badly. So in this case
00:28:07.640 | And often this is the case. I've learned something about the data set itself
00:28:11.240 | Which is that there are some things that are in here that probably shouldn't be
00:28:14.600 | But I've also like it's also clear that
00:28:20.800 | Model has room to improve like to me. That's pretty obviously a
00:28:25.840 | Dog, but one thing I'm suspicious about here is this image is very
00:28:31.440 | kind of fat and
00:28:34.600 | short and
00:28:37.240 | As we all learn
00:28:39.160 | The way these algorithms work is it kind of grabs a square piece at a time?
00:28:44.320 | So this rather makes me suspicious that we're going to need to use something called data augmentation
00:28:49.080 | That will learn about learn about later to handle this properly
00:28:53.320 | Okay, so
00:28:58.160 | That's it right we've now built
00:29:03.000 | We've now built an image classifier and something that you should try now is to grab some data
00:29:11.240 | yourself
00:29:13.720 | some pictures of
00:29:15.720 | Two or more different types of thing put them in different folders and run the same three lines of code
00:29:22.720 | On them, okay, and you'll find
00:29:26.960 | that it will work for that as well as long as that they are pictures of things like
00:29:33.160 | the kinds of things that people normally take photos of right, so if they're
00:29:37.800 | microscope microscope pictures or pathology pictures or
00:29:41.840 | CT scans or something this won't work very well as we'll learn about later
00:29:47.360 | There are some other things we didn't need to do to make that work, but for things that look like normal photos
00:29:54.760 | These you can run exactly the same three lines of code and just point your
00:29:59.440 | path variable somewhere else
00:30:02.440 | To get your own image classifier
00:30:05.320 | so for example
00:30:07.160 | one student
00:30:09.120 | Took those three lines of code downloaded for Google images
00:30:12.840 | Ten examples of pictures of people playing cricket ten examples of people playing baseball and build a classifier
00:30:19.800 | Of those images which was nearly perfectly correct
00:30:23.920 | the same
00:30:25.400 | student actually also tried downloading seven pictures of
00:30:29.360 | Canadian currency seven pictures of American currency and again in that case the model was a hundred percent
00:30:37.280 | Accurate so you can just go to Google images if you like and download a few things of a few different classes and see
00:30:43.440 | See what works and tell us on the forum both your successes and your failures
00:30:52.280 | So what we just did was to
00:30:54.280 | Train a neural network, but we didn't first of all tell you what a neural network is or what training means or
00:31:02.160 | anything
00:31:04.480 | Why is that? Well, this is the start of our top-down approach to learning
00:31:11.140 | And basically the idea is that unlike the way math and technical subjects are usually taught
00:31:17.760 | where you learn every little element piece by piece and you don't actually get to put them all together and
00:31:23.680 | Build your own image classifier until third year of graduate school. Our approach is to say from the start
00:31:31.600 | Hey, let's show you how to train an image classifier and now you can start doing stuff
00:31:36.700 | And then gradually we dig deeper and deeper and deeper
00:31:42.760 | so the idea is that
00:31:46.160 | Throughout the course you're going to see like new problems that we want to solve
00:31:50.640 | So for example in the next lesson, we'll look at well
00:31:54.620 | What if we're not looking at normal kinds of photos, but we're looking at satellite images
00:32:01.180 | And we'll see why it is that this approach that we're learning today doesn't quite work as well
00:32:06.000 | And what things do we have to change and so we'll learn enough about the theory to understand why that happens
00:32:12.100 | And then we'll learn about the libraries and how we can change change things with the libraries to make that work better
00:32:20.440 | So during the course we're gradually going to learn to solve more and more problems as we do
00:32:25.480 | So we'll need to learn more and more parts of the library more and more bits of the theory until by the end
00:32:31.960 | We're actually going to learn how to create a
00:32:35.040 | world-class
00:32:37.440 | neural net architecture from scratch and our own training loop from scratch and so we're actually build everything
00:32:44.240 | ourselves
00:32:45.840 | So that's the general
00:32:47.760 | Approach. Yes, Rachel and we sometimes also call this the whole game
00:32:52.240 | Which is inspired by Harvard professor David Perkins
00:32:57.440 | And so the idea with the whole game is like this is more like how you would learn baseball or music
00:33:02.280 | With baseball you would get taken to a ball game. You would learn what baseball is
00:33:07.240 | You would start playing it and it would only be years later that you might learn about the physics of how curveball works
00:33:14.720 | For example or with music we put an instrument in your hand and you start
00:33:20.040 | Banging the drum or hitting the xylophone and it's not until years later that you learn about the circle of fifths and understand
00:33:26.960 | How to construct a cadence for example
00:33:29.160 | So yeah, so that's this is kind of the approach we're using it's very inspired by
00:33:34.840 | David Perkins and other writers of education
00:33:37.680 | So what that does mean is to take advantage of this as we peel back the layers
00:33:43.440 | We want you to keep like looking under the hood yourself as well like experiment a lot because this is a very code driven
00:33:51.880 | Approach so here's basically what happens right? We start out looking today at
00:33:57.280 | convolutional neural networks for images and then in a couple of lessons
00:34:02.000 | We'll start to look at how to use neural nets to look at structured data and then to look at language data and then to look
00:34:08.960 | at recommendation system data
00:34:10.960 | And then we kind of then take all of those steps and we go backwards through them in reverse order
00:34:18.040 | So now you know by the end of that fourth piece you will know
00:34:22.120 | By the end of lesson four how to create a world-class image classifier a world-class
00:34:30.160 | Structured data analysis program world-class language classifier world-class recommendation system
00:34:36.660 | And then we're going to go back over all of them again and learn in depth about like well
00:34:41.240 | What exactly did it do and how did it work?
00:34:43.360 | And how do we change things around and use it in different situations for for the recommendation systems structured data?
00:34:51.000 | Images and then finally back to language. So that's how it's going to work
00:34:56.680 | So what that kind of means is that most students find that they tend to watch the videos two or three times?
00:35:04.280 | but not like
00:35:06.720 | Watch lesson one two or three times and lesson two two or three times and listen three three times
00:35:11.240 | But like they do the whole thing into end lessons one through seven and then go back and start lesson one again
00:35:18.280 | That's an approach which a lot of people find when they want to kind of go back and understand all the details
00:35:23.840 | That up that can work pretty well, so I would say you know aim to get through to the end of lesson seven
00:35:30.220 | You know as as quickly as you can rather than aiming to fully understand every detail from the start
00:35:39.040 | So basically the plan is that in today's lesson you learn
00:35:46.760 | In as few lines as code as possible with as few details as possible
00:35:52.200 | How do you actually build an image classifier with deep learning to do this to in this case say?
00:35:57.800 | Hey, here are some pictures of dogs as opposed to pictures of cats
00:36:01.760 | Then we're going to learn
00:36:05.200 | How to look at different kinds of images and particularly we're going to look at images of from satellites
00:36:11.840 | I'm going to say for a satellite image
00:36:13.960 | What kinds of things might you be seeing in that image and there could be multiple things that we're looking at so a multi-label?
00:36:21.600 | location problem
00:36:23.400 | From there, we'll move to something which is perhaps the most widely applicable for the most people
00:36:29.540 | Which is looking at what we call structured data
00:36:32.300 | so data about
00:36:35.080 | data that kind of comes from
00:36:37.360 | Databases or spreadsheets, so we're going to specifically look at this data set of predicting sales
00:36:43.080 | The number of things that are sold at different stores on different dates
00:36:48.840 | Based on different holidays and and so on and so forth and so we're going to be doing this sales forecasting
00:36:54.660 | exercise
00:36:56.520 | After that we're going to look at language, and we're going to figure out
00:37:00.620 | What this person?
00:37:03.120 | thinks about the movie zombie Geddon
00:37:05.120 | And we'll be able to figure out how to create just like we create image classifiers for any kind of image
00:37:10.800 | We'll learn to create in NLP classifiers to classify any kind of language in lots of different ways
00:37:18.720 | Then we'll look at something called collaborative filtering which is used mainly for recommendation systems
00:37:23.840 | We're going to be looking at this data set that showed for different people for different movies. What rating did they give it?
00:37:30.200 | Here are some of the movies and so
00:37:32.760 | This is maybe an easier way to think about it
00:37:35.560 | Is there are lots of different users and lots of different movies and then for each one we can look up for each user
00:37:41.480 | How much they like that movie and the goal will be of course to predict for user movie combinations?
00:37:47.840 | We haven't seen before are they likely to enjoy that movie or not and that's the really common approach used for like
00:37:55.640 | Deciding what stuff to put on your home page when somebody's visiting
00:37:59.400 | You know what book might they want to read or what film might they want to see or so forth?
00:38:03.880 | From there we're going to then dig back into language a bit more and we're going to look at
00:38:12.080 | Actually, we're going to look at the writings of Nietzsche the philosopher and learn how to create our own Nietzsche philosophy from scratch
00:38:19.780 | character by character
00:38:21.320 | So this here perhaps that every life of values of blood of intercourse when it senses there is unscrupulous his very rights and still impulse
00:38:28.860 | Love is not actually Nietzsche
00:38:31.240 | That's actually like some character by character generated text that we built with this recurrent neural network
00:38:41.280 | And then finally we're going to loop all the way back to computer vision again
00:38:44.680 | We're going to learn how not just to recognize cats from dogs
00:38:48.440 | But to actually find like where the cat is with this kind of heat map
00:38:52.160 | And we're also going to learn how to write our own architectures from scratch
00:38:56.800 | um, so this is an example of a resnet which is the kind of network that we
00:39:01.280 | Are using in today's lesson for computer vision?
00:39:04.880 | And so we'll actually end up building the network and the training loop from scratch
00:39:09.900 | And so they're basically the the steps that we're going to be taking from here and at each step. We're going to be getting into
00:39:16.200 | Increasing amounts of detail about how to actually do these things yourself
00:39:21.320 | So we've actually heard back from our students of past courses about what they found and
00:39:30.020 | one of the things that we've heard a lot of students say is that they spend too much time on theory and
00:39:39.880 | research
00:39:41.080 | And not enough time running the code
00:39:43.080 | And even after we tell people about this warning where they still come to the end of the course and often say I wish I had
00:39:50.500 | taken more
00:39:52.160 | Seriously that advice which is to keep running code
00:39:55.080 | So these are actual quotes from our forum in retrospect
00:39:59.280 | I should have spent the majority of my time on the actual code and the notebooks
00:40:03.780 | See what goes in see what comes out
00:40:10.400 | This idea that you can create
00:40:14.120 | World-class models in a code first approach learning what you need as you go
00:40:19.520 | It's very different to a lot of the advice you'll read out there such as this
00:40:23.640 | person on the forum Hacker News who claimed that the best way to become an ML engineer is to
00:40:32.080 | Learn all of math learn C and C++ learn parallel programming learn ML
00:40:38.920 | Algorithms implement them yourself using plain C and finally start doing ML
00:40:43.840 | So we would say if you want to become an effective practitioner do exactly the opposite of this
00:40:50.240 | Yes, Rachel. Oh, yeah, I'm just highlighting that this is
00:40:53.920 | We think this is bad advice and this can be very discouraging for a lot of people to come across. Yeah
00:41:00.760 | it's it's it's you know, we now have thousands or tens of thousands of people that have done this course and
00:41:09.160 | Lots and lots of examples of people who are now
00:41:11.820 | running research labs or
00:41:14.680 | Google brain residents or you know
00:41:17.580 | Have created patents based on deep learning and so forth who have done it by doing this course
00:41:22.880 | So the top-down approach works super well
00:41:27.560 | Now one thing to mention is like we've we've now already learned how you can actually train a world-class image classifier in
00:41:35.840 | 17 seconds, I should mention by the way the first time you run that code
00:41:41.600 | there are two things it has to do that take more than 17 seconds one is that it downloads a
00:41:47.440 | Pre-trained model from the internet. So you'll see the first time you run it. It'll say downloading model
00:41:53.160 | So that takes a minute or two
00:41:57.360 | The first time you run it it pre computes and caches
00:42:00.200 | Some of the intermediate information that it needs and that takes about a minute and a half as well
00:42:06.120 | So if the first time you run it it takes
00:42:08.600 | three or four minutes
00:42:10.920 | To download and pre-compute stuff. That's normal if you run it again, you should find it takes
00:42:16.080 | 20 seconds or so
00:42:20.320 | Image classifiers, you know, you may not feel like you need to recognize cats versus dogs very often on a computer
00:42:28.600 | You can probably do it yourself pretty well
00:42:30.720 | But what's interestingly interesting is that these image classification algorithms are really useful for lots and lots of things
00:42:38.120 | For example
00:42:41.760 | AlphaGo which became which beat the go world champion the way it worked was to use something
00:42:49.480 | At its heart that looked almost exactly like our dogs versus cats image classification algorithm
00:42:56.360 | It looked at thousands and thousands of go boards
00:43:00.800 | And for each one there was a label saying whether that go board ended up being the winning or the losing
00:43:07.400 | player and so it learnt
00:43:10.320 | Basically an image classification that was able to look at a go board and figure out whether it was a good go board or a bad
00:43:17.000 | Go board and that's really the key most important
00:43:20.800 | Step in playing go. Well is to know which which move is better
00:43:25.720 | Another example is one of our earlier students who actually
00:43:32.280 | Got a couple of patterns for this work
00:43:35.360 | looked at anti-fraud
00:43:38.160 | He had lots of examples of his customers mouse movements because they they provided kind of these
00:43:46.400 | User tracking software to help avoid fraud and so he took the the mouse paths
00:43:52.540 | basically of the users on his customers websites
00:43:56.680 | Turned them into pictures of where the mouse moved and how quickly it moved
00:44:01.800 | And then built a image classifier that took those images
00:44:06.680 | As input and as output it was was that a fraudulent transaction or not?
00:44:12.480 | And turned out to get you know really great results for his company so image classifiers
00:44:18.440 | Are like much more flexible than you might imagine?
00:44:26.240 | So this is how you know some of the ways you can use deep learning specifically for image recognition and
00:44:32.480 | It's worth understanding that
00:44:35.840 | deep learning is not
00:44:39.520 | You know just a word that means the same thing as machine learning
00:44:42.680 | Like what is it that we're actually doing here when we're doing deep learning?
00:44:46.400 | Instead deep learning is a kind of machine learning
00:44:50.400 | So machine learning was invented by this guy Arthur Samuels who was pretty amazing in the late 50s
00:44:57.060 | He got this IBM mainframe to play checkers better than he can and the way he did it
00:45:04.080 | was he invented machine learning he got the
00:45:07.520 | Mainframe to play against itself
00:45:09.520 | Lots of times and figure out which kinds of things led to victories and which kinds of things didn't
00:45:15.680 | And used that to kind of almost write its own program
00:45:19.320 | And after Samuels actually said in 1962 that he thought that one day the vast majority of computer software
00:45:26.560 | Would be written using this machine learning approach rather than written by hand by writing the loops and so forth by hand
00:45:35.400 | So I guess that hasn't happened yet, but it seems to be in the process of happening
00:45:41.400 | I think one of the reasons it didn't happen for a long time is because traditional machine learning actually was very difficult and very
00:45:49.820 | Knowledge and time intensive so for example here's something called the computational pathologist or CPath
00:45:57.560 | From guy called Andy Beck Andy Beck back when he was at Stanford
00:46:03.160 | He's now moved on to
00:46:05.320 | Somewhere on the East Coast Harvard, I think
00:46:08.400 | And what he did was he took these pathology slides of breast cancer
00:46:13.960 | biopsies, right and
00:46:17.000 | he worked with lots of pathologists to come up with ideas about what kinds of
00:46:23.280 | Patterns or features might be associated with
00:46:26.720 | sort of long-term survival versus
00:46:30.720 | Dining quickly basically and so he came up with these ideas like well
00:46:35.800 | They came up with these ideas like relationship between epithelial nuclear neighbors
00:46:39.320 | relationship between epithelial and stromal objects and so forth and so they came up with all of these ideas of features
00:46:45.880 | these are just a few of the hundreds that they thought of and then lots of
00:46:50.000 | smart computer programmers wrote
00:46:52.840 | specialist algorithms to to calculate all these different features and then those those
00:47:00.360 | Features were passed into a logistic regression
00:47:02.580 | To predict survival and it ended up working very well
00:47:06.920 | It had ended up that the survival predictions were more accurate than pathologists own survival predictions were
00:47:15.080 | and so machine learning can work really well, but the point here is that this was a
00:47:19.720 | An approach that took lots of domain experts and computer experts
00:47:26.040 | Many years of work to actually to build this thing, right?
00:47:33.880 | We really want something
00:47:37.440 | something better and
00:47:40.000 | so specifically I'm going to show you something which rather than being a very specific function with all this very
00:47:48.080 | domain specific
00:47:51.120 | feature engineering we're going to try and create an infinitely flexible function a function that could solve any problem
00:47:58.000 | Right it would solve any problem if only you set the parameters of that function correctly
00:48:03.440 | And so then we need some all-purpose way of setting the parameters of that function
00:48:08.760 | And we would need that to be fast and scalable
00:48:11.220 | Right now if we had something that had these three things
00:48:14.000 | Then you wouldn't need to do this
00:48:17.080 | Incredibly time and domain knowledge intensive approach anymore instead we can learn all of those things
00:48:23.080 | with this
00:48:25.240 | with this algorithm
00:48:27.240 | So as you might have guessed
00:48:29.320 | The algorithm in question which has these three properties is called deep learning
00:48:34.440 | Or if not an algorithm, then maybe we would call it a class of algorithms
00:48:39.240 | Let's look at each of these three things in turn
00:48:43.560 | So the underlying function that deep learning uses is something called the neural network
00:48:49.240 | Now the neural network we're going to learn all about it and implemented ourselves from scratch later on in the course
00:48:56.360 | But for now all you need to know about it is that it consists of a number of simple linear layers
00:49:03.200 | interspersed with a number of simple nonlinear layers
00:49:07.040 | And when you interspersed these layers in this way
00:49:12.880 | You get something called the universal approximation theorem and the universal approximation theorem says that this kind of function
00:49:21.800 | Can solve any given problem?
00:49:24.960 | To arbitrarily close accuracy as long as you add enough parameters
00:49:31.880 | So it's actually provably shown to be an infinitely flexible function
00:49:38.520 | Right. So now we need some way to fit the parameters so that this infinitely flexible neural network solves some specific problem and
00:49:46.240 | so the way we do that is using a technique that
00:49:50.300 | probably most of you will have come across before at some stage called gradient descent and with gradient descent we basically say
00:49:57.680 | Okay, well for the different parameters we have
00:50:00.200 | How how good are they at solving my problem and let's figure out a slightly better set of parameters?
00:50:08.440 | And a slightly better set of parameters and basically follow down
00:50:11.720 | The the surface of the loss function downwards. It's kind of like a marble going down to find the minimum and
00:50:19.440 | As you can see here depending on where you start you end up in different places
00:50:25.160 | These things are called local minima now interestingly it turns out that for neural networks particular in particular
00:50:35.840 | There aren't actually multiple different
00:50:39.080 | Local minima, there's basically just there's basically just one right or to think of it another way
00:50:46.960 | There are different parts of the space which are all equally good
00:50:53.880 | Gradient descent therefore turns out to be actually an excellent way to
00:50:58.400 | Solve this problem of fitting parameters to neural networks
00:51:04.840 | The problem is though that we need to do it in a reasonable amount of time and
00:51:09.480 | It's really only thanks to GPUs that that's become possible
00:51:14.220 | So GPUs this shows over the last few years
00:51:17.520 | How many gigaflops per second can you get out of a?
00:51:23.920 | GPU that's the red and green versus a CPU. That's the blue right and this is on a log scale
00:51:31.760 | So you can see that generally speaking the GPUs are
00:51:35.680 | about 10 times faster than the CPUs and
00:51:40.720 | What's really interesting is that nowadays not only is the Titan X about 10 times faster than the e5
00:51:50.180 | 2699 CPU but the Titan X
00:51:53.600 | Well actually better one to look at would be the GTX 1080i
00:51:59.240 | GPU costs about 700 bucks
00:52:01.240 | Whereas the CPU which is 10 times slower costs over $4,000
00:52:06.920 | So GPUs turn out to be able to solve these
00:52:11.800 | Neural network parameter fitting problems
00:52:15.960 | incredibly quickly
00:52:18.520 | And also incredibly cheaply so they've been absolutely key in bringing these three pieces together
00:52:27.800 | Then there's one more piece
00:52:29.640 | Which is I mentioned that these neural networks you can intersperse multiple sets of linear and then nonlinear layers
00:52:36.960 | In the particular example that's drawn here there's actually only one
00:52:43.560 | what we call hidden layer one layer in the middle and
00:52:46.480 | Something that we learned in the last few years is that these kinds of neural networks although they do
00:52:53.200 | Support the universal approximation theorem they can solve any given problem arbitrarily closely
00:52:59.320 | They require an exponentially increasing number of parameters to do so
00:53:05.000 | So they don't actually solve the fast and scalable for even reasonable size problems
00:53:10.240 | But we've since discovered that if you create at multiple hidden layers
00:53:16.840 | Then you get super linear scaling so you can add a few more hidden layers
00:53:22.920 | to get
00:53:24.320 | multiplicatively
00:53:25.600 | more accuracy to multiplicatively more complex problems and
00:53:29.240 | That is where it becomes called deep learning. So deep learning means a neural network with multiple hidden layers
00:53:36.680 | So when you put all this together, there's actually really amazing what happens
00:53:45.120 | Google started investing in deep learning in 2012
00:53:53.200 | Actually hired Jeffrey Hinton who's kind of the father of deep learning and his top student Alex Kudzewski
00:54:00.040 | And they started trying to build a team that team became known as Google brain
00:54:08.680 | because
00:54:09.680 | Things with these three properties are so incredibly powerful and so incredibly flexible you can actually see over time
00:54:18.320 | How many projects at Google use deep learning?
00:54:22.420 | My graph here only goes up through a bit over a year ago
00:54:26.560 | But it's I know it's been continuing to grow exponentially since then as well
00:54:30.920 | And so what you see now is around Google that deep learning is used in like every part of the business
00:54:37.440 | and so it's really interesting to see how
00:54:43.960 | This this kind of simple idea that we can solve machine learning problems using a an
00:54:51.040 | Algorithm that has these properties
00:54:53.520 | When a big company invests heavily in actually making that happen
00:54:57.720 | You see this incredible growth in how much it's used
00:55:01.640 | So for example if you use the inbox by Google software
00:55:07.920 | Then when you receive an email from somebody it will often
00:55:13.920 | Tell you here are some replies
00:55:15.920 | That I could send for you and so it's actually using deep learning here to read the original email and to generate
00:55:24.240 | some suggested replies and so like this is a really great example of the kind of stuff that
00:55:30.760 | Previously just wasn't possible
00:55:33.640 | Another great example would be Microsoft is also a little bit more recently invested heavily in deep learning and so now you can
00:55:43.800 | Use Skype you can speaking to it in English and ask it at the other end to
00:55:49.880 | Translate it in real time to Chinese or Spanish and then when they talk back to you in Chinese or Spanish
00:55:55.720 | Skype will in real time translate it the speech in in their language into English speech in real time
00:56:03.520 | And again, this is an example of stuff which we can only do thanks to deep learning
00:56:11.880 | I also think it's really interesting to think about how deep learning can be combined with human expertise
00:56:18.080 | So here's an example of like drawing something just sketching it out
00:56:22.960 | And then using a program called neural doodle
00:56:26.080 | This is from a couple of years ago to then say please take that sketch and render it in the style of an artist
00:56:33.280 | And so here's the picture that it then created
00:56:37.440 | Rendering it as you know impressionist painting, and I think this is a really great example of how
00:56:42.880 | You can use deep learning to help combine
00:56:46.480 | human expertise and what computers are good at
00:56:50.480 | So I a few years ago decided to try this myself like what would happen if I took
00:57:02.080 | Deep learning and tried to use it to solve a really important problem, and so the problem I picked was
00:57:08.120 | diagnosing lung cancer
00:57:10.240 | It turns out if you can find
00:57:12.640 | lung nodules earlier
00:57:15.640 | There's a 10 times higher probability of survival
00:57:20.040 | So it's a really important problem to solve so I got together with three other people none of us had any medical background
00:57:27.600 | And we grabbed a data set of CT scans
00:57:31.880 | We used a convolutional neural network
00:57:33.960 | Much like the dogs versus cats one we trained at the start of today's lesson
00:57:38.840 | to try and predict which
00:57:41.520 | CT scans had
00:57:44.480 | malignant tumors in them
00:57:46.480 | And we ended up after a couple of months with something with a much lower
00:57:50.720 | False negative rate and a much lower false positive rate than a panel of four radiologists
00:57:55.800 | And we went on to build this in a startup into into a company called analytic
00:58:01.600 | which has really become pretty successful and
00:58:03.800 | Since that time the idea of using deep learning for medical imaging has become
00:58:09.440 | Hugely popular and it's being used all around the world
00:58:12.760 | So what I've generally noticed is that you know the vast majority of
00:58:18.720 | Of kind of things that people do in the world currently aren't using deep learning
00:58:25.040 | And then each time somebody says oh, let's try using deep learning to improve performance at this thing
00:58:30.880 | They nearly always get fantastic results and then suddenly everybody in that industry starts using it as well
00:58:37.260 | So there's just lots and lots of opportunities here at this particular time to use deep learning to help with all kinds of different stuff
00:58:45.000 | So I've jotted down a few ideas here. These are all things which I know you can use
00:58:51.360 | deep learning for right now to get good results from
00:58:57.720 | You know are things which people spend a lot of money on or have a lot of you know important business opportunities
00:59:03.800 | There's lots more as well
00:59:06.160 | But these are some examples of things that maybe at your company you could think about applying deep learning for
00:59:11.480 | So let's talk about what's actually going on
00:59:15.880 | What actually happened when we trained that deep learning model earlier?
00:59:21.760 | And so as I briefly mentioned the thing we created is something called a convolutional neural network or CNN and
00:59:29.520 | The key piece of a convolutional neural network is the convolution
00:59:34.920 | So here's a great example from a website
00:59:38.880 | I've got the URL up here
00:59:42.040 | explained visually
00:59:44.240 | It's called and the explained visually website has an example of a convolution
00:59:50.760 | kind of in practice over here in the bottom left is a very zoomed in picture of somebody's face and
00:59:56.600 | Over here on the right is an example of using a convolution on that image
01:00:03.440 | You can see here. This particular thing is obviously finding
01:00:08.120 | Edges the edges of his head right top and bottom edges in particular
01:00:17.440 | Now how is it doing that well if we look at each of these little three by three areas that this is moving over
01:00:23.520 | It's taking each three by three area of pixels and here are the pixel values
01:00:28.380 | right each thing in that three by three area and
01:00:31.440 | It's multiplying each one of those three by three pixels by each one of these
01:00:37.320 | three by three
01:00:40.400 | Kernel values in a convolution this specific set of nine values is called a kernel
01:00:47.400 | It doesn't have to be nine it could be four by four or five by five or two by two or whatever, right?
01:00:52.760 | In this case, it's a three by three kernel and in fact in deep learning nearly all of our kernels are three by three
01:00:58.760 | So in this case the kernel is one two one. Oh minus one minus two minus one. So we take each of the
01:01:07.240 | Black through white pixel values and we multiply as you can see each of them by the corresponding value in the kernel and
01:01:17.880 | Then we add them all together
01:01:20.400 | And so if you do that for every three by three area you end up with
01:01:26.040 | The values that you see over here on the right hand side
01:01:29.000 | Okay, so very low values become
01:01:33.640 | Black very high values become white and so you can see when we're at an edge
01:01:39.480 | where it's black at the bottom and
01:01:41.920 | white at the top
01:01:43.960 | We're obviously going to get higher numbers over here and vice versa. Okay, so that's a convolution
01:01:50.780 | So as you can see it is a linear operation and so based on that definition of a neural net
01:01:57.720 | I described before this can be a layer in our neural network. It is a simple linear operation
01:02:04.220 | And we're going to look at lots more at convolutions later including building a little spreadsheet
01:02:08.840 | that implements them ourselves
01:02:11.520 | So the next thing we're going to do is we're going to add a nonlinear layer
01:02:16.280 | so a nonlinearity as it's called is something which takes an input value and
01:02:25.480 | Turns it into some different value in a nonlinear way and you can see this orange picture here is an example of a nonlinear
01:02:32.520 | function specifically this is something called a sigmoid and
01:02:36.120 | so a sigmoid is something that has this kind of s shape and
01:02:40.440 | This is what we used to use as our nonlinearities in neural networks a lot
01:02:45.500 | Actually nowadays we nearly entirely use something else called a relu or rectified linear unit
01:02:52.200 | a relu is simply take any negative numbers and replace them with zero and
01:02:58.360 | Leave any positive numbers as they are so in other words in code that would be
01:03:04.020 | Y equals max x comma 0 so max x comma 0 simply says replace the negatives with 0
01:03:20.000 | Regardless of whether you use a sigmoid or a relu or something else
01:03:24.000 | The key point about taking this combination of a linear layer followed by a element wise nonlinear function is
01:03:32.860 | That it allows us to create arbitrarily complex shapes as you see in the bottom, right?
01:03:38.080 | And the reason why is that this is all from Michael Nielsen's neural networks and deep learning comm really fantastic
01:03:46.720 | interactive book as
01:03:48.880 | You change the values of your linear functions
01:03:53.280 | It basically allows you to kind of like build these arbitrarily tall or thin blocks and then combine those blocks together
01:04:02.240 | And this is actually the essence of the universal approximation theorem this idea that when you have a linear layer
01:04:10.400 | Feeding into a nonlinearity you can actually create these arbitrarily complex shapes
01:04:16.160 | So this is the key idea behind why neural networks can solve any computable problem
01:04:22.600 | So then we need a way as we described to actually
01:04:28.600 | Set these parameters so it's all very well knowing that we can move the parameters around manually to try to
01:04:36.520 | Create different shapes, but we have some specific shape. We want how do we get to that shape?
01:04:42.680 | And so as we discussed earlier the basic idea is to use something called gradient descent
01:04:48.280 | This is an extract from a notebook actually one of the fast AI lessons
01:04:53.640 | And it shows actually an example of using gradient descent to solve a simple linear regression problem
01:05:01.560 | But I can show you the basic idea. Let's say you were just you had a simple
01:05:11.000 | Quadratic, right and
01:05:13.000 | So you were trying to find the minimum of this quadratic
01:05:18.040 | And so in order to find the minimum you start out by randomly picking some point, right?
01:05:24.640 | So we say okay, let's pick let's pick here
01:05:27.120 | And so you go up there and you calculate the value of your quadratic at that point
01:05:31.640 | So what you now want to do is try to find a slightly better point
01:05:35.960 | So what you could do is you can move a little bit to the left
01:05:40.680 | And a little bit to the right to find out which direction is down and what you'll find out
01:05:46.840 | Is that moving a little bit to the left decreases the value of the function so that looks good, right?
01:05:52.280 | and so in other words, we're calculating the
01:05:55.160 | derivative of the function at that point
01:05:59.140 | All right, so that tells you which way is down
01:06:04.760 | It's the gradient. And so now that we know that going to the left is down we can take a small step in
01:06:11.320 | that direction
01:06:13.800 | To create a new point and then we can repeat the process and say okay
01:06:18.680 | Which way is down now and we can now take another step and another step and another step another step another step, okay?
01:06:26.240 | And each time we're getting closer and closer
01:06:29.520 | So the basic approach here is to say okay. We start we're at some point. We've got some value X
01:06:36.440 | Which is our current guess right that at time step n
01:06:41.080 | So then our new guess at time step n plus 1 is just equal to our previous guess
01:06:49.720 | the derivative
01:06:51.720 | Right times
01:07:00.200 | Small number because we want to take a small step
01:07:02.880 | We need to pick a small number because if we picked a big number right then we say okay
01:07:09.240 | We know we want to go to the left. Let's jump a big long way to the left
01:07:12.880 | we could go all the way over here and
01:07:14.880 | We actually end up worse right and then we do it again
01:07:18.540 | now or even worse again, right, so
01:07:21.880 | if you have too high a
01:07:25.960 | Step size you can actually end up with divergence rather than convergence
01:07:31.000 | So this number here we're going to be talking about it a lot during this course
01:07:35.000 | And we're going to be writing all this stuff out and code from scratch ourselves
01:07:37.760 | But this number here is called the learning rate
01:07:40.920 | Okay, so
01:07:48.400 | You can see here
01:07:50.560 | This is an example of basically starting out with some random line and then using gradient descent to gradually make the line
01:07:57.760 | better and better and better
01:07:59.760 | So what happens when you combine these ideas right the convolution?
01:08:04.920 | The non-linearity and gradient descent because they're all tiny small simple little things it doesn't sound that exciting
01:08:12.440 | But if you have enough of these kernels
01:08:17.640 | Right with enough layers something really interesting happens
01:08:21.080 | And we can actually draw them
01:08:23.920 | So here's the
01:08:26.920 | So this is a really interesting paper by Matt Ziler and Rob Fergus and what they did a few years ago
01:08:36.300 | Was they figured out how to basically draw a picture of what each layer in a deep learning net network learned?
01:08:43.840 | And so they showed that layer one of the network here are nine examples of convolutional filters from layer one of a trained network
01:08:53.960 | and they found that some of the filters kind of learnt these diagonal lines or
01:08:58.620 | Simple little grid patterns some of them learnt these simple gradients right and so for each of these filters
01:09:05.800 | They show nine examples of little pieces of actual photos
01:09:10.840 | Which activate that filter quite highly right so you can see layer one
01:09:16.600 | These learn to remember these these are learnt using gradient descent these filters were not programmed
01:09:22.720 | They were learnt using gradient descent right so in other words we were learning
01:09:27.400 | These nine numbers
01:09:37.040 | so layer two then was going to take these as inputs and
01:09:41.360 | Combine them together and so layer two had you know
01:09:46.860 | This is like nine kind of attempts to draw one of the examples of the filters in layer two
01:09:52.700 | They're pretty hard to draw but what you can do is say for each filter
01:09:57.160 | What are examples of little bits of images that activated them and you can see by layer two we've got?
01:10:03.640 | Basically something that's being activated nearly entirely by little bits of sunset
01:10:07.920 | something's that's being activated by circular objects
01:10:12.280 | something that's being activated by
01:10:15.300 | Repeating horizontal lines something that's being activated by corners right so you can see how we're basically combining layer one features together
01:10:24.600 | So if we combine those features together and again, these are all
01:10:29.960 | Institutional filters learnt through gradient descent by the third layer. It's actually learned to recognize the presence of text
01:10:38.360 | Another filter has learned to recognize the presence of petals
01:10:42.160 | Another filter has learned to recognize the presence of human faces right so just three layers is enough to get some pretty
01:10:50.440 | Rich behavior so but by the time we get to layer five
01:10:54.760 | We've got something that can recognize the eyeballs of insects and birds
01:10:59.680 | And something that can recognize
01:11:01.680 | unicycle wheels
01:11:03.960 | Right so so this is kind of where we start with something
01:11:08.340 | Incredibly simple all right
01:11:11.440 | But if we use it as a bit a big enough scale
01:11:14.440 | Thanks to the universal approximation theorem and the use of multiple hidden layers and deep learning
01:11:20.800 | We actually get these very very rich
01:11:24.920 | capabilities
01:11:27.120 | So that is what we used when we actually trained
01:11:30.280 | Our little dog versus cat recognizer, okay
01:11:41.240 | Let's talk more about this dog versus cat recognizer
01:11:44.840 | So we've learned the idea of like we can look at the pictures that come out of the other end to see what the models
01:11:50.880 | Classifying well or classifying badly or which ones it's unsure about
01:11:56.240 | But let's talk about like this key thing. I mentioned which is the learning rate
01:12:01.240 | So I mentioned we have to set this thing
01:12:03.560 | I just called it L before the learning rate and you might have noticed there's a couple of numbers these kind of magic numbers
01:12:09.960 | Here the first one is the learning rate, right?
01:12:14.480 | So this number is how much do you want to multiply the gradient by when you're taking each step in your gradient descent?
01:12:23.400 | We already talked about why you wouldn't want it to be too high
01:12:26.440 | Right, but probably also it's obvious to see why you wouldn't want it to be too low, right? If you had it too low
01:12:33.520 | You would take like a little step and you'd be a little bit closer and a little bit step a little step little step
01:12:40.560 | And it would take lots and lots and lots of steps and it would take too long
01:12:44.480 | so setting this number well is actually really important and
01:12:49.880 | For the longest time this was driving
01:12:53.120 | deep learning researchers crazy because they didn't really know a
01:12:57.520 | Good way to set this reliably
01:13:00.480 | So the good news is last year a researcher came up
01:13:07.000 | with an approach to quite reliably set the learning rate
01:13:10.880 | Unfortunately almost nobody noticed so almost no deep learning researchers. I know about actually are aware of this approach
01:13:20.640 | But it's incredibly successful and it's incredibly simple and I'll show you the idea
01:13:25.320 | It's built into the fast AI library as something called LR find or the learning rate finder and it comes from this paper
01:13:33.000 | I was actually 2015 paper. Sorry
01:13:35.400 | Cyclical learning rates for training neural networks by a terrific researcher called Leslie Smith
01:13:40.960 | And I'll show you Leslie's idea
01:13:48.120 | So Leslie's idea started out with the same
01:13:50.960 | Basic idea that we've seen before which is if we're going to optimize something pick some random point
01:13:57.560 | Take its gradient
01:14:00.480 | Right and then specifically he said take a tiny tiny step
01:14:05.720 | No tiny step so a learning rate of like 10 e next 7
01:14:12.280 | Right and then do it again again, but each time increase the learning rate like double it
01:14:18.240 | So then we try like 2 e next 7 4 e next 7 8 e next 7
01:14:22.960 | 10 e next 6 right and so gradually
01:14:26.920 | your steps
01:14:29.320 | Are getting bigger and bigger?
01:14:31.800 | Right and so you can see what's going to happen. It's going to like
01:14:37.440 | Start doing almost nothing right and it's going to then suddenly the loss function is going to improve very quickly
01:14:43.680 | Right, but then it's going to step even further again
01:14:49.320 | Then even further again
01:14:51.320 | Right, let's draw the rest of that line to be clear
01:14:56.280 | Right and so suddenly it's then going to shoot off and get much worse
01:15:03.240 | right, so
01:15:06.040 | The idea then is to go back and say okay
01:15:10.600 | At what point did we see like the best improvement?
01:15:20.560 | So here
01:15:27.080 | We've got our best improvement right and so we'd say okay. Let's use that
01:15:32.680 | Learning rate right so in other words if we were to plot
01:15:36.120 | the learning rate
01:15:39.520 | Over time
01:15:42.520 | It was increasing
01:15:45.080 | like so
01:15:47.040 | Right and so what we then want to do is we want to plot
01:15:50.520 | the learning rate
01:15:53.160 | Against the loss right so when I say the loss I basically mean like how accurate is the model how close in this case the loss
01:16:01.080 | Would be how far away is the predicted prediction?
01:16:04.960 | from the from the goal
01:16:07.560 | Right and so if we plotted the learning rate against the loss we'd say like okay initially it didn't do very much
01:16:14.880 | Right for small learning rates, and then it suddenly improved a lot and then it suddenly got a lot worse
01:16:22.360 | So that's the basic idea and so we'd be looking for the point where this graph is
01:16:29.920 | Dropping quickly right we're not looking for its minimum point
01:16:33.000 | We're not saying like where was at the lowest because that could actually be the point where it's just jumped too far
01:16:38.400 | We want at what point was it dropping?
01:16:40.560 | the fastest
01:16:43.000 | So if you go
01:16:46.280 | So if you create your learn objects in the same way that we did before we'll be learning more about this these details shortly
01:16:54.320 | If you then call LR find method on that you'll see that it'll start training a model
01:17:01.360 | Like it did before but it'll generally stop before it gets to a hundred percent because if it notices
01:17:08.200 | That the loss is getting a lot worse
01:17:12.640 | Then it'll stop automatically so that you can see here. It stopped at 84% and so then you can call
01:17:19.440 | Learn dot shed that gets you the learning rate scheduler
01:17:22.680 | That's the object which actually does this learning rate finding and that object has a plot learning rate function
01:17:28.240 | And so you can see here by iteration you can see the learning rate
01:17:32.680 | All right, so you can see each step the learning rate is getting bigger and bigger
01:17:36.640 | You can do it this way you can see it's increasing exponentially
01:17:41.880 | Another way that Leslie Smith the researcher suggests is to do it linearly
01:17:47.560 | So I'm actually currently researching with both of these approaches to see which works best
01:17:51.720 | Recently I've been mainly using exponential, but I'm starting to look more at using linear at the moment
01:17:57.200 | And so if we then call shed dot plot that does the plot that I just described down here
01:18:04.000 | learning rate versus
01:18:06.760 | Loss all right, and so we're looking for the highest learning rate we can find
01:18:12.480 | Where the loss is still improving?
01:18:16.240 | clearly well right and so in this case I would say
01:18:20.400 | 10 to the negative 2 max at 10 to the negative 1 is not improving
01:18:25.200 | All right 10 to the negative 3 it is also improving
01:18:28.920 | But I'm trying to find the highest learning rate I can where it's still clearly improving
01:18:33.160 | So I'd say 10 to the negative 2 right so you might have noticed that when we ran our model before we had
01:18:40.240 | 10 to the negative 2 0.01. So that's why we picked that learning rate
01:18:45.940 | So there's really only one other number that we have to pick and
01:18:53.700 | That was this number 3 and so that number 3 controlled how many
01:19:02.100 | epochs that we run so an epoch means going through our entire data set of images and
01:19:11.820 | Using each each time we do a bunch of they called mini batches we grab like
01:19:17.340 | 64 images at a time and use them to try to improve the model a little bit using gradient descent
01:19:23.260 | Right and using all of the images once is called one epoch
01:19:27.420 | and so at the end of each epoch we print out the accuracy and
01:19:32.900 | Validation and training loss at the end of the epoch
01:19:40.260 | question of
01:19:41.780 | how many epochs should we run is kind of the one other question that you need to answer to run these three lines of code and
01:19:48.620 | The answer really to me is like
01:19:51.940 | As many as you like
01:19:55.340 | What you might find happen is if you run it for too long the accuracy you'll start getting worse
01:20:01.640 | Right and we'll learn about that why later. It's something called overfitting right so
01:20:06.900 | You can run it for a while run lots of epochs
01:20:10.740 | Once you see it getting worse
01:20:12.060 | You know how many epochs you can run and the other thing that might happen is if you've got like a really big model
01:20:17.780 | Or what lots and lots of data maybe it takes so long you don't have time and so you just run enough epochs that
01:20:23.500 | Fit into the time you have available so the number of epochs you run you know that's a pretty easy thing to set
01:20:29.580 | So they're the only two numbers you're going to have to set and so the goal
01:20:34.860 | This week will be to make sure that you can run
01:20:39.580 | Not only these three lines of code on the data that I provided
01:20:43.820 | But to run it on a set of images that you either have on your computer or that you
01:20:50.360 | Get from work or that you download from Google
01:20:53.420 | And I try to get a sense of like which kinds of images does it seem to work well for?
01:20:59.780 | Which ones doesn't it work well for?
01:21:02.860 | What kind of learning rates do you need for different kinds of images how many epochs do you need?
01:21:09.540 | How does the number of the learning rate change the accuracy you get and so forth like really experiment and then?
01:21:16.420 | You know try to get a sense of like what's inside this data object?
01:21:21.140 | You know what are the y values look like what are these classes mean?
01:21:24.980 | If you're not familiar with numpy you know really practice a lot with numpy so that by the time you come back for the next
01:21:31.980 | lesson
01:21:33.660 | You know we're going to be digging into a lot more detail, and so you'll really feel ready to do that
01:21:39.060 | now one thing that's really important to be able to do that is that you need to really know how to
01:21:44.720 | work with
01:21:47.780 | Numpy the faster I library and so forth and so I want to show you some tricks in Jupyter notebook to make that much easier
01:21:55.580 | So one trick to be aware of is if you can't quite remember how to spell something right so
01:22:02.260 | If you're not quite sure
01:22:04.860 | What the message you want is you can always hit tab?
01:22:08.220 | And you'll get a list of
01:22:10.220 | Methods that start with that letter right and so that's a quick way to find things
01:22:14.900 | If you then can't remember what the arguments are to a method hit shift tab
01:22:20.360 | All right, so hitting shift tab tells you the arguments to the method so shift tab is like one of the most helpful things
01:22:29.540 | I know
01:22:31.540 | So let's take
01:22:35.860 | Shift tab and so now you might be wondering like okay. Well. What does this function do and how does it work?
01:22:42.140 | If you press shift tab twice
01:22:44.780 | Then it actually brings up the documentation
01:22:48.220 | Shows you what the parameters are and shows you what it returns and gives you examples
01:22:54.340 | Okay, if you press it three times
01:22:58.380 | Then it actually pops up a whole little separate window with that information
01:23:03.740 | Okay, so shift tab is super helpful
01:23:05.860 | One way to grab that window straight away is if you just put question mark at the start
01:23:12.540 | Then it just brings up that little documentation window
01:23:16.660 | Now the other thing to be aware of is increasingly during this course
01:23:22.500 | We're going to be looking at the actual source code of fast AI itself and learning how it's built and why it's built that way
01:23:29.660 | It's really helpful to look at source code in order to you know
01:23:33.740 | Understand what you can do and how you can do it
01:23:36.540 | So if you for example wanted to look at the source code for learn dot predict you can just put two question marks
01:23:42.400 | Okay, and you can see it's popped up the source code right and so it's just a single line of code
01:23:50.300 | You'll very often find that fast AI methods like they're they're designed to never be more than
01:23:57.420 | About half a screen full of code and they're often under six lines so you can see this case
01:24:02.660 | It's calling predict with tags so we could then get the source code for that in the same way
01:24:10.940 | And then that's calling a function called predict with tags so we could get the documentation for that in the same way and
01:24:16.340 | Then so here we are and then finally that's what it does it iterates through a data loader gets the predictions and then passes them back
01:24:23.500 | and so forth, okay, so
01:24:26.980 | question mark question mark is how to get source code a single question mark is how to get documentation and
01:24:33.660 | Shift tab is how to bring up parameters or press it more times
01:24:38.320 | to get the docs
01:24:40.980 | So that's really helpful
01:24:43.020 | Another really helpful thing to know about is how to use Jupyter notebook well and the button that you want to know is H
01:24:50.180 | If you press H, it will bring up the keyboard shortcuts
01:24:54.940 | Palette and so now you can see exactly what Jupyter notebook can do and how to do it
01:25:00.500 | I personally find all of these functions useful
01:25:03.680 | So I generally tell students to try and learn four or five different keyboard shortcuts a day
01:25:08.960 | Try them out see what they do see how they work, and then you can try practicing in that session
01:25:14.940 | And one very important thing to remember when you're finished with your work for the day go back to paper space and click on that
01:25:23.060 | little button
01:25:24.100 | Which stops and starts the machine so after it stopped you'll see it says connection closed and you'll see it's off
01:25:30.460 | If you leave it running you'll be charged for it same thing with Cressel be sure to go to your Cressel
01:25:37.060 | Instance and stop it you can't just turn your computer off or close the browser
01:25:43.020 | You actually have to stop it in Cressel or in paper space and don't forget to do that
01:25:47.940 | Or you'll end up being charged until
01:25:49.940 | You finally do remember
01:25:53.220 | Okay, so I think that's all of the information that you need to get started please remember about the forums
01:26:00.140 | If you get stuck at any point check them out
01:26:04.500 | But before you do make sure you read the information on course.fast.ai for each lesson
01:26:11.020 | All right because that is going to tell you about like things that have changed okay, so if there's been some change to
01:26:17.920 | which
01:26:20.900 | Jupyter notebook provider we suggest using or how to set up paper space or anything like that
01:26:26.020 | That'll all be on course.fast.ai
01:26:28.780 | Okay, thanks very much for watching and look forward to seeing you in the next lesson