back to indexLesson 3: Practical Deep Learning for Coders 2022
Chapters
0:0 Introduction and survey
1:36 "Lesson 0" How to fast.ai
2:25 How to do a fastai lesson
4:28 How to not self-study
5:28 Highest voted student work
7:56 Pets breeds detector
8:52 Paperspace
10:16 JupyterLab
12:11 Make a better pet detector
13:47 Comparison of all (image) models
15:49 Try out new models
19:22 Get the categories of a model
20:40 What’s in the model
21:23 What does model architecture look like
22:15 Parameters of a model
23:36 Create a general quadratic function
27:20 Fit a function by good hands and eyes
30:58 Loss functions
33:39 Automate the search of parameters for better loss
42:45 The mathematical functions
43:18 ReLu: Rectified linear function
45:17 Infinitely complex function
49:21 A chart of all image models compared
52:11 Do I have enough data?
54:56 Interpret gradients in unit?
56:23 Learning rate
60:14 Matrix multiplication
64:22 Build a regression model in spreadsheet
76:18 Build a neuralnet by adding two regression models
78:31 Matrix multiplication makes training faster
81:1 Watch out! it’s chapter 4
82:31 Create dummy variables of 3 classes
83:34 Taste NLP
87:29 fastai NLP library vs Hugging Face library
88:54 Homework to prepare you for the next lesson
00:00:00.000 |
Hi everybody and welcome to lesson three of practical deep learning for coders 00:00:04.580 |
we did a quick survey this week to see how people feel that the course is tracking and 00:00:11.900 |
Over half a few think it's about right-paced and of the rest who aren't 00:00:18.320 |
Some of you think it's a bit slow and some of you think it's a bit to sorry 00:00:21.960 |
I'm gonna think it's a bit slow and some of you think it's a bit fast 00:00:23.920 |
So hopefully we're that's about the best we can do 00:00:27.520 |
Generally speaking the first two lessons are a little more easy pacing for anybody who's already familiar with the kind of 00:00:34.220 |
basic technology pieces and then the later lessons get you know more into kind of some of the foundations and today 00:00:43.820 |
You know things like the matrix multiplications and gradients and capitalists and stuff like that 00:00:50.400 |
So for those of you who are more mathy and less computer II you might find this one more comfortable and vice-a-versa 00:01:04.360 |
official course updates thread where you can see all the up-to-date info about 00:01:09.820 |
Everything you need to know and of course the course website 00:01:13.400 |
As well so by the time you know you watch the video of the lesson 00:01:19.600 |
It's pretty likely that if you come across a question or an issue somebody else will have so definitely search 00:01:27.480 |
First and then of course feel free to ask a question yourself on the forum if you can't find your answer 00:01:32.960 |
One thing I did want to point out which you'll see in the lessons thread and the course website is 00:01:48.960 |
Radix book meta learning which internally is based heavily on all the things that I've said over the years about how to learn fast AI 00:01:59.080 |
Tickbits about the science of learning itself and put them into the course 00:02:04.400 |
It's a different course to probably any other you've taken and it's I strongly recommend 00:02:10.600 |
Watching lesson zero as well. The last bit of lesson zero is about how to set up a Linux box from scratch 00:02:18.040 |
which you can happily skip over unless that's of interest, but the rest of it is 00:02:21.260 |
Full of juicy information that I think you'll find useful 00:02:36.680 |
And I generally you know on the video recommend watching it all the way through 00:02:45.000 |
Watch it with lots of pauses running the notebook as you go because otherwise you're kind of like running the notebook 00:02:50.800 |
Without really knowing where it's heading if that makes sense 00:02:55.160 |
And the idea of running the notebook is is you you know, there's a few notebooks you could go through 00:03:00.840 |
So obviously there's the book so going through chapter one of the book going through chapter two of the book as notebooks 00:03:09.320 |
Experimenting with inputs and outputs to try and understand what's going on 00:03:17.780 |
And then trying to repeat the whole thing with a different data set and if you can do that last step, you know, that's 00:03:25.120 |
Quite a stretch goal particularly at the start of the course because there's so many new concepts 00:03:30.200 |
But that really shows that you you've got it sorted now first third bit reproduce results 00:03:35.300 |
I recommend using you'll find in the fastbook repo. So the repository for the book. There is a special folder called 00:03:44.120 |
Clean and clean contains all of the same chapters of the book 00:03:47.760 |
But with all of the text removed except for headings and all the outputs removed 00:03:53.160 |
And this is a great way for you to test your understanding of the chapter is before you run each cell 00:04:00.440 |
Charter say to yourself. Okay, what's this for? 00:04:03.440 |
And what's it going to output if anything and if you kind of work through that slowly 00:04:09.560 |
That's a great way at any time. You're not sure you can jump back to the 00:04:13.240 |
Version of the notebook with a text to remind yourself and then head back over to the clean version 00:04:18.580 |
So there's an idea for something which a lot of people find really useful for self-study I 00:04:27.640 |
Say self-study, but of course as we've mentioned before 00:04:34.640 |
Study done to some extent with others for most people 00:04:37.800 |
You know the research shows that you're more likely to stick with things if you're doing it 00:04:43.400 |
That's kind of a bit of a social activity there. The forums are a great place to find and create 00:04:49.240 |
Study groups and you'll also find on the forums a link to our discord server 00:04:56.080 |
So yes our discord server where there are some study groups there as well 00:05:01.360 |
so I'd you know in person study groups virtual study groups are a great way to 00:05:06.000 |
You know really make good progress and find other people at a similar level to you 00:05:14.560 |
Study group going at your level in your area in your time zone 00:05:18.320 |
Create one. So just post something saying hey, let's create a study group 00:05:22.280 |
So this week there's been a lot of fantastic activity. I can't show all of it. So what I did was I used the 00:05:31.520 |
summary functionality in the forums to grab all of the things with the highest votes and so I was quickly show a few of those we 00:05:44.480 |
Love this a rock-paper-scissors game where you actually use pictures of the rock-paper-scissors symbols and apparently 00:05:52.280 |
The computer always loses. That's my favorite kind of game 00:05:57.080 |
There is a lot of Elon around so very handy to have an Elon detector to you know 00:06:02.000 |
Either find more of him if that's what you need or maybe less of him 00:06:05.440 |
I thought this one is very interesting. I love these kind of really interesting ideas. It's like gee 00:06:12.580 |
I wonder if this would work. Can you predict the average? 00:06:22.720 |
And the eye and apparently the answer is yeah, actually you can predict it pretty well here in Brisbane 00:06:28.320 |
It was predicted. I believed it was in one and a half 00:06:33.360 |
I think this student is actually a genuine meteorologist if I remember correctly he built a cloud detector 00:06:39.840 |
So then building on top of the what's your favorite Marvel character? There's now also an is it a Marvel character 00:06:48.160 |
My daughter loves this one. What dinosaur is this and I'm not as good about dinosaurs as I should be I feel like there's 00:06:54.920 |
Ten times more dinosaurs than there was when I was a kid, so I'd never know their names. This is very handy 00:07:01.360 |
This is cool. Choose your own adventure where you choose your path using facial expressions 00:07:19.200 |
Application that actually runs on a mobile phone. That's pretty cool 00:07:23.400 |
I wouldn't be surprised to hear that Brian actually works at Microsoft so also an opportunity to promote 00:07:31.400 |
I thought this art movement classifier was interesting in that like there's a really interesting discussion on the forum about 00:07:37.040 |
What it actually shows about similarities between different art movements 00:07:42.400 |
And I thought this reduction detector project was really was really cool 00:07:47.280 |
As well, and there's a whole tweet thread and blog post and everything about this one particularly great piece of work 00:07:59.960 |
Quickly show you a couple of little tips before we kind of jump into the mechanics of what's behind a neural network 00:08:06.080 |
Which is I was playing a little bit with how do you make your 00:08:13.920 |
During the week and so I created this pet detector and this pet detector is not just predicting 00:08:20.360 |
Predicting dogs or cats, but what breed is it? 00:08:28.680 |
Now because I put this out on hugging face spaces 00:08:35.000 |
Download and look at my code because if you just click files and versions on the space which you can find a link on the 00:08:43.480 |
You can see them all here and you can download it to your own computer 00:09:01.360 |
So in the past I've shown you Colab and I've shown you Kaggle 00:09:05.160 |
And we've also looked at doing stuff on your own computer 00:09:09.680 |
Not so much training models on your computer, but using the models you've trained to create applications 00:09:14.480 |
Paper space is a another website a bit like Kaggle and Google 00:09:23.040 |
But in particular they have a product called gradient notebooks 00:09:27.320 |
Which is at least as I speak and things change all the time to check the course website 00:09:32.760 |
but as I speak in my opinion is is by far the best platform for 00:09:38.480 |
Running this course and for you know doing experimentation 00:09:42.040 |
I'll explain why as we go so why haven't I been using the past two weeks? 00:09:46.680 |
Because I've been waiting for them to build some stuff for us to make it particularly good and they just they just finished 00:09:53.760 |
So I've been using it all week, and it's totally amazing 00:10:00.960 |
so you've got a machine running in the cloud, but the thing that 00:10:05.800 |
Was very special about it is it's a it's a real it's a real computer. You're using 00:10:11.120 |
It's not like that kind of weird virtual version of things that Kaggle or Colab has 00:10:15.800 |
So if you whack on this button down here, you'll get a full version of JupyterLab 00:10:21.640 |
Or you can switch over to a full version of plastic Jupyter notebooks 00:10:28.760 |
And I'm actually going to do stuff in JupyterLab today because it's a pretty good environment for beginners who are not 00:10:37.000 |
Familiar with the terminal which I know a lot of people in the course are in that situation. You can do really everything 00:10:41.800 |
Kind of graphically there's a file browser so here you can see I've got my pets repo 00:10:49.040 |
It's got a git repository thing you can pull and push to git 00:11:04.120 |
And so forth so what I tend to do with this is I tend to go into a full screen 00:11:24.360 |
GPUs and most importantly there's two good features one is that you can pay I think it's eight or nine dollars a month to get better 00:11:31.600 |
GPUs and basically as many as you you know as many hours as you want 00:11:35.520 |
And they have persistent storage so with Colab if you've played with it 00:11:40.640 |
You might have noticed it's annoying you have to muck around with saving things to Google Drive and stuff on Kaggle 00:11:51.640 |
Where else on paper space you have you know whatever you save in your storage. It's going to be there the next time you come come back 00:12:04.200 |
Walkthroughs of all of this functionality so look at so if you're interested in really taking advantage of this check those out 00:12:17.160 |
that I wanted you to take away from lesson 2 isn't necessarily all the details of how do you use a particular platform to train models and 00:12:27.000 |
Deploy them into applications through through JavaScript or online platforms 00:12:33.200 |
But the key thing I wanted you to understand was the concept. There's really two pieces 00:12:38.580 |
There's the training piece and at the end of the training piece you end up with this bottle pickle file, right? 00:12:47.880 |
That's now a thing where you feed it inputs, and it spits out outputs 00:12:52.400 |
Based on that model that you trained and then so you don't need 00:12:55.640 |
You know because that happens pretty fast you generally don't need a GPU once you've got that trained 00:13:01.120 |
And so then there's a separate step, which is deploying so I'll show you how I trained my 00:13:12.680 |
So you can see I've got two I Python notebooks 00:13:16.200 |
One is app, which is the one that's going to be doing the inference and production one is the one where I train the model 00:13:22.880 |
So this first bit I'm going to skip over because you've seen it before I create my image data loaders 00:13:29.360 |
Check that my data looks okay with show batch 00:13:49.360 |
To a notebook I created actually most of the work was done by Ross Whiteman 00:13:55.700 |
Where we can try to improve this by finding a better architecture 00:14:00.900 |
There are I think at the moment in the PyTorch image models libraries over 500 00:14:09.320 |
Architectures and we'll be learning over the course 00:14:11.560 |
You know what what they are how they differ, but you know broadly speaking 00:14:18.440 |
mathematical functions, you know, which are basically matrix multiplications and 00:14:29.240 |
But most of the time those details don't matter what we care about is three things how fast are they? 00:14:35.520 |
How much memory do they use and how accurate are they and so what I've done here with Ross is we've grabbed all of the 00:14:42.780 |
Models from PyTorch image image models and you can see all the code. We've got is very very little code 00:14:52.680 |
Now my screen resolutions a bit there we go. Let's do that and so on this plot 00:15:05.280 |
The next axis we've got seconds per sample. So how fast is it? 00:15:09.960 |
So to the left is better who's faster and on the right is how accurate is it? 00:15:15.280 |
So how how accurate was it on ImageNet in particular and so generally speaking you want things that are up towards the 00:15:25.280 |
Now we've been mainly working with ResNet and you can see down here 00:15:29.920 |
Here's ResNet 18 now ResNet 18 is is a particularly small and fast version for prototyping 00:15:35.640 |
We often use ResNet 34, which is this one here and you can see this kind of like classic model 00:15:42.320 |
That's very widely used actually nowadays isn't the state-of-the-art anymore 00:15:46.480 |
So we can start to look up at these ones up here and find out some of these better models 00:15:52.440 |
the ones that seem to be the most accurate and 00:15:59.240 |
So I tried them out on my pets and I found that they didn't work particularly well. So I thought okay 00:16:05.680 |
Let's try something else out. So next up. I tried 00:16:12.640 |
This one in here was particularly interesting. It's kind of like super high accuracy. It's the you know, if you want 00:16:18.960 |
0.001 seconds inference time. It's the most accurate. So I tried that. So how do we try that? 00:16:29.480 |
So the PyTorch image models is in the TIM module. So the very start I imported that 00:16:44.120 |
and so this is going to show all the conv-next models and 00:16:46.800 |
Here I can find the ones that I just saw and all I need to do is when I create the vision learner 00:16:52.120 |
I just put the name of the model in as a string 00:16:59.200 |
This one is not a string. That's because it's a model that fast AI provides the library 00:17:09.920 |
So if you install TIM, so you need to pip install TIM or condor install TIM 00:17:14.760 |
You'll get hundreds more and you put that in a string 00:17:19.040 |
So if I now train that, the time for these epochs goes from 20 seconds to 27 seconds. So it is a little bit slower 00:17:31.120 |
down to 5.5 percent. So, you know, that's a pretty big relative difference 00:17:36.560 |
7.2 divided by 5.5. Yeah, so about a 30 percent improvement. So that's pretty fantastic and you know, it's 00:17:53.320 |
Really big ResNet that's widely available and usable on regular GPUs 00:17:59.480 |
So this is this is a big step. And so this is a you know, there's a few architectures nowadays that really are 00:18:05.600 |
Probably better choices a lot of the time and these con so if you are not sure what to use 00:18:15.000 |
You might wonder what the names are about. Obviously 00:18:17.660 |
Tiny's more large etc. Is how big is the model? So that'll be how much memory is it going to take up? 00:18:31.920 |
These ones have been trained on more data. So image net there's two different image net data sets 00:18:37.720 |
There's one that's got a thousand categories of pictures and there's another one. It's about 22,000 categories of pictures 00:18:43.800 |
So this is trained on the one with 22,000 categories 00:18:49.560 |
So these are generally going to be more accurate on kind of standard photos of natural objects 00:18:56.680 |
Okay, so from there I exported my model and that's the end okay, so now I've trained my model and I'm all done 00:19:03.680 |
You know other things you could do obviously is add more epochs for example 00:19:09.600 |
Add image augmentation. There's various things you can do. But you know, I found this this is actually pretty 00:19:17.680 |
If any of you find you can do better, I'd love to hear about it 00:19:21.160 |
So then I'd turn that into an application. I just did the same thing that we saw last week, which was to 00:19:35.720 |
The learner once we load it and call predict spits out a list of 37 numbers 00:19:40.600 |
That's because there are 37 breeds of dog and cat. So these are the probability of each of those breeds 00:19:50.760 |
The answer is that fast AI always stores this information about categories 00:19:55.800 |
This is a category in this case of dog or cat breed in something called the vocab object and it's inside the data loaders 00:20:02.760 |
So we can grab those categories and that's just a list of strings just tells us the order 00:20:07.360 |
So if we now zip together the categories and the probabilities we'll get back a dictionary that tells you 00:20:22.880 |
This was a basset hound so there you can see yep almost certainly a basset hounder 00:20:30.760 |
So from there just like last week we can go and create our interface and then and then launch it 00:20:41.100 |
What did we just do really? What is this magic? 00:20:48.480 |
So we can take a look at the model pickle file. It's an object type called a learner and 00:20:56.160 |
A learner has two main things in it. The first is the list of pre-processing steps that you did 00:21:03.040 |
to turn your images into things of the model and that's basically 00:21:13.920 |
So it's your data blocks or your image data loaders or whatever 00:21:17.640 |
and then the second thing most importantly is the trained model and 00:21:23.120 |
So you can actually grab the trained model by just grabbing the dot model attribute 00:21:27.880 |
So I'm just going to call that m and then if I type m I can look at the model and so here it is 00:21:33.580 |
Lots of stuff. So what is this stuff? Well, we'll learn about it all over time, but basically what you'll find is 00:21:42.560 |
It contains lots of layers because this is a deep learning model and you can see it's kind of like a tree 00:21:49.640 |
That's because lots of the layers themselves consist of layers 00:21:52.760 |
so there's a whole layer called the Tim body which is most of it and 00:21:59.560 |
then right at the end there's a second layer called sequential and 00:22:08.080 |
It can then it contains something called stem and something called stages and then stages can contain zero one two, etc 00:22:16.880 |
So what is all this stuff? Well, let's take a look at one of them 00:22:20.520 |
So to take a look at one of them, there's a really convenient 00:22:25.480 |
Method in pytorch called get sub module where we can pass in a kind of a dotted string 00:22:33.240 |
Navigating through this hierarchy. So zero model stem one goes zero model stem one 00:22:40.120 |
So this is going to return this layer norm 2d thing. So what is this layer norm 2d thing? 00:22:48.700 |
It's got some code is with the mathematical function. We talked about and then the other thing that we learned about is it has 00:22:56.140 |
Parameters and so we can list its parameters and look at this. It's just lots and lots and lots of numbers 00:23:01.500 |
Let's grab another example. We could have a look at zero dot model dot stages dot zero blocks dot one dot MLP dot FC one and 00:23:14.420 |
So what's going on here? What are these numbers and where at earth did they come from and how come these? 00:23:22.500 |
Numbers can figure out whether something is a basset hound or not 00:23:30.180 |
To answer that question we're going to have a look at a 00:23:45.700 |
But a local version of it here which I'm going to take you through and the basic idea is 00:23:56.220 |
Functions to data. So we start out with a very very flexible 00:24:00.620 |
in fact an infinitely flexible as we've discussed function and your network and 00:24:07.460 |
Which is to recognize the patterns in the data examples we give it 00:24:20.280 |
So let's create a function f which is 3x squared 00:24:26.900 |
Plus one. Okay, so it's a quadratic with coefficients 3 2 and 1 00:24:31.620 |
So we can plot that function f and give it a title 00:24:35.140 |
If you haven't seen this before things between dollar signs is what's called latex 00:24:39.580 |
It's basically how we can create kind of typeset mathematical equations 00:24:47.380 |
And so here you can see the function here you can see the title I passed it and here is 00:24:54.680 |
quadratic, okay, so what we're going to do is we're going to 00:25:02.140 |
Mathematical function we're trying to find as it's obviously much simpler than the function that figures out whether an image is a 00:25:09.100 |
Basset hound or not that we're just going to start super simple 00:25:12.000 |
So this is the real function and we're going to try to to recreate it from some data 00:25:17.340 |
Now it's going to be very helpful if we have an easier way of creating different quadratics 00:25:25.380 |
So I have to find a kind of a general form of a quadratic here 00:25:29.300 |
If the with coefficients a b and c and at some particular point x it's going to be ax squared plus bx plus c 00:25:38.940 |
Okay, so that's for x equals 1.5. That's 3x squared plus 2x plus 1 which is the quadratic we were did before 00:25:48.140 |
Now we're going to want to create lots of different quadratics to test them out and find out which one's best 00:26:00.460 |
Somewhat advanced but very very helpful feature of Python that's worth learning if you're not familiar with it 00:26:05.160 |
And it's used in a lot of programming languages. It's called a partial application of a function. Basically. I want this exact function 00:26:11.340 |
but I want to fix the values of a b and c to pick a particular quadratic and 00:26:17.060 |
the way you fix the values of the function is you call this thing in Python called partial and you pass in the function and 00:26:24.580 |
Then you pass in the values that you want to fix 00:26:31.220 |
If I now say make quadratic 3 2 1 that's going to create a quadratic equation with coefficients 3 2 and 1 00:26:40.200 |
And you can see if I then pass in so that's now f if I pass in 1.5. I get the exact same value I did before 00:26:49.900 |
Okay, so we've now got an ability to create any quadratic 00:26:56.420 |
Equation we want by passing in the parameters of the coefficients of the quadratic 00:27:01.140 |
That gives us a function that we can then just call as just like any normal function 00:27:05.660 |
So that only needs one thing now, which is the value of x because the other three a b and c are now fixed 00:27:15.540 |
We'll get exactly the same shape because it's the same coefficients 00:27:23.460 |
Now I'm going to show an example of of some data some data that 00:27:28.960 |
Matches the shape of this function, but in real life data is never exactly going to match the shape of a function 00:27:36.620 |
It's going to have some noise. So here's a couple of 00:27:43.860 |
So you can see I've still got the basic functional form here, but this data is a bit dotted around it 00:27:54.060 |
The level to which you look at how I implemented these is entirely up to you 00:27:59.180 |
It's not like super necessary, but it's all stuff which you know the kind of things we use quite a lot 00:28:05.020 |
So this is to create normally distributed random numbers 00:28:08.100 |
This is how we set the seed so that each time I run this I've got to get the same random numbers 00:28:14.540 |
This one is actually particularly helpful this creates a 00:28:20.060 |
tensor so in this case a vector that goes from negative to 00:28:26.500 |
Equal steps and there's 20 of them. That's why there's 20 steps along here 00:28:39.860 |
Okay, so as I say the details of that don't matter too much. The main thing to know is we've got some 00:28:46.180 |
Random data now and so this is the idea is now we're going to try to reconstruct the original 00:28:59.220 |
Well what we can do is we can create a function called plot quadratic 00:29:07.540 |
That first of all plots our data as a scatter plot and then it plots a function which is a quadratic 00:29:17.060 |
Now that's a very helpful thing for experimenting 00:29:20.220 |
In Jupiter notebooks, which is the at interact? 00:29:31.700 |
So here's an example of a quadratic with coefficients 1.5 1.5 1.5 00:29:43.860 |
So how would we try to make this fit better? Well, I think what I'd do is I'd take the first slider and 00:29:49.340 |
I would try moving it to the left and see if it looks better or worse 00:29:52.700 |
That looks worse to me I think it needs to be more curvy so let's try the other way 00:29:58.340 |
Yeah, that doesn't look bad let's do the same thing for the next slider have it this way 00:30:04.860 |
No, I think that's worse. Let's try the other way 00:30:15.700 |
So you can see what we can do we can basically pick each of the coefficients 00:30:20.700 |
One at a time try increasing a middle bit see if that improves it try decreasing it a little bit 00:30:26.380 |
See if that improves it find the direction that improves it and then slide it in that direction a little bit 00:30:32.100 |
and then when we're done we can go back to the first one and see if 00:30:41.500 |
And actually you can see that's not bad because I know the answer is meant to be 3 2 1 so they're pretty close 00:30:53.900 |
What we're going to do that's basically how those parameters 00:30:56.820 |
Created but we obviously don't have time because the you know big fancy models have 00:31:03.700 |
Often hundreds of millions of parameters. We don't have time to try a hundred hundred million sliders, so we did something better 00:31:11.580 |
Well the first step is we need a better idea of like when I move it is it getting better or is it getting worse? 00:31:23.860 |
Description of machine learning that we learned about chapter one of the book and in lesson one 00:31:32.020 |
Something we can measure which is a number that tells us how good is our model and if we had that then as we move 00:31:38.580 |
The sliders we could check to see whether it's getting better or worse 00:31:44.780 |
So there's lots of different loss functions you can pick but perhaps the most simple and common is 00:31:49.780 |
Mean squared error which is going to be so it's going to get in our predictions 00:31:54.940 |
And it's got the actuals and we're going to go predictions minus actuals squared and take the mean 00:32:05.220 |
If I now rerun the exact same thing I had before but this time I'm going to calculate the loss the MSE between 00:32:15.940 |
Remember where f is the quadratic we created and the actuals y and this time I'm going to add a title to our function 00:32:32.500 |
We're starting at a mean squared error of eleven point four six. So let's try moving this to the left and see if it gets better 00:32:40.700 |
All right, so around there, okay now let's try this one 00:33:02.820 |
So we're about there and so now we can repeat that process, right? 00:33:07.180 |
So we've we've had each of a B and C move a little bit. Let's go back to a 00:33:11.340 |
Can I get any better than 3.28? Let's try moving left 00:33:14.900 |
Yeah, that was a bit better and for B. Let's try moving left 00:33:41.020 |
But at least we can like we don't have to rely on us to kind of recognize does it look better or worse? 00:33:49.940 |
So the key thing we need to know is for each parameter 00:33:58.060 |
Does the loss get better or when we move it down? Does the loss get better? 00:34:07.300 |
We could manually increase the parameter a bit and see if the loss improves and vice versa 00:34:15.260 |
And the much faster way is to calculate its derivative 00:34:18.460 |
So if you've forgotten what a derivative is, no problem. There's lots of tutorials out there 00:34:24.140 |
You could go to Khan Academy or something like that 00:34:25.900 |
But in short the derivative is what I just said the derivative is a function that tells you 00:34:32.060 |
If you increase the input does the output increase or decrease and by how much so that's called the slope 00:34:44.420 |
Pytorch can automatically calculate that for you. So if you 00:34:51.180 |
Horrifying months of learning derivative rules in year 11 and worried you're going to have to remember them all again. Don't worry you don't 00:34:58.140 |
You don't have to calculate any of this yourself. It's all done for you. Watch this 00:35:02.940 |
So the first thing to do is we need a function that takes the coefficients of the quadratic a b and c as inputs I 00:35:11.540 |
Put them all on the list. You'll see why in a moment. I kind of call them parameters 00:35:22.420 |
This star on the front is a very very common thing in Python 00:35:26.460 |
Basically, it takes these parameters and spreads them out to turn them into a b and c and pass each of them to the function 00:35:38.820 |
And then we return the mean squared error of our predictions against our actions 00:35:45.100 |
So this is a function that's going to take the coefficients of a quadratic and return the loss 00:35:54.260 |
Okay, so if we start with a b and c at 1.5 we get a mean squared error of 11.46 00:36:06.380 |
So don't worry about that too much in short in Pytorch 00:36:13.020 |
Everything is a tensor a tensor just means that you don't it doesn't just work with numbers 00:36:17.700 |
It also works with lists or vectors of numbers. That's got a 1d tensor 00:36:22.580 |
Rectangles of numbers so tables of numbers. It's got a 2d tensor 00:36:27.380 |
Layers of tables of numbers that's got a 3d tensor and so forth. So in this case, this is a single number 00:36:34.580 |
But it's still a tensor. That means it's just wrapped up in the Pytorch 00:36:40.540 |
Machinery that allows it to do things like calculate derivatives, but it's still just the number 11.46 00:36:46.340 |
All right, so what I'm going to do is I'm going to create my parameters a b and c and I'm going to put them all in 00:36:55.220 |
A single 1d tensor a 1d tensor is also known as a rank 1 tensor 00:37:03.540 |
1 tensor and it contains the list of numbers 1.5 1.5 1.5 00:37:15.860 |
For these numbers whenever we use them in a calculation and the way we do that is we just say requires credit 00:37:26.180 |
Tensor it contains 1.5 3 times and it also tells us it's we flagged it to say please calculate gradients for this 00:37:34.260 |
particular tensor when we use it in calculations 00:37:38.700 |
So let's now use it in the calculation. We're going to pass it to that quad MSC. That's the function 00:37:45.060 |
We just created that gets the MSC a mean squared error for a set of coefficients 00:37:49.780 |
And not surprisingly, it's the same number we saw before 11.46. Okay 00:37:57.900 |
But there is one thing that's very exciting which is added an extra thing to the end called grad function 00:38:02.820 |
And this is the thing that tells us that if we wanted to 00:38:06.840 |
Pytorch knows how to create calculate the gradients 00:38:10.460 |
For our inputs and to tell Pytorch just please go ahead and do that calculation 00:38:16.340 |
You call backward on the result of your loss function. Now when I run it nothing happens 00:38:24.260 |
It doesn't look like anything happens. But what does happen is it's just added an attribute called grad 00:38:30.780 |
Which is the gradient to our inputs ABC. So if we run this cell 00:38:34.020 |
This tells me that if I increase a the loss will go down 00:38:40.940 |
If I increase B, the loss will go down a bit less 00:38:45.580 |
You know if I increase C, the loss will go down 00:38:51.980 |
Right. So that means we should increase a B and C 00:38:56.700 |
Well, how much by well given that a is says if you increase a even a little bit the loss 00:39:03.180 |
Improves a lot that suggests we're a long way away from the right answer. So we should probably increase this one a lot 00:39:08.980 |
This one the second most and this one the third most 00:39:15.220 |
This parameter the loss decreases. So in other words, we want to adjust our parameters a B and C 00:39:23.500 |
By the negative of these we want to increase increase increase 00:39:41.540 |
But we're just going to like decrease it a bit. We don't want to jump too far. Okay, so just we're just going to go 00:39:46.460 |
A small distance. So we're going to we're just going to somewhat arbitrarily pick point. Oh one 00:39:52.620 |
So that is now going to create a new set of parameters 00:39:55.940 |
Which are going to be a little bit bigger than before because we subtracted negative numbers 00:40:15.100 |
There's one extra line of code which we didn't mention which is with torch dot no grad 00:40:21.340 |
Remember earlier on we said that the parameter ABC requires grad and that means pytorch will automatically calculate 00:40:28.380 |
Its derivative when it's used in a in a function 00:40:32.380 |
Here it's being used in a function, but we don't want the derivative of this. This is not our loss 00:40:37.900 |
Right. This is us updating the gradients. So this is basically 00:40:42.020 |
the standard in a part of a pytorch loop and every neural net deep learning machine pretty much every machine learning model 00:40:51.340 |
At least of this style that your build basically looks like this 00:40:55.020 |
If you look deep inside fast.io source code, you'll see something that basically looks like this 00:41:00.460 |
So we could automate that right? So let's just take those steps which is we're going to 00:41:10.540 |
Calculate let's go back to here. We're going to calculate the mean squared error for our quadratic 00:41:20.340 |
Call backward and then subtract the gradient times a small number from the gradient 00:41:30.380 |
So so far we're up to a loss of ten point one 00:41:32.700 |
So we're going to calculate our loss call dot backward to calculate the gradients 00:41:41.100 |
subtract the gradients times a small number and print how we're going and 00:42:09.500 |
There they are three point two one point nine two point. Oh, so they're definitely heading in the right direction 00:42:17.980 |
That's basically how we do it's called optimization 00:42:23.500 |
Okay, so you'll hear a lot in deep learning about optimizers. This is the most basic kind of 00:42:29.420 |
Optimizer, but they're all built on this principle of course 00:42:33.380 |
It's called gradient descent and you can see why it's called gradient descent. We calculate the gradients and 00:42:38.980 |
Then do a descent which is we're trying to decrease the loss 00:42:49.380 |
The entire foundations of how we create those parameters. So we need one more piece 00:42:55.900 |
Which is what is the mathematical function that we're finding parameters for? 00:42:59.940 |
We can't just use quadratics, right because it's pretty unlikely that the relationship between 00:43:08.420 |
Whether a pixel is part of a basset hound is a quadratic. It's going to be something much more complicated 00:43:20.620 |
We can create an infinitely flexible function from this one tiny thing 00:43:33.020 |
It's a linear function. We've got our output Y 00:43:40.140 |
coefficients M and B. This is even simpler than our quadratic and 00:43:48.460 |
And torch.clip is a function that takes the output Y and if it's greater than that number 00:43:56.340 |
It turns it into that number. So in other words, this is going to take anything that's negative and make it zero 00:44:06.820 |
Calculate the output of a line and if it is bigger than or smaller than zero, it'll make it zero 00:44:18.180 |
To take that function and set the M and B to one and one. So this is now going to be this function here 00:44:26.740 |
Y equals X plus one followed by this torch.clip 00:44:30.740 |
And here's the shape okay as we'd expect it's a line 00:44:39.340 |
When it comes to the line, it becomes a horizontal line 00:44:43.500 |
So we can now do the same thing we can take this plot function and make it interactive 00:44:53.020 |
We can see what happens when we change these two parameters M and B. So we're now plotting 00:45:04.180 |
And B is the intercept for the shift up and down 00:45:20.300 |
Work now, why is this interesting? Well, it's not interesting of itself 00:45:24.700 |
but what we could do is we could take this rectified linear function and 00:45:33.580 |
Which adds up to rectified linear functions together 00:45:37.540 |
So there's some slope M1B1, some second slope N2B2. We're going to calculate it at some point X and 00:45:49.540 |
at what that function looks like if we plot it and 00:45:53.020 |
You can see what happens is we get this downward slope and then a hook and then an upward slope 00:45:58.700 |
So if I change M1, it's going to change the slope of that first bit 00:46:06.420 |
Okay, and I'm sure you won't be surprised to hear that M2 changes the slope of the second bit and 00:46:23.220 |
Because we don't just have to do a double value 00:46:26.380 |
We could add as many values together as we want 00:46:30.780 |
And if we add as many values together as we want, then we can have an arbitrarily squiggly function and with enough values 00:46:42.300 |
right, so you could imagine incredibly squiggly like I don't know like an audio waveform of me speaking and 00:46:48.860 |
If I gave you a hundred million values together, you could almost exactly match that 00:47:04.660 |
We want things that can have more than one input 00:47:06.660 |
but you can add these together across as many dimensions as you like and so exactly the same thing will give you a 00:47:18.140 |
3D, 4D, 5D and so forth and it's the same idea with this 00:47:35.500 |
Problem is you need some numbers for them, you need parameters. Oh, no problem. We know how to get parameters 00:47:59.020 |
Tweaks to make it faster and make it need less data 00:48:10.500 |
Now I remember a few years ago when I said something like this in a class 00:48:15.300 |
Somebody on the forum was like this reminds me of that thing about how to draw an owl 00:48:26.820 |
The thing I find I have a lot of trouble explaining to students is when it comes to deep learning 00:48:33.420 |
there's nothing between these two steps. When you have 00:48:38.860 |
gradient descent to optimize the parameters and 00:48:51.780 |
So we're going to learn about all these other tweaks and they're all very important 00:48:55.500 |
But when you come down to like trying to understand something in deep learning, just try to keep coming back to remind yourself 00:49:05.940 |
Which it's using gradient descent to set some parameters to make a wiggly function 00:49:11.100 |
Which is basically the addition of lots of rectified linear units or something very similar to that 00:49:18.340 |
Okay, so we've got some questions on the forum 00:49:24.660 |
Okay, so question from Zakiya with six upvotes so for those of you 00:49:33.060 |
watching the video what we do in the lesson is we want to make sure that the 00:49:37.740 |
Questions that you hear answered are the ones that people really care about 00:49:41.040 |
So we pick the ones which get the most upvotes. This question is 00:49:46.940 |
Is there perhaps a way to try out all the different models and automatically find the best performing one? 00:49:59.820 |
If we go back to our training script remember there's this thing called list models and 00:50:08.340 |
It's a list of strings. So you can easily add a for loop around this 00:50:16.340 |
Architecture in Tim dot list models and you could do the whole lot which would be like that and then you could 00:50:30.580 |
It's going to take a long time for 500 and something models 00:50:34.580 |
So generally speaking like I've I've never done anything like that myself 00:50:40.100 |
I would rather look at a picture like this and say like okay. Where am I in? 00:50:44.700 |
the vast majority of the time this is something this would be the biggest I reckon number one mistake of 00:50:51.100 |
Beginners I see is that they jump to these models 00:50:55.540 |
From the start of a new project at the start of a new project. I pretty much only use ResNet 18 00:51:06.220 |
Trying things out and I try different data augmentation. I'm going to try different ways of cleaning the data 00:51:16.420 |
Different external data I can bring in and so I want to be trying lots of things now 00:51:22.180 |
I want to be able to try it as fast as possible, right? So 00:51:25.300 |
Trying better architectures is the very last thing that I do and 00:51:33.980 |
What I do is once I've spent all this time, and I've got to the point where I've got okay 00:51:37.620 |
I've got my ResNet 18 or maybe you know ResNet 34 because it's nearly as fast 00:51:49.540 |
Do I need it more accurate for what I'm doing do I need it faster for what I'm doing? 00:51:54.980 |
Could I accept some trade-off to make it a bit slower to make more accurate and so then I'll have a look and I'll say 00:52:00.260 |
Okay, well I kind of need to be somewhere around 0.001 seconds, and so I try a few of these 00:52:04.980 |
So that would be how I would think about that 00:52:09.860 |
Okay next question from the forum is around how do I know if I have enough data? 00:52:18.300 |
What are some signs that indicate my problem needs more data? 00:52:22.780 |
I think it's pretty similar to the architecture question, so you've got something out of data 00:52:30.640 |
Presumably you've you know you've started using all the data that you have access to you built your model 00:52:41.360 |
Do you have the accuracy that you need for whatever it is you're doing? 00:52:45.840 |
You can't know until you've trained the model, but as you've seen it only takes a few minutes to train a quick model 00:52:57.280 |
my very strong opinion is that the vast majority of 00:53:00.560 |
Projects I see in industry wait far too long before they train their first model 00:53:06.840 |
You know my opinion you want to train your first model on day one with whatever 00:53:11.720 |
CSV files or whatever that you can hack together 00:53:15.080 |
And you might be surprised that none of the fancy stuff 00:53:19.960 |
You're thinking of doing is necessary because you already have a good enough accuracy for what you need 00:53:24.200 |
Or you might find quite the opposite you might find that oh my god with we're basically getting no accuracy at all 00:53:32.600 |
These are things you want to know at the start 00:53:38.040 |
We'll learn lots of techniques both in this part of the course and in part two 00:53:42.360 |
About ways to really get the most out of your data 00:53:45.560 |
In particular there's a reasonably recent technique called semi-supervised learning 00:53:51.880 |
Which actually lets you get dramatically more out of your data 00:53:54.640 |
And we've also started talking already about data augmentation, which is a classic technique you can use 00:54:00.040 |
So you generally speaking it depends how expensive is it going to be to get more data? 00:54:03.960 |
But also what do you mean when you say get more data? Do you mean more labeled data? 00:54:08.000 |
Often it's easy to get lots of inputs and hard to get lots of outputs 00:54:13.040 |
For example in medical imaging where I've spent a lot of time 00:54:16.720 |
It's generally super easy to jump into the radiology archive and grab more CT scans 00:54:22.240 |
But it's maybe very difficult and expensive to 00:54:25.880 |
You know draw segmentation masks and and pixel boundaries and so forth on them 00:54:39.160 |
Or text or whatever and maybe it's harder to get labels 00:54:43.600 |
And again, there's a lot of stuff you can do using stuff things like we'll discuss semi-supervised learning to actually take advantage of unlabeled data 00:54:56.600 |
Final question here in the quadratic example where we calculated the initial derivatives for A B and C 00:55:02.800 |
We got values of minus 10.8 minus 2.4, etc. What unit are these expressed in? 00:55:08.440 |
Why don't we adjust our parameters by these values themselves? 00:55:11.520 |
So I guess the question here is why are we multiplying it by a small number? 00:55:17.560 |
Okay, let's take those two parts of the question 00:55:32.720 |
how much does what sorry in for each increase in in a of 1 so if I increase a from 00:55:42.840 |
We have 1.5. So if we increase from 1.5 to 2.5 00:55:52.180 |
887 now, that's not exactly right because it's kind of like 00:55:58.280 |
It's kind of like in an infinitely small space right because actually it's going to be curved 00:56:04.600 |
Right, but if it if it stays its data, that's like that's what would happen 00:56:13.720 |
The loss would decrease if it stayed constant 00:56:17.800 |
You know if the slope stayed the same the loss would decrease by minus 2.1 to 2 00:56:26.760 |
Change it directly by these numbers. Well, the reason is 00:56:47.960 |
And there's some kind of interesting theory that says that once you get close enough to the 00:56:57.280 |
Optimal value all functions look like quadratics anyway, right? So we can kind of safely draw it in this kind of shape 00:57:06.080 |
Because this is what they end up looking like if you get close enough 00:57:16.320 |
Used my daughter's favorite pens and I sparkly ones. So we're measuring the slope here 00:57:26.720 |
So that seems to suggest we should jump a really long way. So we jump a really long way 00:57:34.800 |
And what happened? Well, we jumped way too far. And the reason is that that slope 00:57:42.960 |
We moved a lot and so that's generally what's going to happen, right? 00:57:47.320 |
Particularly as you approach the optimal is generally the slopes going to decrease 00:57:51.080 |
So that's why we multiply the gradient by a small number 00:57:55.040 |
And that small number is a very very very important number. It has a special name 00:58:11.520 |
Hyper parameter, it's not a parameter. It's not one of the actual coefficients of your function 00:58:17.720 |
But it's a parameter you use to calculate the parameters 00:58:22.000 |
Pretty better, right? It's a hyper parameter. And so it's something you have to pick now. We haven't picked any yet 00:58:30.760 |
In any of the stuff we've done that I remember and that's because fast AI generally picks reasonable defaults 00:58:36.720 |
For most things but later in the course we will learn about how to try and find 00:58:43.800 |
Learning rates and you will find sometimes you need to actually spend some time finding a good learning rate 00:58:49.960 |
You could probably understand the intuition here if you pick a learning rate, that's too big 00:59:03.600 |
Then jump back again and you'll jump too far the other way and you'll actually 00:59:08.480 |
Diverge and so if you ever see when your model is training, it's getting worse and worse 00:59:16.440 |
What would happen on the other hand if you pick a learning rate that's too small? 00:59:25.480 |
Take tiny steps and of course the flatter it gets the smaller the steps are going to get and 00:59:34.040 |
So finding the right learning rate is a compromise 00:59:37.560 |
Between the speed at which you find the answer and the possibility that you're actually going to shoot past it and get worse and worse 00:59:43.800 |
Okay, so one of the bits of feedback I got quite a lot in the survey is that people want a break halfway through 00:59:52.720 |
Which I think is a good idea. So I think now is a good time to have a break 00:59:55.520 |
So let's come back in 10 minutes at 25 past 7 01:00:00.160 |
Okay, hope you had a good rest have a good break I should say 01:00:17.760 |
So I want to now show you a really really important 01:00:31.680 |
All right, so we're going to be wanting to do a whole lot of 01:00:36.800 |
MX plus base and we want don't just want to do MX plus B. We're going to want to have like lots of 01:00:44.000 |
Variables so for example every single pixel of an image would be a separate variable 01:00:49.800 |
so we're going to multiply every single one of those times some coefficient and then add them all together and 01:00:58.920 |
The crop the the ReLU and then we're going to do it a second time with a second bunch of parameters 01:01:05.000 |
And then a third time and a fourth time and fifth time 01:01:07.120 |
It's going to be pretty inconvenient to write out a hundred million ReLU's 01:01:13.920 |
But so happens there's a mathematical single mathematical operation that does all of those things for us except for the final replace 01:01:21.840 |
negatives with zeros and it's called matrix multiplication I 01:01:25.400 |
Expect everybody at some point did matrix multiplication at high school. I suspect also a lot of you have forgotten works 01:01:39.520 |
They give the impression you need years of graduate school study to learn all this linear algebra 01:01:45.280 |
You don't actually all you need almost all the time is matrix multiplication and it couldn't be simpler 01:01:52.800 |
I'm going to show you a couple of different ways 01:01:54.000 |
The first is there's a really cool site called matrix multiplication dot XYZ you can put in any matrix you want 01:02:07.360 |
So this matrix is saying I've got three rows of data 01:02:14.440 |
So maybe they're tiny to the tiny images with three pixels and the value of the first one is 1 2 1 01:02:27.520 |
These are our three sets of coefficients. So we've got a 01:02:31.960 |
B and C in our data. So so I guess you'd call it x1 x2 and x3 and then here's our first set of coefficients a B and C 01:02:44.640 |
So here's what happens when we do matrix multiplication that second this matrix here of coefficients 01:02:58.640 |
This is the multiplications and additions that I mentioned right? So multiply and multiply add multiply add so that's going to give you 01:03:13.200 |
Second matrix times the first row so that gives you the top left 01:03:21.120 |
So the next one is going to give us two results, right? 01:03:23.920 |
So we've got now the right hand one with the top row and the left hand one with the second row 01:03:34.400 |
And that's it that's what matrix multiplication is it's multiplying things together and adding them up 01:03:40.840 |
So there'd be one more step to do to make this a layer of a neural network 01:03:45.200 |
Which is if this had any negatives we replace them with zeros 01:03:55.880 |
Foundation or mathematical operation and basically all of deep learning 01:04:02.360 |
GPUs that we use the thing that they are good at is this matrix multiplication 01:04:11.560 |
Which we can basically only do one thing which is to multiply together two four by four matrices 01:04:18.400 |
And then they do that lots of times with bigger matrices 01:04:24.600 |
example of this we're actually going to build a 01:04:27.480 |
complete machine learning model on real data in the spreadsheet 01:04:40.000 |
Fast AI has become kind of famous for a number of things and one of them is using spreadsheets 01:04:44.880 |
To create deep learning models. We haven't done it for a couple of years. I'm pretty pumped to show this to you 01:04:56.480 |
Where there's a competition I actually helped create many years ago called Titanic 01:05:05.920 |
And it's like an ongoing competition. So 14,000 people have entered it. So 12 teams have entered it so far 01:05:15.960 |
There's no end date and the data for it is the data about 01:05:32.440 |
And so I clicked here on the download button to grab it on my computer that gave me a CSV 01:05:43.520 |
The first thing I did then was I just removed a few columns that 01:05:46.840 |
Clearly were not going to be important things like the name of the passengers the passenger ID 01:05:57.400 |
Each row of this is one passenger. The first column is the dependent variable. The dependent variable is the thing we're trying to predict 01:06:07.280 |
The remaining are some information such as what class of the boat first second or third class for sex their age 01:06:16.680 |
So you should always look for a data dictionary to find out what's what number of parents and children, okay 01:06:31.040 |
What was their fare and which of the three cities did they embark on? Okay, so there's that data 01:06:48.140 |
There's all kinds of things we could do for that. But for this purpose, I just decided to remove them and 01:06:57.400 |
I found the same thing for embarked. I removed the blanks as well 01:07:00.900 |
But that left me with nearly all of the data, okay, so then I've put that over here 01:07:11.920 |
Okay, that's the so this these are the columns that came directly from Kaggle 01:07:26.000 |
So basically what we now want to do is we want to multiply each of these by a coefficient 01:07:41.040 |
You can't so I converted all of these two numbers male and female are very easy 01:07:45.960 |
I created a column called is male and as you can see, there's just an if statement that says if sex is male 01:07:55.320 |
And we can do something very similar for them, but we can have one column called did they embark in Southampton? 01:08:00.820 |
Same deal and another column for today. What's the court shows boom? 01:08:09.760 |
And their P class is one two or three which is a number but it's not really 01:08:17.180 |
It's not really a continuous measurement of something. There isn't one or two or three things that different 01:08:24.760 |
Levels, so I decided to turn those into similar things into these binary. They quote. These are called binary categorical variables 01:08:39.920 |
The other thing that I was thinking well, you know that I kind of tried it and checked out what happened and what happened was 01:08:49.960 |
So I created some random numbers. So to create the random numbers 01:08:57.960 |
Right and I copied those to the right and then I just went copy and I went paste values 01:09:04.240 |
So that gave me some random numbers and that's my like so just because like I was like before I said all a B and C 01:09:10.800 |
Let's just start them at 1.5 1.5 1.5 what we do in real life is we start our parameters at random numbers 01:09:22.880 |
Actually, sorry, I slightly lied. I didn't use Rand. I used Rand minus 0.5 01:09:28.160 |
And that way I've got small numbers that were on either side of 0 01:09:32.240 |
So then when I took each of these and I multiplied them by 01:09:39.760 |
Fairs and ages and so forth what happened was that these numbers here 01:09:51.600 |
You know these numbers here and so in the end all that mattered was what was their fair? 01:09:57.680 |
That because they were just bigger than everything else 01:10:00.240 |
So I wanted everything to basically go from 0 to 1 these numbers were too big 01:10:05.200 |
So what I did up here is I just grabbed the maximum 01:10:08.040 |
of this column the maximum of all the fairs is 512 and so then 01:10:16.000 |
Actually, I do age first. I did a maximum of age because a similar thing, right? There's 80 year olds and there's two year olds and 01:10:22.360 |
So then I'm over here. I just did okay. Well, what's their age? 01:10:26.880 |
Divided by the maximum and so that way all of these are between 0 and 1 01:10:35.200 |
So that's how I fix this is called normalizing the data 01:10:41.720 |
Now we haven't done any of these things when we've done stuff with fast AI 01:10:47.720 |
That's because fast AI does all of these things for you 01:10:54.000 |
But it's all these things are being done behind the scenes 01:10:59.080 |
The fair I did something a bit more which is I noticed there's some lots of very small fairs and 01:11:07.760 |
There's also some a few very big fairs so like $70 and then $7 $7 01:11:13.000 |
Generally speaking when you have lots of really big numbers and a few small ones 01:11:17.880 |
So generally speaking when you've got a few really big numbers and lots of really small numbers. This is really common with with 01:11:25.640 |
With money, you know because money kind of follows this relationship where a few people have lots of it 01:11:31.600 |
And they spend huge amounts of it and most people don't have heaps 01:11:34.840 |
If you take the log of something that's like that has that kind of extreme distribution 01:11:39.440 |
You end up with something that's much more evenly distributed. So I've added this here called log fair 01:11:48.320 |
And these are all around one which isn't bad. I could have normalized that as well 01:11:51.760 |
But I was too lazy. I didn't bother because it seemed okay 01:11:54.520 |
So at this point you can now see that if we start from here 01:12:02.680 |
All of these are all around the same kind of level, right? So none of these columns are going to 01:12:11.640 |
So now I've got my coefficients which are just as I said, they're just random 01:12:21.960 |
Ax1 plus Bx2 plus Cx3 plus blah blah blah blah blah blah blah. Okay, and so to do that 01:12:32.980 |
Can use some product in Excel? I could have typed it out by hand. It'd be very boring 01:12:37.280 |
But some product is just going to multiply each of these 01:12:48.320 |
This one will be multiplied by this one so forth and then they get all added together 01:12:52.580 |
Now one thing if you're eagle-eyed you might be wondering is in a linear equation 01:13:02.520 |
There's this constant term and I do not have any constant term 01:13:06.320 |
I've got something here called const, but I don't have any plus at the end 01:13:13.320 |
Well, there's a nice trick that we pretty much always use in machine learning 01:13:17.360 |
Which is to add a column of data just containing the number one every time 01:13:23.120 |
If you have a column of data containing the number one every time and that parameter becomes your constant term 01:13:34.600 |
Code a little bit simpler when you do it that way. It's just a trick but everybody does it 01:13:41.160 |
Okay, so this is now the result of our linear model 01:13:45.460 |
So this is not I'm not even going to do value right? I'm just going to do 01:13:54.200 |
Now if you've done regression before you might have learned about it as something you kind of solve with various matrix things 01:14:00.400 |
But in fact, you can solve a regression using gradient descent 01:14:05.040 |
So I've just kind of had and created a loss for each row. And so the loss is going to be equal to 01:14:23.280 |
And there they all are squared errors. And so here I've just 01:14:26.280 |
Sum them up. I could have taken the mean. I guess that would have been a bit easier to think about 01:14:32.000 |
But some is going to be give us the same result. So here's our loss 01:14:35.140 |
And so now we need to optimize that using gradient descent 01:14:40.480 |
So Microsoft Excel has a gradient descent optimizer in it called solver 01:14:45.920 |
So I'll click solver and it'll say okay, what are you trying to optimize? It's this one here and I'm going to do it by changing 01:15:00.200 |
I'm trying to minimize it. And so we're starting a loss of 55.78 01:15:30.640 |
And there we go, so it's gone from 1.03 to 0.1 and so we can check the predictions so the first one 01:15:42.560 |
It was they didn't survive and we predict wouldn't survive 01:15:52.800 |
So this one you can start to see a few issues here, which is like sometimes it's predicting less than one 01:15:58.080 |
So it's less than zero and sometimes it's predicting more than one 01:16:04.720 |
Wouldn't it be cool if we had some way of constraining it to between zero and one and that's an example of some of the things 01:16:10.240 |
We're going to learn about that make this stuff work a little bit better, right? 01:16:13.240 |
But you can see it's doing an okay job. So this is not deep learning 01:16:16.160 |
This is not a neural net yet. This is just a regression 01:16:24.480 |
So I'm just going to do it twice. So now rather than one set of coefficients 01:16:29.280 |
I've got two sets and again, I just put in random numbers 01:16:44.800 |
So the first sum product is with my first set of coefficients 01:16:47.840 |
And my second sum product is with my second set of coefficients 01:16:52.800 |
So I'm just calling them linear one and linear two 01:16:56.000 |
Now there's no point adding those up together because if you add up two linear functions together you get another linear function 01:17:08.400 |
So in microsoft excel value looks like this if the number is less than zero 01:17:12.720 |
Use zero, otherwise use the number. So that's how we're going to replace the negatives with zeros 01:17:25.440 |
We have to add them together. So we add the values 01:17:30.560 |
So that's going to be our prediction and then our loss is the same as the other sheet. It's just survived minus prediction squared 01:17:50.800 |
Optimize a h1 and this time we're changing all of those 01:18:04.000 |
Excel solvers not the fastest in the world, but it gets the job done 01:18:18.960 |
So there you go. So we've now created our first deep learning neural network from scratch 01:18:24.880 |
And we did it in microsoft excel everybody's favorite artificial intelligence tool 01:18:34.560 |
Be a bit faster and easier if we used matrix multiplication. So let's finally do that 01:18:40.400 |
So this next one is going to be exactly the same as the last one but with matrix multiplication 01:18:47.840 |
You'll notice the key difference now is our parameters have been transposed 01:18:55.120 |
Matching the data in terms of being in columns 01:19:04.240 |
The expectation is the way matrix multiplication works works is that you have to transpose this so it goes 01:19:10.400 |
The x and y is kind of the opposite way around the rows and columns the opposite way around 01:19:15.360 |
Other than that it's the same i've got the same I just copied and pasted the random numbers 01:19:27.120 |
This entire thing here is a single function which which which is 01:20:07.920 |
Should get the same number point. Oh wait, wasn't it? 01:20:16.480 |
Okay, so that's just another way of doing the same thing so you can see that um 01:20:21.520 |
Matrix multiplication it takes like a surprisingly long time at least for me 01:20:30.160 |
For matrix multiplication is like a single mathematical operation. So I still find it helpful 01:20:50.880 |
That is a deep learning neural network in microsoft excel 01:21:09.200 |
Learning competition if you haven't done much machine learning before 01:21:13.040 |
Then it's certainly worth like trying out just to kind of get the feel for these how these all get put together 01:21:31.040 |
Is the chapter where we lose the most people because it's um, to be honest, it's hard 01:21:48.720 |
We're teaching it a very different way in the course to what's in the book 01:21:53.120 |
Um, and you know, you can use the two together, but if you've tried to read the book and been a bit disheartened 01:21:59.040 |
Um, yeah, try, you know, try following through through the spreadsheet instead 01:22:03.360 |
Maybe try trading like if you use numbers or google sheets or something you could try to create your own kind of version of it 01:22:13.520 |
Or you could try to do it yourself from scratch in python 01:22:28.160 |
Okay question from victor guerra in the excel exercise 01:22:38.560 |
And germany is doing some feature engineering. He comes up with two new columns p class one and p class two 01:22:53.040 |
Um, is it because p class one if p class one is zero when p class two is zero then p class three must be one 01:22:59.920 |
So in a way two columns are enough to encode the input with the original column. Yes 01:23:09.840 |
Tell the computer about things that can kind of figure out for itself. Um, so when you create these are called dummy variables 01:23:20.560 |
Three levels like this one you need two dummy variables. So in general a categorical variable with n levels needs n minus one 01:23:36.960 |
So what we're going to be doing in our next lesson is looking at natural language processing 01:23:41.600 |
So so far we've looked at some computer vision and just now we've looked at some what we call tabular data 01:23:50.480 |
Next up where we're going to be looking at natural language processing. So I'll give you a taste of it 01:23:54.800 |
So you might want to open up the getting start with getting started with nlp for absolute beginners 01:24:06.720 |
So here's the getting started with nlp absolute beginners notebook, I will say as a notebook author I 01:24:13.440 |
Um may sound a bit lame, but I always see when people have uploaded it. It always makes me really happy 01:24:21.280 |
So remember to upvote these notebooks or any other notebooks you you like 01:24:27.440 |
So if you want to ask any questions or make any comments, I enjoy those as well 01:24:39.440 |
Is about rather than taking for example image data and making predictions we take text data 01:24:46.880 |
That text data most of the time is in the form of 01:24:52.720 |
So like plain english text, uh, so, you know english is the most common language used for nlp 01:24:58.240 |
But there's nlp models in dozens of different languages nowadays 01:25:12.880 |
There's less resources in non-english languages and there's a great opportunity 01:25:21.760 |
This has actually been one of the things that the fastai community has been fantastic at and the global community 01:25:33.200 |
Nlp resource was created by a student from the very first fastai course 01:25:44.400 |
Some of the best resources have come out of fastai alumni and so forth 01:25:48.320 |
So that's a particularly valuable thing you could look at. So if your language is not well represented, that's an opportunity 01:25:57.440 |
So some examples of things you could use nlp for well, perhaps the most common and practically useful in my opinion is classification 01:26:06.020 |
Classification means you take a document now when I say a document that could best be one or two words 01:26:13.360 |
Could be a wikipedia page. So it could be any length. We use the word document 01:26:16.880 |
It sounds like that's a specific kind of length 01:26:19.360 |
But it can be a very short thing a very long thing. We take a document and we try to figure out a category for it 01:26:25.520 |
Now that can cover many many different kinds of applications. So 01:26:28.640 |
One common one that we'll look at a bit is sentiment analysis 01:26:31.920 |
Um, so for example, is this movie review positive or negative sentiment analysis is very helpful in things like 01:26:38.640 |
marketing and product development, you know in big companies, there's lots and lots of 01:26:43.200 |
You know information coming in about your product. It's very nice to get a quickly sorted out and kind of track metrics from week to week 01:26:51.040 |
Something like figuring out what author wrote the document would be an example of 01:26:55.040 |
Classification exercise because you're trying to put it a category in this case is which author 01:26:59.680 |
I think there's a lot of opportunity in legal discovery. There's already some products in this area where in this case the category is 01:27:06.880 |
Is this a legal document in scope or out of scope in the court case? 01:27:19.920 |
Which part of the organization should it be sent to is it an urgent or not? 01:27:23.600 |
Stuff like that. So these are examples of categories of classification 01:27:30.960 |
Classification tasks in nlp is it's going to look very very similar to 01:27:41.280 |
But what we're going to do is we're going to use a different library 01:27:44.080 |
The library we're going to use is called hugging face transformers rather than fastai 01:27:51.680 |
The main reason why is because I think it's really helpful to see how things are done in more than one library 01:28:03.840 |
So you can do things at a very high level with very little code or you can dig deeper and deeper and deeper 01:28:10.560 |
Hacking face transformers doesn't have the same high level 01:28:16.160 |
Fastai has so you have to do more stuff manually 01:28:18.880 |
And so at this point of the course, you know, we're going to actually intentionally use a library, which is a little bit less user-friendly 01:28:26.160 |
In order to see kind of what it's just depths you have to go through to use other 01:28:30.400 |
Libraries having said that the reason I picked 01:28:34.160 |
This particular library is it is particularly good 01:28:38.000 |
It has really good models in it. It has a lot of really good techniques in it 01:28:45.200 |
Not at all surprising because they have hired lots and lots of fastai alumni. So they have very high quality people working on it 01:28:54.800 |
Before the next lesson, um, yeah, if you've got time 01:28:58.640 |
Take it take a look at this notebook and take a look at the data the data we're going to be working with 01:29:18.160 |
Are referring to the same thing or not where those concepts are represented as english text 01:29:22.960 |
And when you think about it, that is a classification task because the document is 01:29:30.640 |
Text one blah text two blah, and then the category is similar or not similar 01:29:36.480 |
And in fact in this case they actually have scores 01:29:41.120 |
It's either going to be basically zero zero point two five point five point seven five or one of like how similar is it? 01:29:46.720 |
But it's basically a classification task when you think of it that way 01:29:52.320 |
So yeah, you can have a look at the data and um 01:29:55.520 |
Next week we're going to go through step by step through this notebook 01:29:59.760 |
And we're going to take advantage of that as an opportunity also to talk about the really important 01:30:05.920 |
Topics of validation sets and metrics which are two of the most important topics in 01:30:13.040 |
Not just deep learning but machine learning more generally 01:30:16.480 |
All right. Thanks, everybody. I'll see you next week. Bye