back to indexIntro to Machine Learning: Lesson 1
Chapters
0:0 Intro
8:5 Importing Libraries
12:5 Cargill Competitions
16:0 Downloading Data
21:25 Installing Data
23:28 Running Paths
25:13 Structured Data
28:18 Read CSV
34:8 Evaluation
36:23 Random Forest
42:55 SKlearn
47:11 Stack trace
50:46 Feature engineering
56:16 Categorical variables
00:00:00.000 |
Let me introduce everybody to everybody else first of all. 00:00:11.600 |
We're here at the University of San Francisco learning machine learning or you might be 00:00:19.000 |
Here is the University of San Francisco graduate students. 00:00:22.280 |
Thank you everybody and wave back from the future and from home to all the students here. 00:00:29.040 |
If you're watching this on YouTube, please stop and instead go to course.fast.ai and 00:00:40.760 |
There's nothing wrong with YouTube but I can't edit these videos after I've created them, 00:00:47.160 |
so I need to be able to give you updated information about what environments to use, how the technology 00:00:57.080 |
So you can also watch the lessons from here, here's lots of lessons and so forth. 00:01:08.800 |
Tip number two for the video is because I can't edit them, all I can do is add these 00:01:13.120 |
things called cards and cards are little things that appear in the top right-hand corner of 00:01:19.720 |
So by the time this video comes out, I'm going to put a little card there right now for you 00:01:26.080 |
Unfortunately they're not easy to notice, so keep an eye out for that because that's going 00:01:33.900 |
So welcome, we're going to be learning about machine learning today. 00:01:40.080 |
And so for everybody in the class here, you all have Amazon Web Services set up, so you 00:01:45.200 |
might want to go ahead and launch your AWS instance now or go ahead and launch your Jupyter 00:01:56.340 |
If you don't have Jupyter notebook set up, then what I recommend is you go to cressell.com 00:02:03.560 |
www.cressell.com, sign in there, sign up, and you can then turn off Enable GPU and click 00:02:14.280 |
Start Jupyter and you'll have a Jupyter notebook instantly. 00:02:17.980 |
That costs you some money, it's 3 cents an hour. 00:02:22.040 |
So if you don't mind spending 3 cents an hour to learn machine learning, here's a good way. 00:02:25.840 |
So I'm going to go ahead and say Start Jupyter. 00:02:29.360 |
And so whatever technique you use, there you go. 00:02:32.760 |
One of the things that you'll find on the website is links to lots of information about 00:02:38.560 |
the costs and benefits and approaches to setting up lots of different environments for Jupyter 00:02:43.160 |
notebook, both for deep learning and for regular machine learning. 00:02:47.800 |
So check them out because there's lots of options. 00:02:52.200 |
So if I then open Jupyter in a new tab, here I am in cressell or on AWS or your own computer. 00:03:04.260 |
We use the Anaconda Python distribution for basically everything. 00:03:10.440 |
And again, there's lots of information on the website about how to set that up. 00:03:17.160 |
We're also assuming that either you're using cressell or there's something else which I 00:03:22.600 |
really like called paperspace.com, which is another place you can fire up if you put a 00:03:29.900 |
Both of these already have all of the fastai stuff pre-installed for you. 00:03:36.560 |
So as soon as you open up cressell or paperspace, assuming you chose the paperspace fastai template, 00:03:46.640 |
If you are using your own computer or AWS, you'll need to go to our GitHub repo, fastai 00:03:57.560 |
And then you'll need to do a conda update to install the libraries, and again, that's all 00:04:03.720 |
information we've got on the website, and we've got some previous workshop videos to 00:04:09.560 |
So for this class, I'm assuming that you have a Jupyter notebook running. 00:04:18.380 |
So here we are in the Jupyter notebook, and if I click on fastai, that's what you get 00:04:25.640 |
if you get clone or if you're on cressell, you can see our repo here. 00:04:32.120 |
All of our lessons are inside the courses folder, and the machine learning part 1 is 00:04:45.000 |
If you're ever looking at my screen and wondering where are you, look up here and you'll see 00:04:57.440 |
And today we're going to be looking at lesson 1, random forests. 00:05:16.400 |
So there's a couple of different ways you can do this, both here in person or on the 00:05:22.320 |
You can either attempt to follow along as you watch, or you can just watch and then 00:05:31.200 |
It's up to you, I would maybe have a loose recommendation to watch now and follow along 00:05:41.480 |
with the video later just because it's quite hard to multitask, and if you're working on 00:05:47.640 |
something you might miss a key piece of information which you're welcome to ask about. 00:05:53.640 |
But if you follow along with the video afterwards, then you can pause, stop, experiment and so 00:06:03.120 |
I'm going to go view, toggle header, view, toggle toolbar, and then full screen it so 00:06:17.520 |
So the basic approach we're going to be taking here is to get straight into code, start building 00:06:29.160 |
We're going to get to all the theory, but at the point where you deeply understand what 00:06:33.520 |
it's for and at the point that you're able to be an effective practitioner. 00:06:39.480 |
So my hope is that you're going to spend your time focusing on experimenting. 00:06:44.300 |
So if you take these notebooks and try different variations of what I show you, try it with 00:06:49.680 |
your own datasets, the more coding you can do, the better, the more you'll learn. 00:06:57.000 |
My suggestion, or at least all of my students have told me, the ones who have gone away 00:07:00.600 |
and spent time studying books of theory rather than coding, found that they learned less 00:07:07.540 |
machine learning and that they often tell me they wish there's more time coding. 00:07:14.880 |
The stuff that we're showing in this course, a lot of it's never been shown before. 00:07:18.400 |
This is not a summary of other people's research. 00:07:22.000 |
This is more a summary of 25 years of work that I've been doing in machine learning. 00:07:27.080 |
So a lot of this is going to be shown for the first time. 00:07:30.120 |
And so that's kind of cool because if you want to write a blog post about something 00:07:33.040 |
that you learn here, you might be building something that a lot of people find super 00:07:39.960 |
There's a great opportunity to practice your technical writing, and here's some examples 00:07:42.840 |
of good technical writing, by showing people stuff. 00:07:47.240 |
It's not like, "Hey, I just learned this thing, I bet you all know it." 00:07:50.120 |
Often it will be, "I just learned this thing and I'm going to tell you about it and other 00:07:55.320 |
In fact, this is the first course ever that's been built on top of the fast AI library, 00:08:00.840 |
so even just stuff in the library is going to be new to everybody. 00:08:07.400 |
When we use Jupyter Notebook or anything else in Python, we have to import the libraries 00:08:16.760 |
Something that's quite convenient is if you use these two auto-reload commands at the 00:08:20.560 |
top of your notebook, you can go in and edit the source code of the modules and your notebook 00:08:26.560 |
will automatically update with those new modules. 00:08:29.320 |
You won't have to restart anything, so that's super handy. 00:08:32.800 |
Then to show your plots inside the notebook, you'll want that plot in line. 00:08:37.300 |
These three lines appear at the top of all of my notebooks. 00:08:44.120 |
You'll notice when I import the libraries that for anybody here who is an experienced 00:08:48.360 |
Python programmer, I am doing something that would be widely considered very inappropriate. 00:08:56.720 |
Generally speaking in software engineering, we're taught to specifically figure out what 00:09:05.640 |
The more experienced you are as a Python programmer, the more extremely offensive practices you're 00:09:11.400 |
For example, I don't follow what's called PEP8, which is the normal style of code used 00:09:21.080 |
First is go along with it for a while, don't judge me just yet. 00:09:25.880 |
There's reasons that I do these things, and if it really bothers you, then feel free to 00:09:31.720 |
But the basic idea is data science is not software engineering. 00:09:38.040 |
We're using the same languages, and in the end these things may become software engineering 00:09:46.000 |
But what we're doing right now is we're prototyping models. 00:09:49.680 |
Prototyping models has a very different set of best practices that are taught basically 00:09:58.240 |
But the key is to be able to do things very interactively and very iteratively. 00:10:03.480 |
So for example, from library import star means you don't have to figure out ahead of time 00:10:09.640 |
what you're going to need from that library, it's all there. 00:10:14.020 |
Also because we're in this wonderful interactive Jupyter environment, it lets us understand 00:10:23.840 |
So for example, later on I'm using a function called display. 00:10:34.000 |
So you can just type the name of a function and press shift enter, remember shift enter 00:10:39.640 |
is to run a cell, and it will tell you where it's from. 00:10:43.680 |
So anytime you see a function you're not familiar with, you can find out where it's from. 00:10:49.420 |
And then if you want to find out what it does, put a question mark at the start. 00:11:01.720 |
And then, particularly helpful for the FastAI library, I try to make as many functions as 00:11:07.960 |
possible be no more than about five lines of code, it's going to be really easy to read. 00:11:14.320 |
If you put a second question mark at the start, it shows you the source code of the function. 00:11:25.720 |
Right so all the documentation plus the source code, so you can see nothing has to be mysterious. 00:11:30.960 |
And we're going to be using, the other library we'll use a lot is scikit-learn, which implements 00:11:40.040 |
The scikit-learn source code is often pretty readable. 00:11:44.140 |
And so very often if I want to really understand something, I'll just go question mark, question 00:11:48.340 |
mark, and the name of the scikit-learn function I'm typing, and I'll just go ahead and read 00:11:54.880 |
As I say, the FastAI library in particular is designed to have source code that's very 00:12:00.120 |
easy to read, and we're going to be reading it a lot. 00:12:07.200 |
So today we're going to be working on a Kaggle competition called Blue Book for bulldozers. 00:12:12.880 |
So the first thing we need is to get that data. 00:12:16.040 |
So if you go Kaggle, bulldozers, then you can find it. 00:12:24.520 |
So Kaggle competitions allow you to download a real-world dataset, a real problem that 00:12:32.920 |
somebody is trying to solve, and solve it according to a specification that that actual 00:12:37.640 |
person with that actual problem decided would be actually helpful to them. 00:12:42.120 |
So these are pretty authentic experiences for applied machine learning. 00:12:48.480 |
Now of course you're missing all the bits that went before, which was why did this company, 00:12:52.840 |
this startup, decide that predicting the auction sale price of bulldozers was important? 00:13:03.440 |
And that's all important stuff as well, but the focus of this course is really on what 00:13:08.200 |
happens next, which is like how do you actually build the model. 00:13:12.480 |
One of the great things about you working on Kaggle competitions, whether they be running 00:13:16.040 |
now or whether they be old ones, is that you can submit to the leaderboard, even old closed 00:13:22.080 |
competitions, you can submit to the leaderboard and find out how would you have gone. 00:13:26.360 |
And there's really no other way in the world of knowing whether you're competent at this 00:13:32.160 |
kind of data and this kind of model than doing that. 00:13:35.640 |
Because otherwise, if your accuracy is really bad, is it because this is just very hard, 00:13:41.080 |
like it's just not possible, then the data is so noisy you can't do better? 00:13:46.120 |
Or is it actually that it's an easy data set and you made a mistake? 00:13:51.600 |
And like when you finish this course and apply this to your own projects, this is going to 00:13:58.560 |
be something you're going to find very hard and there isn't a simple solution to it, which 00:14:03.200 |
is you're now using something that hasn't been on Kaggle, it's your own data set, do 00:14:12.320 |
So we'll talk about that more during the course. 00:14:16.480 |
And in the end, we just have to know that we have good, effective techniques to reliably 00:14:22.000 |
build baseline models, otherwise there's really no way to know. 00:14:27.440 |
There's no way other than creating a Kaggle competition or getting 100 top data scientists 00:14:33.160 |
to work at your problem to really know what's possible. 00:14:37.000 |
So Kaggle competitions are fantastic for learning. 00:14:41.760 |
And as I've said many times, I've learned more from competing in Kaggle competitions 00:14:49.280 |
So to compete in a Kaggle competition, you need the data. 00:14:52.840 |
This one's an old competition, so it's not running now, but we can still access everything. 00:15:00.040 |
So we first of all want to understand what the goal is. 00:15:03.680 |
And I suggest that you read this later, but basically we're going to try and predict the 00:15:10.580 |
And one of the nice things about this competition is that if you're like me, you probably don't 00:15:16.120 |
know very much about heavy industrial equipment options. 00:15:20.400 |
I actually know more than I used to because my toddler loves building equipment, so we 00:15:26.000 |
actually watch YouTube videos about front-end loaders and forklifts. 00:15:37.040 |
So one of the nice things is that machine learning should help us understand a data 00:15:43.920 |
So by picking an area which we're not familiar with, it's a good test of whether we can build 00:15:51.200 |
Because otherwise what can happen is that your intuition about the data can make it 00:15:55.280 |
very difficult for you to be open-minded enough to see what does the data really say. 00:16:00.920 |
It's easy enough to download the data to your computer. 00:16:05.520 |
You just have to click on the data set, so here is train.zip, and click download. 00:16:15.000 |
And so you can go ahead and do that if you're running on your own computer right now. 00:16:18.480 |
If you're running on AWS, it's a little bit harder because unless you're familiar with 00:16:24.680 |
text-mode browsers like Elinks or Lynx, it's quite tricky to get the data set to Kaggle. 00:16:33.160 |
One is you can download it to your computer and then SCP it to AWS, so SCP works just 00:16:39.960 |
like SSH but it copies data rather than logging in. 00:16:43.600 |
I'll show you a trick though that I really like, and it relies on using Firefox. 00:16:47.840 |
For some reason Chrome doesn't work correctly with Kaggle for this. 00:16:55.440 |
So if I go on Firefox to the website, eventually, and what we're going to do is we're going 00:17:10.360 |
to use something called the JavaScript console. 00:17:14.720 |
So every web browser comes with a set of tools for web developers to help them see what's 00:17:21.380 |
going on, and you can hit control-shift-i to bring up this web developer 00:17:45.720 |
And so then if I click on train.zip and I click on download, and I'm not even going 00:17:56.760 |
to download it, I'm just going to say cancel, but you'll see down here it's shown me all 00:18:01.440 |
of the network connections that were just initiated. 00:18:05.080 |
And so here's one which is downloading a zip file from storage.googleapis.com, blah blah 00:18:11.240 |
That's probably what I want, that looks good. 00:18:14.080 |
So what you can do is you can right-click on that and say copy, copy as curl. 00:18:21.320 |
So curl is a Unix command like wget that downloads stuff. 00:18:27.240 |
So if I go copy as curl, that's going to create a command that has all of my cookies, headers, 00:18:34.800 |
everything in it necessary to download this authenticated data set. 00:18:39.600 |
So if I now go into my server, and if I paste that, you can see a really really long curl 00:18:56.880 |
One thing I notice is that at least recent versions have started adding this --2.0 thing 00:19:04.600 |
That doesn't seem to work with all versions of curl, so something you might want to do 00:19:17.400 |
is to pop that into an editor, find that to get rid of it, and then use that instead. 00:19:27.000 |
Now one thing to be very careful about, by default curl downloads the file and displays 00:19:36.600 |
So if I try to display this, it's going to display gigabytes of binary data in my terminal 00:19:42.540 |
So to say that I want to output it using some different file name, I always type -0 for 00:19:48.800 |
output file name, and then the name of the file, bulldozers.zip, and make sure you give 00:20:01.080 |
So in this case the file was train.zip, so bulldozers.zip. 00:20:11.800 |
There it is, and so there it all is, so I could make directory bulldozers, then I could 00:20:24.280 |
move my zip file into there, it's the wrong way around, yes, thank you. 00:20:39.480 |
Okay, and then if you don't have unzip installed, you may need to sudo apt-install unzip, or 00:21:06.520 |
if you're on a Mac, that would be brew install unzip, if brew doesn't work, you haven't got 00:21:13.000 |
homebrew installed, so make sure you install it, and then unzip. 00:21:21.920 |
One nice thing is that if you're using Cressel, most of the datasets should already be pre-installed 00:21:33.840 |
So what I can do here is I can say open a new tab, here's a cool trick, in Jupyter you 00:21:40.960 |
can actually say new terminal, and you can actually get a web-based terminal. 00:21:47.200 |
And so you'll find on Cressel there's a /datasets folder, /datasets/caggle, /datasets/fastai, 00:21:58.600 |
often the things you need are going to be in one of those places. 00:22:04.460 |
So assuming that we don't have it already downloaded, actually PaperSpace should have 00:22:09.120 |
most of them as well, then we'd need to go to fastai, let's go into the courses, machine 00:22:14.760 |
learning folder, and what I tend to do is I tend to put all of my data for a course into 00:22:23.760 |
You'll find that if you're using Git, you'll find that that doesn't get added to Git because 00:22:31.840 |
So don't worry about creating the data folder, it's not going to screw anything up. 00:22:37.240 |
So I generally make a folder called data, and then I tend to create folders for everything 00:22:43.420 |
So in this case, I'll make the bulldozers, cd, and remember the last word of the last 00:22:56.360 |
I'll go ahead and grab that curl command again, and zip bulldozers, there we go. 00:23:25.360 |
So you can now see I generally have anything that might change from person to person, I 00:23:36.400 |
So here I just define something called path, but if you've used the same path I just did, 00:23:39.680 |
you should just be able to go ahead and run that, and let's go ahead and keep moving along. 00:23:45.120 |
So we've now got all of our libraries imported, and we've set the path to the data. 00:23:52.480 |
You can run shell commands from within Jupyter Notebook by using an exclamation mark. 00:23:59.960 |
So if I want to check what's inside that path, I can go ls data/bulldozers, and you can see 00:24:11.240 |
If you use a Python variable inside a Jupyter shell command, you have to put it in curlies. 00:24:18.920 |
So that makes me feel good that my path is pointing at the right place. 00:24:22.040 |
If you say ls curly_capitals_path, and you get nothing at all, then you're pointing at 00:24:37.760 |
So the curly brackets refer to the fact that I put an exclamation mark at the front, which 00:24:41.880 |
means the rest of this is not a Python command, it's a bash command. 00:24:48.960 |
And bash doesn't know about capital path, because capital path is part of Python. 00:24:54.360 |
So this is a special Jupyter thing which says expand this Python thing, please, before you 00:25:14.640 |
So the goal here is to use the training set, which contains data through the end of 2011 00:25:24.040 |
And so the main thing to start with then is of course to look at the data. 00:25:30.640 |
Now the data is in CSV format, so one easy way to look at the data would be to use shell 00:25:37.920 |
command head to look at the first few lines, head, bulldozers, and even tab completion 00:25:51.040 |
So there's a bunch of column headers, and then there's a bunch of data. 00:25:56.400 |
So what we want to do is take this and read it into a nice tabular format. 00:26:01.920 |
So does Terrence putting these glasses on mean I should make this bigger, or is it okay? 00:26:11.360 |
So this kind of data where you've got columns representing a wide range of different types 00:26:17.560 |
of things, such as an identifier, a currency, a date, a size, I refer to this as structured 00:26:27.680 |
Now I say I refer to this as structured data because there have been many arguments in 00:26:32.560 |
the machine learning community on Twitter about what is structured data. 00:26:37.000 |
Weirdly enough, this is like the most important type of distinction between data that looks 00:26:42.560 |
like this and data like images where every column is of the same type. 00:26:48.080 |
That's the most important distinction in machine learning, yet we don't have standard accepted 00:26:55.400 |
So I'm going to use the terms structured and unstructured. 00:26:58.960 |
But note that other people you talk to, particularly in NLP, people use structured to mean something 00:27:07.040 |
So when I refer to structured data, I mean columns of data that can have varying different 00:27:14.720 |
By far the most important tool in Python for working with structured data is pandas. 00:27:20.520 |
Pandas is so important that it's one of the few libraries that everybody uses the same 00:27:27.920 |
So you'll find that one of the things I've got here is from fastai-imports-import-star. 00:27:37.180 |
The fastai-imports module has nothing but imports of a bunch of hopefully useful tools. 00:27:47.520 |
So all of the code for fastai is inside the fastai-directory inside the fastai-repo. 00:27:56.600 |
And so you can have a look at imports, and you'll see it's just literally a list of inputs. 00:28:11.000 |
So you'll see lots of people using pd.something, they're always talking about pandas. 00:28:21.520 |
And so when we read the CSV file, we just tell it the path to the CSV file, a list of 00:28:28.440 |
any columns that contain dates, and I always add this low memory equals false that's going 00:28:33.640 |
to actually make it read more of the file to decide what the types are. 00:28:39.320 |
This here is something called a Python 3.6 format string. 00:28:48.840 |
We've probably used lots of different ways in the past in Python of interpolating variables 00:28:54.440 |
Python 3.6 has a very simple way that you'll probably always want to use from now on. 00:28:59.920 |
And you create a normal string, you type in f at the start, and then if I define a variable, 00:29:10.640 |
then I can say hello curly's python function. 00:29:21.360 |
These are not the same curlies that we saw earlier on in the ls command. 00:29:25.640 |
That ls command is specific to Jupyter and it interpolates python code into shell code. 00:29:34.520 |
These curlies are Python 3.6 format string curlies. 00:29:38.120 |
They require an f at the start, so if I get rid of the f, it doesn't interpolate. 00:29:46.720 |
And the cool thing is, inside that curlies, you can write any Python code you like, just 00:30:03.680 |
And it doesn't matter, because it's a format string, it doesn't matter if the thing was 00:30:15.400 |
Normally if you like to string concatenation with integers, python complains, no such problem 00:30:26.160 |
So this is going to read path/train.csv into a thing called a data frame. 00:30:33.600 |
Amanda's data frames and R's data frames are pretty similar, so if you've used R before, 00:30:41.120 |
then you'll find that this is reasonably comfortable. 00:30:45.340 |
So this file is 9.3 meg, and its size is 112 meg. 00:31:01.920 |
And it has 400,000 rows in it, so it takes a moment to import it. 00:31:13.960 |
So when it's done, we can type the name of the data frame, df-raw, and then use various 00:31:28.240 |
So for example df-raw.tail will show us the last few rows of the data frame. 00:31:34.640 |
By default it's going to show the columns along the top and the rows down the side, 00:31:41.000 |
So I've just set .transpose to show it the other way around. 00:31:47.280 |
I've created one extra function here, display-all. 00:31:50.240 |
Normally if you just type df-raw, if it's too big to show conveniently, it truncates 00:31:58.200 |
So the details don't matter, but this is just changing a couple of settings to say even 00:32:02.720 |
if it's got a thousand rows and a thousand columns, please still show the whole thing. 00:32:10.520 |
In Jupyter Notebook you can type a variable of almost any kind, a video, HTML, an image, 00:32:19.680 |
whatever, and it will generally figure out a way of displaying it for you. 00:32:22.960 |
So in this case it's a pandas data frame, it figures it out a way of displaying it for 00:32:27.920 |
And so you can see here that by default it doesn't show me the whole thing. 00:32:40.080 |
This is the last bit, the tail of it, last few rows. 00:32:55.640 |
And then we've got a whole bunch of things we could predict it with. 00:32:58.280 |
And when I start with a dataset, I tend -- yes, Terrence, can I give you this? 00:33:12.480 |
I've read in books that you should never look at the data because of the risk of overfit. 00:33:19.380 |
So I was actually going to mention, I actually kind of don't, like I want to find out at 00:33:24.920 |
least enough to know that I've managed to import it okay, but I tend not to really study 00:33:29.760 |
it at all at this point because I don't want to make too many assumptions about it. 00:33:35.640 |
I would actually say most books, say the opposite, most books do a whole lot of EDA, expiratory 00:33:45.200 |
>> Well, I mean the academic books I've read say that's one of the biggest risks of overfitting. 00:33:52.200 |
>> Yeah, so the truth is kind of somewhere in between, and I generally try to do machine 00:33:59.400 |
learning driven EDA, and that's what we're going to learn today. 00:34:07.420 |
So the thing I do care about though is what's the purpose of the project? 00:34:12.920 |
And for Kaggle projects, the purpose is very easy. 00:34:15.760 |
We can just look and find out, there's always an evaluation section, how is it evaluated? 00:34:21.480 |
And this is evaluated on root, mean, squared, log, error. 00:34:26.400 |
So this means they're going to look at the difference between the log of our prediction 00:34:30.160 |
of price and the log of the actual price, and then they're going to square it and add 00:34:37.500 |
So because they're going to be focusing on the difference of the logs, that means that 00:34:45.040 |
And this is pretty common, like for a price, generally you care not so much about did I 00:34:53.280 |
So if it was a million-dollar thing and you're $100,000 off, or if you're a $10,000 thing 00:34:57.920 |
and you're $1,000 off, often we would consider those equivalent scale issues. 00:35:03.120 |
And so for this auction problem, the organizers are telling us they care about ratios more 00:35:10.240 |
than differences, and so the log is the thing we care about. 00:35:20.840 |
I'm assuming that you have some familiarity with NumPy. 00:35:23.900 |
If you don't, we've got a video called Deep Learning Workshop, which actually isn't just 00:35:28.280 |
for deep learning, it's basically for this as well. 00:35:31.460 |
And one of the parts there, which we've got a time-coded link to, is a quick introduction 00:35:36.920 |
But basically NumPy lets us treat arrays, matrices, vectors, high-dimensional tensors 00:35:42.760 |
as if they're Python variables, and we can do stuff like log to them, and it will apply 00:35:54.680 |
So in this case df-raw.sale_price is pulling a column out of a pandas data frame, which 00:36:02.400 |
gives us a pandas series, which shows us the sale prices and the indexes. 00:36:17.160 |
And a series can be passed to a NumPy function, which is pretty handy. 00:36:23.160 |
And so you can see here, this is how I can replace a column with a new column. 00:36:31.720 |
Now that we've replaced its sale price with its log, we can go ahead and try to create 00:36:40.080 |
We'll find out in detail, but in brief, a random forest is a kind of universal machine 00:36:48.520 |
It's a way of predicting something that can be of any kind. 00:36:52.180 |
It could be a category, like is it a dog or a cat, or it could be a continuous variable 00:37:01.940 |
It can predict it with columns of pretty much any kind. 00:37:16.640 |
It can, and we'll learn to check whether it is, but it doesn't generally overfit too badly, 00:37:21.400 |
and it's very, very easy to make to stop it from overfitting. 00:37:25.600 |
You don't need -- and we'll talk more about this -- you don't need a separate validation 00:37:30.300 |
It can tell you how well it generalizes, even if you only have one dataset. 00:37:38.340 |
It doesn't assume that your data is normally distributed. 00:37:41.560 |
It doesn't assume that the relationships are linear. 00:37:44.280 |
It doesn't assume that you've just specified the interactions. 00:37:48.320 |
It requires very few pieces of feature engineering for many different types of situations. 00:37:55.480 |
You don't have to take the log of the data, you don't have to model plane directions together. 00:37:59.360 |
So in other words, it's a great place to start. 00:38:03.480 |
If your first random forest does very little useful, then that's a sign that there might 00:38:12.960 |
Can you please throw it at or towards this gentleman? 00:38:16.040 |
What about the curse of dimensionality when you're using random forests? 00:38:22.120 |
So there's this concept of curse of dimensionality. 00:38:25.040 |
In fact there's two concepts I'll touch on, curse of dimensionality and the no-free lunch 00:38:30.580 |
These are two concepts you'll often hear a lot about. 00:38:34.680 |
They're both largely meaningless and basically stupid, and yet I would say maybe the majority 00:38:42.600 |
of people in the field not only don't know that but think the opposite. 00:38:48.640 |
The curse of dimensionality is this idea that the more columns you have, it basically creates 00:38:56.840 |
And there's this kind of fascinating mathematical idea which is the more dimensions you have, 00:39:02.920 |
the more all of the points sit on the edge of that space. 00:39:06.360 |
So if you've just got a single dimension where things are like random, then they're spread 00:39:12.640 |
Whereas if it's a square, then the probability that they're in the middle means that they 00:39:17.200 |
can't have been on the edge of either dimension, so it's a little bit less likely that they're 00:39:22.840 |
Each dimension you add, it becomes multiplicatively less likely that the point isn't on the edge 00:39:30.380 |
And so basically in higher dimensions, everything sits on the edge. 00:39:34.200 |
And what that means in theory is that the distance between points is much less meaningful. 00:39:39.880 |
And so if we assume that somehow that matters, then it would suggest that when you've got 00:39:44.720 |
lots and lots of columns and you just use them without being very careful to remove 00:39:50.560 |
the ones you don't care about, that somehow things won't work. 00:40:01.200 |
One is that the points still do have different distances away from each other. 00:40:06.120 |
Just because they're on the edge, they still do vary in how far away they are from each 00:40:10.880 |
And so this point is more similar to this point than it is to that point. 00:40:13.920 |
So even things we'll learn about k-nearest neighbors actually work really well, really 00:40:18.600 |
really well in high dimensions despite what the theoreticians claimed. 00:40:22.880 |
And what really happened here was that in the 90s, theory totally took over machine 00:40:31.440 |
And so particularly there was this concept of these things called support vector machines 00:40:34.600 |
that were theoretically very well justified, extremely easy to analyze mathematically, 00:40:39.920 |
and you could kind of prove things about them. 00:40:43.040 |
And we kind of lost a decade of real practical development in my opinion. 00:40:47.080 |
And all these theories became very popular like the curse of dimensionality. 00:40:52.080 |
Nowadays, and a lot of theoreticians hate this, the world of machine learning has become 00:40:58.320 |
very empirical, which is like which techniques actually work. 00:41:01.480 |
And it turns out that in practice, building models on lots and lots of columns works really 00:41:09.000 |
So the other thing to quickly mention is the no free lunch theorem. 00:41:13.000 |
There's a mathematical theorem by that name that you will often hear about that claims 00:41:17.840 |
that there is no type of model that works well for any kind of dataset. 00:41:26.440 |
Which is true, and is obviously true if you think about it, in the mathematical sense, 00:41:32.320 |
any random dataset, by definition it's random. 00:41:36.000 |
So there isn't going to be some way of looking at every possible random dataset that's in 00:41:39.960 |
some way more useful than any other approach. 00:41:43.000 |
In the real world, we look at data which is not random. 00:41:47.240 |
Mathematically we'd say it sits on some lower dimensional manifold, it was created by some 00:41:51.200 |
kind of causal structure, there are some relationships in there. 00:41:57.280 |
So the truth is that we're not using random datasets. 00:42:00.840 |
And so the truth is, in the real world, there are actually techniques that work much better 00:42:06.120 |
than other techniques for nearly all of the datasets you look at. 00:42:10.560 |
And nowadays there are empirical researchers who spend a lot of time studying this, which 00:42:20.180 |
And ensembles of decision trees, of which random forests are one, is perhaps the technique 00:42:29.880 |
And that is despite the fact that until the library that we're showing you today, Fast 00:42:34.880 |
AI came along, there wasn't really any standard way to pre-process them properly and to properly 00:42:46.880 |
So yeah, I think this is where the difference between theory and practice is huge. 00:42:55.000 |
So when I try to create a random forest regressor, what is that? 00:43:00.440 |
OK, it's part of something called sklearn. sklearn is scikit-learn. 00:43:05.880 |
It is by far the most popular and important package for machine learning in Python. 00:43:13.220 |
It's not the best at nearly everything, but it's perfectly good at nearly everything. 00:43:18.600 |
So you might find in the next part of this course with your net, you're going to look 00:43:23.040 |
at a different kind of decision tree ensemble called gradient boosting trees, where actually 00:43:28.640 |
there's something called xgboost, which is better than gradient boosting trees in scikit-learn. 00:43:35.320 |
But it's pretty good at everything, so I'm really going to focus on scikit-learn. 00:43:41.440 |
Random forest, you can do two kinds of things with a random forest. 00:43:58.520 |
So you can hit tab in Jupyter Notebook to get tab completion for anything that's in 00:44:05.320 |
You'll see that there's also a random forest classifier. 00:44:09.000 |
So in general, there's an important distinction between things which can predict continuous 00:44:14.720 |
variables, and that's called regression, and therefore a method for doing that would be 00:44:19.440 |
a regressor, and things that predict categorical variables, and that is called classification, 00:44:27.480 |
and the things that do that are called classifiers. 00:44:30.640 |
So in our case, we're trying to predict a continuous variable price. 00:44:34.440 |
So therefore we are doing regression, and therefore we need a regressor. 00:44:39.840 |
A lot of people incorrectly use the word regression to refer to linear regression, which is just 00:44:48.560 |
Regression means a machine learning model that's trying to predict some kind of continuous 00:44:57.300 |
So pretty much everything in scikit-learn has the same form. 00:45:00.300 |
You first of all create an instance of an object for the machine learning model you want. 00:45:04.760 |
You then call fit, passing in the independent variables, the things you want to use to predict, 00:45:11.360 |
and the dependent variable, the thing that you want to predict. 00:45:13.920 |
So in our case, the dependent variable is the data frame's sale price column, and so 00:45:24.640 |
the thing we want to use to predict is everything except that. 00:45:28.000 |
In pandas, the drop method returns a new data frame with a list of columns removed. 00:45:40.800 |
So this here is the data frame containing everything except for sale price. 00:46:00.560 |
So to find out, I could hit shift+tab, and that will bring up a quick inspection of the 00:46:09.160 |
In this case, it doesn't quite tell me what I want. 00:46:12.140 |
So if I hit shift+tab twice, it gives me a bit more information. 00:46:17.000 |
Ah yes, and that tells me it's a single label or list-like. 00:46:24.800 |
By the way, if I hit three times, it will give me a whole little window at the bottom. 00:46:33.440 |
Another way of doing that, of course, which we learned, would be question mark, question 00:46:45.040 |
Question mark would be the source code for it, or a single question mark is the documentation. 00:46:54.480 |
So I think that trick of tab complete, shift+tab parameters, question mark and double question 00:47:00.560 |
mark for the docs and the source code, if you know nothing else about using Python libraries, 00:47:07.040 |
know that because now you know how to find out everything else. 00:47:18.000 |
So anytime you get a stack trace like this, so an error, the trick is to go to the bottom 00:47:24.400 |
because the bottom tells you what went wrong. 00:47:26.480 |
Above it, it tells you all of the functions that could cause other functions to get there. 00:47:31.760 |
Could not convert string to float conventional. 00:47:35.760 |
So there was a value inside my dataset, conventional, and it didn't know how to create a model using 00:47:49.120 |
We have to pass numbers to most machine learning models, and certainly to random forests. 00:47:56.880 |
So step one is to convert everything into numbers. 00:48:02.460 |
So our dataset contains both continuous variables, so numbers where the meaning is numeric, like 00:48:09.480 |
price, and it contains categorical variables which could either be numbers where the meaning 00:48:17.840 |
is not continuous, like zip code, or it could be a string, like large, small, and medium. 00:48:28.680 |
We want to basically get to a point where we have a dataset where we can use all of 00:48:34.000 |
So they have to all be numeric, and they have to be usable in some way. 00:48:37.420 |
So one issue is that we've got something called sale date, which you might remember right 00:48:44.360 |
at the top, we told it that that's a date, so it's been parsed as a date, and so you 00:48:49.320 |
can see here it's data type, dtype, very important thing, data type is date time, 64-bit. 00:49:00.360 |
And this is actually where we need to do our first piece of feature engineering. 00:49:08.920 |
So since you've got the catch box, can you tell me what are some of the interesting bits 00:49:15.360 |
Well you can see like a time series pattern, I guess. 00:49:24.140 |
What are some columns that we could pull out of this? 00:49:30.260 |
The date as in, tell me at least to be a number, year, month, quarter, you want to pass it 00:49:38.920 |
Just pass it to your right, you've got some more columns for us? 00:50:00.400 |
I'll give you a few more that you might want to think about would be like, is it a holiday? 00:50:15.920 |
It depends a bit on what you're doing, right? 00:50:18.040 |
So like if you're predicting soda sales in Soma, you would probably want to know was there 00:50:29.120 |
So like what's in a date is one of the most important pieces of feature engineering you 00:50:33.200 |
can do, and no machine learning algorithm can tell you whether the Giants were playing 00:50:41.920 |
So this is where you need to do feature engineering. 00:50:44.680 |
So I do as many things automatically as I can for you. 00:50:51.920 |
So here I've got something called add date part. 00:51:09.120 |
You'll find most of my functions are less than half a page of code. 00:51:15.080 |
So often rather than having docs, I'm going to try to add docs over time, but they're 00:51:20.640 |
designed that you can understand them by reading the code. 00:51:23.000 |
So we're passing in a data frame, and the name of some field, which in this case was 00:51:27.820 |
sale date, and so in this case we can't go df.fieldname because that would actually find 00:51:38.400 |
So df.fieldname is how we grab a column where that column name is stored in this variable. 00:51:44.800 |
So we've now got the field itself, the series. 00:51:48.580 |
And so what we're going to do is we're going to go through all of these different strings, 00:51:54.200 |
and this is a piece of Python which actually looks inside an object and finds an attribute 00:52:02.320 |
So this is going to go through, and again you can Google for Python get attribute, it's 00:52:06.360 |
a cool little advanced technique, but this is going to go through and it's going to find 00:52:10.840 |
for this field it's going to find its year attribute. 00:52:16.960 |
Now Pandas has got this interesting idea which is if I actually look inside, let's go field 00:52:22.760 |
equals, this is the kind of experiment I want you to do, play around, sale date. 00:52:26.800 |
So I've now got that in a field object, and so I can go field, and I can go field.tab, 00:52:41.480 |
Well that's because year is only going to apply to Pandas series that are datetime objects. 00:52:47.920 |
So what Pandas does is it splits out different methods inside attributes that are specific 00:52:55.020 |
So datetime objects will have a dt attribute defined, and at that is where you'll find 00:53:07.200 |
So what I went through was I went through all of these and picked out all of the ones 00:53:11.040 |
that could ever be interesting for any reason. 00:53:14.120 |
And this is like the opposite of the curse of dimensionality. 00:53:17.040 |
It's like if there is any column or any variant of that column that could ever be interesting 00:53:22.000 |
at all, add that to your data set and every variation of it you can think of. 00:53:27.280 |
There's no harm in adding more columns nearly all the time. 00:53:31.920 |
So in this case we're going to go ahead and add all of these different attributes. 00:53:37.360 |
And so for every one I'm going to create a new field that's going to be called the name 00:53:45.040 |
of your field with the word "date" removed, so it will be "sale" and then the name of 00:53:51.440 |
So we're going to get a sale year, sale month, sale week, sale day, etc etc. 00:53:56.560 |
And then at the very end I'm going to remove the original field. 00:54:01.440 |
Because remember we can't use "sale date" directly because it's not a number. 00:54:07.480 |
So you're saying this only worked because it was a date type? 00:54:14.800 |
Did you make it a date type or was it already saved as one in the original? 00:54:20.480 |
And the reason it was a date type is because when we imported it, we said "has dates equals" 00:54:31.100 |
So as long as it looks date-ish and we tell it to parse it as a date, it'll turn it into 00:54:38.000 |
Was there a way to do that so it would just look through all the columns and say "if it 00:54:41.480 |
looks like a date, make it a date" or do you have to know which one? 00:54:46.260 |
I think there might be but for some reason it wasn't ideal. 00:54:49.760 |
Maybe it took lots of time or it didn't always work or for some reason I had to list it here. 00:54:56.280 |
I would suggest checking out the docs for pandas.read_csv and maybe on the forum you 00:55:01.360 |
can tell us what you find because I can't remember offhand. 00:55:12.680 |
Let's do that one on the same forum thread that Savannah creates because I think it's 00:55:22.960 |
a reasonably advanced question, but generally speaking the time zone in a properly formatted 00:55:28.760 |
date will be included in the string and it should pull it out correctly and turn it into 00:55:36.960 |
Generally speaking, it should handle it for you. 00:55:42.280 |
So notice for indexing a column, you think it should simply use the dot and the EIF. 00:55:53.940 |
The square brackets one is safer, particularly if you're assigning to a column. 00:55:58.980 |
If it didn't already exist, you need to use the square brackets format, otherwise you'll 00:56:04.640 |
So the square brackets format is safer, the dot version saves me a couple of keystrokes 00:56:13.440 |
In this particular case, because I wanted to grab something that had something inside 00:56:22.120 |
it, wasn't the name itself, I have to use square brackets. 00:56:25.920 |
So square brackets is going to be your safe bet if in doubt. 00:56:32.080 |
So after I run that, you'll notice that dfraw.columns gives me a list of all of the columns just 00:56:44.160 |
as strings, and at the end, there they all are. 00:56:47.520 |
So it's removed sale date and it's added all those. 00:56:53.880 |
The other problem is that we've got a whole bunch of strings in there. 00:57:18.000 |
So pandas actually has a concept of a category data type, but by default it doesn't turn 00:57:26.160 |
So I've created something called train_cats, which creates categorical variables for everything 00:57:37.120 |
And so what that's going to do is behind the scenes it's going to create a column that's 00:57:40.920 |
actually a number, it's an integer, and it's going to store a mapping from the integers 00:57:50.200 |
The reason it's train_cats is that you use this for the training set. 00:57:53.760 |
More advanced usage is that when we get to looking at the test and validation sets, this 00:58:01.520 |
In fact Terrence came to me the other day and he said, "My model's not working. 00:58:08.120 |
It turned out the reason why was because the mappings he was using from string to number 00:58:12.680 |
in the training set were different to the mappings he was using from string to number 00:58:18.080 |
So therefore in the training set, high might have been 3, but in the test set it might 00:58:25.160 |
So the 2 were totally different, and so the model was basically non-predictive. 00:58:30.920 |
So I have another function called apply_categories, where you can pass in your existing training 00:58:39.520 |
set and it will use the same mappings to make sure your test set or validation set uses 00:58:46.880 |
So when I go train_cats, it's actually not going to make the data frame look different 00:58:52.920 |
Behind the scenes it's going to turn them all into numbers. 00:59:04.480 |
Let's see how we go, I'll try and finish on time. 00:59:11.960 |
So you'll see now, remember I mentioned there was this .dt attribute that gives you access 00:59:16.760 |
to everything, assuming it's about the date time, there's a .cat attribute that gives 00:59:21.240 |
you access to things assuming something's a category. 00:59:24.960 |
And so usage_band was a string, and so now that I've run train_cats, it's turned it into 00:59:29.720 |
a category, so I can go dfraw.usage_band.cat and there's a whole bunch of other things 00:59:41.040 |
So one of the things we've got there is .categories, and you can see here is the list. 00:59:46.400 |
Now one of the things you might notice is that this list is in a bit of a weird order, 00:59:52.640 |
The truth is, it doesn't matter too much, but what's going to happen when we use the 00:59:57.240 |
random forest is this is going to be 0, this is going to be 1, this is going to be 2, and 01:00:04.440 |
And so we're going to have a decision tree that can split things at a single point. 01:00:08.020 |
So it'd either be high versus low and medium, or medium versus high and low. 01:00:15.760 |
It actually turns out not to work too badly, but it'll work a little bit better if you 01:00:22.140 |
So if you want to reorder a category, then you can just go cat.set_categories and pass 01:00:31.280 |
And almost every pandas method has an in-place parameter, which rather than returning a new 01:00:38.440 |
data frame, it's going to change that data frame. 01:00:42.080 |
So I'm not going to do that, I didn't check that carefully for categories that should 01:00:45.240 |
be ordered, but this seems like a pretty obvious one. 01:00:58.800 |
The usage_band column is actually going to be, this is actually what our random forest 01:01:20.240 |
And as we're going to learn shortly, a random forest consists of a bunch of trees that's 01:01:24.520 |
going to make a single split, and a single split is going to be either greater than or 01:01:33.700 |
So we could split it into high versus low and medium, which that semantically makes sense. 01:01:40.520 |
Like is it big, or we could split it into medium versus high and low, which doesn't 01:01:48.600 |
So in practice, the decision tree could then make a second split to say medium versus high 01:01:53.640 |
and low, and then within the high and low into high and low. 01:01:56.600 |
But by putting it in a sensible order, if it wants to split out low, it can do it in 01:02:03.980 |
And we'll be learning more about this shortly. 01:02:07.760 |
It honestly is not a big deal, but I just wanted to mention it's there. 01:02:12.120 |
It's also good to know that people, when they talk about different types of categorical 01:02:16.440 |
variable, specifically you need to know there's a kind of categorical variable called ordinal. 01:02:21.400 |
And an ordinal categorical variable is one that has some kind of order, like high, medium, 01:02:28.200 |
And random forests aren't terribly sensitive to that fact, but it's worth knowing it's 01:02:44.520 |
It means you can get there with one decision rather than two. 01:02:55.640 |
So for free, we get a negative one which refers to missing. 01:03:01.280 |
And one of the things we're going to do is we're going to actually add one. 01:03:05.200 |
We're going to add one to our codes, maybe until he goes, "Let people know it's coming!" 01:03:13.680 |
So we're going to add one to all of our codes to make missing zero later on. 01:03:18.600 |
So for these categories, you're basically mapping streams to different integers. 01:03:36.960 |
So getDummies, which we'll get to in a moment, is going to create three separate columns, 01:03:40.800 |
ones and zeros for high, ones and zeros for medium, ones and zeros for low, whereas this 01:03:44.240 |
one creates a single column with an integer, 0, 1, or 2. 01:03:47.800 |
So at this point, as long as we always make sure we use .cat.codes, the thing with the 01:04:07.600 |
All of our strings have been turned into numbers, our dates have been turned into a bunch of 01:04:11.040 |
numeric columns, and everything else is already a number. 01:04:16.360 |
The only other main thing we have to do is notice that we have lots of missing values. 01:04:21.520 |
So here is dfraw.isnull, that's going to return true or false, depending on whether something 01:04:28.160 |
is empty, .sum is going to add up how many are empty for each series, and then I'm going 01:04:37.040 |
to sort them and divide by the size of the dataset. 01:04:40.960 |
So here we have some things which have quite high percentages of nulls. 01:04:50.080 |
So missing values, we call them in display_all, maybe I didn't run it. 01:05:08.280 |
So we're going to get to that in a moment, but I will point something out, which is reading 01:05:13.160 |
the CSV took a minute or so, the processing took another 10 seconds or so, from time to 01:05:19.120 |
time when I've done a little bit of work I don't want to wait for again, I will tend 01:05:24.880 |
And I'm going to save it in a format called feather_format, this is very, very new. 01:05:29.400 |
But what this is going to do is it's going to save it to disk in exactly the same basic 01:05:35.560 |
This is by far the fastest way to save something, and the fastest way to read it back. 01:05:39.920 |
So most of the folks you deal with, unless they're on the cutting edge, won't be familiar 01:05:44.720 |
with this format, so this will be something you can teach them about. 01:05:49.560 |
It's actually becoming something that's going to be used not just in pandas, but in Java, 01:05:56.560 |
in Spark, in lots of things for communicating across computers because it's incredibly fast. 01:06:04.040 |
And it's actually co-designed by the guy that made pandas by Wes McKinney. 01:06:08.200 |
So we can just go dfraw.tofeather and pass in some name. 01:06:14.480 |
I tend to have a folder called temp for all of my "as I'm going along" stuff. 01:06:21.800 |
And so when you go os.makedurs, you can path in any path here you like. 01:06:26.880 |
It won't complain if it's already there, if it exists okay equals true. 01:06:30.400 |
If there are some subdirectories, it'll create them for you, so this is a super handy little 01:06:38.040 |
So it's not installed, because I'm using Cressel for the first time. 01:06:46.740 |
So if you get a message that something's not installed, if you're using Anaconda, you can 01:06:53.760 |
Cressel actually doesn't use Anaconda, it uses pip, and so we wait for that to go along. 01:07:11.360 |
And so now if I run it, and so sometimes you may find you actually have to restart Jupyter. 01:07:23.600 |
So I won't do that now because we're nearly out of time, so if you restart Jupyter you'll 01:07:28.300 |
So from now on, you don't have to rerun all the stuff that I have. 01:07:32.400 |
You could just say pd.readfeather and we've got our data frame back. 01:07:38.140 |
So the last step we're going to do is to actually replace the strings with the numeric codes. 01:07:47.760 |
And we're going to pull out the dependent variable, sale price, into a separate variable. 01:07:53.560 |
And we're going to also handle missing continuous values. 01:07:59.840 |
So you'll see here we've got a function called proc df. 01:08:22.800 |
So quite a lot of the functions have a few additional parameters that you can provide, 01:08:27.000 |
and we'll talk about them later, but basically we're providing the data frame to process 01:08:30.720 |
and the name of the dependent variable, the y field name. 01:08:35.740 |
And so all it's going to do is it's going to make a copy of the data frame, it's going 01:08:41.040 |
to grab the y value, it's going to drop the dependent variable from the original, and 01:08:58.680 |
So what we do to fix missing is pretty simple. 01:09:03.080 |
If it's numeric, then we fix it by basically saying let's first of all check that it does 01:09:10.560 |
So if it does have some missing values, so in other words the is_null.sum is non-zero, 01:09:16.600 |
then we're going to create a new column with the same name as the original, plus_na, and 01:09:22.520 |
it's going to be a boolean column with a 1 any time that was missing, and a 0 any time 01:09:28.760 |
We're going to talk about this again next week, but I'll give you the quick version. 01:09:33.280 |
Having done that, we're then going to replace the n_a's, the missing, with the median. 01:09:39.320 |
So anywhere that used to be missing will be replaced with the median, and we'll add a 01:09:42.600 |
new column to tell us which ones were missing. 01:09:46.560 |
We only do that for numeric, we don't need it for categories because Pandas handles categorical 01:09:51.360 |
variables automatically by setting them to -1. 01:09:55.800 |
So what we're going to do is if it's not numeric, and it's a categorical type (we'll talk about 01:10:07.520 |
the maximum number of categories later, but let's assume this is always true, so if it's 01:10:10.720 |
not a numeric type) we're going to replace the column with its codes, the integers, +1. 01:10:18.320 |
So by default Pandas uses -1 for missing, so now 0 will be missing, and 1, 2, 3, 4 will 01:10:32.960 |
So we're going to talk about dummies later on in the course, but basically optionally 01:10:37.800 |
you can say that if you already know about dummy values, they're columns with a small 01:10:41.200 |
number of possible values, you can turn into dummies instead if you're numericalizing them, 01:10:48.120 |
So for now all we're doing is we're using the categorical codes +1, replacing missing 01:10:53.000 |
values with the median, adding an additional column, telling us which ones were replaced, 01:11:02.360 |
So that's what PROC DF does, it runs very quickly. 01:11:07.240 |
So you'll see now, sale price is no longer here. 01:11:11.680 |
We've now got a whole new variable called y that contains sale price. 01:11:17.140 |
You'll see we've got a couple of extra blah_nas at the end. 01:11:22.960 |
And if I look at that, everything is a number. 01:11:34.880 |
These Booleans are treated as numbers, they're just considered as 0 or 1, they're just displayed 01:11:42.120 |
So you can see here, is it the end of a month, is it the start of a month, is it the end 01:11:49.920 |
It's kind of funny, right, because we've got things like a model ID, which presumably is 01:11:53.840 |
something like a serial number, or it could be like the model identifier that's created 01:11:58.040 |
by the factory, or something, we've got like a data source ID. 01:12:01.360 |
Some of these are numbers, but they're not continuous. 01:12:05.040 |
It turns out actually random forests work fine with those. 01:12:08.780 |
We'll talk about why and how and a lot about that in detail, but for now all you need to 01:12:14.780 |
So as long as this is all numbers, which it now is, we can now go ahead and create a random 01:12:21.100 |
So m.randomforestregressor, random forests are trivially parallelizable. 01:12:27.660 |
So what that means is that if you've got more than one CPU, which everybody will basically 01:12:33.200 |
on their computers at home, and if you've got a T2.medium or bigger at AWS, you've got 01:12:41.360 |
Randomly parallelizable means that it will split up the data across your different CPUs 01:12:48.560 |
So the more CPUs you have, pretty much it will divide the time it takes by that number. 01:12:56.200 |
So njobs=-1 tells the random forest regressor to create a separate job, a separate process 01:13:02.800 |
basically, for each CPU you have, so that's pretty much what you want all the time. 01:13:09.060 |
Get the model using this new data frame we created, using that y value we pulled out, 01:13:15.560 |
The score is going to be the R^2, we'll define that next week, hopefully some of you already 01:13:20.280 |
1 is very good, 0 is very bad, so as you can see we've immediately got a very high score. 01:13:28.320 |
So that looks great, but what we'll talk about next week a lot more is that it's not quite 01:13:35.080 |
great because maybe we had data that had points that looked like this, and we fitted a line 01:13:41.400 |
that looks like this, when actually we wanted one that looks like that. 01:13:46.200 |
The only way to know whether we've actually done a good job is by having some other dataset 01:13:54.320 |
Now we're going to learn about some ways with random forests we can kind of get away without 01:13:57.920 |
even having that other dataset, but for now what we're going to do is we're going to split 01:14:03.800 |
into 12,000 rows which we're going to put in a separate dataset called the validation 01:14:09.400 |
set versus the training set that's going to contain everything else. 01:14:15.040 |
And our dataset is going to be sorted by date, and so that means that the most recent 12,000 01:14:23.680 |
Again, we'll talk more about this next week, it's a really important idea, but for now 01:14:28.640 |
we can just recognize that if we do that and run it, I've created a little thing called 01:14:33.720 |
print score and it's going to print out the root mean squared error between the predictions 01:14:38.720 |
and actuals for the training set, for the validation set, the R^2 for the training set 01:14:46.240 |
And you'll see that actually the R^2 for the training was 0.98, but for the validation 01:14:52.680 |
Then the RMSE, and remember this is on the logs, was 0.09 for the training set and 0.25 01:15:02.280 |
Now if you actually go to Kaggle and go to the leaderboard, in fact let's do it right 01:15:06.880 |
now, he's got private and public, I'll click on public leaderboard, and we can go down 01:15:17.720 |
So there are 475 teams, and generally speaking if you're in the top half of a Kaggle competition 01:15:29.800 |
So 0.25, here we are, 0.25, what was it exactly, 0.25, 0.2507, yeah, about 110. 01:15:47.540 |
So the idea, this is pretty cool, with no thinking at all, using the defaults of everything 01:15:58.080 |
So random forests are insanely powerful, and this totally standardized process is insanely 01:16:09.240 |
So we're going to wrap up, what I'm going to ask you to do for Tuesday is take as many 01:16:17.840 |
Kaggle competitions as you can, whether they be running now or old ones or datasets that 01:16:22.760 |
you're interested in for hobbies or work, and please try it. 01:16:29.160 |
And if it doesn't work, tell us on the forum, here's the dataset I'm using, here's where 01:16:34.160 |
I got it from, here's the stack trace of where I got an error, or here's if you use my print 01:16:41.640 |
score function or something like it, show us what the training versus test set looks 01:16:49.400 |
But what I'm hoping we'll find is that all of you will be pleasantly surprised that with 01:16:54.080 |
an hour or two of information you've got today, you can already get better models than most 01:17:03.240 |
of the very serious practicing data scientists that compete in Kaggle competitions. 01:17:12.040 |
Oh, one more thing, Friday, the other class said a lot of them had class during my office 01:17:18.840 |
hours, so if I made them 1-3 instead of 2-4 on Fridays, is that okay? 01:17:30.680 |
All right, I will talk to somebody who actually knows what they're doing, unlike me, about