back to index

Intro to Machine Learning: Lesson 1


Chapters

0:0 Intro
8:5 Importing Libraries
12:5 Cargill Competitions
16:0 Downloading Data
21:25 Installing Data
23:28 Running Paths
25:13 Structured Data
28:18 Read CSV
34:8 Evaluation
36:23 Random Forest
42:55 SKlearn
47:11 Stack trace
50:46 Feature engineering
56:16 Categorical variables

Whisper Transcript | Transcript Only Page

00:00:00.000 | Let me introduce everybody to everybody else first of all.
00:00:11.600 | We're here at the University of San Francisco learning machine learning or you might be
00:00:15.600 | at home watching this on video.
00:00:18.000 | Everybody wave.
00:00:19.000 | Here is the University of San Francisco graduate students.
00:00:22.280 | Thank you everybody and wave back from the future and from home to all the students here.
00:00:29.040 | If you're watching this on YouTube, please stop and instead go to course.fast.ai and
00:00:38.320 | watch it from there instead.
00:00:40.760 | There's nothing wrong with YouTube but I can't edit these videos after I've created them,
00:00:47.160 | so I need to be able to give you updated information about what environments to use, how the technology
00:00:54.400 | changes and so you need to go here.
00:00:57.080 | So you can also watch the lessons from here, here's lots of lessons and so forth.
00:01:05.280 | So that's tip number one for the video.
00:01:08.800 | Tip number two for the video is because I can't edit them, all I can do is add these
00:01:13.120 | things called cards and cards are little things that appear in the top right-hand corner of
00:01:18.720 | the screen.
00:01:19.720 | So by the time this video comes out, I'm going to put a little card there right now for you
00:01:23.800 | to click on and try that out.
00:01:26.080 | Unfortunately they're not easy to notice, so keep an eye out for that because that's going
00:01:29.800 | to be important updates to the video.
00:01:33.900 | So welcome, we're going to be learning about machine learning today.
00:01:40.080 | And so for everybody in the class here, you all have Amazon Web Services set up, so you
00:01:45.200 | might want to go ahead and launch your AWS instance now or go ahead and launch your Jupyter
00:01:52.920 | notebook on your own computer.
00:01:56.340 | If you don't have Jupyter notebook set up, then what I recommend is you go to cressell.com
00:02:03.560 | www.cressell.com, sign in there, sign up, and you can then turn off Enable GPU and click
00:02:14.280 | Start Jupyter and you'll have a Jupyter notebook instantly.
00:02:17.980 | That costs you some money, it's 3 cents an hour.
00:02:22.040 | So if you don't mind spending 3 cents an hour to learn machine learning, here's a good way.
00:02:25.840 | So I'm going to go ahead and say Start Jupyter.
00:02:29.360 | And so whatever technique you use, there you go.
00:02:32.760 | One of the things that you'll find on the website is links to lots of information about
00:02:38.560 | the costs and benefits and approaches to setting up lots of different environments for Jupyter
00:02:43.160 | notebook, both for deep learning and for regular machine learning.
00:02:47.800 | So check them out because there's lots of options.
00:02:52.200 | So if I then open Jupyter in a new tab, here I am in cressell or on AWS or your own computer.
00:03:04.260 | We use the Anaconda Python distribution for basically everything.
00:03:09.200 | You can install that yourself.
00:03:10.440 | And again, there's lots of information on the website about how to set that up.
00:03:17.160 | We're also assuming that either you're using cressell or there's something else which I
00:03:22.600 | really like called paperspace.com, which is another place you can fire up if you put a
00:03:27.360 | notebook pretty much instantly.
00:03:29.900 | Both of these already have all of the fastai stuff pre-installed for you.
00:03:36.560 | So as soon as you open up cressell or paperspace, assuming you chose the paperspace fastai template,
00:03:42.760 | you'll see that there's a fastai folder.
00:03:46.640 | If you are using your own computer or AWS, you'll need to go to our GitHub repo, fastai
00:03:55.440 | and clone it.
00:03:57.560 | And then you'll need to do a conda update to install the libraries, and again, that's all
00:04:03.720 | information we've got on the website, and we've got some previous workshop videos to
00:04:07.280 | help you through all those steps.
00:04:09.560 | So for this class, I'm assuming that you have a Jupyter notebook running.
00:04:18.380 | So here we are in the Jupyter notebook, and if I click on fastai, that's what you get
00:04:25.640 | if you get clone or if you're on cressell, you can see our repo here.
00:04:32.120 | All of our lessons are inside the courses folder, and the machine learning part 1 is
00:04:42.080 | in the ml1 folder.
00:04:45.000 | If you're ever looking at my screen and wondering where are you, look up here and you'll see
00:04:50.900 | it tells you the path, fastai/courses/ml1.
00:04:57.440 | And today we're going to be looking at lesson 1, random forests.
00:05:01.080 | So here is lesson 1, RF.
00:05:16.400 | So there's a couple of different ways you can do this, both here in person or on the
00:05:21.320 | video.
00:05:22.320 | You can either attempt to follow along as you watch, or you can just watch and then
00:05:28.120 | follow along later with the video.
00:05:31.200 | It's up to you, I would maybe have a loose recommendation to watch now and follow along
00:05:41.480 | with the video later just because it's quite hard to multitask, and if you're working on
00:05:47.640 | something you might miss a key piece of information which you're welcome to ask about.
00:05:53.640 | But if you follow along with the video afterwards, then you can pause, stop, experiment and so
00:06:00.200 | forth.
00:06:01.200 | But anyway, you can choose either way.
00:06:03.120 | I'm going to go view, toggle header, view, toggle toolbar, and then full screen it so
00:06:11.560 | it will get a bit more space.
00:06:17.520 | So the basic approach we're going to be taking here is to get straight into code, start building
00:06:24.680 | models, not to look at theory.
00:06:29.160 | We're going to get to all the theory, but at the point where you deeply understand what
00:06:33.520 | it's for and at the point that you're able to be an effective practitioner.
00:06:39.480 | So my hope is that you're going to spend your time focusing on experimenting.
00:06:44.300 | So if you take these notebooks and try different variations of what I show you, try it with
00:06:49.680 | your own datasets, the more coding you can do, the better, the more you'll learn.
00:06:57.000 | My suggestion, or at least all of my students have told me, the ones who have gone away
00:07:00.600 | and spent time studying books of theory rather than coding, found that they learned less
00:07:07.540 | machine learning and that they often tell me they wish there's more time coding.
00:07:14.880 | The stuff that we're showing in this course, a lot of it's never been shown before.
00:07:18.400 | This is not a summary of other people's research.
00:07:22.000 | This is more a summary of 25 years of work that I've been doing in machine learning.
00:07:27.080 | So a lot of this is going to be shown for the first time.
00:07:30.120 | And so that's kind of cool because if you want to write a blog post about something
00:07:33.040 | that you learn here, you might be building something that a lot of people find super
00:07:38.960 | useful.
00:07:39.960 | There's a great opportunity to practice your technical writing, and here's some examples
00:07:42.840 | of good technical writing, by showing people stuff.
00:07:47.240 | It's not like, "Hey, I just learned this thing, I bet you all know it."
00:07:50.120 | Often it will be, "I just learned this thing and I'm going to tell you about it and other
00:07:53.480 | people haven't seen it."
00:07:55.320 | In fact, this is the first course ever that's been built on top of the fast AI library,
00:08:00.840 | so even just stuff in the library is going to be new to everybody.
00:08:07.400 | When we use Jupyter Notebook or anything else in Python, we have to import the libraries
00:08:13.540 | that we're going to use.
00:08:16.760 | Something that's quite convenient is if you use these two auto-reload commands at the
00:08:20.560 | top of your notebook, you can go in and edit the source code of the modules and your notebook
00:08:26.560 | will automatically update with those new modules.
00:08:29.320 | You won't have to restart anything, so that's super handy.
00:08:32.800 | Then to show your plots inside the notebook, you'll want that plot in line.
00:08:37.300 | These three lines appear at the top of all of my notebooks.
00:08:44.120 | You'll notice when I import the libraries that for anybody here who is an experienced
00:08:48.360 | Python programmer, I am doing something that would be widely considered very inappropriate.
00:08:53.760 | I'm importing star.
00:08:56.720 | Generally speaking in software engineering, we're taught to specifically figure out what
00:09:00.480 | we need and import those things.
00:09:05.640 | The more experienced you are as a Python programmer, the more extremely offensive practices you're
00:09:10.200 | going to see me use.
00:09:11.400 | For example, I don't follow what's called PEP8, which is the normal style of code used
00:09:17.600 | in Python.
00:09:18.600 | I'm going to mention a couple of things.
00:09:21.080 | First is go along with it for a while, don't judge me just yet.
00:09:25.880 | There's reasons that I do these things, and if it really bothers you, then feel free to
00:09:30.720 | change it.
00:09:31.720 | But the basic idea is data science is not software engineering.
00:09:37.040 | There's a lot of overlap.
00:09:38.040 | We're using the same languages, and in the end these things may become software engineering
00:09:45.000 | projects.
00:09:46.000 | But what we're doing right now is we're prototyping models.
00:09:49.680 | Prototyping models has a very different set of best practices that are taught basically
00:09:54.880 | in hardware.
00:09:55.880 | They're not really even really written down.
00:09:58.240 | But the key is to be able to do things very interactively and very iteratively.
00:10:03.480 | So for example, from library import star means you don't have to figure out ahead of time
00:10:09.640 | what you're going to need from that library, it's all there.
00:10:14.020 | Also because we're in this wonderful interactive Jupyter environment, it lets us understand
00:10:21.960 | what's in the libraries really well.
00:10:23.840 | So for example, later on I'm using a function called display.
00:10:30.740 | So an obvious question is, what is display?
00:10:34.000 | So you can just type the name of a function and press shift enter, remember shift enter
00:10:39.640 | is to run a cell, and it will tell you where it's from.
00:10:43.680 | So anytime you see a function you're not familiar with, you can find out where it's from.
00:10:49.420 | And then if you want to find out what it does, put a question mark at the start.
00:10:58.280 | And here you have the documentation.
00:11:01.720 | And then, particularly helpful for the FastAI library, I try to make as many functions as
00:11:07.960 | possible be no more than about five lines of code, it's going to be really easy to read.
00:11:14.320 | If you put a second question mark at the start, it shows you the source code of the function.
00:11:25.720 | Right so all the documentation plus the source code, so you can see nothing has to be mysterious.
00:11:30.960 | And we're going to be using, the other library we'll use a lot is scikit-learn, which implements
00:11:36.840 | a lot of machine learning stuff in Python.
00:11:40.040 | The scikit-learn source code is often pretty readable.
00:11:44.140 | And so very often if I want to really understand something, I'll just go question mark, question
00:11:48.340 | mark, and the name of the scikit-learn function I'm typing, and I'll just go ahead and read
00:11:51.880 | the source code.
00:11:54.880 | As I say, the FastAI library in particular is designed to have source code that's very
00:12:00.120 | easy to read, and we're going to be reading it a lot.
00:12:07.200 | So today we're going to be working on a Kaggle competition called Blue Book for bulldozers.
00:12:12.880 | So the first thing we need is to get that data.
00:12:16.040 | So if you go Kaggle, bulldozers, then you can find it.
00:12:24.520 | So Kaggle competitions allow you to download a real-world dataset, a real problem that
00:12:32.920 | somebody is trying to solve, and solve it according to a specification that that actual
00:12:37.640 | person with that actual problem decided would be actually helpful to them.
00:12:42.120 | So these are pretty authentic experiences for applied machine learning.
00:12:48.480 | Now of course you're missing all the bits that went before, which was why did this company,
00:12:52.840 | this startup, decide that predicting the auction sale price of bulldozers was important?
00:12:58.800 | Where did they get the data from?
00:13:00.560 | How did they clean the data?
00:13:02.320 | And so forth.
00:13:03.440 | And that's all important stuff as well, but the focus of this course is really on what
00:13:08.200 | happens next, which is like how do you actually build the model.
00:13:12.480 | One of the great things about you working on Kaggle competitions, whether they be running
00:13:16.040 | now or whether they be old ones, is that you can submit to the leaderboard, even old closed
00:13:22.080 | competitions, you can submit to the leaderboard and find out how would you have gone.
00:13:26.360 | And there's really no other way in the world of knowing whether you're competent at this
00:13:32.160 | kind of data and this kind of model than doing that.
00:13:35.640 | Because otherwise, if your accuracy is really bad, is it because this is just very hard,
00:13:41.080 | like it's just not possible, then the data is so noisy you can't do better?
00:13:46.120 | Or is it actually that it's an easy data set and you made a mistake?
00:13:51.600 | And like when you finish this course and apply this to your own projects, this is going to
00:13:58.560 | be something you're going to find very hard and there isn't a simple solution to it, which
00:14:03.200 | is you're now using something that hasn't been on Kaggle, it's your own data set, do
00:14:08.840 | you have a good enough answer or not?
00:14:12.320 | So we'll talk about that more during the course.
00:14:16.480 | And in the end, we just have to know that we have good, effective techniques to reliably
00:14:22.000 | build baseline models, otherwise there's really no way to know.
00:14:27.440 | There's no way other than creating a Kaggle competition or getting 100 top data scientists
00:14:33.160 | to work at your problem to really know what's possible.
00:14:37.000 | So Kaggle competitions are fantastic for learning.
00:14:41.760 | And as I've said many times, I've learned more from competing in Kaggle competitions
00:14:46.000 | than everything else I've done in my life.
00:14:49.280 | So to compete in a Kaggle competition, you need the data.
00:14:52.840 | This one's an old competition, so it's not running now, but we can still access everything.
00:15:00.040 | So we first of all want to understand what the goal is.
00:15:03.680 | And I suggest that you read this later, but basically we're going to try and predict the
00:15:07.200 | sale price of heavy equipment.
00:15:10.580 | And one of the nice things about this competition is that if you're like me, you probably don't
00:15:16.120 | know very much about heavy industrial equipment options.
00:15:20.400 | I actually know more than I used to because my toddler loves building equipment, so we
00:15:26.000 | actually watch YouTube videos about front-end loaders and forklifts.
00:15:30.520 | But two months ago, I was a real layman.
00:15:37.040 | So one of the nice things is that machine learning should help us understand a data
00:15:41.280 | set, not just make predictions about it.
00:15:43.920 | So by picking an area which we're not familiar with, it's a good test of whether we can build
00:15:49.200 | an understanding.
00:15:51.200 | Because otherwise what can happen is that your intuition about the data can make it
00:15:55.280 | very difficult for you to be open-minded enough to see what does the data really say.
00:16:00.920 | It's easy enough to download the data to your computer.
00:16:05.520 | You just have to click on the data set, so here is train.zip, and click download.
00:16:15.000 | And so you can go ahead and do that if you're running on your own computer right now.
00:16:18.480 | If you're running on AWS, it's a little bit harder because unless you're familiar with
00:16:24.680 | text-mode browsers like Elinks or Lynx, it's quite tricky to get the data set to Kaggle.
00:16:31.400 | So a couple of options.
00:16:33.160 | One is you can download it to your computer and then SCP it to AWS, so SCP works just
00:16:39.960 | like SSH but it copies data rather than logging in.
00:16:43.600 | I'll show you a trick though that I really like, and it relies on using Firefox.
00:16:47.840 | For some reason Chrome doesn't work correctly with Kaggle for this.
00:16:55.440 | So if I go on Firefox to the website, eventually, and what we're going to do is we're going
00:17:10.360 | to use something called the JavaScript console.
00:17:14.720 | So every web browser comes with a set of tools for web developers to help them see what's
00:17:21.380 | going on, and you can hit control-shift-i to bring up this web developer
00:17:41.240 | tools and one of the tabs is network.
00:17:45.720 | And so then if I click on train.zip and I click on download, and I'm not even going
00:17:56.760 | to download it, I'm just going to say cancel, but you'll see down here it's shown me all
00:18:01.440 | of the network connections that were just initiated.
00:18:05.080 | And so here's one which is downloading a zip file from storage.googleapis.com, blah blah
00:18:10.240 | blah.
00:18:11.240 | That's probably what I want, that looks good.
00:18:14.080 | So what you can do is you can right-click on that and say copy, copy as curl.
00:18:21.320 | So curl is a Unix command like wget that downloads stuff.
00:18:27.240 | So if I go copy as curl, that's going to create a command that has all of my cookies, headers,
00:18:34.800 | everything in it necessary to download this authenticated data set.
00:18:39.600 | So if I now go into my server, and if I paste that, you can see a really really long curl
00:18:49.680 | command.
00:18:56.880 | One thing I notice is that at least recent versions have started adding this --2.0 thing
00:19:03.600 | to the command.
00:19:04.600 | That doesn't seem to work with all versions of curl, so something you might want to do
00:19:17.400 | is to pop that into an editor, find that to get rid of it, and then use that instead.
00:19:27.000 | Now one thing to be very careful about, by default curl downloads the file and displays
00:19:34.760 | it in your terminal.
00:19:36.600 | So if I try to display this, it's going to display gigabytes of binary data in my terminal
00:19:41.000 | and crash it.
00:19:42.540 | So to say that I want to output it using some different file name, I always type -0 for
00:19:48.800 | output file name, and then the name of the file, bulldozers.zip, and make sure you give
00:19:56.440 | it a suitable extension.
00:20:01.080 | So in this case the file was train.zip, so bulldozers.zip.
00:20:11.800 | There it is, and so there it all is, so I could make directory bulldozers, then I could
00:20:24.280 | move my zip file into there, it's the wrong way around, yes, thank you.
00:20:39.480 | Okay, and then if you don't have unzip installed, you may need to sudo apt-install unzip, or
00:21:06.520 | if you're on a Mac, that would be brew install unzip, if brew doesn't work, you haven't got
00:21:13.000 | homebrew installed, so make sure you install it, and then unzip.
00:21:18.920 | And so they're the basic steps.
00:21:21.920 | One nice thing is that if you're using Cressel, most of the datasets should already be pre-installed
00:21:31.920 | for you.
00:21:33.840 | So what I can do here is I can say open a new tab, here's a cool trick, in Jupyter you
00:21:40.960 | can actually say new terminal, and you can actually get a web-based terminal.
00:21:47.200 | And so you'll find on Cressel there's a /datasets folder, /datasets/caggle, /datasets/fastai,
00:21:58.600 | often the things you need are going to be in one of those places.
00:22:04.460 | So assuming that we don't have it already downloaded, actually PaperSpace should have
00:22:09.120 | most of them as well, then we'd need to go to fastai, let's go into the courses, machine
00:22:14.760 | learning folder, and what I tend to do is I tend to put all of my data for a course into
00:22:21.240 | a folder called data.
00:22:23.760 | You'll find that if you're using Git, you'll find that that doesn't get added to Git because
00:22:29.800 | it's in the Git ignore.
00:22:31.840 | So don't worry about creating the data folder, it's not going to screw anything up.
00:22:37.240 | So I generally make a folder called data, and then I tend to create folders for everything
00:22:41.680 | I need there.
00:22:43.420 | So in this case, I'll make the bulldozers, cd, and remember the last word of the last
00:22:53.900 | command is exclamation mark dollar.
00:22:56.360 | I'll go ahead and grab that curl command again, and zip bulldozers, there we go.
00:23:25.360 | So you can now see I generally have anything that might change from person to person, I
00:23:35.400 | kind of put in a constant.
00:23:36.400 | So here I just define something called path, but if you've used the same path I just did,
00:23:39.680 | you should just be able to go ahead and run that, and let's go ahead and keep moving along.
00:23:45.120 | So we've now got all of our libraries imported, and we've set the path to the data.
00:23:52.480 | You can run shell commands from within Jupyter Notebook by using an exclamation mark.
00:23:59.960 | So if I want to check what's inside that path, I can go ls data/bulldozers, and you can see
00:24:07.360 | that works.
00:24:08.760 | Or you can even use Python variables.
00:24:11.240 | If you use a Python variable inside a Jupyter shell command, you have to put it in curlies.
00:24:18.920 | So that makes me feel good that my path is pointing at the right place.
00:24:22.040 | If you say ls curly_capitals_path, and you get nothing at all, then you're pointing at
00:24:27.920 | the wrong spot.
00:24:30.080 | Let me turn this up here.
00:24:37.760 | So the curly brackets refer to the fact that I put an exclamation mark at the front, which
00:24:41.880 | means the rest of this is not a Python command, it's a bash command.
00:24:48.960 | And bash doesn't know about capital path, because capital path is part of Python.
00:24:54.360 | So this is a special Jupyter thing which says expand this Python thing, please, before you
00:25:00.240 | pass it to the shell.
00:25:14.640 | So the goal here is to use the training set, which contains data through the end of 2011
00:25:21.120 | to predict the sale price of bulldozers.
00:25:24.040 | And so the main thing to start with then is of course to look at the data.
00:25:30.640 | Now the data is in CSV format, so one easy way to look at the data would be to use shell
00:25:37.920 | command head to look at the first few lines, head, bulldozers, and even tab completion
00:25:44.200 | works here.
00:25:45.200 | Jupyter does everything.
00:25:47.720 | So here's the first few five lines.
00:25:51.040 | So there's a bunch of column headers, and then there's a bunch of data.
00:25:54.880 | So that's pretty hard to look at.
00:25:56.400 | So what we want to do is take this and read it into a nice tabular format.
00:26:01.920 | So does Terrence putting these glasses on mean I should make this bigger, or is it okay?
00:26:07.200 | Is this big enough font size for everybody?
00:26:11.360 | So this kind of data where you've got columns representing a wide range of different types
00:26:17.560 | of things, such as an identifier, a currency, a date, a size, I refer to this as structured
00:26:26.680 | data.
00:26:27.680 | Now I say I refer to this as structured data because there have been many arguments in
00:26:32.560 | the machine learning community on Twitter about what is structured data.
00:26:37.000 | Weirdly enough, this is like the most important type of distinction between data that looks
00:26:42.560 | like this and data like images where every column is of the same type.
00:26:48.080 | That's the most important distinction in machine learning, yet we don't have standard accepted
00:26:54.400 | terms.
00:26:55.400 | So I'm going to use the terms structured and unstructured.
00:26:58.960 | But note that other people you talk to, particularly in NLP, people use structured to mean something
00:27:05.400 | totally different.
00:27:07.040 | So when I refer to structured data, I mean columns of data that can have varying different
00:27:12.320 | types of data in them.
00:27:14.720 | By far the most important tool in Python for working with structured data is pandas.
00:27:20.520 | Pandas is so important that it's one of the few libraries that everybody uses the same
00:27:24.960 | abbreviation for it, which is pd.
00:27:27.920 | So you'll find that one of the things I've got here is from fastai-imports-import-star.
00:27:37.180 | The fastai-imports module has nothing but imports of a bunch of hopefully useful tools.
00:27:47.520 | So all of the code for fastai is inside the fastai-directory inside the fastai-repo.
00:27:56.600 | And so you can have a look at imports, and you'll see it's just literally a list of inputs.
00:28:05.280 | And you'll find there pandas as pd.
00:28:09.400 | And so everybody does this, right?
00:28:11.000 | So you'll see lots of people using pd.something, they're always talking about pandas.
00:28:16.320 | So pandas lets us read a CSV file.
00:28:21.520 | And so when we read the CSV file, we just tell it the path to the CSV file, a list of
00:28:28.440 | any columns that contain dates, and I always add this low memory equals false that's going
00:28:33.640 | to actually make it read more of the file to decide what the types are.
00:28:39.320 | This here is something called a Python 3.6 format string.
00:28:44.500 | It's one of the coolest parts of Python 3.6.
00:28:48.840 | We've probably used lots of different ways in the past in Python of interpolating variables
00:28:53.160 | into your strings.
00:28:54.440 | Python 3.6 has a very simple way that you'll probably always want to use from now on.
00:28:59.920 | And you create a normal string, you type in f at the start, and then if I define a variable,
00:29:10.640 | then I can say hello curly's python function.
00:29:20.360 | This is kind of confusing.
00:29:21.360 | These are not the same curlies that we saw earlier on in the ls command.
00:29:25.640 | That ls command is specific to Jupyter and it interpolates python code into shell code.
00:29:34.520 | These curlies are Python 3.6 format string curlies.
00:29:38.120 | They require an f at the start, so if I get rid of the f, it doesn't interpolate.
00:29:44.420 | So the f tells it to interpolate.
00:29:46.720 | And the cool thing is, inside that curlies, you can write any Python code you like, just
00:29:52.560 | about.
00:29:53.560 | So for example, name.upper, oh Jeremy!
00:30:00.280 | So I use this all the time.
00:30:03.680 | And it doesn't matter, because it's a format string, it doesn't matter if the thing was
00:30:07.280 | ... I always forget my age, I think I'm 43.
00:30:13.880 | It doesn't matter if it's an integer.
00:30:15.400 | Normally if you like to string concatenation with integers, python complains, no such problem
00:30:22.320 | here.
00:30:26.160 | So this is going to read path/train.csv into a thing called a data frame.
00:30:33.600 | Amanda's data frames and R's data frames are pretty similar, so if you've used R before,
00:30:41.120 | then you'll find that this is reasonably comfortable.
00:30:45.340 | So this file is 9.3 meg, and its size is 112 meg.
00:31:01.920 | And it has 400,000 rows in it, so it takes a moment to import it.
00:31:13.960 | So when it's done, we can type the name of the data frame, df-raw, and then use various
00:31:27.240 | methods on it.
00:31:28.240 | So for example df-raw.tail will show us the last few rows of the data frame.
00:31:34.640 | By default it's going to show the columns along the top and the rows down the side,
00:31:38.360 | but in this case there's a lot of columns.
00:31:41.000 | So I've just set .transpose to show it the other way around.
00:31:47.280 | I've created one extra function here, display-all.
00:31:50.240 | Normally if you just type df-raw, if it's too big to show conveniently, it truncates
00:31:55.720 | it and puts little ellipses in the middle.
00:31:58.200 | So the details don't matter, but this is just changing a couple of settings to say even
00:32:02.720 | if it's got a thousand rows and a thousand columns, please still show the whole thing.
00:32:08.520 | So this is finished.
00:32:09.520 | I can actually show you that.
00:32:10.520 | In Jupyter Notebook you can type a variable of almost any kind, a video, HTML, an image,
00:32:19.680 | whatever, and it will generally figure out a way of displaying it for you.
00:32:22.960 | So in this case it's a pandas data frame, it figures it out a way of displaying it for
00:32:27.920 | And so you can see here that by default it doesn't show me the whole thing.
00:32:35.840 | So here's the dataset.
00:32:38.280 | We've got a few different rows.
00:32:40.080 | This is the last bit, the tail of it, last few rows.
00:32:44.960 | This is the thing we want to predict, price.
00:32:49.960 | We call this the dependent variable.
00:32:55.640 | And then we've got a whole bunch of things we could predict it with.
00:32:58.280 | And when I start with a dataset, I tend -- yes, Terrence, can I give you this?
00:33:12.480 | I've read in books that you should never look at the data because of the risk of overfit.
00:33:16.600 | Why do you start by looking at the data?
00:33:19.380 | So I was actually going to mention, I actually kind of don't, like I want to find out at
00:33:24.920 | least enough to know that I've managed to import it okay, but I tend not to really study
00:33:29.760 | it at all at this point because I don't want to make too many assumptions about it.
00:33:35.640 | I would actually say most books, say the opposite, most books do a whole lot of EDA, expiratory
00:33:42.200 | data analysis first.
00:33:43.200 | >> Academic books.
00:33:44.200 | >> Yeah, academic books.
00:33:45.200 | >> Well, I mean the academic books I've read say that's one of the biggest risks of overfitting.
00:33:52.200 | >> Yeah, so the truth is kind of somewhere in between, and I generally try to do machine
00:33:59.400 | learning driven EDA, and that's what we're going to learn today.
00:34:07.420 | So the thing I do care about though is what's the purpose of the project?
00:34:12.920 | And for Kaggle projects, the purpose is very easy.
00:34:15.760 | We can just look and find out, there's always an evaluation section, how is it evaluated?
00:34:21.480 | And this is evaluated on root, mean, squared, log, error.
00:34:26.400 | So this means they're going to look at the difference between the log of our prediction
00:34:30.160 | of price and the log of the actual price, and then they're going to square it and add
00:34:35.440 | them up.
00:34:37.500 | So because they're going to be focusing on the difference of the logs, that means that
00:34:42.320 | we should focus on the logs as well.
00:34:45.040 | And this is pretty common, like for a price, generally you care not so much about did I
00:34:49.600 | miss by $10, but did I miss by 10%?
00:34:53.280 | So if it was a million-dollar thing and you're $100,000 off, or if you're a $10,000 thing
00:34:57.920 | and you're $1,000 off, often we would consider those equivalent scale issues.
00:35:03.120 | And so for this auction problem, the organizers are telling us they care about ratios more
00:35:10.240 | than differences, and so the log is the thing we care about.
00:35:13.920 | So the first thing I do is to take the log.
00:35:17.920 | Now np is NumPy.
00:35:20.840 | I'm assuming that you have some familiarity with NumPy.
00:35:23.900 | If you don't, we've got a video called Deep Learning Workshop, which actually isn't just
00:35:28.280 | for deep learning, it's basically for this as well.
00:35:31.460 | And one of the parts there, which we've got a time-coded link to, is a quick introduction
00:35:35.920 | to NumPy.
00:35:36.920 | But basically NumPy lets us treat arrays, matrices, vectors, high-dimensional tensors
00:35:42.760 | as if they're Python variables, and we can do stuff like log to them, and it will apply
00:35:48.000 | it to everything.
00:35:51.000 | NumPy and pandas work together very nicely.
00:35:54.680 | So in this case df-raw.sale_price is pulling a column out of a pandas data frame, which
00:36:02.400 | gives us a pandas series, which shows us the sale prices and the indexes.
00:36:17.160 | And a series can be passed to a NumPy function, which is pretty handy.
00:36:23.160 | And so you can see here, this is how I can replace a column with a new column.
00:36:31.720 | Now that we've replaced its sale price with its log, we can go ahead and try to create
00:36:36.320 | a random forest.
00:36:37.320 | What's a random forest?
00:36:40.080 | We'll find out in detail, but in brief, a random forest is a kind of universal machine
00:36:46.760 | learning technique.
00:36:48.520 | It's a way of predicting something that can be of any kind.
00:36:52.180 | It could be a category, like is it a dog or a cat, or it could be a continuous variable
00:37:00.000 | like price.
00:37:01.940 | It can predict it with columns of pretty much any kind.
00:37:06.840 | Pixel data, zip codes, revenues, whatever.
00:37:13.480 | In general, it doesn't overfit.
00:37:16.640 | It can, and we'll learn to check whether it is, but it doesn't generally overfit too badly,
00:37:21.400 | and it's very, very easy to make to stop it from overfitting.
00:37:25.600 | You don't need -- and we'll talk more about this -- you don't need a separate validation
00:37:28.680 | set in general.
00:37:30.300 | It can tell you how well it generalizes, even if you only have one dataset.
00:37:35.440 | It has few, if any, statistical assumptions.
00:37:38.340 | It doesn't assume that your data is normally distributed.
00:37:41.560 | It doesn't assume that the relationships are linear.
00:37:44.280 | It doesn't assume that you've just specified the interactions.
00:37:48.320 | It requires very few pieces of feature engineering for many different types of situations.
00:37:55.480 | You don't have to take the log of the data, you don't have to model plane directions together.
00:37:59.360 | So in other words, it's a great place to start.
00:38:03.480 | If your first random forest does very little useful, then that's a sign that there might
00:38:09.820 | be problems with your data.
00:38:11.120 | It's designed to work pretty much first off.
00:38:12.960 | Can you please throw it at or towards this gentleman?
00:38:15.040 | Thank you.
00:38:16.040 | What about the curse of dimensionality when you're using random forests?
00:38:20.640 | Yeah, great question.
00:38:22.120 | So there's this concept of curse of dimensionality.
00:38:25.040 | In fact there's two concepts I'll touch on, curse of dimensionality and the no-free lunch
00:38:29.580 | theorem.
00:38:30.580 | These are two concepts you'll often hear a lot about.
00:38:34.680 | They're both largely meaningless and basically stupid, and yet I would say maybe the majority
00:38:42.600 | of people in the field not only don't know that but think the opposite.
00:38:46.860 | So it's well worth explaining.
00:38:48.640 | The curse of dimensionality is this idea that the more columns you have, it basically creates
00:38:54.240 | a space that's more and more empty.
00:38:56.840 | And there's this kind of fascinating mathematical idea which is the more dimensions you have,
00:39:02.920 | the more all of the points sit on the edge of that space.
00:39:06.360 | So if you've just got a single dimension where things are like random, then they're spread
00:39:11.640 | out all over.
00:39:12.640 | Whereas if it's a square, then the probability that they're in the middle means that they
00:39:17.200 | can't have been on the edge of either dimension, so it's a little bit less likely that they're
00:39:21.320 | not on the edge.
00:39:22.840 | Each dimension you add, it becomes multiplicatively less likely that the point isn't on the edge
00:39:28.320 | of at least one dimension.
00:39:30.380 | And so basically in higher dimensions, everything sits on the edge.
00:39:34.200 | And what that means in theory is that the distance between points is much less meaningful.
00:39:39.880 | And so if we assume that somehow that matters, then it would suggest that when you've got
00:39:44.720 | lots and lots of columns and you just use them without being very careful to remove
00:39:50.560 | the ones you don't care about, that somehow things won't work.
00:39:54.800 | That turns out just not to be the case.
00:39:58.880 | It's not the case for a number of reasons.
00:40:01.200 | One is that the points still do have different distances away from each other.
00:40:06.120 | Just because they're on the edge, they still do vary in how far away they are from each
00:40:09.880 | other.
00:40:10.880 | And so this point is more similar to this point than it is to that point.
00:40:13.920 | So even things we'll learn about k-nearest neighbors actually work really well, really
00:40:18.600 | really well in high dimensions despite what the theoreticians claimed.
00:40:22.880 | And what really happened here was that in the 90s, theory totally took over machine
00:40:30.240 | learning.
00:40:31.440 | And so particularly there was this concept of these things called support vector machines
00:40:34.600 | that were theoretically very well justified, extremely easy to analyze mathematically,
00:40:39.920 | and you could kind of prove things about them.
00:40:43.040 | And we kind of lost a decade of real practical development in my opinion.
00:40:47.080 | And all these theories became very popular like the curse of dimensionality.
00:40:52.080 | Nowadays, and a lot of theoreticians hate this, the world of machine learning has become
00:40:58.320 | very empirical, which is like which techniques actually work.
00:41:01.480 | And it turns out that in practice, building models on lots and lots of columns works really
00:41:06.320 | really well.
00:41:09.000 | So the other thing to quickly mention is the no free lunch theorem.
00:41:13.000 | There's a mathematical theorem by that name that you will often hear about that claims
00:41:17.840 | that there is no type of model that works well for any kind of dataset.
00:41:26.440 | Which is true, and is obviously true if you think about it, in the mathematical sense,
00:41:32.320 | any random dataset, by definition it's random.
00:41:36.000 | So there isn't going to be some way of looking at every possible random dataset that's in
00:41:39.960 | some way more useful than any other approach.
00:41:43.000 | In the real world, we look at data which is not random.
00:41:47.240 | Mathematically we'd say it sits on some lower dimensional manifold, it was created by some
00:41:51.200 | kind of causal structure, there are some relationships in there.
00:41:57.280 | So the truth is that we're not using random datasets.
00:42:00.840 | And so the truth is, in the real world, there are actually techniques that work much better
00:42:06.120 | than other techniques for nearly all of the datasets you look at.
00:42:10.560 | And nowadays there are empirical researchers who spend a lot of time studying this, which
00:42:16.840 | techniques work a lot of the time.
00:42:20.180 | And ensembles of decision trees, of which random forests are one, is perhaps the technique
00:42:27.560 | which most often comes up the top.
00:42:29.880 | And that is despite the fact that until the library that we're showing you today, Fast
00:42:34.880 | AI came along, there wasn't really any standard way to pre-process them properly and to properly
00:42:41.120 | set their parameters.
00:42:42.800 | So I think it's even more strong than that.
00:42:46.880 | So yeah, I think this is where the difference between theory and practice is huge.
00:42:55.000 | So when I try to create a random forest regressor, what is that?
00:42:59.440 | Random forest regressor.
00:43:00.440 | OK, it's part of something called sklearn. sklearn is scikit-learn.
00:43:05.880 | It is by far the most popular and important package for machine learning in Python.
00:43:11.240 | It does nearly everything.
00:43:13.220 | It's not the best at nearly everything, but it's perfectly good at nearly everything.
00:43:18.600 | So you might find in the next part of this course with your net, you're going to look
00:43:23.040 | at a different kind of decision tree ensemble called gradient boosting trees, where actually
00:43:28.640 | there's something called xgboost, which is better than gradient boosting trees in scikit-learn.
00:43:35.320 | But it's pretty good at everything, so I'm really going to focus on scikit-learn.
00:43:41.440 | Random forest, you can do two kinds of things with a random forest.
00:43:44.200 | If I hit tab, I haven't imported it.
00:43:48.960 | So let's go back to where we import.
00:43:58.520 | So you can hit tab in Jupyter Notebook to get tab completion for anything that's in
00:44:04.320 | your environment.
00:44:05.320 | You'll see that there's also a random forest classifier.
00:44:09.000 | So in general, there's an important distinction between things which can predict continuous
00:44:14.720 | variables, and that's called regression, and therefore a method for doing that would be
00:44:19.440 | a regressor, and things that predict categorical variables, and that is called classification,
00:44:27.480 | and the things that do that are called classifiers.
00:44:30.640 | So in our case, we're trying to predict a continuous variable price.
00:44:34.440 | So therefore we are doing regression, and therefore we need a regressor.
00:44:39.840 | A lot of people incorrectly use the word regression to refer to linear regression, which is just
00:44:45.560 | not at all true or appropriate.
00:44:48.560 | Regression means a machine learning model that's trying to predict some kind of continuous
00:44:52.080 | outcome.
00:44:53.080 | It has a continuous dependent variable.
00:44:57.300 | So pretty much everything in scikit-learn has the same form.
00:45:00.300 | You first of all create an instance of an object for the machine learning model you want.
00:45:04.760 | You then call fit, passing in the independent variables, the things you want to use to predict,
00:45:11.360 | and the dependent variable, the thing that you want to predict.
00:45:13.920 | So in our case, the dependent variable is the data frame's sale price column, and so
00:45:24.640 | the thing we want to use to predict is everything except that.
00:45:28.000 | In pandas, the drop method returns a new data frame with a list of columns removed.
00:45:35.880 | A list of rows or columns removed.
00:45:37.960 | So axis=1 means removed columns.
00:45:40.800 | So this here is the data frame containing everything except for sale price.
00:45:52.400 | Let's find out.
00:46:00.560 | So to find out, I could hit shift+tab, and that will bring up a quick inspection of the
00:46:08.160 | parameters.
00:46:09.160 | In this case, it doesn't quite tell me what I want.
00:46:12.140 | So if I hit shift+tab twice, it gives me a bit more information.
00:46:17.000 | Ah yes, and that tells me it's a single label or list-like.
00:46:21.120 | List-like means like anything you can index.
00:46:23.000 | In Python, there's lots of things.
00:46:24.800 | By the way, if I hit three times, it will give me a whole little window at the bottom.
00:46:30.160 | So that was shift+tab.
00:46:33.440 | Another way of doing that, of course, which we learned, would be question mark, question
00:46:37.080 | mark, df, draw, dot, drop.
00:46:45.040 | Question mark would be the source code for it, or a single question mark is the documentation.
00:46:54.480 | So I think that trick of tab complete, shift+tab parameters, question mark and double question
00:47:00.560 | mark for the docs and the source code, if you know nothing else about using Python libraries,
00:47:07.040 | know that because now you know how to find out everything else.
00:47:11.720 | So we try to run it and it doesn't work.
00:47:16.480 | So why didn't it work?
00:47:18.000 | So anytime you get a stack trace like this, so an error, the trick is to go to the bottom
00:47:24.400 | because the bottom tells you what went wrong.
00:47:26.480 | Above it, it tells you all of the functions that could cause other functions to get there.
00:47:31.760 | Could not convert string to float conventional.
00:47:35.760 | So there was a value inside my dataset, conventional, and it didn't know how to create a model using
00:47:45.520 | that string.
00:47:47.040 | Now that's true.
00:47:49.120 | We have to pass numbers to most machine learning models, and certainly to random forests.
00:47:56.880 | So step one is to convert everything into numbers.
00:48:02.460 | So our dataset contains both continuous variables, so numbers where the meaning is numeric, like
00:48:09.480 | price, and it contains categorical variables which could either be numbers where the meaning
00:48:17.840 | is not continuous, like zip code, or it could be a string, like large, small, and medium.
00:48:25.760 | So categorical and continuous variables.
00:48:28.680 | We want to basically get to a point where we have a dataset where we can use all of
00:48:33.000 | these variables.
00:48:34.000 | So they have to all be numeric, and they have to be usable in some way.
00:48:37.420 | So one issue is that we've got something called sale date, which you might remember right
00:48:44.360 | at the top, we told it that that's a date, so it's been parsed as a date, and so you
00:48:49.320 | can see here it's data type, dtype, very important thing, data type is date time, 64-bit.
00:48:57.120 | So that's not a number.
00:49:00.360 | And this is actually where we need to do our first piece of feature engineering.
00:49:04.960 | Inside a date is a lot of interesting stuff.
00:49:08.920 | So since you've got the catch box, can you tell me what are some of the interesting bits
00:49:13.240 | of information inside a date?
00:49:15.360 | Well you can see like a time series pattern, I guess.
00:49:21.740 | That's true, I didn't express very well.
00:49:24.140 | What are some columns that we could pull out of this?
00:49:27.340 | Year, month, and then the date.
00:49:30.260 | The date as in, tell me at least to be a number, year, month, quarter, you want to pass it
00:49:36.760 | to your right and get some more behind you?
00:49:38.920 | Just pass it to your right, you've got some more columns for us?
00:49:45.860 | Day of month, keep going to the right.
00:49:49.160 | Day of week, yeah.
00:49:58.400 | Week of year?
00:49:59.400 | Yeah, okay.
00:50:00.400 | I'll give you a few more that you might want to think about would be like, is it a holiday?
00:50:07.840 | Is it a weekend?
00:50:10.200 | Was it raining that day?
00:50:12.360 | Was there a sports event that day?
00:50:15.920 | It depends a bit on what you're doing, right?
00:50:18.040 | So like if you're predicting soda sales in Soma, you would probably want to know was there
00:50:25.680 | a San Francisco Giants ballgame on that day?
00:50:29.120 | So like what's in a date is one of the most important pieces of feature engineering you
00:50:33.200 | can do, and no machine learning algorithm can tell you whether the Giants were playing
00:50:39.280 | that day and that it was important.
00:50:41.920 | So this is where you need to do feature engineering.
00:50:44.680 | So I do as many things automatically as I can for you.
00:50:51.920 | So here I've got something called add date part.
00:50:55.960 | What is that?
00:50:58.360 | It's something inside fastai.structured.
00:51:03.360 | And what is it?
00:51:04.360 | Well, let's read the source code.
00:51:08.120 | Here it is.
00:51:09.120 | You'll find most of my functions are less than half a page of code.
00:51:15.080 | So often rather than having docs, I'm going to try to add docs over time, but they're
00:51:20.640 | designed that you can understand them by reading the code.
00:51:23.000 | So we're passing in a data frame, and the name of some field, which in this case was
00:51:27.820 | sale date, and so in this case we can't go df.fieldname because that would actually find
00:51:35.600 | a field called field name literally.
00:51:38.400 | So df.fieldname is how we grab a column where that column name is stored in this variable.
00:51:44.800 | So we've now got the field itself, the series.
00:51:48.580 | And so what we're going to do is we're going to go through all of these different strings,
00:51:54.200 | and this is a piece of Python which actually looks inside an object and finds an attribute
00:52:01.320 | with that name.
00:52:02.320 | So this is going to go through, and again you can Google for Python get attribute, it's
00:52:06.360 | a cool little advanced technique, but this is going to go through and it's going to find
00:52:10.840 | for this field it's going to find its year attribute.
00:52:16.960 | Now Pandas has got this interesting idea which is if I actually look inside, let's go field
00:52:22.760 | equals, this is the kind of experiment I want you to do, play around, sale date.
00:52:26.800 | So I've now got that in a field object, and so I can go field, and I can go field.tab,
00:52:36.080 | and let's see, is year in there?
00:52:38.260 | Oh it's not.
00:52:40.480 | Why not?
00:52:41.480 | Well that's because year is only going to apply to Pandas series that are datetime objects.
00:52:47.920 | So what Pandas does is it splits out different methods inside attributes that are specific
00:52:53.800 | to what they are.
00:52:55.020 | So datetime objects will have a dt attribute defined, and at that is where you'll find
00:53:02.760 | all the datetime specific stuff.
00:53:07.200 | So what I went through was I went through all of these and picked out all of the ones
00:53:11.040 | that could ever be interesting for any reason.
00:53:14.120 | And this is like the opposite of the curse of dimensionality.
00:53:17.040 | It's like if there is any column or any variant of that column that could ever be interesting
00:53:22.000 | at all, add that to your data set and every variation of it you can think of.
00:53:27.280 | There's no harm in adding more columns nearly all the time.
00:53:31.920 | So in this case we're going to go ahead and add all of these different attributes.
00:53:37.360 | And so for every one I'm going to create a new field that's going to be called the name
00:53:45.040 | of your field with the word "date" removed, so it will be "sale" and then the name of
00:53:50.440 | the attribute.
00:53:51.440 | So we're going to get a sale year, sale month, sale week, sale day, etc etc.
00:53:56.560 | And then at the very end I'm going to remove the original field.
00:54:01.440 | Because remember we can't use "sale date" directly because it's not a number.
00:54:07.480 | So you're saying this only worked because it was a date type?
00:54:14.800 | Did you make it a date type or was it already saved as one in the original?
00:54:18.520 | Yeah, it's already a date type.
00:54:20.480 | And the reason it was a date type is because when we imported it, we said "has dates equals"
00:54:29.200 | and told pandas it's a date type.
00:54:31.100 | So as long as it looks date-ish and we tell it to parse it as a date, it'll turn it into
00:54:37.000 | a date type.
00:54:38.000 | Was there a way to do that so it would just look through all the columns and say "if it
00:54:41.480 | looks like a date, make it a date" or do you have to know which one?
00:54:46.260 | I think there might be but for some reason it wasn't ideal.
00:54:49.760 | Maybe it took lots of time or it didn't always work or for some reason I had to list it here.
00:54:56.280 | I would suggest checking out the docs for pandas.read_csv and maybe on the forum you
00:55:01.360 | can tell us what you find because I can't remember offhand.
00:55:12.680 | Let's do that one on the same forum thread that Savannah creates because I think it's
00:55:22.960 | a reasonably advanced question, but generally speaking the time zone in a properly formatted
00:55:28.760 | date will be included in the string and it should pull it out correctly and turn it into
00:55:34.480 | a universal time zone.
00:55:36.960 | Generally speaking, it should handle it for you.
00:55:42.280 | So notice for indexing a column, you think it should simply use the dot and the EIF.
00:55:53.940 | The square brackets one is safer, particularly if you're assigning to a column.
00:55:58.980 | If it didn't already exist, you need to use the square brackets format, otherwise you'll
00:56:03.000 | get weird errors.
00:56:04.640 | So the square brackets format is safer, the dot version saves me a couple of keystrokes
00:56:10.000 | so I probably use it more than I should.
00:56:13.440 | In this particular case, because I wanted to grab something that had something inside
00:56:22.120 | it, wasn't the name itself, I have to use square brackets.
00:56:25.920 | So square brackets is going to be your safe bet if in doubt.
00:56:32.080 | So after I run that, you'll notice that dfraw.columns gives me a list of all of the columns just
00:56:44.160 | as strings, and at the end, there they all are.
00:56:47.520 | So it's removed sale date and it's added all those.
00:56:50.960 | So that's not quite enough.
00:56:53.880 | The other problem is that we've got a whole bunch of strings in there.
00:57:10.160 | So here's low, high, medium.
00:57:18.000 | So pandas actually has a concept of a category data type, but by default it doesn't turn
00:57:24.080 | anything into a category for you.
00:57:26.160 | So I've created something called train_cats, which creates categorical variables for everything
00:57:34.720 | that's a string.
00:57:37.120 | And so what that's going to do is behind the scenes it's going to create a column that's
00:57:40.920 | actually a number, it's an integer, and it's going to store a mapping from the integers
00:57:46.840 | to the strings.
00:57:50.200 | The reason it's train_cats is that you use this for the training set.
00:57:53.760 | More advanced usage is that when we get to looking at the test and validation sets, this
00:57:57.760 | is a really important idea.
00:58:01.520 | In fact Terrence came to me the other day and he said, "My model's not working.
00:58:05.580 | Why not?"
00:58:06.580 | And he figured it out for himself.
00:58:08.120 | It turned out the reason why was because the mappings he was using from string to number
00:58:12.680 | in the training set were different to the mappings he was using from string to number
00:58:16.880 | in the test set.
00:58:18.080 | So therefore in the training set, high might have been 3, but in the test set it might
00:58:23.920 | have been 2.
00:58:25.160 | So the 2 were totally different, and so the model was basically non-predictive.
00:58:30.920 | So I have another function called apply_categories, where you can pass in your existing training
00:58:39.520 | set and it will use the same mappings to make sure your test set or validation set uses
00:58:45.040 | the same mappings.
00:58:46.880 | So when I go train_cats, it's actually not going to make the data frame look different
00:58:51.920 | at all.
00:58:52.920 | Behind the scenes it's going to turn them all into numbers.
00:58:59.400 | We finish at 12, 11.50.
00:59:04.480 | Let's see how we go, I'll try and finish on time.
00:59:11.960 | So you'll see now, remember I mentioned there was this .dt attribute that gives you access
00:59:16.760 | to everything, assuming it's about the date time, there's a .cat attribute that gives
00:59:21.240 | you access to things assuming something's a category.
00:59:24.960 | And so usage_band was a string, and so now that I've run train_cats, it's turned it into
00:59:29.720 | a category, so I can go dfraw.usage_band.cat and there's a whole bunch of other things
00:59:38.320 | we've got there.
00:59:41.040 | So one of the things we've got there is .categories, and you can see here is the list.
00:59:46.400 | Now one of the things you might notice is that this list is in a bit of a weird order,
00:59:50.720 | high, low, medium.
00:59:52.640 | The truth is, it doesn't matter too much, but what's going to happen when we use the
00:59:57.240 | random forest is this is going to be 0, this is going to be 1, this is going to be 2, and
01:00:02.520 | we're going to be creating decision trees.
01:00:04.440 | And so we're going to have a decision tree that can split things at a single point.
01:00:08.020 | So it'd either be high versus low and medium, or medium versus high and low.
01:00:14.200 | That would be kind of weird.
01:00:15.760 | It actually turns out not to work too badly, but it'll work a little bit better if you
01:00:20.160 | have these in sensible orders.
01:00:22.140 | So if you want to reorder a category, then you can just go cat.set_categories and pass
01:00:27.360 | in the order you want until it's ordered.
01:00:31.280 | And almost every pandas method has an in-place parameter, which rather than returning a new
01:00:38.440 | data frame, it's going to change that data frame.
01:00:42.080 | So I'm not going to do that, I didn't check that carefully for categories that should
01:00:45.240 | be ordered, but this seems like a pretty obvious one.
01:00:57.520 | Sure.
01:00:58.800 | The usage_band column is actually going to be, this is actually what our random forest
01:01:13.880 | is going to see, these numbers, 1, 0, 2, 1.
01:01:17.720 | And they map to the position in this array.
01:01:20.240 | And as we're going to learn shortly, a random forest consists of a bunch of trees that's
01:01:24.520 | going to make a single split, and a single split is going to be either greater than or
01:01:29.280 | less than 1, or greater than or less than 2.
01:01:33.700 | So we could split it into high versus low and medium, which that semantically makes sense.
01:01:40.520 | Like is it big, or we could split it into medium versus high and low, which doesn't
01:01:46.880 | make much sense.
01:01:48.600 | So in practice, the decision tree could then make a second split to say medium versus high
01:01:53.640 | and low, and then within the high and low into high and low.
01:01:56.600 | But by putting it in a sensible order, if it wants to split out low, it can do it in
01:02:02.160 | one decision rather than two.
01:02:03.980 | And we'll be learning more about this shortly.
01:02:07.760 | It honestly is not a big deal, but I just wanted to mention it's there.
01:02:12.120 | It's also good to know that people, when they talk about different types of categorical
01:02:16.440 | variable, specifically you need to know there's a kind of categorical variable called ordinal.
01:02:21.400 | And an ordinal categorical variable is one that has some kind of order, like high, medium,
01:02:26.840 | and low.
01:02:28.200 | And random forests aren't terribly sensitive to that fact, but it's worth knowing it's
01:02:34.640 | there and trying it out.
01:02:42.520 | That's what I'm saying.
01:02:43.520 | It helps a little bit.
01:02:44.520 | It means you can get there with one decision rather than two.
01:02:54.400 | Yeah, exactly.
01:02:55.640 | So for free, we get a negative one which refers to missing.
01:03:01.280 | And one of the things we're going to do is we're going to actually add one.
01:03:03.640 | Somebody pass it back to Paul.
01:03:05.200 | We're going to add one to our codes, maybe until he goes, "Let people know it's coming!"
01:03:13.680 | So we're going to add one to all of our codes to make missing zero later on.
01:03:18.600 | So for these categories, you're basically mapping streams to different integers.
01:03:36.960 | So getDummies, which we'll get to in a moment, is going to create three separate columns,
01:03:40.800 | ones and zeros for high, ones and zeros for medium, ones and zeros for low, whereas this
01:03:44.240 | one creates a single column with an integer, 0, 1, or 2.
01:03:47.800 | So at this point, as long as we always make sure we use .cat.codes, the thing with the
01:04:04.520 | numbers in, we're basically done.
01:04:07.600 | All of our strings have been turned into numbers, our dates have been turned into a bunch of
01:04:11.040 | numeric columns, and everything else is already a number.
01:04:16.360 | The only other main thing we have to do is notice that we have lots of missing values.
01:04:21.520 | So here is dfraw.isnull, that's going to return true or false, depending on whether something
01:04:28.160 | is empty, .sum is going to add up how many are empty for each series, and then I'm going
01:04:37.040 | to sort them and divide by the size of the dataset.
01:04:40.960 | So here we have some things which have quite high percentages of nulls.
01:04:50.080 | So missing values, we call them in display_all, maybe I didn't run it.
01:05:08.280 | So we're going to get to that in a moment, but I will point something out, which is reading
01:05:13.160 | the CSV took a minute or so, the processing took another 10 seconds or so, from time to
01:05:19.120 | time when I've done a little bit of work I don't want to wait for again, I will tend
01:05:21.960 | to save where I'm at.
01:05:23.880 | So here I'm going to save it.
01:05:24.880 | And I'm going to save it in a format called feather_format, this is very, very new.
01:05:29.400 | But what this is going to do is it's going to save it to disk in exactly the same basic
01:05:33.400 | format that it's actually in RAM.
01:05:35.560 | This is by far the fastest way to save something, and the fastest way to read it back.
01:05:39.920 | So most of the folks you deal with, unless they're on the cutting edge, won't be familiar
01:05:44.720 | with this format, so this will be something you can teach them about.
01:05:47.360 | It's becoming the standard.
01:05:49.560 | It's actually becoming something that's going to be used not just in pandas, but in Java,
01:05:56.560 | in Spark, in lots of things for communicating across computers because it's incredibly fast.
01:06:04.040 | And it's actually co-designed by the guy that made pandas by Wes McKinney.
01:06:08.200 | So we can just go dfraw.tofeather and pass in some name.
01:06:14.480 | I tend to have a folder called temp for all of my "as I'm going along" stuff.
01:06:21.800 | And so when you go os.makedurs, you can path in any path here you like.
01:06:26.880 | It won't complain if it's already there, if it exists okay equals true.
01:06:30.400 | If there are some subdirectories, it'll create them for you, so this is a super handy little
01:06:34.560 | function.
01:06:38.040 | So it's not installed, because I'm using Cressel for the first time.
01:06:45.240 | It's complaining about that.
01:06:46.740 | So if you get a message that something's not installed, if you're using Anaconda, you can
01:06:51.240 | conda install.
01:06:53.760 | Cressel actually doesn't use Anaconda, it uses pip, and so we wait for that to go along.
01:07:11.360 | And so now if I run it, and so sometimes you may find you actually have to restart Jupyter.
01:07:23.600 | So I won't do that now because we're nearly out of time, so if you restart Jupyter you'll
01:07:26.840 | be able to keep moving along.
01:07:28.300 | So from now on, you don't have to rerun all the stuff that I have.
01:07:32.400 | You could just say pd.readfeather and we've got our data frame back.
01:07:38.140 | So the last step we're going to do is to actually replace the strings with the numeric codes.
01:07:47.760 | And we're going to pull out the dependent variable, sale price, into a separate variable.
01:07:53.560 | And we're going to also handle missing continuous values.
01:07:56.760 | And so how are we going to do that?
01:07:59.840 | So you'll see here we've got a function called proc df.
01:08:05.440 | What is that?
01:08:06.440 | It's inside fastai.structured, again.
01:08:21.040 | And here it is.
01:08:22.800 | So quite a lot of the functions have a few additional parameters that you can provide,
01:08:27.000 | and we'll talk about them later, but basically we're providing the data frame to process
01:08:30.720 | and the name of the dependent variable, the y field name.
01:08:35.740 | And so all it's going to do is it's going to make a copy of the data frame, it's going
01:08:41.040 | to grab the y value, it's going to drop the dependent variable from the original, and
01:08:49.520 | then it's going to fix missing.
01:08:53.880 | So how do we fix missing?
01:08:58.680 | So what we do to fix missing is pretty simple.
01:09:03.080 | If it's numeric, then we fix it by basically saying let's first of all check that it does
01:09:09.560 | have some missing.
01:09:10.560 | So if it does have some missing values, so in other words the is_null.sum is non-zero,
01:09:16.600 | then we're going to create a new column with the same name as the original, plus_na, and
01:09:22.520 | it's going to be a boolean column with a 1 any time that was missing, and a 0 any time
01:09:27.760 | it wasn't.
01:09:28.760 | We're going to talk about this again next week, but I'll give you the quick version.
01:09:33.280 | Having done that, we're then going to replace the n_a's, the missing, with the median.
01:09:39.320 | So anywhere that used to be missing will be replaced with the median, and we'll add a
01:09:42.600 | new column to tell us which ones were missing.
01:09:46.560 | We only do that for numeric, we don't need it for categories because Pandas handles categorical
01:09:51.360 | variables automatically by setting them to -1.
01:09:55.800 | So what we're going to do is if it's not numeric, and it's a categorical type (we'll talk about
01:10:07.520 | the maximum number of categories later, but let's assume this is always true, so if it's
01:10:10.720 | not a numeric type) we're going to replace the column with its codes, the integers, +1.
01:10:18.320 | So by default Pandas uses -1 for missing, so now 0 will be missing, and 1, 2, 3, 4 will
01:10:27.080 | be all the other categories.
01:10:32.960 | So we're going to talk about dummies later on in the course, but basically optionally
01:10:37.800 | you can say that if you already know about dummy values, they're columns with a small
01:10:41.200 | number of possible values, you can turn into dummies instead if you're numericalizing them,
01:10:45.760 | but we're not going to do that for now.
01:10:48.120 | So for now all we're doing is we're using the categorical codes +1, replacing missing
01:10:53.000 | values with the median, adding an additional column, telling us which ones were replaced,
01:10:58.560 | and removing the dependent variable.
01:11:02.360 | So that's what PROC DF does, it runs very quickly.
01:11:07.240 | So you'll see now, sale price is no longer here.
01:11:11.680 | We've now got a whole new variable called y that contains sale price.
01:11:17.140 | You'll see we've got a couple of extra blah_nas at the end.
01:11:22.960 | And if I look at that, everything is a number.
01:11:34.880 | These Booleans are treated as numbers, they're just considered as 0 or 1, they're just displayed
01:11:39.880 | as false and true.
01:11:42.120 | So you can see here, is it the end of a month, is it the start of a month, is it the end
01:11:45.960 | of a quarter?
01:11:49.920 | It's kind of funny, right, because we've got things like a model ID, which presumably is
01:11:53.840 | something like a serial number, or it could be like the model identifier that's created
01:11:58.040 | by the factory, or something, we've got like a data source ID.
01:12:01.360 | Some of these are numbers, but they're not continuous.
01:12:05.040 | It turns out actually random forests work fine with those.
01:12:08.780 | We'll talk about why and how and a lot about that in detail, but for now all you need to
01:12:12.680 | know is no problem.
01:12:14.780 | So as long as this is all numbers, which it now is, we can now go ahead and create a random
01:12:19.080 | forest.
01:12:21.100 | So m.randomforestregressor, random forests are trivially parallelizable.
01:12:27.660 | So what that means is that if you've got more than one CPU, which everybody will basically
01:12:33.200 | on their computers at home, and if you've got a T2.medium or bigger at AWS, you've got
01:12:39.600 | multiple CPUs.
01:12:41.360 | Randomly parallelizable means that it will split up the data across your different CPUs
01:12:46.640 | and basically linearly scale.
01:12:48.560 | So the more CPUs you have, pretty much it will divide the time it takes by that number.
01:12:53.920 | Not exactly, but roughly.
01:12:56.200 | So njobs=-1 tells the random forest regressor to create a separate job, a separate process
01:13:02.800 | basically, for each CPU you have, so that's pretty much what you want all the time.
01:13:09.060 | Get the model using this new data frame we created, using that y value we pulled out,
01:13:13.920 | and then get the score.
01:13:15.560 | The score is going to be the R^2, we'll define that next week, hopefully some of you already
01:13:19.280 | know about the R^2.
01:13:20.280 | 1 is very good, 0 is very bad, so as you can see we've immediately got a very high score.
01:13:28.320 | So that looks great, but what we'll talk about next week a lot more is that it's not quite
01:13:35.080 | great because maybe we had data that had points that looked like this, and we fitted a line
01:13:41.400 | that looks like this, when actually we wanted one that looks like that.
01:13:46.200 | The only way to know whether we've actually done a good job is by having some other dataset
01:13:52.120 | that we didn't use to train the model.
01:13:54.320 | Now we're going to learn about some ways with random forests we can kind of get away without
01:13:57.920 | even having that other dataset, but for now what we're going to do is we're going to split
01:14:03.800 | into 12,000 rows which we're going to put in a separate dataset called the validation
01:14:09.400 | set versus the training set that's going to contain everything else.
01:14:15.040 | And our dataset is going to be sorted by date, and so that means that the most recent 12,000
01:14:22.000 | rows are going to be our validation set.
01:14:23.680 | Again, we'll talk more about this next week, it's a really important idea, but for now
01:14:28.640 | we can just recognize that if we do that and run it, I've created a little thing called
01:14:33.720 | print score and it's going to print out the root mean squared error between the predictions
01:14:38.720 | and actuals for the training set, for the validation set, the R^2 for the training set
01:14:44.560 | and the validation set.
01:14:46.240 | And you'll see that actually the R^2 for the training was 0.98, but for the validation
01:14:51.200 | was 0.89.
01:14:52.680 | Then the RMSE, and remember this is on the logs, was 0.09 for the training set and 0.25
01:15:00.720 | for the validation set.
01:15:02.280 | Now if you actually go to Kaggle and go to the leaderboard, in fact let's do it right
01:15:06.880 | now, he's got private and public, I'll click on public leaderboard, and we can go down
01:15:15.240 | and find out where is 0.25.
01:15:17.720 | So there are 475 teams, and generally speaking if you're in the top half of a Kaggle competition
01:15:28.280 | you're doing pretty well.
01:15:29.800 | So 0.25, here we are, 0.25, what was it exactly, 0.25, 0.2507, yeah, about 110.
01:15:44.440 | So we're about in the top 25%.
01:15:47.540 | So the idea, this is pretty cool, with no thinking at all, using the defaults of everything
01:15:55.280 | were in the top 25% of a Kaggle competition.
01:15:58.080 | So random forests are insanely powerful, and this totally standardized process is insanely
01:16:06.800 | good for any dataset.
01:16:09.240 | So we're going to wrap up, what I'm going to ask you to do for Tuesday is take as many
01:16:17.840 | Kaggle competitions as you can, whether they be running now or old ones or datasets that
01:16:22.760 | you're interested in for hobbies or work, and please try it.
01:16:27.640 | Try this process.
01:16:29.160 | And if it doesn't work, tell us on the forum, here's the dataset I'm using, here's where
01:16:34.160 | I got it from, here's the stack trace of where I got an error, or here's if you use my print
01:16:41.640 | score function or something like it, show us what the training versus test set looks
01:16:47.040 | like, we'll try and figure it out.
01:16:49.400 | But what I'm hoping we'll find is that all of you will be pleasantly surprised that with
01:16:54.080 | an hour or two of information you've got today, you can already get better models than most
01:17:03.240 | of the very serious practicing data scientists that compete in Kaggle competitions.
01:17:07.880 | Okay?
01:17:08.880 | Great.
01:17:09.880 | Good luck, and I'll see you on the forums.
01:17:12.040 | Oh, one more thing, Friday, the other class said a lot of them had class during my office
01:17:18.840 | hours, so if I made them 1-3 instead of 2-4 on Fridays, is that okay?
01:17:26.120 | Seminar.
01:17:27.120 | Okay, I have to find a whole other time.
01:17:30.680 | All right, I will talk to somebody who actually knows what they're doing, unlike me, about
01:17:35.640 | finding office hours.
01:17:36.640 | Thank you.
01:17:37.640 | (inaudible)
01:17:38.640 | Absolutely.
01:17:39.640 | (inaudible)
01:17:39.640 | [BLANK_AUDIO]