Intro to Machine Learning: Lesson 1

Let me introduce everybody to everybody else first of all. We're here at the University of San Francisco learning machine learning or you might be at home watching this on video. Everybody wave. Here is the University of San Francisco graduate students. Thank you everybody and wave back from the future and from home to all the students here.

If you're watching this on YouTube, please stop and instead go to course.fast.ai and watch it from there instead. There's nothing wrong with YouTube but I can't edit these videos after I've created them, so I need to be able to give you updated information about what environments to use, how the technology changes and so you need to go here.

So you can also watch the lessons from here, here's lots of lessons and so forth. So that's tip number one for the video. Tip number two for the video is because I can't edit them, all I can do is add these things called cards and cards are little things that appear in the top right-hand corner of the screen.

So by the time this video comes out, I'm going to put a little card there right now for you to click on and try that out. Unfortunately they're not easy to notice, so keep an eye out for that because that's going to be important updates to the video. So welcome, we're going to be learning about machine learning today.

And so for everybody in the class here, you all have Amazon Web Services set up, so you might want to go ahead and launch your AWS instance now or go ahead and launch your Jupyter notebook on your own computer. If you don't have Jupyter notebook set up, then what I recommend is you go to cressell.com www.cressell.com, sign in there, sign up, and you can then turn off Enable GPU and click Start Jupyter and you'll have a Jupyter notebook instantly.

That costs you some money, it's 3 cents an hour. So if you don't mind spending 3 cents an hour to learn machine learning, here's a good way. So I'm going to go ahead and say Start Jupyter. And so whatever technique you use, there you go. One of the things that you'll find on the website is links to lots of information about the costs and benefits and approaches to setting up lots of different environments for Jupyter notebook, both for deep learning and for regular machine learning.

So check them out because there's lots of options. So if I then open Jupyter in a new tab, here I am in cressell or on AWS or your own computer. We use the Anaconda Python distribution for basically everything. You can install that yourself. And again, there's lots of information on the website about how to set that up.

We're also assuming that either you're using cressell or there's something else which I really like called paperspace.com, which is another place you can fire up if you put a notebook pretty much instantly. Both of these already have all of the fastai stuff pre-installed for you. So as soon as you open up cressell or paperspace, assuming you chose the paperspace fastai template, you'll see that there's a fastai folder.

If you are using your own computer or AWS, you'll need to go to our GitHub repo, fastai and clone it. And then you'll need to do a conda update to install the libraries, and again, that's all information we've got on the website, and we've got some previous workshop videos to help you through all those steps.

So for this class, I'm assuming that you have a Jupyter notebook running. So here we are in the Jupyter notebook, and if I click on fastai, that's what you get if you get clone or if you're on cressell, you can see our repo here. All of our lessons are inside the courses folder, and the machine learning part 1 is in the ml1 folder.

If you're ever looking at my screen and wondering where are you, look up here and you'll see it tells you the path, fastai/courses/ml1. And today we're going to be looking at lesson 1, random forests. So here is lesson 1, RF. So there's a couple of different ways you can do this, both here in person or on the video.

You can either attempt to follow along as you watch, or you can just watch and then follow along later with the video. It's up to you, I would maybe have a loose recommendation to watch now and follow along with the video later just because it's quite hard to multitask, and if you're working on something you might miss a key piece of information which you're welcome to ask about.

But if you follow along with the video afterwards, then you can pause, stop, experiment and so forth. But anyway, you can choose either way. I'm going to go view, toggle header, view, toggle toolbar, and then full screen it so it will get a bit more space. So the basic approach we're going to be taking here is to get straight into code, start building models, not to look at theory.

We're going to get to all the theory, but at the point where you deeply understand what it's for and at the point that you're able to be an effective practitioner. So my hope is that you're going to spend your time focusing on experimenting. So if you take these notebooks and try different variations of what I show you, try it with your own datasets, the more coding you can do, the better, the more you'll learn.

My suggestion, or at least all of my students have told me, the ones who have gone away and spent time studying books of theory rather than coding, found that they learned less machine learning and that they often tell me they wish there's more time coding. The stuff that we're showing in this course, a lot of it's never been shown before.

This is not a summary of other people's research. This is more a summary of 25 years of work that I've been doing in machine learning. So a lot of this is going to be shown for the first time. And so that's kind of cool because if you want to write a blog post about something that you learn here, you might be building something that a lot of people find super useful.

There's a great opportunity to practice your technical writing, and here's some examples of good technical writing, by showing people stuff. It's not like, "Hey, I just learned this thing, I bet you all know it." Often it will be, "I just learned this thing and I'm going to tell you about it and other people haven't seen it." In fact, this is the first course ever that's been built on top of the fast AI library, so even just stuff in the library is going to be new to everybody.

When we use Jupyter Notebook or anything else in Python, we have to import the libraries that we're going to use. Something that's quite convenient is if you use these two auto-reload commands at the top of your notebook, you can go in and edit the source code of the modules and your notebook will automatically update with those new modules.

You won't have to restart anything, so that's super handy. Then to show your plots inside the notebook, you'll want that plot in line. These three lines appear at the top of all of my notebooks. You'll notice when I import the libraries that for anybody here who is an experienced Python programmer, I am doing something that would be widely considered very inappropriate.

I'm importing star. Generally speaking in software engineering, we're taught to specifically figure out what we need and import those things. The more experienced you are as a Python programmer, the more extremely offensive practices you're going to see me use. For example, I don't follow what's called PEP8, which is the normal style of code used in Python.

I'm going to mention a couple of things. First is go along with it for a while, don't judge me just yet. There's reasons that I do these things, and if it really bothers you, then feel free to change it. But the basic idea is data science is not software engineering.

There's a lot of overlap. We're using the same languages, and in the end these things may become software engineering projects. But what we're doing right now is we're prototyping models. Prototyping models has a very different set of best practices that are taught basically in hardware. They're not really even really written down.

But the key is to be able to do things very interactively and very iteratively. So for example, from library import star means you don't have to figure out ahead of time what you're going to need from that library, it's all there. Also because we're in this wonderful interactive Jupyter environment, it lets us understand what's in the libraries really well.

So for example, later on I'm using a function called display. So an obvious question is, what is display? So you can just type the name of a function and press shift enter, remember shift enter is to run a cell, and it will tell you where it's from. So anytime you see a function you're not familiar with, you can find out where it's from.

And then if you want to find out what it does, put a question mark at the start. And here you have the documentation. And then, particularly helpful for the FastAI library, I try to make as many functions as possible be no more than about five lines of code, it's going to be really easy to read.

If you put a second question mark at the start, it shows you the source code of the function. Right so all the documentation plus the source code, so you can see nothing has to be mysterious. And we're going to be using, the other library we'll use a lot is scikit-learn, which implements a lot of machine learning stuff in Python.

The scikit-learn source code is often pretty readable. And so very often if I want to really understand something, I'll just go question mark, question mark, and the name of the scikit-learn function I'm typing, and I'll just go ahead and read the source code. As I say, the FastAI library in particular is designed to have source code that's very easy to read, and we're going to be reading it a lot.

So today we're going to be working on a Kaggle competition called Blue Book for bulldozers. So the first thing we need is to get that data. So if you go Kaggle, bulldozers, then you can find it. So Kaggle competitions allow you to download a real-world dataset, a real problem that somebody is trying to solve, and solve it according to a specification that that actual person with that actual problem decided would be actually helpful to them.

So these are pretty authentic experiences for applied machine learning. Now of course you're missing all the bits that went before, which was why did this company, this startup, decide that predicting the auction sale price of bulldozers was important? Where did they get the data from? How did they clean the data?

And so forth. And that's all important stuff as well, but the focus of this course is really on what happens next, which is like how do you actually build the model. One of the great things about you working on Kaggle competitions, whether they be running now or whether they be old ones, is that you can submit to the leaderboard, even old closed competitions, you can submit to the leaderboard and find out how would you have gone.

And there's really no other way in the world of knowing whether you're competent at this kind of data and this kind of model than doing that. Because otherwise, if your accuracy is really bad, is it because this is just very hard, like it's just not possible, then the data is so noisy you can't do better?

Or is it actually that it's an easy data set and you made a mistake? And like when you finish this course and apply this to your own projects, this is going to be something you're going to find very hard and there isn't a simple solution to it, which is you're now using something that hasn't been on Kaggle, it's your own data set, do you have a good enough answer or not?

So we'll talk about that more during the course. And in the end, we just have to know that we have good, effective techniques to reliably build baseline models, otherwise there's really no way to know. There's no way other than creating a Kaggle competition or getting 100 top data scientists to work at your problem to really know what's possible.

So Kaggle competitions are fantastic for learning. And as I've said many times, I've learned more from competing in Kaggle competitions than everything else I've done in my life. So to compete in a Kaggle competition, you need the data. This one's an old competition, so it's not running now, but we can still access everything.

So we first of all want to understand what the goal is. And I suggest that you read this later, but basically we're going to try and predict the sale price of heavy equipment. And one of the nice things about this competition is that if you're like me, you probably don't know very much about heavy industrial equipment options.

I actually know more than I used to because my toddler loves building equipment, so we actually watch YouTube videos about front-end loaders and forklifts. But two months ago, I was a real layman. So one of the nice things is that machine learning should help us understand a data set, not just make predictions about it.

So by picking an area which we're not familiar with, it's a good test of whether we can build an understanding. Because otherwise what can happen is that your intuition about the data can make it very difficult for you to be open-minded enough to see what does the data really say.

It's easy enough to download the data to your computer. You just have to click on the data set, so here is train.zip, and click download. And so you can go ahead and do that if you're running on your own computer right now. If you're running on AWS, it's a little bit harder because unless you're familiar with text-mode browsers like Elinks or Lynx, it's quite tricky to get the data set to Kaggle.

So a couple of options. One is you can download it to your computer and then SCP it to AWS, so SCP works just like SSH but it copies data rather than logging in. I'll show you a trick though that I really like, and it relies on using Firefox. For some reason Chrome doesn't work correctly with Kaggle for this.

So if I go on Firefox to the website, eventually, and what we're going to do is we're going to use something called the JavaScript console. So every web browser comes with a set of tools for web developers to help them see what's going on, and you can hit control-shift-i to bring up this web developer tools and one of the tabs is network.

And so then if I click on train.zip and I click on download, and I'm not even going to download it, I'm just going to say cancel, but you'll see down here it's shown me all of the network connections that were just initiated. And so here's one which is downloading a zip file from storage.googleapis.com, blah blah blah.

That's probably what I want, that looks good. So what you can do is you can right-click on that and say copy, copy as curl. So curl is a Unix command like wget that downloads stuff. So if I go copy as curl, that's going to create a command that has all of my cookies, headers, everything in it necessary to download this authenticated data set.

So if I now go into my server, and if I paste that, you can see a really really long curl command. One thing I notice is that at least recent versions have started adding this --2.0 thing to the command. That doesn't seem to work with all versions of curl, so something you might want to do is to pop that into an editor, find that to get rid of it, and then use that instead.

Now one thing to be very careful about, by default curl downloads the file and displays it in your terminal. So if I try to display this, it's going to display gigabytes of binary data in my terminal and crash it. So to say that I want to output it using some different file name, I always type -0 for output file name, and then the name of the file, bulldozers.zip, and make sure you give it a suitable extension.

So in this case the file was train.zip, so bulldozers.zip. There it is, and so there it all is, so I could make directory bulldozers, then I could move my zip file into there, it's the wrong way around, yes, thank you. Okay, and then if you don't have unzip installed, you may need to sudo apt-install unzip, or if you're on a Mac, that would be brew install unzip, if brew doesn't work, you haven't got homebrew installed, so make sure you install it, and then unzip.

And so they're the basic steps. One nice thing is that if you're using Cressel, most of the datasets should already be pre-installed for you. So what I can do here is I can say open a new tab, here's a cool trick, in Jupyter you can actually say new terminal, and you can actually get a web-based terminal.

And so you'll find on Cressel there's a /datasets folder, /datasets/caggle, /datasets/fastai, often the things you need are going to be in one of those places. So assuming that we don't have it already downloaded, actually PaperSpace should have most of them as well, then we'd need to go to fastai, let's go into the courses, machine learning folder, and what I tend to do is I tend to put all of my data for a course into a folder called data.

You'll find that if you're using Git, you'll find that that doesn't get added to Git because it's in the Git ignore. So don't worry about creating the data folder, it's not going to screw anything up. So I generally make a folder called data, and then I tend to create folders for everything I need there.

So in this case, I'll make the bulldozers, cd, and remember the last word of the last command is exclamation mark dollar. I'll go ahead and grab that curl command again, and zip bulldozers, there we go. So you can now see I generally have anything that might change from person to person, I kind of put in a constant.

So here I just define something called path, but if you've used the same path I just did, you should just be able to go ahead and run that, and let's go ahead and keep moving along. So we've now got all of our libraries imported, and we've set the path to the data.

You can run shell commands from within Jupyter Notebook by using an exclamation mark. So if I want to check what's inside that path, I can go ls data/bulldozers, and you can see that works. Or you can even use Python variables. If you use a Python variable inside a Jupyter shell command, you have to put it in curlies.

So that makes me feel good that my path is pointing at the right place. If you say ls curly_capitals_path, and you get nothing at all, then you're pointing at the wrong spot. Yes? Let me turn this up here. So the curly brackets refer to the fact that I put an exclamation mark at the front, which means the rest of this is not a Python command, it's a bash command.

And bash doesn't know about capital path, because capital path is part of Python. So this is a special Jupyter thing which says expand this Python thing, please, before you pass it to the shell. So the goal here is to use the training set, which contains data through the end of 2011 to predict the sale price of bulldozers.

And so the main thing to start with then is of course to look at the data. Now the data is in CSV format, so one easy way to look at the data would be to use shell command head to look at the first few lines, head, bulldozers, and even tab completion works here.

Jupyter does everything. So here's the first few five lines. So there's a bunch of column headers, and then there's a bunch of data. So that's pretty hard to look at. So what we want to do is take this and read it into a nice tabular format. So does Terrence putting these glasses on mean I should make this bigger, or is it okay?

Is this big enough font size for everybody? So this kind of data where you've got columns representing a wide range of different types of things, such as an identifier, a currency, a date, a size, I refer to this as structured data. Now I say I refer to this as structured data because there have been many arguments in the machine learning community on Twitter about what is structured data.

Weirdly enough, this is like the most important type of distinction between data that looks like this and data like images where every column is of the same type. That's the most important distinction in machine learning, yet we don't have standard accepted terms. So I'm going to use the terms structured and unstructured.

But note that other people you talk to, particularly in NLP, people use structured to mean something totally different. So when I refer to structured data, I mean columns of data that can have varying different types of data in them. By far the most important tool in Python for working with structured data is pandas.

Pandas is so important that it's one of the few libraries that everybody uses the same abbreviation for it, which is pd. So you'll find that one of the things I've got here is from fastai-imports-import-star. The fastai-imports module has nothing but imports of a bunch of hopefully useful tools. So all of the code for fastai is inside the fastai-directory inside the fastai-repo.

And so you can have a look at imports, and you'll see it's just literally a list of inputs. And you'll find there pandas as pd. And so everybody does this, right? So you'll see lots of people using pd.something, they're always talking about pandas. So pandas lets us read a CSV file.

And so when we read the CSV file, we just tell it the path to the CSV file, a list of any columns that contain dates, and I always add this low memory equals false that's going to actually make it read more of the file to decide what the types are.

This here is something called a Python 3.6 format string. It's one of the coolest parts of Python 3.6. We've probably used lots of different ways in the past in Python of interpolating variables into your strings. Python 3.6 has a very simple way that you'll probably always want to use from now on.

And you create a normal string, you type in f at the start, and then if I define a variable, then I can say hello curly's python function. This is kind of confusing. These are not the same curlies that we saw earlier on in the ls command. That ls command is specific to Jupyter and it interpolates python code into shell code.

These curlies are Python 3.6 format string curlies. They require an f at the start, so if I get rid of the f, it doesn't interpolate. So the f tells it to interpolate. And the cool thing is, inside that curlies, you can write any Python code you like, just about.

So for example, name.upper, oh Jeremy! So I use this all the time. And it doesn't matter, because it's a format string, it doesn't matter if the thing was ... I always forget my age, I think I'm 43. It doesn't matter if it's an integer. Normally if you like to string concatenation with integers, python complains, no such problem here.

So this is going to read path/train.csv into a thing called a data frame. Amanda's data frames and R's data frames are pretty similar, so if you've used R before, then you'll find that this is reasonably comfortable. So this file is 9.3 meg, and its size is 112 meg. And it has 400,000 rows in it, so it takes a moment to import it.

So when it's done, we can type the name of the data frame, df-raw, and then use various methods on it. So for example df-raw.tail will show us the last few rows of the data frame. By default it's going to show the columns along the top and the rows down the side, but in this case there's a lot of columns.

So I've just set .transpose to show it the other way around. I've created one extra function here, display-all. Normally if you just type df-raw, if it's too big to show conveniently, it truncates it and puts little ellipses in the middle. So the details don't matter, but this is just changing a couple of settings to say even if it's got a thousand rows and a thousand columns, please still show the whole thing.

So this is finished. I can actually show you that. In Jupyter Notebook you can type a variable of almost any kind, a video, HTML, an image, whatever, and it will generally figure out a way of displaying it for you. So in this case it's a pandas data frame, it figures it out a way of displaying it for me.

And so you can see here that by default it doesn't show me the whole thing. So here's the dataset. We've got a few different rows. This is the last bit, the tail of it, last few rows. This is the thing we want to predict, price. We call this the dependent variable.

And then we've got a whole bunch of things we could predict it with. And when I start with a dataset, I tend -- yes, Terrence, can I give you this? I've read in books that you should never look at the data because of the risk of overfit. Why do you start by looking at the data?

So I was actually going to mention, I actually kind of don't, like I want to find out at least enough to know that I've managed to import it okay, but I tend not to really study it at all at this point because I don't want to make too many assumptions about it.

I would actually say most books, say the opposite, most books do a whole lot of EDA, expiratory data analysis first. >> Academic books. >> Yeah, academic books. >> Well, I mean the academic books I've read say that's one of the biggest risks of overfitting. >> Yeah, so the truth is kind of somewhere in between, and I generally try to do machine learning driven EDA, and that's what we're going to learn today.

So the thing I do care about though is what's the purpose of the project? And for Kaggle projects, the purpose is very easy. We can just look and find out, there's always an evaluation section, how is it evaluated? And this is evaluated on root, mean, squared, log, error. So this means they're going to look at the difference between the log of our prediction of price and the log of the actual price, and then they're going to square it and add them up.

So because they're going to be focusing on the difference of the logs, that means that we should focus on the logs as well. And this is pretty common, like for a price, generally you care not so much about did I miss by $10, but did I miss by 10%?

So if it was a million-dollar thing and you're $100,000 off, or if you're a $10,000 thing and you're $1,000 off, often we would consider those equivalent scale issues. And so for this auction problem, the organizers are telling us they care about ratios more than differences, and so the log is the thing we care about.

So the first thing I do is to take the log. Now np is NumPy. I'm assuming that you have some familiarity with NumPy. If you don't, we've got a video called Deep Learning Workshop, which actually isn't just for deep learning, it's basically for this as well. And one of the parts there, which we've got a time-coded link to, is a quick introduction to NumPy.

But basically NumPy lets us treat arrays, matrices, vectors, high-dimensional tensors as if they're Python variables, and we can do stuff like log to them, and it will apply it to everything. NumPy and pandas work together very nicely. So in this case df-raw.sale_price is pulling a column out of a pandas data frame, which gives us a pandas series, which shows us the sale prices and the indexes.

And a series can be passed to a NumPy function, which is pretty handy. And so you can see here, this is how I can replace a column with a new column. Now that we've replaced its sale price with its log, we can go ahead and try to create a random forest.

What's a random forest? We'll find out in detail, but in brief, a random forest is a kind of universal machine learning technique. It's a way of predicting something that can be of any kind. It could be a category, like is it a dog or a cat, or it could be a continuous variable like price.

It can predict it with columns of pretty much any kind. Pixel data, zip codes, revenues, whatever. In general, it doesn't overfit. It can, and we'll learn to check whether it is, but it doesn't generally overfit too badly, and it's very, very easy to make to stop it from overfitting.

You don't need -- and we'll talk more about this -- you don't need a separate validation set in general. It can tell you how well it generalizes, even if you only have one dataset. It has few, if any, statistical assumptions. It doesn't assume that your data is normally distributed.

It doesn't assume that the relationships are linear. It doesn't assume that you've just specified the interactions. It requires very few pieces of feature engineering for many different types of situations. You don't have to take the log of the data, you don't have to model plane directions together. So in other words, it's a great place to start.

If your first random forest does very little useful, then that's a sign that there might be problems with your data. It's designed to work pretty much first off. Can you please throw it at or towards this gentleman? Thank you. What about the curse of dimensionality when you're using random forests?

Yeah, great question. So there's this concept of curse of dimensionality. In fact there's two concepts I'll touch on, curse of dimensionality and the no-free lunch theorem. These are two concepts you'll often hear a lot about. They're both largely meaningless and basically stupid, and yet I would say maybe the majority of people in the field not only don't know that but think the opposite.

So it's well worth explaining. The curse of dimensionality is this idea that the more columns you have, it basically creates a space that's more and more empty. And there's this kind of fascinating mathematical idea which is the more dimensions you have, the more all of the points sit on the edge of that space.

So if you've just got a single dimension where things are like random, then they're spread out all over. Whereas if it's a square, then the probability that they're in the middle means that they can't have been on the edge of either dimension, so it's a little bit less likely that they're not on the edge.

Each dimension you add, it becomes multiplicatively less likely that the point isn't on the edge of at least one dimension. And so basically in higher dimensions, everything sits on the edge. And what that means in theory is that the distance between points is much less meaningful. And so if we assume that somehow that matters, then it would suggest that when you've got lots and lots of columns and you just use them without being very careful to remove the ones you don't care about, that somehow things won't work.

That turns out just not to be the case. It's not the case for a number of reasons. One is that the points still do have different distances away from each other. Just because they're on the edge, they still do vary in how far away they are from each other.

And so this point is more similar to this point than it is to that point. So even things we'll learn about k-nearest neighbors actually work really well, really really well in high dimensions despite what the theoreticians claimed. And what really happened here was that in the 90s, theory totally took over machine learning.

And so particularly there was this concept of these things called support vector machines that were theoretically very well justified, extremely easy to analyze mathematically, and you could kind of prove things about them. And we kind of lost a decade of real practical development in my opinion. And all these theories became very popular like the curse of dimensionality.

Nowadays, and a lot of theoreticians hate this, the world of machine learning has become very empirical, which is like which techniques actually work. And it turns out that in practice, building models on lots and lots of columns works really really well. So the other thing to quickly mention is the no free lunch theorem.

There's a mathematical theorem by that name that you will often hear about that claims that there is no type of model that works well for any kind of dataset. Which is true, and is obviously true if you think about it, in the mathematical sense, any random dataset, by definition it's random.

So there isn't going to be some way of looking at every possible random dataset that's in some way more useful than any other approach. In the real world, we look at data which is not random. Mathematically we'd say it sits on some lower dimensional manifold, it was created by some kind of causal structure, there are some relationships in there.

So the truth is that we're not using random datasets. And so the truth is, in the real world, there are actually techniques that work much better than other techniques for nearly all of the datasets you look at. And nowadays there are empirical researchers who spend a lot of time studying this, which techniques work a lot of the time.

And ensembles of decision trees, of which random forests are one, is perhaps the technique which most often comes up the top. And that is despite the fact that until the library that we're showing you today, Fast AI came along, there wasn't really any standard way to pre-process them properly and to properly set their parameters.

So I think it's even more strong than that. So yeah, I think this is where the difference between theory and practice is huge. So when I try to create a random forest regressor, what is that? Random forest regressor. OK, it's part of something called sklearn. sklearn is scikit-learn. It is by far the most popular and important package for machine learning in Python.

It does nearly everything. It's not the best at nearly everything, but it's perfectly good at nearly everything. So you might find in the next part of this course with your net, you're going to look at a different kind of decision tree ensemble called gradient boosting trees, where actually there's something called xgboost, which is better than gradient boosting trees in scikit-learn.

But it's pretty good at everything, so I'm really going to focus on scikit-learn. Random forest, you can do two kinds of things with a random forest. If I hit tab, I haven't imported it. So let's go back to where we import. So you can hit tab in Jupyter Notebook to get tab completion for anything that's in your environment.

You'll see that there's also a random forest classifier. So in general, there's an important distinction between things which can predict continuous variables, and that's called regression, and therefore a method for doing that would be a regressor, and things that predict categorical variables, and that is called classification, and the things that do that are called classifiers.

So in our case, we're trying to predict a continuous variable price. So therefore we are doing regression, and therefore we need a regressor. A lot of people incorrectly use the word regression to refer to linear regression, which is just not at all true or appropriate. Regression means a machine learning model that's trying to predict some kind of continuous outcome.

It has a continuous dependent variable. So pretty much everything in scikit-learn has the same form. You first of all create an instance of an object for the machine learning model you want. You then call fit, passing in the independent variables, the things you want to use to predict, and the dependent variable, the thing that you want to predict.

So in our case, the dependent variable is the data frame's sale price column, and so the thing we want to use to predict is everything except that. In pandas, the drop method returns a new data frame with a list of columns removed. A list of rows or columns removed.

So axis=1 means removed columns. So this here is the data frame containing everything except for sale price. Let's find out. So to find out, I could hit shift+tab, and that will bring up a quick inspection of the parameters. In this case, it doesn't quite tell me what I want.

So if I hit shift+tab twice, it gives me a bit more information. Ah yes, and that tells me it's a single label or list-like. List-like means like anything you can index. In Python, there's lots of things. By the way, if I hit three times, it will give me a whole little window at the bottom.

So that was shift+tab. Another way of doing that, of course, which we learned, would be question mark, question mark, df, draw, dot, drop. Question mark would be the source code for it, or a single question mark is the documentation. So I think that trick of tab complete, shift+tab parameters, question mark and double question mark for the docs and the source code, if you know nothing else about using Python libraries, know that because now you know how to find out everything else.

So we try to run it and it doesn't work. So why didn't it work? So anytime you get a stack trace like this, so an error, the trick is to go to the bottom because the bottom tells you what went wrong. Above it, it tells you all of the functions that could cause other functions to get there.

Could not convert string to float conventional. So there was a value inside my dataset, conventional, and it didn't know how to create a model using that string. Now that's true. We have to pass numbers to most machine learning models, and certainly to random forests. So step one is to convert everything into numbers.

So our dataset contains both continuous variables, so numbers where the meaning is numeric, like price, and it contains categorical variables which could either be numbers where the meaning is not continuous, like zip code, or it could be a string, like large, small, and medium. So categorical and continuous variables.

We want to basically get to a point where we have a dataset where we can use all of these variables. So they have to all be numeric, and they have to be usable in some way. So one issue is that we've got something called sale date, which you might remember right at the top, we told it that that's a date, so it's been parsed as a date, and so you can see here it's data type, dtype, very important thing, data type is date time, 64-bit.

So that's not a number. And this is actually where we need to do our first piece of feature engineering. Inside a date is a lot of interesting stuff. So since you've got the catch box, can you tell me what are some of the interesting bits of information inside a date?

Well you can see like a time series pattern, I guess. That's true, I didn't express very well. What are some columns that we could pull out of this? Year, month, and then the date. The date as in, tell me at least to be a number, year, month, quarter, you want to pass it to your right and get some more behind you?

Just pass it to your right, you've got some more columns for us? Day of month, keep going to the right. Day of week, yeah. Week of year? Yeah, okay. I'll give you a few more that you might want to think about would be like, is it a holiday? Is it a weekend?

Was it raining that day? Was there a sports event that day? It depends a bit on what you're doing, right? So like if you're predicting soda sales in Soma, you would probably want to know was there a San Francisco Giants ballgame on that day? So like what's in a date is one of the most important pieces of feature engineering you can do, and no machine learning algorithm can tell you whether the Giants were playing that day and that it was important.

So this is where you need to do feature engineering. So I do as many things automatically as I can for you. So here I've got something called add date part. What is that? It's something inside fastai.structured. And what is it? Well, let's read the source code. Here it is.

You'll find most of my functions are less than half a page of code. So often rather than having docs, I'm going to try to add docs over time, but they're designed that you can understand them by reading the code. So we're passing in a data frame, and the name of some field, which in this case was sale date, and so in this case we can't go df.fieldname because that would actually find a field called field name literally.

So df.fieldname is how we grab a column where that column name is stored in this variable. So we've now got the field itself, the series. And so what we're going to do is we're going to go through all of these different strings, and this is a piece of Python which actually looks inside an object and finds an attribute with that name.

So this is going to go through, and again you can Google for Python get attribute, it's a cool little advanced technique, but this is going to go through and it's going to find for this field it's going to find its year attribute. Now Pandas has got this interesting idea which is if I actually look inside, let's go field equals, this is the kind of experiment I want you to do, play around, sale date.

So I've now got that in a field object, and so I can go field, and I can go field.tab, and let's see, is year in there? Oh it's not. Why not? Well that's because year is only going to apply to Pandas series that are datetime objects. So what Pandas does is it splits out different methods inside attributes that are specific to what they are.

So datetime objects will have a dt attribute defined, and at that is where you'll find all the datetime specific stuff. So what I went through was I went through all of these and picked out all of the ones that could ever be interesting for any reason. And this is like the opposite of the curse of dimensionality.

It's like if there is any column or any variant of that column that could ever be interesting at all, add that to your data set and every variation of it you can think of. There's no harm in adding more columns nearly all the time. So in this case we're going to go ahead and add all of these different attributes.

And so for every one I'm going to create a new field that's going to be called the name of your field with the word "date" removed, so it will be "sale" and then the name of the attribute. So we're going to get a sale year, sale month, sale week, sale day, etc etc.

And then at the very end I'm going to remove the original field. Because remember we can't use "sale date" directly because it's not a number. So you're saying this only worked because it was a date type? Did you make it a date type or was it already saved as one in the original?

Yeah, it's already a date type. And the reason it was a date type is because when we imported it, we said "has dates equals" and told pandas it's a date type. So as long as it looks date-ish and we tell it to parse it as a date, it'll turn it into a date type.

Was there a way to do that so it would just look through all the columns and say "if it looks like a date, make it a date" or do you have to know which one? I think there might be but for some reason it wasn't ideal. Maybe it took lots of time or it didn't always work or for some reason I had to list it here.

I would suggest checking out the docs for pandas.read_csv and maybe on the forum you can tell us what you find because I can't remember offhand. Let's do that one on the same forum thread that Savannah creates because I think it's a reasonably advanced question, but generally speaking the time zone in a properly formatted date will be included in the string and it should pull it out correctly and turn it into a universal time zone.

Generally speaking, it should handle it for you. So notice for indexing a column, you think it should simply use the dot and the EIF. The square brackets one is safer, particularly if you're assigning to a column. If it didn't already exist, you need to use the square brackets format, otherwise you'll get weird errors.

So the square brackets format is safer, the dot version saves me a couple of keystrokes so I probably use it more than I should. In this particular case, because I wanted to grab something that had something inside it, wasn't the name itself, I have to use square brackets. So square brackets is going to be your safe bet if in doubt.

So after I run that, you'll notice that dfraw.columns gives me a list of all of the columns just as strings, and at the end, there they all are. So it's removed sale date and it's added all those. So that's not quite enough. The other problem is that we've got a whole bunch of strings in there.

So here's low, high, medium. So pandas actually has a concept of a category data type, but by default it doesn't turn anything into a category for you. So I've created something called train_cats, which creates categorical variables for everything that's a string. And so what that's going to do is behind the scenes it's going to create a column that's actually a number, it's an integer, and it's going to store a mapping from the integers to the strings.

The reason it's train_cats is that you use this for the training set. More advanced usage is that when we get to looking at the test and validation sets, this is a really important idea. In fact Terrence came to me the other day and he said, "My model's not working.

Why not?" And he figured it out for himself. It turned out the reason why was because the mappings he was using from string to number in the training set were different to the mappings he was using from string to number in the test set. So therefore in the training set, high might have been 3, but in the test set it might have been 2.

So the 2 were totally different, and so the model was basically non-predictive. So I have another function called apply_categories, where you can pass in your existing training set and it will use the same mappings to make sure your test set or validation set uses the same mappings. So when I go train_cats, it's actually not going to make the data frame look different at all.

Behind the scenes it's going to turn them all into numbers. We finish at 12, 11.50. Let's see how we go, I'll try and finish on time. So you'll see now, remember I mentioned there was this .dt attribute that gives you access to everything, assuming it's about the date time, there's a .cat attribute that gives you access to things assuming something's a category.

And so usage_band was a string, and so now that I've run train_cats, it's turned it into a category, so I can go dfraw.usage_band.cat and there's a whole bunch of other things we've got there. So one of the things we've got there is .categories, and you can see here is the list.

Now one of the things you might notice is that this list is in a bit of a weird order, high, low, medium. The truth is, it doesn't matter too much, but what's going to happen when we use the random forest is this is going to be 0, this is going to be 1, this is going to be 2, and we're going to be creating decision trees.

And so we're going to have a decision tree that can split things at a single point. So it'd either be high versus low and medium, or medium versus high and low. That would be kind of weird. It actually turns out not to work too badly, but it'll work a little bit better if you have these in sensible orders.

So if you want to reorder a category, then you can just go cat.set_categories and pass in the order you want until it's ordered. And almost every pandas method has an in-place parameter, which rather than returning a new data frame, it's going to change that data frame. So I'm not going to do that, I didn't check that carefully for categories that should be ordered, but this seems like a pretty obvious one.

Sure. The usage_band column is actually going to be, this is actually what our random forest is going to see, these numbers, 1, 0, 2, 1. And they map to the position in this array. And as we're going to learn shortly, a random forest consists of a bunch of trees that's going to make a single split, and a single split is going to be either greater than or less than 1, or greater than or less than 2.

So we could split it into high versus low and medium, which that semantically makes sense. Like is it big, or we could split it into medium versus high and low, which doesn't make much sense. So in practice, the decision tree could then make a second split to say medium versus high and low, and then within the high and low into high and low.

But by putting it in a sensible order, if it wants to split out low, it can do it in one decision rather than two. And we'll be learning more about this shortly. It honestly is not a big deal, but I just wanted to mention it's there. It's also good to know that people, when they talk about different types of categorical variable, specifically you need to know there's a kind of categorical variable called ordinal.

And an ordinal categorical variable is one that has some kind of order, like high, medium, and low. And random forests aren't terribly sensitive to that fact, but it's worth knowing it's there and trying it out. That's what I'm saying. It helps a little bit. It means you can get there with one decision rather than two.

Yeah, exactly. So for free, we get a negative one which refers to missing. And one of the things we're going to do is we're going to actually add one. Somebody pass it back to Paul. We're going to add one to our codes, maybe until he goes, "Let people know it's coming!" So we're going to add one to all of our codes to make missing zero later on.

So for these categories, you're basically mapping streams to different integers. So getDummies, which we'll get to in a moment, is going to create three separate columns, ones and zeros for high, ones and zeros for medium, ones and zeros for low, whereas this one creates a single column with an integer, 0, 1, or 2.

So at this point, as long as we always make sure we use .cat.codes, the thing with the numbers in, we're basically done. All of our strings have been turned into numbers, our dates have been turned into a bunch of numeric columns, and everything else is already a number. The only other main thing we have to do is notice that we have lots of missing values.

So here is dfraw.isnull, that's going to return true or false, depending on whether something is empty, .sum is going to add up how many are empty for each series, and then I'm going to sort them and divide by the size of the dataset. So here we have some things which have quite high percentages of nulls.

So missing values, we call them in display_all, maybe I didn't run it. So we're going to get to that in a moment, but I will point something out, which is reading the CSV took a minute or so, the processing took another 10 seconds or so, from time to time when I've done a little bit of work I don't want to wait for again, I will tend to save where I'm at.

So here I'm going to save it. And I'm going to save it in a format called feather_format, this is very, very new. But what this is going to do is it's going to save it to disk in exactly the same basic format that it's actually in RAM. This is by far the fastest way to save something, and the fastest way to read it back.

So most of the folks you deal with, unless they're on the cutting edge, won't be familiar with this format, so this will be something you can teach them about. It's becoming the standard. It's actually becoming something that's going to be used not just in pandas, but in Java, in Spark, in lots of things for communicating across computers because it's incredibly fast.

And it's actually co-designed by the guy that made pandas by Wes McKinney. So we can just go dfraw.tofeather and pass in some name. I tend to have a folder called temp for all of my "as I'm going along" stuff. And so when you go os.makedurs, you can path in any path here you like.

It won't complain if it's already there, if it exists okay equals true. If there are some subdirectories, it'll create them for you, so this is a super handy little function. So it's not installed, because I'm using Cressel for the first time. It's complaining about that. So if you get a message that something's not installed, if you're using Anaconda, you can conda install.

Cressel actually doesn't use Anaconda, it uses pip, and so we wait for that to go along. And so now if I run it, and so sometimes you may find you actually have to restart Jupyter. So I won't do that now because we're nearly out of time, so if you restart Jupyter you'll be able to keep moving along.

So from now on, you don't have to rerun all the stuff that I have. You could just say pd.readfeather and we've got our data frame back. So the last step we're going to do is to actually replace the strings with the numeric codes. And we're going to pull out the dependent variable, sale price, into a separate variable.

And we're going to also handle missing continuous values. And so how are we going to do that? So you'll see here we've got a function called proc df. What is that? It's inside fastai.structured, again. And here it is. So quite a lot of the functions have a few additional parameters that you can provide, and we'll talk about them later, but basically we're providing the data frame to process and the name of the dependent variable, the y field name.

And so all it's going to do is it's going to make a copy of the data frame, it's going to grab the y value, it's going to drop the dependent variable from the original, and then it's going to fix missing. So how do we fix missing? So what we do to fix missing is pretty simple.

If it's numeric, then we fix it by basically saying let's first of all check that it does have some missing. So if it does have some missing values, so in other words the is_null.sum is non-zero, then we're going to create a new column with the same name as the original, plus_na, and it's going to be a boolean column with a 1 any time that was missing, and a 0 any time it wasn't.

We're going to talk about this again next week, but I'll give you the quick version. Having done that, we're then going to replace the n_a's, the missing, with the median. So anywhere that used to be missing will be replaced with the median, and we'll add a new column to tell us which ones were missing.

We only do that for numeric, we don't need it for categories because Pandas handles categorical variables automatically by setting them to -1. So what we're going to do is if it's not numeric, and it's a categorical type (we'll talk about the maximum number of categories later, but let's assume this is always true, so if it's not a numeric type) we're going to replace the column with its codes, the integers, +1.

So by default Pandas uses -1 for missing, so now 0 will be missing, and 1, 2, 3, 4 will be all the other categories. So we're going to talk about dummies later on in the course, but basically optionally you can say that if you already know about dummy values, they're columns with a small number of possible values, you can turn into dummies instead if you're numericalizing them, but we're not going to do that for now.

So for now all we're doing is we're using the categorical codes +1, replacing missing values with the median, adding an additional column, telling us which ones were replaced, and removing the dependent variable. So that's what PROC DF does, it runs very quickly. So you'll see now, sale price is no longer here.

We've now got a whole new variable called y that contains sale price. You'll see we've got a couple of extra blah_nas at the end. And if I look at that, everything is a number. These Booleans are treated as numbers, they're just considered as 0 or 1, they're just displayed as false and true.

So you can see here, is it the end of a month, is it the start of a month, is it the end of a quarter? It's kind of funny, right, because we've got things like a model ID, which presumably is something like a serial number, or it could be like the model identifier that's created by the factory, or something, we've got like a data source ID.

Some of these are numbers, but they're not continuous. It turns out actually random forests work fine with those. We'll talk about why and how and a lot about that in detail, but for now all you need to know is no problem. So as long as this is all numbers, which it now is, we can now go ahead and create a random forest.

So m.randomforestregressor, random forests are trivially parallelizable. So what that means is that if you've got more than one CPU, which everybody will basically on their computers at home, and if you've got a T2.medium or bigger at AWS, you've got multiple CPUs. Randomly parallelizable means that it will split up the data across your different CPUs and basically linearly scale.

So the more CPUs you have, pretty much it will divide the time it takes by that number. Not exactly, but roughly. So njobs=-1 tells the random forest regressor to create a separate job, a separate process basically, for each CPU you have, so that's pretty much what you want all the time.

Get the model using this new data frame we created, using that y value we pulled out, and then get the score. The score is going to be the R^2, we'll define that next week, hopefully some of you already know about the R^2. 1 is very good, 0 is very bad, so as you can see we've immediately got a very high score.

So that looks great, but what we'll talk about next week a lot more is that it's not quite great because maybe we had data that had points that looked like this, and we fitted a line that looks like this, when actually we wanted one that looks like that. The only way to know whether we've actually done a good job is by having some other dataset that we didn't use to train the model.

Now we're going to learn about some ways with random forests we can kind of get away without even having that other dataset, but for now what we're going to do is we're going to split into 12,000 rows which we're going to put in a separate dataset called the validation set versus the training set that's going to contain everything else.

And our dataset is going to be sorted by date, and so that means that the most recent 12,000 rows are going to be our validation set. Again, we'll talk more about this next week, it's a really important idea, but for now we can just recognize that if we do that and run it, I've created a little thing called print score and it's going to print out the root mean squared error between the predictions and actuals for the training set, for the validation set, the R^2 for the training set and the validation set.

And you'll see that actually the R^2 for the training was 0.98, but for the validation was 0.89. Then the RMSE, and remember this is on the logs, was 0.09 for the training set and 0.25 for the validation set. Now if you actually go to Kaggle and go to the leaderboard, in fact let's do it right now, he's got private and public, I'll click on public leaderboard, and we can go down and find out where is 0.25.

So there are 475 teams, and generally speaking if you're in the top half of a Kaggle competition you're doing pretty well. So 0.25, here we are, 0.25, what was it exactly, 0.25, 0.2507, yeah, about 110. So we're about in the top 25%. So the idea, this is pretty cool, with no thinking at all, using the defaults of everything were in the top 25% of a Kaggle competition.

So random forests are insanely powerful, and this totally standardized process is insanely good for any dataset. So we're going to wrap up, what I'm going to ask you to do for Tuesday is take as many Kaggle competitions as you can, whether they be running now or old ones or datasets that you're interested in for hobbies or work, and please try it.

Try this process. And if it doesn't work, tell us on the forum, here's the dataset I'm using, here's where I got it from, here's the stack trace of where I got an error, or here's if you use my print score function or something like it, show us what the training versus test set looks like, we'll try and figure it out.

But what I'm hoping we'll find is that all of you will be pleasantly surprised that with an hour or two of information you've got today, you can already get better models than most of the very serious practicing data scientists that compete in Kaggle competitions. Okay? Great. Good luck, and I'll see you on the forums.

Oh, one more thing, Friday, the other class said a lot of them had class during my office hours, so if I made them 1-3 instead of 2-4 on Fridays, is that okay? Seminar. Okay, I have to find a whole other time. All right, I will talk to somebody who actually knows what they're doing, unlike me, about finding office hours.

Thank you. (inaudible) Absolutely. (inaudible)

Intro to Machine Learning: Lesson 1

Chapters

Transcript