back to indexGetting In Shape For The Sport Of Data Science
00:00:00.000 |
OK, so we're here at the Melbourne R meetup, and we are talking about some techniques that 00:00:09.560 |
Jeremy Howard has used to do as well as he can in a variety of Kaggle competitions. 00:00:17.920 |
And we're going to start by having a look at some of the tools that I've found useful 00:00:24.120 |
in predictive modelling in general and in Kaggle competitions in particular. 00:00:31.480 |
So I've tried to write down here what I think are some of the key steps. 00:00:37.500 |
So after you download data from a Kaggle competition, you end up with CSV files, generally speaking, 00:00:46.720 |
CSV files, which can be in all kinds of formats. 00:00:49.800 |
So here's the first thing you see when you open up the time series CSV file. 00:00:57.120 |
So each of these columns is actually-- oh, here we come-- is actually quarterly time 00:01:09.360 |
And so because-- well, for various reasons, each one's different lengths, and they kind 00:01:16.760 |
of start further along, the particular way that this was provided didn't really suit 00:01:24.280 |
And in fact, if I remember correctly, that's already-- I've already adjusted it slightly 00:01:30.640 |
because it originally came in rows rather than columns. 00:01:35.560 |
This is how it originally came in rows rather than columns. 00:01:39.720 |
So this is where this kind of data manipulation toolbox comes in. 00:01:45.280 |
There's all kinds of ways to swap rows and columns around, which is where I started. 00:01:49.720 |
The really simple approach is to select the whole lot, copy it, and then paste it and 00:02:02.380 |
And then having done that, I ended up with something which I could open up in-- let's 00:02:13.440 |
This is the original file in VIM, which is my text editor of choice. 00:02:20.320 |
This is actually a really good time to get rid of all of those kind of bleeding commas 00:02:28.000 |
So this is where stuff like VIM is great, even things like Notepad++ and VMAX and any 00:02:32.980 |
of these kind of power user text editors will work fine. 00:02:35.900 |
As long as you know how to use regular expressions-- and if you don't, I'm not going to show you 00:02:43.640 |
So in this case, I'm just going to go, OK, let's use a regular expression. 00:02:47.400 |
So I say, yes, to substitute for the whole file, start with any number of commas, and 00:02:59.800 |
So I can now save that, and I've got a nice, easy file that I can start using. 00:03:06.040 |
So that's why I've missed this idea of data manipulation tools in my toolbox. 00:03:13.120 |
And to me, VIM or some regular expression, how a text editor which can handle large files 00:03:22.520 |
So just in case you can catch that, that is regular expressions. 00:03:30.880 |
Probably the most powerful tool for doing text and data manipulation that I know of. 00:03:40.160 |
The most powerful types of regular expressions, I would say, would be the ones that are in 00:03:48.860 |
Any C program that uses the PCRE engine has the same regular expressions as PEL, more 00:03:55.920 |
C# and .NET have the same regular expressions as PEL, more or less. 00:03:59.400 |
So this is a nice example of one bunch of people getting it right, and everybody else 00:04:05.960 |
VIM's regular expressions are slightly different, unfortunately, which annoys me no end, but 00:04:14.240 |
So yeah, make sure you've got a good text editor that you know well how to use. 00:04:20.800 |
Something with a good macro facility is nice as well. 00:04:24.680 |
You can record a series of keystrokes and hit a button, and it repeats it basically 00:04:31.040 |
I also wrote PEL here because, to me, PEL is a rather unloved programming language, 00:04:40.000 |
but if you think back to where it comes from, it was originally developed as the Swiss Army 00:04:50.720 |
And today, that is something it still does, I think, better than any other tool. 00:04:55.520 |
It has amazing command line options you can pass to it that do things like run the following 00:05:02.740 |
command on every line in a file, or run the following line on every command in a file 00:05:09.440 |
There's a command line option to back up each file before changing it to large up back. 00:05:15.120 |
I find with PEL I can do stuff which would take me a much, much longer time than any 00:05:24.120 |
Even simple little things like I was hacking some data on the weekend where I had to concatenate 00:05:27.900 |
a whole bunch of files, but only the first one I wanted to keep the first line because 00:05:32.360 |
there were a whole bunch of CSV files in which they had a line I had to delete. 00:05:36.160 |
So in PEL, in fact, it's probably still going to be sitting here in my history, so in PEL, 00:05:51.760 |
that's basically minus N means do this on every single row, minus A means I'm not even 00:05:56.720 |
going to write a script file, I'm going to give you the thing to do it right here on 00:06:00.240 |
the command line, and here's a piece of rather difficult to comprehend PEL, but trust me, 00:06:05.800 |
what it says is if the line number is greater than one, then print that line. 00:06:10.240 |
So here's something to strip the first line from every file. 00:06:15.240 |
So this kind of stuff you can do in PEL is great, and I see a lot of people in the forums 00:06:19.800 |
who complain about the format of the data wasn't quite what I expected or not quite 00:06:24.800 |
convenient, can you please change it for me, and I always think, well, this is part of 00:06:28.720 |
data science, this is part of data hacking, this is data munging or data manipulation. 00:06:35.240 |
There's actually a really great book called, I don't know if it's hard to find nowadays, 00:06:41.680 |
but I loved it, called Data Munging in PEL, and it's a whole book about all the cool stuff 00:06:52.840 |
So okay, I've now got the data into a form where I can kind of load it up into some tool 00:07:03.280 |
Now, your first reaction might be to think, Excel, not so good for big files, to which 00:07:10.080 |
my reaction would be, if you're just looking at the data for the first time, why are you 00:07:18.600 |
And again, this is the kind of thing you can do in your data manipulation piece, that thing 00:07:22.280 |
I just showed you in PEL, if that's if rand is greater than 0.9 and andgrid, that's going 00:07:32.400 |
So if you've got a huge data file, get it to a size that you can easily start playing 00:07:37.020 |
with it, which normally means some random sampling. 00:07:40.320 |
So I like to look at it in Excel, and I will show you for a particular competition how 00:07:51.280 |
So let's have a look, for example, at a couple. 00:07:58.200 |
So here's one which the New South Wales government basically ran, which was to predict how long 00:08:03.680 |
it's going to take cars to travel along each segment of the M4 motorway, in each direction. 00:08:12.520 |
The data for this is a lot of columns, because every column is another root, and lots and 00:08:19.940 |
lots of rows, because every row is another two-minute observation, and very hard to get 00:08:25.120 |
a feel for what's going on. There were various terrific attempts on the 00:08:29.960 |
forum at trying to create animated pictures of what the road looks like over time. 00:08:34.840 |
I did something extremely low-tech, which is something I'm proud of spending a lot of 00:08:40.480 |
time doing these things for extremely low-tech, which is I created a simple little macro in 00:08:45.720 |
Excel which selected each column, and then went conditional formatting, color scales, 00:08:54.600 |
red to green, and I ran that on each column, and I got this picture. 00:09:00.280 |
So here's each root on this road, and here's how long it took to go on that root at this 00:09:08.200 |
And isn't this interesting, because I can immediately see what traffic jams look like. 00:09:14.120 |
See how they kind of flow as you start getting a traffic jam here? 00:09:17.880 |
They flow along the road as time goes on, and you can then start to see at what kind 00:09:23.720 |
of times they happen and where they tend to start, so here's a really big jam. 00:09:30.280 |
So if we go into Sydney in the afternoon, then obviously you start getting these jams 00:09:37.120 |
up here, and as the afternoon progresses, you can see the jam moving so that at 5pm it looks 00:09:44.880 |
like there's actually a couple of them, and at the other end of the road it stays jammed 00:09:50.000 |
until everybody's cleared up through the freeway. 00:09:54.080 |
So you get a real feel for it, and even when it's not peak hour, and even in some of the 00:09:58.520 |
period areas which aren't so busy, you can see that's interesting. 00:10:03.000 |
There are basically parts of the freeway which, out of peak hour, they're basically constant 00:10:12.160 |
So when we actually got on the phone with the RTA to take them through the winning model, 00:10:17.320 |
actually the people that won this competition were kind enough to organise a screencast 00:10:21.600 |
with all the people in the RTA and from Kaggle to show the winning model. 00:10:24.960 |
And the people from RTA said, "Well, this is interesting, because you tell me in your 00:10:29.880 |
model," they said, "What we looked at was we basically created a model that looked at 00:10:34.760 |
for a particular time, for a particular route. 00:10:37.960 |
We looked at the times and routes just before and around it on both sides." 00:10:43.760 |
And I remember one of the guys said, "That's weird, because normally these kind of queues 00:10:49.160 |
traffic jams only go in one direction, so why would you look at both sides?" 00:10:53.600 |
And so I was able to quickly say, "OK, guys, that's true, but have a look at this." 00:10:58.520 |
So if you go to the other end, you can see how sometimes although queues kind of form 00:11:03.200 |
in one direction, they can kind of slide away in the other direction, for example. 00:11:07.880 |
So by looking at this kind of picture, you can see what your model is going to have to 00:11:14.520 |
So you can see what kind of inputs it's going to have and how it's going to have to be set 00:11:19.280 |
And you can immediately see that if you created the model that basically tried to predict each 00:11:22.820 |
thing based on the previous few periods of the routes around it, whatever modeling technique 00:11:32.520 |
you're using, you're probably going to get a pretty good answer. 00:11:35.640 |
And interestingly, the guys that won this competition, this is basically all they did, 00:11:41.160 |
They used random florists, as it happens, which we'll talk about soon. 00:11:46.700 |
They added a couple of extra things, which was, I think, the rate of change of time. 00:11:53.800 |
So a really good example of how visualization can quite quickly tell you what you need to 00:12:03.360 |
This is a recent competition that was set up by the dataists.com blog. 00:12:13.400 |
And what it was, was they wanted to try and create a recommendation system for R packages. 00:12:20.160 |
So they got a bunch of users to say, OK, this user, for this package, doesn't have it installed. 00:12:29.480 |
This user, for this package, does have it installed. 00:12:33.040 |
So you can kind of see how this is structured. 00:12:35.880 |
They added a bunch of additional potential predictors for you. 00:12:39.480 |
How many dependencies does this package have, how many suggestions does this package have, 00:12:43.520 |
how many imports, how many of those task views on CRAN is it included in, is it a core 00:12:50.120 |
package, is it a recommended package, who maintains it, and so forth. 00:12:55.920 |
So I found this not particularly easy to get my head around what this looks like. 00:13:01.320 |
So I used my number one most favorite tool for data visualization and then hook analysis, 00:13:10.040 |
A pivot table is something which dynamically lets you slice and dice your data. 00:13:16.480 |
So if you've used maybe Tableau or something like that, you'll know the field. 00:13:21.880 |
This is kind of like Tableau, it doesn't cost a thousand dollars. 00:13:24.440 |
No, I mean, Tableau's got cool stuff as well, but this is fantastic for most things I find 00:13:31.560 |
And so in this case, I simply drag user ID up to the top, and I dragged package name 00:13:38.760 |
down to the side, and just quickly through this into a matrix, basically. 00:13:44.760 |
And so you can see here what this data looks like, which is that those nasty people at 00:13:49.800 |
dataversus.com is deleted a whole bunch of things in this matrix. 00:13:54.360 |
So that's the stuff that they want us to predict. 00:13:57.160 |
And then we can see that generally, as you expect, there's ones and zeros. 00:14:01.000 |
There's some weird shit going on here where some people have things apparently there twice, 00:14:05.440 |
which suggests to me maybe there's something funny with the data collection. 00:14:10.320 |
And there's other interesting things. There are some things which seem to be quite widely 00:14:22.400 |
And there is this mysterious user number five, who is the world's biggest R package slut. 00:14:33.120 |
And I can only imagine that ADACGH is particularly hard to install, because not even user number 00:14:42.440 |
So you can see how creating a simple little picture like this, I can get a sense of what's 00:14:52.500 |
So I took that data in the R package competition, and I thought, well, if I just knew for a 00:15:00.600 |
particular-- so let's say this empty cell is the one we're trying to predict. 00:15:04.400 |
So if I just knew in general how commonly acceptance sampling was installed, and how 00:15:10.040 |
often user number one installed stuff, I probably got a good sense of the probability of user 00:15:19.800 |
So to me, one of the interesting points here was to think, actually, I don't think I care 00:15:26.520 |
So I jumped into R, and all I did was I basically said, OK, read that CSV file in. 00:15:36.960 |
There's a whole bunch of rows here, because this is my entire solution. 00:15:39.080 |
But I'm just going to show you the rows I used for solution number one. 00:15:45.480 |
Although user is a number, treated as a factor, because user number one is not 50 times worse 00:15:50.400 |
than user number 50, those trues and falses turn them into 1 to 0 to make life a bit easier. 00:15:58.840 |
And now apply the mean function to each user across their installations, and apply the 00:16:04.360 |
mean function to each package across that package's installations. 00:16:08.880 |
So now I've got a couple of lookups, basically, that tell me user number 50 installs this 00:16:13.360 |
percent of packages, this particular package is installed by this percent of users. 00:16:20.680 |
And then I just stuck them, basically, back into my file of predictors. 00:16:26.960 |
So I basically did these simple lookups for each row to find lookup the user and find 00:16:32.640 |
out for that row the mean for that user and the mean for that package. 00:16:40.600 |
At that point, I then created a GLM in which I created a GLM, in which obviously I had 00:16:51.800 |
my ones and zeroes of installations as the thing I was predicting. 00:16:55.320 |
And my first version I had UP and PP, so these two probabilities as my predictors. 00:17:01.600 |
In fact, no, in the first version it was even easier than that. 00:17:06.760 |
All I did, in fact, was I took the max of those two things. 00:17:12.200 |
So P max, if you're not familiar with R, is just something that does a max on each row 00:17:17.280 |
In R, nearly everything works on vectors by default, except for max. 00:17:29.720 |
So this user installs 30% of things, and this package is installed by 40% of users. 00:17:38.640 |
And I actually created a GLM with just one predictor. 00:17:42.780 |
The benchmark that was created by the data people for this used the GLM on all of those 00:17:48.500 |
predictors, including all kinds of relations analysis of the manual pages and maintain 00:17:55.560 |
names and God knows what, and they had an AUC of 0.8. 00:18:00.200 |
This five line of code thing had an AUC of 0.95. 00:18:05.520 |
So the message here is, don't overcompact things. 00:18:13.120 |
If people give you data, don't assume that you need to use it, and look at pictures. 00:18:19.560 |
So if we have a look at kind of my progress in there, so here's my first attempt, which 00:18:24.640 |
was basically to multiply the user probability by the package probability. 00:18:30.880 |
And you can see one of the nice things in Kaggle is you get a history of your results. 00:18:34.400 |
So here's my 0.84 AUC, and then I changed it to using the maximum of two, and there's my 00:18:45.480 |
Imagine how powerful this will be when I use all that data that they gave us with a fancy 00:18:54.600 |
So you can really see that actually a bit of focused simple analysis can often take 00:19:00.880 |
So if we look to the next page, we can kind of see where, you know, I kind of kept thinking 00:19:07.120 |
They thought the body works, they get more random forests, and that went backwards, and 00:19:13.280 |
And then actually I thought, you know, there is one piece of data which is really useful, 00:19:19.840 |
If somebody has installed package A, and it depends on package B, and I know they've got 00:19:25.720 |
package A, then I also know they've got package B. 00:19:32.920 |
That's the kind of thing I find a bit difficult to do in R, because I think R is a slightly 00:19:38.900 |
So I did that piece in language, which I quite like, which is C#, imported it back to R, and 00:19:44.400 |
then as you can see, each time I send something off to Kaggle, I generally copy and paste 00:19:49.000 |
into my notes just the line of code that I ran, so I can see exactly what it was. 00:19:54.220 |
So here I added this dependency graph, and I jumped up to 0.98. 00:20:00.700 |
That's basically as far as I got in this competition, which was enough for sixth place. 00:20:08.480 |
Yes, if somebody has package A, and it depends on package B, then obviously that means they've 00:20:16.460 |
If somebody doesn't have package B, and package A depends on it, then you know they definitely 00:20:23.140 |
And so when I went back and put that in after the competition was over, and I realized I 00:20:26.680 |
had forgotten it, and I realized I could have come about second if I'd just done that. 00:20:31.000 |
In fact, to get the top three in this competition, that's probably as much modeling as you needed. 00:20:38.720 |
So I think you can do well in these comps without necessarily being an R expert or necessarily 00:20:43.500 |
being a stats expert, but you do need to kind of dig into the toolbox appropriately. 00:20:51.960 |
So let's go back to my extensive slide presentation. 00:20:58.340 |
So you can see here we talked about data manipulation, about interactive analysis, we've talked 00:21:04.220 |
a bit about visualizations, and I include there even simple things like those tables 00:21:11.720 |
As I just indicated, in my toolbox is some kind of general purpose programming tool. 00:21:19.960 |
And to me, there's kind of three or four clear leaders in this space. 00:21:24.520 |
And I know from speaking to people in the data science world, about half the people 00:21:33.740 |
You definitely should, because otherwise all you can do is use stuff that other people 00:21:51.400 |
And I would combine it with these particular libraries for, yes, question? 00:21:56.760 |
Yeah, I was just wondering whether you saw complementary or competing? 00:22:05.880 |
And I'll come to that in the very next bullet point, yes. 00:22:11.040 |
So this general purpose programming tools is for the stuff that R doesn't do that well. 00:22:16.840 |
And even the guy that wrote R, Ross Lock, says he's not that fun nowadays of various things 00:22:29.200 |
Whereas there are other languages which are just so powerful and so rich and so beautiful. 00:22:33.600 |
I should have actually included some of the functional languages in here too, like Haskell 00:22:39.240 |
But if you've got a good powerful language, a good powerful matrix library, and a good 00:22:45.080 |
powerful machine learning toolkit, you're doing great. 00:22:54.280 |
A REPL is like where you type in a line of code like an R, and it immediately gives you 00:23:01.920 |
You can use IPython, which is a really fantastic REPL for Python. 00:23:10.320 |
And in fact, the other really nice thing in Python is matplotlib, which gives you a really 00:23:19.280 |
Much less elegant, but just as effective for C# and just as free is the MSChart controls. 00:23:28.720 |
I've written a kind of a functional layer on top of those to make them easier to do 00:23:32.120 |
analysis with, but they're super fast and super powerful, so that only takes 10 minutes. 00:23:41.080 |
There's a really brilliant thing very, very underutilized called Eigen, which originally 00:23:45.440 |
came from the KDE project and just provides an amazingly powerful kind of vector and scientific 00:23:59.040 |
Java to me is something that used to be on a par with C# back in the 1.0, 1.1.0. 00:24:05.640 |
It's looking a bit sad nowadays, but on the other hand, it has just about the most powerful 00:24:11.520 |
general purpose machine learning library on top of it, which is weaker. 00:24:15.440 |
So there's a lot to be said for using that combination. 00:24:19.320 |
In the end, if you're a data scientist who doesn't yet know how to program, my message 00:24:24.240 |
And I don't think it matters too much, which one you pick. 00:24:26.840 |
I would be picking one of these, but without it, you're going to be struggling to go beyond 00:24:47.080 |
Yeah, OK, so the question was about visualization tools and equivalent to SaaS jump. 00:24:55.360 |
Yeah, I would have a look at something like G-Gobi. 00:25:01.480 |
G-Gobi is a fascinating tool, which kind of has-- and not free, but in the same kind of 00:25:17.400 |
Supports this concept of brushing, which is this idea that you can look at a whole bunch 00:25:21.600 |
of plots and scatter plots and parallel coordinate plots and all kinds of plots, and you can 00:25:26.360 |
highlight one area of one plot, and it will show you where those points are in all the 00:25:32.880 |
And so in terms of really powerful visualization libraries, I think G-Gobi would be where I 00:25:41.400 |
Having said that, it's amazing how little I use it in real life. 00:25:47.280 |
Because things like Excel and what I'm about to come to, which is G-Gplot2, although much 00:25:54.320 |
less fancy than things like Jump and Tableau and G-Gobi, support a hypothesis-driven problem-solving 00:26:06.480 |
Something else that I do is I tend to try to create visualizations which meet my particular 00:26:22.200 |
And the time series problem is one in which I used a very simple ten-line JavaScript piece 00:26:28.680 |
of code to plot every single time series in a huge mess like this. 00:26:35.040 |
Now you kind of might think, well, if you're plotting hundreds and hundreds of time series, 00:26:40.000 |
how much insight are you really getting from that? 00:26:41.880 |
But I found it was amazing how just scrolling through hundreds of time series, how much 00:26:49.680 |
And what I then did was when I started modeling this, was I then turned these into something 00:26:58.480 |
a bit better, which was to basically repeat it, but this time I showed both the orange, 00:27:10.280 |
which is the actuals, and the blues, which is my predictions. 00:27:14.640 |
And then I put the metric of how successful this particular time series was. 00:27:20.880 |
So I kind of found that using more focused kind of visualization development, in this 00:27:30.520 |
case I could immediately see whereabouts were these, which numbers were high, so here's 00:27:37.460 |
one here, point one, that's a bit higher than the others, and I could immediately kind of 00:27:40.280 |
see what have I done wrong, and I could get a feel of how my modeling was going straight. 00:27:45.740 |
So I tend to think you don't necessarily need particularly sophisticated visualization tools, 00:27:52.480 |
they just need to be fairly flexible and you need to know how to drive them to give you 00:27:59.400 |
So through this kind of visualization, I was able to make sure every single chart in this 00:28:05.080 |
competition, if it wasn't matching well, and I'd look at it and I'd say, yeah, it's not 00:28:10.840 |
matching well because there was just a shock in some period which couldn't possibly be 00:28:18.360 |
And so this was one of the competitions that I won, and I really think that this visualization 00:28:29.760 |
So I mentioned I was going to come back to one really interesting plotting tool, which 00:28:36.080 |
GG plot two is created by a particularly amazing New Zealander who seemed to have more time 00:28:43.480 |
than everybody else in the world combined and creates all these fantastic tools. 00:28:49.140 |
I just wanted to show you what I meant by a really powerful but kind of simple plotting 00:28:55.520 |
Here's something really fascinating, you know how creating scatter plots with lots and lots 00:28:59.680 |
of data is really hard because you end up with just big black blobs. 00:29:04.440 |
So here's a really simple idea, which is why don't you give each point in the data a kind 00:29:09.880 |
of a level of transparency, so that the more they sit on top of each other, it's like transparent 00:29:15.760 |
disks stacking up and getting darker and darker. 00:29:18.920 |
So in the amazing art package called GG plot two, you can add. 00:29:24.400 |
So here's something that says plot the carrots of a diamond against its price, and I want 00:29:30.680 |
you to vary, it's called the alpha channel to the graphic stakes amongst you, you know 00:29:34.020 |
that means kind of the level of transparency, and I want you to basically set the alpha 00:29:39.180 |
channel for each point to be one over 10, or one over 100, one over 200. 00:29:43.640 |
And you end up with these plots, which actually show you kind of the heat, you know, the amount 00:29:50.840 |
And it's just so much better than any other approach to scatter plots that I've ever seen. 00:29:55.880 |
So simple, and just one little line of code in your GG plot. 00:30:01.920 |
And this, by the way, is in a completely free chapter of the book that he's got up on his 00:30:06.480 |
This is a fantastic book, you should definitely buy it by the author of the package about 00:30:11.760 |
But this one, and most important chapter is available free on his website, so check it 00:30:21.980 |
Here's a simple approach of plotting a lower smoother through a bunch of data, always handy. 00:30:28.020 |
But every time you plot something, you should see the confidence intervals. 00:30:34.600 |
The best kind of plot, kind of thing you want to see normally is a lower smoother. 00:30:42.560 |
So if you ask for a fit, it gives you the lowest move by default, it gives you the confidence 00:30:48.320 |
So it makes it hard to create really bad graphs in GG plot two, although some people have 00:30:59.840 |
Things like box plots all stacked up next to each other, it's such an easy way of seeing 00:31:04.160 |
in this case how the color of diamonds varies. 00:31:07.920 |
They've all got roughly the same median, that some of them have really long tails in their 00:31:16.920 |
And so impressive that in this chapter of the book, he shows a few options. 00:31:21.480 |
Here's what would happen if you used a jitter approach, and he's got another one down here, 00:31:27.120 |
which is, here's what would happen if you used that alpha transparency approach, and 00:31:31.040 |
you can really compare the different approaches. 00:31:35.360 |
So GG plot two is something which, and I'll scroll through these so you can see what kind 00:31:39.320 |
of stuff you can do, is a really important part of the toolbox. 00:31:46.000 |
Okay, so we do lots of scatter plots, and scatter plots are really powerful. 00:31:50.920 |
And sometimes you actually want to see how, if the points are kind of order chronologically, 00:31:57.240 |
So one way to do that is to connect them up with the line, pretty bloody hard to read. 00:32:01.840 |
So if you take this exact thing, but just add this simple thing, set the color to be 00:32:07.800 |
related to the year of the date, and then bang. 00:32:11.240 |
Now you can see, by following the color, exactly how this is sorted. 00:32:17.440 |
And so you can see we've got, here's one end here, here's one end here, so GG plot again 00:32:23.360 |
has done fantastic things to make us understand this data more easily. 00:32:34.760 |
How many people here have used the carat package? 00:32:38.760 |
So I'm not going to show you carat, but I will tell you this. 00:32:42.600 |
If you go into R and you type some model equals train on my data, carat and spn, that's what 00:32:58.640 |
You've got a command called train, and you can pass in a string which is any of 300 different, 00:33:04.480 |
I think it's about 300 different possible models, classification and regression models. 00:33:11.840 |
And then you can add various things in here about saying I want you to center the data 00:33:15.120 |
first, please, and I'll do a PCA on it first, please, and it just, you know, it's kind of 00:33:25.680 |
It can do things like remove columns from the data which hardly vary at all, and therefore 00:33:35.040 |
It can automatically remove columns from the data that are highly collinear, but most powerfully 00:33:39.940 |
it's got this wrapper that basically lets you take any of hundreds and hundreds of most 00:33:43.880 |
powerful algorithms, really hard to use, and they all now can be done through one algorithm, 00:33:52.880 |
I don't know how many of you tried to do spn, but they're really hard to get a good result 00:33:57.800 |
because they depend so much on the parameters. 00:34:00.880 |
In this version, it automatically does a grid search to automatically find the best parameters. 00:34:06.120 |
So you just create one command and it does spn for you. 00:34:10.320 |
So you definitely should be using a character. 00:34:16.720 |
There's one more thing in the toolbox I wanted to mention, which is you need to use some 00:34:24.160 |
How many people here have used a version control tool like git, cbs, spn? 00:34:29.560 |
Okay, so let me give you an example from our terrific designer at Kaggle. 00:34:40.400 |
He's recently been changing some of the HTML on our site and he checked it into this version 00:34:46.120 |
And it's so nice, right, because I can go back to any file now and I can see exactly 00:34:52.840 |
what was changed and when, and then I can go through and I can say, "Okay, I remember 00:35:11.280 |
And you can see with my version control tool, it's keeping track of everything I can do. 00:35:16.520 |
Can you see how powerful this is for modeling? 00:35:19.520 |
Because you go back through your submission history at Kaggle and you say, "Oh shit, I 00:35:29.960 |
Go back into your version control tool and have a look at the history, so the commits 00:35:38.160 |
list, and you can go back to the date where Kaggle shows you that you had a really shit-height 00:35:43.640 |
result and you can't now remember how the hell you did it. 00:35:46.760 |
And you go back to that date and you go, "Oh yeah, it's this one here," and you go and 00:35:53.840 |
And it can do all kinds of cool stuff like it can merge back-in results from earlier 00:35:58.360 |
pushes or you can undo the change you made between these two dates, so on and so forth. 00:36:04.640 |
And most importantly, at the end of the competition, when you win, and Anthony sends you an email 00:36:11.120 |
Send us your winning model," and you go, "Oh, I don't have the winning model anymore." 00:36:16.080 |
You can go back into your version control tool and ask for it as it was on the day that 00:36:31.640 |
There's quite a lot of other things I wanted to show you, but I don't have time to do. 00:36:34.400 |
So what I'm going to do is I'm going to jump to this interesting one, which was about predicting 00:36:43.700 |
which grants would be successful or unsuccessful at the University of Melbourne, based on data 00:36:49.320 |
structure about the people involved in the grant and all kinds of metadata about the 00:36:57.240 |
This one's interesting because I won it by a fair margin, kind of from 0.967 to 0.97 is 00:37:07.320 |
It's interesting to think, "What did I do right this time and how did I set this up?" 00:37:17.080 |
Actually what I did in this was I used a random forest. 00:37:19.560 |
So I'm going to tell you guys a bit about random forests. 00:37:22.480 |
What's also interesting in this is I didn't use R at all. 00:37:28.480 |
That's not to say that R couldn't have come up with a pretty interesting answer. 00:37:32.920 |
The guy who came second in his comp used SAS, but I think he used like 12 gig of RAM, multi-core, 00:37:44.760 |
So I'll show you an approach which is very efficient as well as been very powerful. 00:38:04.000 |
The reason that I didn't use R for this is because the data was kind of complex. 00:38:08.200 |
Each grant had a whole bunch of people attached to it. 00:38:14.680 |
I don't know how many of you guys are familiar with kind of normalization strategies, but 00:38:24.600 |
basically, denormalized form basically means you had a whole bunch of information about 00:38:30.120 |
the grant, kind of the date and blah, blah, blah. 00:38:34.680 |
And then there was a whole bunch of columns about person one, did they have a PhD, and 00:38:43.560 |
then there's a whole bunch of columns about person two and so forth for I think it was 00:38:52.320 |
Very very difficult model is extremely wide and extremely messy data set. 00:39:01.080 |
It's the kind of thing that general purpose computing tools are pretty good at. 00:39:04.720 |
So I pulled this into C# and created a grants data class where basically I went, okay, read 00:39:12.960 |
through this file, and I created this thing called grants data, and for each line I split 00:39:19.720 |
it on a comma, and I added that grant to this grants data. 00:39:23.560 |
For those people who maybe aren't so familiar with general purpose programming languages, 00:39:29.760 |
you might be surprised to see how readable they are. 00:39:32.760 |
This idea I can say for each something in lines dot select the lines bit by comma, if 00:39:39.360 |
you haven't used anything with portrait you might be surprised that something like C# 00:39:45.840 |
File dot read lines dot skip some lines, this is just a skip the first line of the header, 00:39:51.040 |
and in fact later on I discovered the first couple of years of data were not very predictive 00:39:54.960 |
of today, so I actually skipped all of those. 00:40:00.200 |
And the other nice thing about these kind of tools is okay, what does this dot add do? 00:40:03.960 |
I can work one button and bang, I need the definition of dot add. 00:40:08.360 |
These kind of IDE features are really helpful, and this is equally true of most Python and 00:40:19.160 |
So the kind of stuff that I was able to do here was to create all kinds of interesting 00:40:26.400 |
derived variables, like here's one called max year birth, so this one is one that goes 00:40:33.120 |
through all of the people on this application and finds the one with the largest year of 00:40:39.840 |
Okay, again it's just a single line of code, if you kind of get around the kind of curly 00:40:46.160 |
brackets and things like that the actual logic is extremely easy to understand, you know? 00:40:53.280 |
Things like do any of them have a PhD, well if there's no people in it, none of them do, 00:40:58.440 |
otherwise, oh this is just one person has a PhD, down here somewhere I've got, and he 00:41:06.960 |
has a PhD, bang, straight to there, there you go, does any person have a PhD? 00:41:13.640 |
So I created all these different derived fields, I used pivot tables to kind of work out which 00:41:19.040 |
one seemed to be quite predictive before I put these together, thing, and so what did 00:41:26.880 |
Well I wanted to create a random forest from this. 00:41:30.080 |
Now random forests are a very powerful, very general purpose tool, but the R implementation 00:41:45.560 |
For example, if you have a categorical variable, in other words a factor, it can't have any 00:41:59.960 |
If you have a continuous variable, so like an integer or a double or whatever, it can't 00:42:10.000 |
So there are these kind of nasty limitations that make it quite difficult, and it's particularly 00:42:16.160 |
difficult to use in this case because things like the RFCD codes had hundreds and hundreds 00:42:20.400 |
of levels, and all the continuous variables were full of nulls, and in fact if I remember 00:42:27.360 |
correctly, even the factors aren't allowed to have nulls, which I find a bit weird because 00:42:33.520 |
to me null is just another factor, they're male or they're female or they're unknown. 00:42:41.680 |
It's still something I should get a model on. 00:42:43.720 |
So I created a system that basically made it easy for me to create a data set up on one. 00:42:54.480 |
So I made this decision, I decided that for doubles that had nulls in them, I created 00:43:05.040 |
something which basically simply added two rows, sorry two columns, one column which 00:43:14.920 |
was is that column null or not, one or zero, and another column which is the actual data 00:43:23.280 |
from that column, so whatever it was, 2.36 blah blah blah blah blah, and wherever there 00:43:33.000 |
was a null, I just replaced it with the median. 00:43:39.440 |
So I now had two columns where I used to have one, and both of them are now modelable. 00:43:48.920 |
Actually it doesn't matter, because every place where this is the median there's a one 00:43:53.960 |
over here, so in my model I'm going to use this as a predictor, I'm going to use this 00:43:58.680 |
as a predictor, so if all of the places that that data column was originally null all meant 00:44:04.400 |
something interesting, then it'll be picked up by this is null version of the column. 00:44:12.640 |
So to me this is something which I do which I did automatically because it's clearly the 00:44:18.520 |
obvious way to deal with null values, and then as I said in the categorical variables 00:44:25.800 |
I just said okay the factors, if there's a null just treat it as another level, and then 00:44:32.480 |
finally in the factors I said okay take all of the levels and if there are more observations 00:44:42.600 |
than I think it was 25 then keep it, or maybe it's more than that, I think if there's more 00:44:49.400 |
levels maybe if there's more observations than 100 then keep it, if there's more observations 00:44:53.200 |
than 25 less than 100, and it was quite predictive, in other words that level was different to 00:44:59.760 |
the others in terms of application success then keep it, otherwise merge all the rest 00:45:09.640 |
So that way I basically was able to create a data set which actually I could then feed 00:45:17.760 |
to R, although I think in this case I ended up using my own random forest implementation. 00:45:26.880 |
So should we have a quick talk about random forests and how they work? 00:45:35.720 |
So to me there's kind of basically two main types of model, there's these kind of parametric 00:45:41.760 |
models, models with parameters, things where you say oh this bit's linear and this bit's 00:45:46.960 |
interactive with this bit and this bit's kind of logarithmic and I specify how I think this 00:45:56.680 |
system looks, and all the modeling tool does is it fills in parameters, okay this is the 00:46:01.080 |
slope of that linear bit, this is the slope of that logarithmic bit, this is how these 00:46:06.720 |
So things like GLM, very well known parametric tools, then there are these kind of non-parametric 00:46:15.320 |
or semi-parametric models which are things where I don't do any of that, I just say here's 00:46:19.600 |
my data, I don't know how it's related to each other, just build a model, and so things 00:46:24.920 |
like support vector machines, neural nets, random forests, decision trees all have that 00:46:38.080 |
Non-parametric models are not necessarily better than parametric models, I mean think back 00:46:41.880 |
to that example of the R package competition where really all I wanted was some weights 00:46:47.720 |
to say how does this kind of this max column relate, and if all you really wanted some 00:46:51.600 |
weights all you wanted some parameters, and so GLM is perfect. 00:46:58.960 |
Analysts certainly can overfit, but there are ways of creating GLMs that don't, for 00:47:05.600 |
example you can use stepwise regression, or the much more fancy modern version you can 00:47:12.800 |
use GLMnet, which is basically another tool for doing GLMs which doesn't overfit, but 00:47:22.560 |
anytime you don't really know what the model form is, this is where you use a non-parametric 00:47:26.720 |
tool, and random forests are great because they're super, super fast and extremely flexible, 00:47:36.160 |
and they don't really have any parameters in attitude, so they're pretty hard to get it 00:47:44.200 |
A random forest is simply, in fact we shouldn't even use this term random forest, because 00:47:50.540 |
a random forest is a trademark term, so we will call it an ensemble of decision trees, 00:48:00.440 |
and in fact the trademark term random forest, I think that was 2001, that wasn't where this 00:48:10.440 |
ensemble of decision trees was invented, it goes all the way back to 1995. 00:48:14.560 |
In fact it was actually kind of independently developed by three different people in 1995, 00:48:22.800 |
The random forest implementation is really just one way of doing it. 00:48:28.200 |
It all rests on a really fascinating observation, which is that if you have a model that is 00:48:37.100 |
really, really, really shit, but it's not quite random, it's slightly better than nothing, 00:48:45.320 |
and if you've got 10,000 of these models that are all different to each other, and they're 00:48:50.840 |
all shit in different ways, but they're all better than nothing, the average of those 00:48:55.760 |
10,000 models will actually be fantastically powerful as a model of its own. 00:49:03.040 |
So this is the wisdom of crowds or ensemble learning techniques. 00:49:08.320 |
You can kind of see why, because if out of these 10,000 models they're all kind of crap 00:49:13.680 |
in different ways, they're all a bit random, they're all a bit better than nothing, 9,099 00:49:18.800 |
of them might basically be useless, but one of them just happened upon the true structure 00:49:25.640 |
So the other 9,099 will kind of average out, if they're unbiased, not correlated with 00:49:31.640 |
each other, they'll all average out to whatever the average of the data is. 00:49:36.480 |
So any difference in the predictions of this ensemble will all come down to that one model 00:49:42.600 |
which happened to have actually figured it out right. 00:49:45.460 |
Now that's an extreme version, but that's basically the concept behind all these ensemble 00:49:51.400 |
techniques, and if you want to invent your own ensemble technique, all you have to do 00:49:56.160 |
is come up with some learner, some underlying model, which you can randomise in some way 00:50:04.440 |
and each one will be a bit different, and you run it lots of times. 00:50:08.560 |
And generally speaking, this whole approach we call random subspace. 00:50:21.280 |
So random subspace techniques, let me show you how unbelievably easy this is. 00:50:26.680 |
Take any model, any kind of modelling algorithm you like. 00:50:31.680 |
Here's our data, here's all the rows, here's all the columns. 00:50:53.440 |
So let's now build a model using that subset of rows and that subset of columns. 00:50:59.720 |
It's not going to be as perfect at recognising the training data as using the full art, but 00:51:07.960 |
it's one way of building a model, but it's not going to build a second model. 00:51:13.560 |
This time I'll use this subspace, a different set of rows and a different set of columns. 00:51:19.960 |
No, absolutely not, but I didn't want to draw 4000 lines, so let's pretend. 00:51:28.080 |
So in fact what I'm really doing each time here is I'm pulling out a bunch of random 00:51:44.960 |
It's just one way of creating a random subspace, but it's a nice easy one, and because I didn't 00:51:51.320 |
do very well at linear algebra, in fact I'm just a philosophy graduate, I don't know any 00:51:54.720 |
linear algebra, I don't know what subspace means well enough to do it properly, but this 00:51:59.920 |
certainly works, and this is all decision trees do. 00:52:03.240 |
So now I'll imagine that we're going to do this, and for each one of these different 00:52:07.240 |
random subspaces we're going to build a decision tree. 00:52:17.360 |
So let's say we've got age, sex, smoker, and lung capacity, and we kind of predict people's 00:52:50.080 |
So to build a decision tree, let's assume that this is the particular subset of columns 00:52:54.760 |
and rows in a random subspace, so let's build a decision tree. 00:52:58.400 |
So to build a decision tree, what I do is I say, okay, on which variable, on which predictor, 00:53:08.560 |
and at which point of that predictor, can I do a single split which makes the biggest 00:53:13.920 |
difference possible in my dependent variable? 00:53:17.360 |
So it might turn out that if I looked at this smoker, yes, and no, that the average lung 00:53:29.320 |
capacity for all of the smokers might be 30, and the average for all of the non-smokers 00:53:37.060 |
So literally all I've done is I've just gone through each of these and calculated the average 00:53:40.480 |
for the two groups, and I've found the one split that makes that as big a difference 00:53:49.200 |
So in those people that are non-smokers, I now, interestingly, with the random forest 00:53:56.960 |
or these decision tree ensemble algorithms, generally speaking, at each point, I select 00:54:04.440 |
So I randomly select a new group of columns, but I'm going to use the same rows. 00:54:07.520 |
I obviously have to use the same rows because I'm kind of taking them down the tree. 00:54:12.480 |
So now it turns out that if we look at age amongst the people that are non-smokers, if 00:54:17.520 |
you're less than 18 versus greater than 18, it's the number one biggest thing in this 00:54:22.960 |
random subspace that makes the difference, and that's like 50, and that's like 80. 00:54:28.280 |
And so this is how I create a decision tree, okay? 00:54:34.360 |
So at each point, I've taken a different random subset of columns. 00:54:40.040 |
For the whole tree, I've used the same random subset of rows. 00:54:43.680 |
And at the end of that, I keep going until every one of my leaves either has only one 00:54:51.040 |
or two data points left, or all of the data points at that leaf all have exactly the same 00:54:56.240 |
outcome, the same line capacity, for example. 00:55:01.080 |
And at that point, I've finished making my decision tree. 00:55:03.800 |
So now I put that aside, and I say, okay, that is decision tree number one. 00:55:12.540 |
And now go back and take a different set of rows and repeat the whole process. 00:55:27.400 |
And at the end of that, I've now got 1,000 decision trees. 00:55:31.280 |
And for each thing I want to predict, I then stick that thing I want to predict down every 00:55:37.600 |
So the first thing I'm trying to predict might be, you know, a non-smoker who is 16 years 00:55:42.480 |
old, blah, blah, blah, blah, blah, and that gives me a prediction. 00:55:45.400 |
So the predictions for these things at the very bottom is simply what's the average of 00:55:53.520 |
the dependent variable, in this case, the lung capacity for that group. 00:55:57.200 |
So that gives me 50 in decision tree one and it might be 30 in decision tree two and 14 00:56:05.400 |
And that's given me what I wanted, which is a whole bunch of independent, unbiased, not 00:56:21.800 |
If you want to be super cautious and you really need to make sure you're avoiding overfitting 00:56:27.080 |
then what you do is you make sure your random subspaces are smaller. 00:56:34.280 |
So then each tree is shitter than average, whereas if you want to be quick, you make 00:56:45.920 |
So it better reflects the true data that you've got. 00:56:49.640 |
Obviously the less rows and less columns you have each time, the less powerful each tree 00:56:58.400 |
And the nice thing about this is that building each of these trees takes like a ten thousandth 00:57:06.560 |
It depends on how much data you've got, but you can build thousands of trees in a few 00:57:17.240 |
And here's a really cool thing, the really cool thing. 00:57:21.360 |
In this tree I built it with these rows, which means that these rows I didn't use to build 00:57:32.840 |
my tree, which means these rows are out of sample for that tree. 00:57:38.600 |
And what that means is I don't need to have a separate cross-validation dataset. 00:57:43.760 |
What it means is I can create a table now of my full dataset, and for each one I can say, 00:57:51.640 |
okay, row number one, how good am I at predicting row number one? 00:57:55.760 |
Well, here's all of my trees from one to a thousand. 00:58:01.080 |
Row number one is in fact one of the things that was included when I created tree number 00:58:07.240 |
So I won't use it here, but row number one wasn't included when I built tree two. 00:58:13.000 |
It wasn't included in the random subspace of tree three, and it was included in the one 00:58:18.280 |
So what I do is row number one, I send down trees two and three, and I get predictions 00:58:26.720 |
for everything that it wasn't in, and average them out, and that gives me this fantastic 00:58:35.240 |
thing, which is an out-of-band estimate for row one, and I do that for every row. 00:58:45.360 |
So none of this, all of this stuff which is being predicted here is actually not using 00:58:50.120 |
any of the data that was used to build the trees, and therefore it is truly out of sample 00:58:53.800 |
or out-of-band, and therefore when I put this all together to create my final whatever it 00:58:58.520 |
is, AUC, or log likelihood, or SSC, or whatever, and then I send that off to Kaggle. 00:59:06.640 |
Kaggle should give you pretty much the same answer, because you're by definition not overfitting. 00:59:13.160 |
I'm just wondering if it's possible to say just pick a tree, but that has the best, is 00:59:18.160 |
it averaging out a thousand trees, is it possible to pick that one tree that actually has the 00:59:24.760 |
best performance, and would you recommend it? 00:59:26.960 |
So the question was, can you just pick one tree, and would that tree, picking that one 00:59:34.320 |
And let's think about what, that's a really important question, and let's think about 00:59:40.160 |
The whole purpose of this was to not overfit. 00:59:43.680 |
So the whole purpose of this was to say each of these trees is pretty crap, but it's better 00:59:50.000 |
than nothing, and so we averaged them all out, it tells us something about the true 00:59:56.960 |
If I now go back and do anything to those trees, if I try and prune them, which is in 01:00:01.680 |
the old-fashioned decision tree algorithms, or if I weight them, or if I pick a subset 01:00:07.000 |
of them, I'm now introducing bias based on the training set predictivity. 01:00:13.840 |
So anytime I introduce bias, I now break the laws of ensemble methods fundamentally. 01:00:19.840 |
So the other thing I'd say is there's no point, right? 01:00:25.280 |
Because if you have something where actually you've got so much training data that out 01:00:33.880 |
of sample isn't a big problem or whatever, you just use bigger subspaces and less trees. 01:00:40.840 |
And in fact, the only reason you do that is for time, and because this approach is so 01:00:44.040 |
fast anyway, I wouldn't even bother then, you see? 01:00:47.920 |
And the nice thing about this is, is that you can say, okay, I'm going to use kind of 01:00:54.240 |
this many columns and this many rows in each subspace, right? 01:00:58.240 |
And I've got to start building my trees, and I build tree number one, and I get this out 01:01:07.760 |
And the nice thing is I can watch and see, and it will be monotonic. 01:01:12.840 |
Well, not exactly monotonic, but kind of bumpy monotonic, it will keep getting better on 01:01:18.880 |
And I can get to a point where I say, okay, that's good enough, I'll stop. 01:01:22.160 |
And as I say, normally, it's talking four or five seconds, it's just time's not an issue, 01:01:27.040 |
but if you're talking about huge data sets, you can't sample them, this is a way you can 01:01:35.400 |
So this is a technique that I used in the Grant's prediction competition. 01:01:39.840 |
I did a bunch of things to make it even more random than this. 01:01:46.120 |
One of the big problems here, both in terms of time and lack of randomness, is that all 01:01:50.320 |
of these continuous variables, the official random forest algorithm searches through every 01:01:56.960 |
possible breakpoint to find the very best, which means that every single time that you 01:02:02.520 |
use that particular variable, particularly if it's in the same spot, like at the top 01:02:07.680 |
of the tree, it's going to do the same split, right? 01:02:11.080 |
In the version I wrote, actually, all it does is every time it comes across a continuous 01:02:15.800 |
variable, it randomly picks three breakpoints, so it might try 50, 70, and 90, and it just 01:02:26.320 |
And to me, this is the secret of good ensemble algorithms, is to make every one as different 01:02:34.520 |
Does the distribution of the population variable you're trying to predict matter? 01:02:40.920 |
So the question was, does the distribution of the dependent variable matter? 01:02:45.920 |
And the answer is it doesn't, and the reason it doesn't is because we're using a tree. 01:02:50.380 |
So the nice thing about a tree is, let's imagine that the dependent variable was kind of maybe 01:03:04.000 |
The nice thing is that, as it looks at the independent variables, it's looking at the 01:03:08.800 |
difference in two groups, and trying to find the biggest difference between those two groups. 01:03:13.600 |
So regardless of the distribution, it's more like a rank measure, isn't it? 01:03:18.780 |
It's picked a particular breakpoint, and it's saying which one finds the biggest difference 01:03:24.680 |
So regardless of the distribution of the dependent variable, it's still going to find the same 01:03:28.280 |
breakpoints, because it's really a non-parametric measure. 01:03:33.280 |
We're using something like, for example, GINI, or some kind of other measure of the information 01:03:42.800 |
So this is true of really all decision tree approaches in fact. 01:03:48.040 |
Does it work with the highly imbalanced data set? 01:03:54.560 |
So the question is, does it work for a highly imbalanced data set? 01:03:58.640 |
Sometimes some versions can, and some versions can't. 01:04:02.560 |
The approaches which use more randomization are more likely to work okay, but the problem 01:04:07.120 |
is in highly imbalanced data sets, you can quite quickly end up with nodes which are 01:04:14.200 |
So I actually have often found I get better results if I do some stratified sampling, 01:04:20.120 |
so that, for example, think about the R competition, where most people don't have in store, other 01:04:31.340 |
So in that case, I tend to say, all right, at least half of that data set is so obviously 01:04:36.480 |
zero, let's just call it zero and just work with the rest, and I do find I often get better 01:04:44.560 |
Would it be better to instead of using tree for the forest, use another algorithm in the 01:04:52.160 |
Well, you can't call it forest if you use a different algorithm other than a tree, but 01:04:56.560 |
yes, you can use other random subspace methods. 01:05:02.000 |
A lot of people have been going down that path. 01:05:07.680 |
So GLMnet would be a good example because that's very fast, but GLMnet is parametric. 01:05:16.460 |
The nice thing about decision trees is that they're totally flexible. 01:05:21.080 |
They don't assume any particular data structure. 01:05:23.520 |
They kind of are almost unlimited in the amount of interactions that they can handle, and 01:05:29.960 |
you can build thousands of them very quickly, but there are certainly people who are creating 01:05:33.720 |
other types of random subspace ensemble methods, and I believe some of them are quite effective. 01:05:41.080 |
Interestingly, I can't remember where I saw it, but I have seen some papers which show 01:05:46.400 |
evidence that it doesn't really matter if you've got a truly flexible underlying model 01:05:53.240 |
and you make it random enough and you create enough of them, it doesn't really matter which 01:05:58.380 |
one you use or how you do it, which is a nice result. 01:06:01.440 |
It kind of suggests that we don't have to spend lots of time trying to come up with 01:06:05.680 |
better and better and better generic predictive modeling tools. 01:06:11.880 |
If you think about things, there are better or better versions of this in quotes like 01:06:15.240 |
a rotation forest and then there's things like GBM, Australian Boosting Machines and 01:06:20.320 |
In practice, they can be faster for certain types of situation, but the general result 01:06:26.640 |
here is that these ensemble methods are as flexible as you need them to be. 01:06:32.400 |
How do you define the optimal size of the subspace? 01:06:37.740 |
The question is how do you define the optimal size of the subspace, and that's a really 01:06:43.240 |
The answer to it is really nice, and it's that you really don't have to. 01:06:49.600 |
Generally speaking, the less rows and the less columns you use, the more trees you need, 01:06:56.520 |
but the less you'll overfit and the better results you'll get. 01:07:02.780 |
The nice thing normally is that for most data sets, because of the speed of random forest, 01:07:07.200 |
you can pretty much always pick a row count and a column count that's small enough that 01:07:12.280 |
you're absolutely sure it's going to be fine. 01:07:16.440 |
Sometimes it can become an issue, maybe you've got really huge data sets, or maybe you've 01:07:20.880 |
got really big problems with data imbalances, or hard-mini training data, and in these cases 01:07:27.440 |
you can use the kind of approaches which would be familiar to most of us around creating 01:07:33.400 |
a grid of a few different values of the column count and the row count, and trying a few 01:07:39.600 |
out, and watching that graph of as you add more trees, how does it improve. 01:07:46.120 |
The truth is it's so insensitive to this that if you pick a number of columns of somewhere 01:07:56.520 |
between 10% and 50% of the total, and a number of rows of between 10% and 50% of the total, 01:08:09.360 |
You just keep adding more trees until you're sick of waiting, or it's obviously a flash. 01:08:14.600 |
If you do 1,000 trees, again, these are all, it really doesn't matter, they seem to, it's 01:08:22.200 |
not sensitive to that assumption on the whole. 01:08:28.720 |
Yeah, the iRoutine actually, so this idea of a random subspace, there are different ways 01:08:35.240 |
of creating this random subspace, and one key one is, can I go and pull out a row again 01:08:46.160 |
The R random forest and the portrait encoding, which is based by default, let you pull something 01:08:51.920 |
out multiple times, and by default, in fact, pull out, if you've got n rows, it will pull 01:08:57.960 |
out n rows, but because it's pulled out some multiple times, on average, it will cover 01:09:08.440 |
I don't have the best results when I use that, but it doesn't matter because in R random 01:09:14.600 |
forest options, you can choose, is it with or without sampling, and how many of them 01:09:24.080 |
To me, I'm sure it depends on the dataset, but I guess, I always enter Kaggle competitions 01:09:30.680 |
which are in areas that I've never entered before, kind of domain-wise or algorithm-wise, 01:09:36.680 |
so I guess I'd be getting a good spread of different types of situation, and in the ones 01:09:41.440 |
I've looked at, sampling without replacement is kind of more random, and I also tend to 01:09:48.400 |
pick much lower n than 63.2%, you know, I tend to use more like 10 or 20% of the data 01:09:56.640 |
Yeah, I know the concepts, I guess I can say it. 01:10:10.000 |
That's my experience, but I'm sure it depends on the dataset, and I'm not sure it's terribly 01:10:22.160 |
I always put it into two branches, so there's a few possibilities here, as you get in your 01:10:30.720 |
In this case, I've got something here which is a binary variable, so obviously that has 01:10:38.240 |
In this case, I've got something here which is a continuous variable, now it's fitted 01:10:43.160 |
into two, but if actually it's going to be optimal to split it into three, then if the 01:10:49.320 |
variable appears again at the next level, it can always put it into another two at that 01:11:01.400 |
So it just depends whether, when I did that, remember at every level I repeat the sampling 01:11:07.880 |
of a different bunch of columns, I could absolutely have the same column again in that group, 01:11:13.320 |
and it could so happen that again I find the split point which is the best in that group. 01:11:17.280 |
If you're doing 10,000 trees with 100 levels each it's going to happen lots of times, so 01:11:22.800 |
the nice thing is that if the true underlying system is a single univariate logarithmic 01:11:30.480 |
relationship, these trees will absolutely find that, eventually. 01:11:40.460 |
Definitely don't prune the trees, if you prune the trees you introduce mice, so the key thing 01:11:45.760 |
here which makes them so fast and so easy but also so powerful is you don't prune trees. 01:11:57.860 |
No it doesn't necessarily because your split point will be such that the true halves will 01:12:13.080 |
Yeah that's right, because in the under 18 group you could have not that many people, 01:12:21.360 |
and in the over 18 group you can have quite a lot of people, so the weighted average of 01:12:29.480 |
Have you ever compared this with scrodiant boosting machines? 01:12:39.480 |
Scrodiant boosting machines are interesting, they're a lot harder to understand, gradient 01:12:45.280 |
boosting machines, I mean they're still basically ensemble technique and they're more working 01:12:55.680 |
There's a few pieces of theory around gradient boosting machines which are nicer than random 01:12:59.240 |
forests, they ought to be faster and they ought to be more well directed, and you can do things 01:13:04.800 |
like say with a graded boosting machine, this particular column has a monotonic relationship 01:13:09.680 |
with a dependent variable, so you can actually add constraints in which you can't do with 01:13:15.680 |
In my experience I don't need the extra speed of GBMs because I just never have found it 01:13:23.760 |
I find them harder to, they've got more parameters to deal with, so I haven't found them useful 01:13:29.960 |
for me, and I know a lot of batter mining competitions and also a lot of real world predictive 01:13:35.560 |
modelling problems, people try both and end up with random forests. 01:13:39.720 |
Well we're probably just about out of time, so maybe if there's any more questions I can 01:13:44.880 |
chat to you guys afterwards, thanks very much.