Getting In Shape For The Sport Of Data Science

OK, so we're here at the Melbourne R meetup, and we are talking about some techniques that Jeremy Howard has used to do as well as he can in a variety of Kaggle competitions. And we're going to start by having a look at some of the tools that I've found useful in predictive modelling in general and in Kaggle competitions in particular.

So I've tried to write down here what I think are some of the key steps. So after you download data from a Kaggle competition, you end up with CSV files, generally speaking, CSV files, which can be in all kinds of formats. So here's the first thing you see when you open up the time series CSV file.

It's not very hopeful, is it? So each of these columns is actually-- oh, here we come-- is actually quarterly time series data. And so because-- well, for various reasons, each one's different lengths, and they kind of start further along, the particular way that this was provided didn't really suit the tools that I was using.

And in fact, if I remember correctly, that's already-- I've already adjusted it slightly because it originally came in rows rather than columns. Yeah, that's right. This is how it originally came in rows rather than columns. So this is where this kind of data manipulation toolbox comes in. There's all kinds of ways to swap rows and columns around, which is where I started.

The really simple approach is to select the whole lot, copy it, and then paste it and say transpose in Excel. And that's one way to do it. And then having done that, I ended up with something which I could open up in-- let's have a look at the original file.

This is the original file in VIM, which is my text editor of choice. This is actually a really good time to get rid of all of those kind of bleeding commas because they kind of confuse me. So this is where stuff like VIM is great, even things like Notepad++ and VMAX and any of these kind of power user text editors will work fine.

As long as you know how to use regular expressions-- and if you don't, I'm not going to show you now, but you should definitely look it up. So in this case, I'm just going to go, OK, let's use a regular expression. So I say, yes, to substitute for the whole file, start with any number of commas, and post it with nothing at all.

And you can see it's in VIM, and I'm done. So I can now save that, and I've got a nice, easy file that I can start using. So that's why I've missed this idea of data manipulation tools in my toolbox. And to me, VIM or some regular expression, how a text editor which can handle large files is something to be familiar with.

So just in case you can catch that, that is regular expressions. Probably the most powerful tool for doing text and data manipulation that I know of. Sometimes they're just called regexes. The most powerful types of regular expressions, I would say, would be the ones that are in PEL. They've been widely used elsewhere.

Any C program that uses the PCRE engine has the same regular expressions as PEL, more or less. C# and .NET have the same regular expressions as PEL, more or less. So this is a nice example of one bunch of people getting it right, and everybody else plagiarizing. VIM's regular expressions are slightly different, unfortunately, which annoys me no end, but they still do the job.

So yeah, make sure you've got a good text editor that you know well how to use. Something with a good macro facility is nice as well. Again, VIM's great for that. You can record a series of keystrokes and hit a button, and it repeats it basically on every line.

I also wrote PEL here because, to me, PEL is a rather unloved programming language, but if you think back to where it comes from, it was originally developed as the Swiss Army chainsaw of text processing tools. And today, that is something it still does, I think, better than any other tool.

It has amazing command line options you can pass to it that do things like run the following command on every line in a file, or run the following line on every command in a file in a print about. There's a command line option to back up each file before changing it to large up back.

I find with PEL I can do stuff which would take me a much, much longer time than any other tool. Even simple little things like I was hacking some data on the weekend where I had to concatenate a whole bunch of files, but only the first one I wanted to keep the first line because there were a whole bunch of CSV files in which they had a line I had to delete.

So in PEL, in fact, it's probably still going to be sitting here in my history, so in PEL, that's basically minus N means do this on every single row, minus A means I'm not even going to write a script file, I'm going to give you the thing to do it right here on the command line, and here's a piece of rather difficult to comprehend PEL, but trust me, what it says is if the line number is greater than one, then print that line.

So here's something to strip the first line from every file. So this kind of stuff you can do in PEL is great, and I see a lot of people in the forums who complain about the format of the data wasn't quite what I expected or not quite convenient, can you please change it for me, and I always think, well, this is part of data science, this is part of data hacking, this is data munging or data manipulation.

There's actually a really great book called, I don't know if it's hard to find nowadays, but I loved it, called Data Munging in PEL, and it's a whole book about all the cool stuff you can do in PEL in a line or two. So okay, I've now got the data into a form where I can kind of load it up into some tool and start looking at it.

What's the tool I normally start with? I normally start with Excel. Now, your first reaction might be to think, Excel, not so good for big files, to which my reaction would be, if you're just looking at the data for the first time, why are you looking at a big file?

Start by sampling it. And again, this is the kind of thing you can do in your data manipulation piece, that thing I just showed you in PEL, if that's if rand is greater than 0.9 and andgrid, that's going to sample every 10 rows, more or less. So if you've got a huge data file, get it to a size that you can easily start playing with it, which normally means some random sampling.

So I like to look at it in Excel, and I will show you for a particular competition how I go about doing that. So let's have a look, for example, at a couple. So here's one which the New South Wales government basically ran, which was to predict how long it's going to take cars to travel along each segment of the M4 motorway, in each direction.

The data for this is a lot of columns, because every column is another root, and lots and lots of rows, because every row is another two-minute observation, and very hard to get a feel for what's going on. There were various terrific attempts on the forum at trying to create animated pictures of what the road looks like over time.

I did something extremely low-tech, which is something I'm proud of spending a lot of time doing these things for extremely low-tech, which is I created a simple little macro in Excel which selected each column, and then went conditional formatting, color scales, red to green, and I ran that on each column, and I got this picture.

So here's each root on this road, and here's how long it took to go on that root at this time. And isn't this interesting, because I can immediately see what traffic jams look like. See how they kind of flow as you start getting a traffic jam here? They flow along the road as time goes on, and you can then start to see at what kind of times they happen and where they tend to start, so here's a really big jam.

And it's interesting, isn't it? So if we go into Sydney in the afternoon, then obviously you start getting these jams up here, and as the afternoon progresses, you can see the jam moving so that at 5pm it looks like there's actually a couple of them, and at the other end of the road it stays jammed until everybody's cleared up through the freeway.

So you get a real feel for it, and even when it's not peak hour, and even in some of the period areas which aren't so busy, you can see that's interesting. There are basically parts of the freeway which, out of peak hour, they're basically constant travel time. And the colours are immediately showing you.

You see how easy it is? So when we actually got on the phone with the RTA to take them through the winning model, actually the people that won this competition were kind enough to organise a screencast with all the people in the RTA and from Kaggle to show the winning model.

And the people from RTA said, "Well, this is interesting, because you tell me in your model," they said, "What we looked at was we basically created a model that looked at for a particular time, for a particular route. We looked at the times and routes just before and around it on both sides." And I remember one of the guys said, "That's weird, because normally these kind of queues traffic jams only go in one direction, so why would you look at both sides?" And so I was able to quickly say, "OK, guys, that's true, but have a look at this." So if you go to the other end, you can see how sometimes although queues kind of form in one direction, they can kind of slide away in the other direction, for example.

So by looking at this kind of picture, you can see what your model is going to have to be able to model. So you can see what kind of inputs it's going to have and how it's going to have to be set up. And you can immediately see that if you created the model that basically tried to predict each thing based on the previous few periods of the routes around it, whatever modeling technique you're using, you're probably going to get a pretty good answer.

And interestingly, the guys that won this competition, this is basically all they did, a really nice, simple model. They used random florists, as it happens, which we'll talk about soon. They added a couple of extra things, which was, I think, the rate of change of time. But that was basically it.

So a really good example of how visualization can quite quickly tell you what you need to do. I'll show you another example. This is a recent competition that was set up by the dataists.com blog. And what it was, was they wanted to try and create a recommendation system for R packages.

So they got a bunch of users to say, OK, this user, for this package, doesn't have it installed. This user, for this package, does have it installed. So you can kind of see how this is structured. They added a bunch of additional potential predictors for you. How many dependencies does this package have, how many suggestions does this package have, how many imports, how many of those task views on CRAN is it included in, is it a core package, is it a recommended package, who maintains it, and so forth.

So I found this not particularly easy to get my head around what this looks like. So I used my number one most favorite tool for data visualization and then hook analysis, which is a pivot table. A pivot table is something which dynamically lets you slice and dice your data.

So if you've used maybe Tableau or something like that, you'll know the field. This is kind of like Tableau, it doesn't cost a thousand dollars. No, I mean, Tableau's got cool stuff as well, but this is fantastic for most things I find I need to do. And so in this case, I simply drag user ID up to the top, and I dragged package name down to the side, and just quickly through this into a matrix, basically.

And so you can see here what this data looks like, which is that those nasty people at dataversus.com is deleted a whole bunch of things in this matrix. So that's the stuff that they want us to predict. And then we can see that generally, as you expect, there's ones and zeros.

There's some weird shit going on here where some people have things apparently there twice, which suggests to me maybe there's something funny with the data collection. And there's other interesting things. There are some things which seem to be quite widely installed. Most people don't install most packages. And there is this mysterious user number five, who is the world's biggest R package slut.

He or she installs everything that they can. And I can only imagine that ADACGH is particularly hard to install, because not even user number five managed to get around to it. So you can see how creating a simple little picture like this, I can get a sense of what's going on.

So I took that data in the R package competition, and I thought, well, if I just knew for a particular-- so let's say this empty cell is the one we're trying to predict. So if I just knew in general how commonly acceptance sampling was installed, and how often user number one installed stuff, I probably got a good sense of the probability of user number one installing acceptance standpoint.

So to me, one of the interesting points here was to think, actually, I don't think I care about any of this stuff. So I jumped into R, and all I did was I basically said, OK, read that CSV file in. There's a whole bunch of rows here, because this is my entire solution.

But I'm just going to show you the rows I used for solution number one. So read in the whole lot. Although user is a number, treated as a factor, because user number one is not 50 times worse than user number 50, those trues and falses turn them into 1 to 0 to make life a bit easier.

And now apply the mean function to each user across their installations, and apply the mean function to each package across that package's installations. So now I've got a couple of lookups, basically, that tell me user number 50 installs this percent of packages, this particular package is installed by this percent of users.

And then I just stuck them, basically, back into my file of predictors. So I basically did these simple lookups for each row to find lookup the user and find out for that row the mean for that user and the mean for that package. And that was actually it. At that point, I then created a GLM in which I created a GLM, in which obviously I had my ones and zeroes of installations as the thing I was predicting.

And my first version I had UP and PP, so these two probabilities as my predictors. In fact, no, in the first version it was even easier than that. All I did, in fact, was I took the max of those two things. So P max, if you're not familiar with R, is just something that does a max on each row individually.

In R, nearly everything works on vectors by default, except for max. So that's why you have to use P max. That's something well worth knowing. So I just took the max. So this user installs 30% of things, and this package is installed by 40% of users. So the max of the two is 40%.

And I actually created a GLM with just one predictor. The benchmark that was created by the data people for this used the GLM on all of those predictors, including all kinds of relations analysis of the manual pages and maintain names and God knows what, and they had an AUC of 0.8.

This five line of code thing had an AUC of 0.95. So the message here is, don't overcompact things. If people give you data, don't assume that you need to use it, and look at pictures. So if we have a look at kind of my progress in there, so here's my first attempt, which was basically to multiply the user probability by the package probability.

And you can see one of the nice things in Kaggle is you get a history of your results. So here's my 0.84 AUC, and then I changed it to using the maximum of two, and there's my 0.95 AUC. And I thought, oh, that was good. Imagine how powerful this will be when I use all that data that they gave us with a fancy random forest, and it went backwards.

So you can really see that actually a bit of focused simple analysis can often take you a lot further. So if we look to the next page, we can kind of see where, you know, I kind of kept thinking random forests. They thought the body works, they get more random forests, and that went backwards, and then I started adding in a few extra things.

And then actually I thought, you know, there is one piece of data which is really useful, which is that dependency graph. If somebody has installed package A, and it depends on package B, and I know they've got package A, then I also know they've got package B. So I added that piece.

That's the kind of thing I find a bit difficult to do in R, because I think R is a slightly shit programming language. So I did that piece in language, which I quite like, which is C#, imported it back to R, and then as you can see, each time I send something off to Kaggle, I generally copy and paste into my notes just the line of code that I ran, so I can see exactly what it was.

So here I added this dependency graph, and I jumped up to 0.98. That's basically as far as I got in this competition, which was enough for sixth place. I made a really stupid mistake. Yes, if somebody has package A, and it depends on package B, then obviously that means they've got package B.

I did that. If somebody doesn't have package B, and package A depends on it, then you know they definitely don't have package A. I forgot that piece. And so when I went back and put that in after the competition was over, and I realized I had forgotten it, and I realized I could have come about second if I'd just done that.

In fact, to get the top three in this competition, that's probably as much modeling as you needed. So I think you can do well in these comps without necessarily being an R expert or necessarily being a stats expert, but you do need to kind of dig into the toolbox appropriately.

So let's go back to my extensive slide presentation. So you can see here we talked about data manipulation, about interactive analysis, we've talked a bit about visualizations, and I include there even simple things like those tables we did. As I just indicated, in my toolbox is some kind of general purpose programming tool.

And to me, there's kind of three or four clear leaders in this space. And I know from speaking to people in the data science world, about half the people I speak to don't really know how to program. You definitely should, because otherwise all you can do is use stuff that other people have made for you.

And I would be picking from these tools. So I like the highly misunderstood C#. And I would combine it with these particular libraries for, yes, question? Yeah, I was just wondering whether you saw complementary or competing? Yeah, complementary. And I'll come to that in the very next bullet point, yes.

So this general purpose programming tools is for the stuff that R doesn't do that well. And even the guy that wrote R, Ross Lock, says he's not that fun nowadays of various things of R as a kind of an underlying language. Whereas there are other languages which are just so powerful and so rich and so beautiful.

I should have actually included some of the functional languages in here too, like Haskell would be another great choice. But if you've got a good powerful language, a good powerful matrix library, and a good powerful machine learning toolkit, you're doing great. So Python is fantastic. Python also has a really, really nice REPL.

A REPL is like where you type in a line of code like an R, and it immediately gives you the results. And you can keep looking through like that. You can use IPython, which is a really fantastic REPL for Python. And in fact, the other really nice thing in Python is matplotlib, which gives you a really nice charting library.

Much less elegant, but just as effective for C# and just as free is the MSChart controls. I've written a kind of a functional layer on top of those to make them easier to do analysis with, but they're super fast and super powerful, so that only takes 10 minutes. If you use C++, that also works great.

There's a really brilliant thing very, very underutilized called Eigen, which originally came from the KDE project and just provides an amazingly powerful kind of vector and scientific programming kind of language on top of C++. Java to me is something that used to be on a par with C# back in the 1.0, 1.1.0.

It's looking a bit sad nowadays, but on the other hand, it has just about the most powerful general purpose machine learning library on top of it, which is weaker. So there's a lot to be said for using that combination. In the end, if you're a data scientist who doesn't yet know how to program, my message is going to program.

And I don't think it matters too much, which one you pick. I would be picking one of these, but without it, you're going to be struggling to go beyond what the tools provide. Question at the back. Yeah, OK, so the question was about visualization tools and equivalent to SaaS jump.

Fairly available. Yeah, I would have a look at something like G-Gobi. G-Gobi is a fascinating tool, which kind of has-- and not free, but in the same kind of area, if we talked about Tableau. Supports this concept of brushing, which is this idea that you can look at a whole bunch of plots and scatter plots and parallel coordinate plots and all kinds of plots, and you can highlight one area of one plot, and it will show you where those points are in all the other plots.

And so in terms of really powerful visualization libraries, I think G-Gobi would be where I would go. Having said that, it's amazing how little I use it in real life. Because things like Excel and what I'm about to come to, which is G-Gplot2, although much less fancy than things like Jump and Tableau and G-Gobi, support a hypothesis-driven problem-solving approach very well.

Something else that I do is I tend to try to create visualizations which meet my particular needs. So we talked about the time series problem. And the time series problem is one in which I used a very simple ten-line JavaScript piece of code to plot every single time series in a huge mess like this.

Now you kind of might think, well, if you're plotting hundreds and hundreds of time series, how much insight are you really getting from that? But I found it was amazing how just scrolling through hundreds of time series, how much my brain picked up. And what I then did was when I started modeling this, was I then turned these into something a bit better, which was to basically repeat it, but this time I showed both the orange, which is the actuals, and the blues, which is my predictions.

And then I put the metric of how successful this particular time series was. So I kind of found that using more focused kind of visualization development, in this case I could immediately see whereabouts were these, which numbers were high, so here's one here, point one, that's a bit higher than the others, and I could immediately kind of see what have I done wrong, and I could get a feel of how my modeling was going straight.

So I tend to think you don't necessarily need particularly sophisticated visualization tools, they just need to be fairly flexible and you need to know how to drive them to give you what you need. So through this kind of visualization, I was able to make sure every single chart in this competition, if it wasn't matching well, and I'd look at it and I'd say, yeah, it's not matching well because there was just a shock in some period which couldn't possibly be predicted, so that's okay.

And so this was one of the competitions that I won, and I really think that this visualization approach was key. So I mentioned I was going to come back to one really interesting plotting tool, which is GG plot two. GG plot two is created by a particularly amazing New Zealander who seemed to have more time than everybody else in the world combined and creates all these fantastic tools.

Thank you Hadley. I just wanted to show you what I meant by a really powerful but kind of simple plotting tool. Here's something really fascinating, you know how creating scatter plots with lots and lots of data is really hard because you end up with just big black blobs. So here's a really simple idea, which is why don't you give each point in the data a kind of a level of transparency, so that the more they sit on top of each other, it's like transparent disks stacking up and getting darker and darker.

So in the amazing art package called GG plot two, you can add. So here's something that says plot the carrots of a diamond against its price, and I want you to vary, it's called the alpha channel to the graphic stakes amongst you, you know that means kind of the level of transparency, and I want you to basically set the alpha channel for each point to be one over 10, or one over 100, one over 200.

And you end up with these plots, which actually show you kind of the heat, you know, the amount of that area. And it's just so much better than any other approach to scatter plots that I've ever seen. So simple, and just one little line of code in your GG plot.

I'll show you another couple of examples. And this, by the way, is in a completely free chapter of the book that he's got up on his website. This is a fantastic book, you should definitely buy it by the author of the package about GG plot two. But this one, and most important chapter is available free on his website, so check it out.

I'll show you another couple of examples. Everything's done just right. Here's a simple approach of plotting a lower smoother through a bunch of data, always handy. But every time you plot something, you should see the confidence intervals. No problem. This does it by default. The best kind of plot, kind of thing you want to see normally is a lower smoother.

So if you ask for a fit, it gives you the lowest move by default, it gives you the confidence interval by default. So it makes it hard to create really bad graphs in GG plot two, although some people have managed, I've noticed. Things like box plots all stacked up next to each other, it's such an easy way of seeing in this case how the color of diamonds varies.

They've all got roughly the same median, that some of them have really long tails in their prices. What a really powerful plotting device. And so impressive that in this chapter of the book, he shows a few options. Here's what would happen if you used a jitter approach, and he's got another one down here, which is, here's what would happen if you used that alpha transparency approach, and you can really compare the different approaches.

So GG plot two is something which, and I'll scroll through these so you can see what kind of stuff you can do, is a really important part of the toolbox. Here's another one I love, right? Okay, so we do lots of scatter plots, and scatter plots are really powerful.

And sometimes you actually want to see how, if the points are kind of order chronologically, how did they change over time? So one way to do that is to connect them up with the line, pretty bloody hard to read. So if you take this exact thing, but just add this simple thing, set the color to be related to the year of the date, and then bang.

Now you can see, by following the color, exactly how this is sorted. And so you can see we've got, here's one end here, here's one end here, so GG plot again has done fantastic things to make us understand this data more easily. One other thing I will mention is carat.

How many people here have used the carat package? So I'm not going to show you carat, but I will tell you this. If you go into R and you type some model equals train on my data, carat and spn, that's what a command carat looks like. You've got a command called train, and you can pass in a string which is any of 300 different, I think it's about 300 different possible models, classification and regression models.

And then you can add various things in here about saying I want you to center the data first, please, and I'll do a PCA on it first, please, and it just, you know, it's kind of puts all of the pieces together. It can do things like remove columns from the data which hardly vary at all, and therefore use some modeling to do that automatically.

It can automatically remove columns from the data that are highly collinear, but most powerfully it's got this wrapper that basically lets you take any of hundreds and hundreds of most powerful algorithms, really hard to use, and they all now can be done through one algorithm, through one command. And here's the cool bit, right?

Imagine we're doing an spn. I don't know how many of you tried to do spn, but they're really hard to get a good result because they depend so much on the parameters. In this version, it automatically does a grid search to automatically find the best parameters. So you just create one command and it does spn for you.

So you definitely should be using a character. There's one more thing in the toolbox I wanted to mention, which is you need to use some kind of version control tool. How many people here have used a version control tool like git, cbs, spn? Okay, so let me give you an example from our terrific designer at Kaggle.

He's recently been changing some of the HTML on our site and he checked it into this version control tool that we use. And it's so nice, right, because I can go back to any file now and I can see exactly what was changed and when, and then I can go through and I can say, "Okay, I remember that thing broke at about this time.

What changed?" "Oh, I think it was this file here." "Okay, that line was deleted. This line was changed. This section of this line was changed." And you can see with my version control tool, it's keeping track of everything I can do. Can you see how powerful this is for modeling?

Because you go back through your submission history at Kaggle and you say, "Oh shit, I used to be getting 0.97 AUC. Now I'm getting 0.93. I'm sure I'm doing everything the same." Go back into your version control tool and have a look at the history, so the commits list, and you can go back to the date where Kaggle shows you that you had a really shit-height result and you can't now remember how the hell you did it.

And you go back to that date and you go, "Oh yeah, it's this one here," and you go and you have a look and you see what changed. And it can do all kinds of cool stuff like it can merge back-in results from earlier pushes or you can undo the change you made between these two dates, so on and so forth.

And most importantly, at the end of the competition, when you win, and Anthony sends you an email and says, "Fantastic. Send us your winning model," and you go, "Oh, I don't have the winning model anymore." No problem. You can go back into your version control tool and ask for it as it was on the day that you had that fantastic answer.

So there's my toolkit. There's quite a lot of other things I wanted to show you, but I don't have time to do. So what I'm going to do is I'm going to jump to this interesting one, which was about predicting which grants would be successful or unsuccessful at the University of Melbourne, based on data structure about the people involved in the grant and all kinds of metadata about the application.

This one's interesting because I won it by a fair margin, kind of from 0.967 to 0.97 is kind of 25% of the available error. It's interesting to think, "What did I do right this time and how did I set this up?" Actually what I did in this was I used a random forest.

So I'm going to tell you guys a bit about random forests. What's also interesting in this is I didn't use R at all. That's not to say that R couldn't have come up with a pretty interesting answer. The guy who came second in his comp used SAS, but I think he used like 12 gig of RAM, multi-core, huge thing.

Mine ran on my laptop in two seconds. So I'll show you an approach which is very efficient as well as been very powerful. I did this all in C#. The reason that I didn't use R for this is because the data was kind of complex. Each grant had a whole bunch of people attached to it.

It was done in a denormalized form. I don't know how many of you guys are familiar with kind of normalization strategies, but basically, denormalized form basically means you had a whole bunch of information about the grant, kind of the date and blah, blah, blah. And then there was a whole bunch of columns about person one, did they have a PhD, and then there's a whole bunch of columns about person two and so forth for I think it was about 13 people.

Very very difficult model is extremely wide and extremely messy data set. It's the kind of thing that general purpose computing tools are pretty good at. So I pulled this into C# and created a grants data class where basically I went, okay, read through this file, and I created this thing called grants data, and for each line I split it on a comma, and I added that grant to this grants data.

For those people who maybe aren't so familiar with general purpose programming languages, you might be surprised to see how readable they are. This idea I can say for each something in lines dot select the lines bit by comma, if you haven't used anything with portrait you might be surprised that something like C# looks so easy.

File dot read lines dot skip some lines, this is just a skip the first line of the header, and in fact later on I discovered the first couple of years of data were not very predictive of today, so I actually skipped all of those. And the other nice thing about these kind of tools is okay, what does this dot add do?

I can work one button and bang, I need the definition of dot add. These kind of IDE features are really helpful, and this is equally true of most Python and Java and C++ editors as well. So the kind of stuff that I was able to do here was to create all kinds of interesting derived variables, like here's one called max year birth, so this one is one that goes through all of the people on this application and finds the one with the largest year of birth.

Okay, again it's just a single line of code, if you kind of get around the kind of curly brackets and things like that the actual logic is extremely easy to understand, you know? Things like do any of them have a PhD, well if there's no people in it, none of them do, otherwise, oh this is just one person has a PhD, down here somewhere I've got, and he has a PhD, bang, straight to there, there you go, does any person have a PhD?

So I created all these different derived fields, I used pivot tables to kind of work out which one seemed to be quite predictive before I put these together, thing, and so what did I do with this? Well I wanted to create a random forest from this. Now random forests are a very powerful, very general purpose tool, but the R implementation of them has some pretty nasty limitations.

For example, if you have a categorical variable, in other words a factor, it can't have any more than 32 levels. If you have a continuous variable, so like an integer or a double or whatever, it can't have any nulls. So there are these kind of nasty limitations that make it quite difficult, and it's particularly difficult to use in this case because things like the RFCD codes had hundreds and hundreds of levels, and all the continuous variables were full of nulls, and in fact if I remember correctly, even the factors aren't allowed to have nulls, which I find a bit weird because to me null is just another factor, they're male or they're female or they're unknown.

It's still something I should get a model on. So I created a system that basically made it easy for me to create a data set up on one. So I made this decision, I decided that for doubles that had nulls in them, I created something which basically simply added two rows, sorry two columns, one column which was is that column null or not, one or zero, and another column which is the actual data from that column, so whatever it was, 2.36 blah blah blah blah blah, and wherever there was a null, I just replaced it with the median.

So I now had two columns where I used to have one, and both of them are now modelable. Why is that, why the median? Actually it doesn't matter, because every place where this is the median there's a one over here, so in my model I'm going to use this as a predictor, I'm going to use this as a predictor, so if all of the places that that data column was originally null all meant something interesting, then it'll be picked up by this is null version of the column.

So to me this is something which I do which I did automatically because it's clearly the obvious way to deal with null values, and then as I said in the categorical variables I just said okay the factors, if there's a null just treat it as another level, and then finally in the factors I said okay take all of the levels and if there are more observations than I think it was 25 then keep it, or maybe it's more than that, I think if there's more levels maybe if there's more observations than 100 then keep it, if there's more observations than 25 less than 100, and it was quite predictive, in other words that level was different to the others in terms of application success then keep it, otherwise merge all the rest into one super level called the rest.

So that way I basically was able to create a data set which actually I could then feed to R, although I think in this case I ended up using my own random forest implementation. So should we have a quick talk about random forests and how they work? So to me there's kind of basically two main types of model, there's these kind of parametric models, models with parameters, things where you say oh this bit's linear and this bit's interactive with this bit and this bit's kind of logarithmic and I specify how I think this system looks, and all the modeling tool does is it fills in parameters, okay this is the slope of that linear bit, this is the slope of that logarithmic bit, this is how these two things interact.

So things like GLM, very well known parametric tools, then there are these kind of non-parametric or semi-parametric models which are things where I don't do any of that, I just say here's my data, I don't know how it's related to each other, just build a model, and so things like support vector machines, neural nets, random forests, decision trees all have that kind of flexibility.

Non-parametric models are not necessarily better than parametric models, I mean think back to that example of the R package competition where really all I wanted was some weights to say how does this kind of this max column relate, and if all you really wanted some weights all you wanted some parameters, and so GLM is perfect.

Analysts certainly can overfit, but there are ways of creating GLMs that don't, for example you can use stepwise regression, or the much more fancy modern version you can use GLMnet, which is basically another tool for doing GLMs which doesn't overfit, but anytime you don't really know what the model form is, this is where you use a non-parametric tool, and random forests are great because they're super, super fast and extremely flexible, and they don't really have any parameters in attitude, so they're pretty hard to get it wrong.

So let me show you how that works. A random forest is simply, in fact we shouldn't even use this term random forest, because a random forest is a trademark term, so we will call it an ensemble of decision trees, and in fact the trademark term random forest, I think that was 2001, that wasn't where this ensemble of decision trees was invented, it goes all the way back to 1995.

In fact it was actually kind of independently developed by three different people in 1995, 1996 and 1997. The random forest implementation is really just one way of doing it. It all rests on a really fascinating observation, which is that if you have a model that is really, really, really shit, but it's not quite random, it's slightly better than nothing, and if you've got 10,000 of these models that are all different to each other, and they're all shit in different ways, but they're all better than nothing, the average of those 10,000 models will actually be fantastically powerful as a model of its own.

So this is the wisdom of crowds or ensemble learning techniques. You can kind of see why, because if out of these 10,000 models they're all kind of crap in different ways, they're all a bit random, they're all a bit better than nothing, 9,099 of them might basically be useless, but one of them just happened upon the true structure of the data.

So the other 9,099 will kind of average out, if they're unbiased, not correlated with each other, they'll all average out to whatever the average of the data is. So any difference in the predictions of this ensemble will all come down to that one model which happened to have actually figured it out right.

Now that's an extreme version, but that's basically the concept behind all these ensemble techniques, and if you want to invent your own ensemble technique, all you have to do is come up with some learner, some underlying model, which you can randomise in some way and each one will be a bit different, and you run it lots of times.

And generally speaking, this whole approach we call random subspace. So random subspace techniques, let me show you how unbelievably easy this is. Take any model, any kind of modelling algorithm you like. Here's our data, here's all the rows, here's all the columns. I'm now going to create a random subspace.

Some of the columns, some of the rows. So let's now build a model using that subset of rows and that subset of columns. It's not going to be as perfect at recognising the training data as using the full art, but it's one way of building a model, but it's not going to build a second model.

This time I'll use this subspace, a different set of rows and a different set of columns. No, absolutely not, but I didn't want to draw 4000 lines, so let's pretend. So in fact what I'm really doing each time here is I'm pulling out a bunch of random rows and a bunch of random columns.

Correct, and this is a random subspace. It's just one way of creating a random subspace, but it's a nice easy one, and because I didn't do very well at linear algebra, in fact I'm just a philosophy graduate, I don't know any linear algebra, I don't know what subspace means well enough to do it properly, but this certainly works, and this is all decision trees do.

So now I'll imagine that we're going to do this, and for each one of these different random subspaces we're going to build a decision tree. How do we build a decision tree? Easy. Let's create some data. So let's say we've got age, sex, smoker, and lung capacity, and we kind of predict people's lung capacity.

So we've got a whole bunch of data there. So to build a decision tree, let's assume that this is the particular subset of columns and rows in a random subspace, so let's build a decision tree. So to build a decision tree, what I do is I say, okay, on which variable, on which predictor, and at which point of that predictor, can I do a single split which makes the biggest difference possible in my dependent variable?

So it might turn out that if I looked at this smoker, yes, and no, that the average lung capacity for all of the smokers might be 30, and the average for all of the non-smokers might be 70. So literally all I've done is I've just gone through each of these and calculated the average for the two groups, and I've found the one split that makes that as big a difference as possible, okay?

And then I keep doing that. So in those people that are non-smokers, I now, interestingly, with the random forest or these decision tree ensemble algorithms, generally speaking, at each point, I select a different group of columns. So I randomly select a new group of columns, but I'm going to use the same rows.

I obviously have to use the same rows because I'm kind of taking them down the tree. So now it turns out that if we look at age amongst the people that are non-smokers, if you're less than 18 versus greater than 18, it's the number one biggest thing in this random subspace that makes the difference, and that's like 50, and that's like 80.

And so this is how I create a decision tree, okay? So at each point, I've taken a different random subset of columns. For the whole tree, I've used the same random subset of rows. And at the end of that, I keep going until every one of my leaves either has only one or two data points left, or all of the data points at that leaf all have exactly the same outcome, the same line capacity, for example.

And at that point, I've finished making my decision tree. So now I put that aside, and I say, okay, that is decision tree number one. Put that aside. And now go back and take a different set of rows and repeat the whole process. And that gives me decision tree number two.

And I do that 1,000 times, whatever. And at the end of that, I've now got 1,000 decision trees. And for each thing I want to predict, I then stick that thing I want to predict down every one of these decision trees. So the first thing I'm trying to predict might be, you know, a non-smoker who is 16 years old, blah, blah, blah, blah, blah, and that gives me a prediction.

So the predictions for these things at the very bottom is simply what's the average of the dependent variable, in this case, the lung capacity for that group. So that gives me 50 in decision tree one and it might be 30 in decision tree two and 14 in decision tree three.

I just take the average from all of those. And that's given me what I wanted, which is a whole bunch of independent, unbiased, not completely crap models. How not completely crap are they? Well, the nice thing is we can pick, right? If you want to be super cautious and you really need to make sure you're avoiding overfitting then what you do is you make sure your random subspaces are smaller.

You pick less rows and less columns. So then each tree is shitter than average, whereas if you want to be quick, you make each one have more rows and more columns. So it better reflects the true data that you've got. Obviously the less rows and less columns you have each time, the less powerful each tree is, and therefore the more trees you need.

And the nice thing about this is that building each of these trees takes like a ten thousandth of a second, you know, or less. It depends on how much data you've got, but you can build thousands of trees in a few seconds or the kind of datasets I look at.

So generally speaking, this isn't an issue. And here's a really cool thing, the really cool thing. In this tree I built it with these rows, which means that these rows I didn't use to build my tree, which means these rows are out of sample for that tree. And what that means is I don't need to have a separate cross-validation dataset.

What it means is I can create a table now of my full dataset, and for each one I can say, okay, row number one, how good am I at predicting row number one? Well, here's all of my trees from one to a thousand. Row number one is in fact one of the things that was included when I created tree number one.

So I won't use it here, but row number one wasn't included when I built tree two. It wasn't included in the random subspace of tree three, and it was included in the one for four. So what I do is row number one, I send down trees two and three, and I get predictions for everything that it wasn't in, and average them out, and that gives me this fantastic thing, which is an out-of-band estimate for row one, and I do that for every row.

So none of this, all of this stuff which is being predicted here is actually not using any of the data that was used to build the trees, and therefore it is truly out of sample or out-of-band, and therefore when I put this all together to create my final whatever it is, AUC, or log likelihood, or SSC, or whatever, and then I send that off to Kaggle.

Kaggle should give you pretty much the same answer, because you're by definition not overfitting. Yes, you did. I'm just wondering if it's possible to say just pick a tree, but that has the best, is it averaging out a thousand trees, is it possible to pick that one tree that actually has the best performance, and would you recommend it?

Yes, you can, no I wouldn't. So the question was, can you just pick one tree, and would that tree, picking that one tree, be better than what we've just done? And let's think about what, that's a really important question, and let's think about why that won't work. The whole purpose of this was to not overfit.

So the whole purpose of this was to say each of these trees is pretty crap, but it's better than nothing, and so we averaged them all out, it tells us something about the true data, each one can't overfit on its own. If I now go back and do anything to those trees, if I try and prune them, which is in the old-fashioned decision tree algorithms, or if I weight them, or if I pick a subset of them, I'm now introducing bias based on the training set predictivity.

So anytime I introduce bias, I now break the laws of ensemble methods fundamentally. So the other thing I'd say is there's no point, right? Because if you have something where actually you've got so much training data that out of sample isn't a big problem or whatever, you just use bigger subspaces and less trees.

And in fact, the only reason you do that is for time, and because this approach is so fast anyway, I wouldn't even bother then, you see? And the nice thing about this is, is that you can say, okay, I'm going to use kind of this many columns and this many rows in each subspace, right?

And I've got to start building my trees, and I build tree number one, and I get this out of band error. Tree number two, this out of band error. Tree number three, this out of band error. And the nice thing is I can watch and see, and it will be monotonic.

Well, not exactly monotonic, but kind of bumpy monotonic, it will keep getting better on average. And I can get to a point where I say, okay, that's good enough, I'll stop. And as I say, normally, it's talking four or five seconds, it's just time's not an issue, but if you're talking about huge data sets, you can't sample them, this is a way you can watch it go.

So this is a technique that I used in the Grant's prediction competition. I did a bunch of things to make it even more random than this. One of the big problems here, both in terms of time and lack of randomness, is that all of these continuous variables, the official random forest algorithm searches through every possible breakpoint to find the very best, which means that every single time that you use that particular variable, particularly if it's in the same spot, like at the top of the tree, it's going to do the same split, right?

In the version I wrote, actually, all it does is every time it comes across a continuous variable, it randomly picks three breakpoints, so it might try 50, 70, and 90, and it just finds the best of those three. And to me, this is the secret of good ensemble algorithms, is to make every one as different to every other tree as possible.

Does the distribution of the population variable you're trying to predict matter? No, not at all. So the question was, does the distribution of the dependent variable matter? And the answer is it doesn't, and the reason it doesn't is because we're using a tree. So the nice thing about a tree is, let's imagine that the dependent variable was kind of maybe very long tail distribution, like so.

The nice thing is that, as it looks at the independent variables, it's looking at the difference in two groups, and trying to find the biggest difference between those two groups. So regardless of the distribution, it's more like a rank measure, isn't it? It's picked a particular breakpoint, and it's saying which one finds the biggest difference between the two groups.

So regardless of the distribution of the dependent variable, it's still going to find the same breakpoints, because it's really a non-parametric measure. We're using something like, for example, GINI, or some kind of other measure of the information gain of that to build the decision tree. So this is true of really all decision tree approaches in fact.

Does it work with the highly imbalanced data set? Yes and no. So the question is, does it work for a highly imbalanced data set? Sometimes some versions can, and some versions can't. The approaches which use more randomization are more likely to work okay, but the problem is in highly imbalanced data sets, you can quite quickly end up with nodes which are all the same value.

So I actually have often found I get better results if I do some stratified sampling, so that, for example, think about the R competition, where most people don't have in store, other than using number five, 99% of packages. So in that case, I tend to say, all right, at least half of that data set is so obviously zero, let's just call it zero and just work with the rest, and I do find I often get better answers.

But it does depend. Would it be better to instead of using tree for the forest, use another algorithm in the run by forest? Well, you can't call it forest if you use a different algorithm other than a tree, but yes, you can use other random subspace methods. You can use another class five.

Yes, you absolutely can. A lot of people have been going down that path. It would have to be fast. So GLMnet would be a good example because that's very fast, but GLMnet is parametric. The nice thing about decision trees is that they're totally flexible. They don't assume any particular data structure.

They kind of are almost unlimited in the amount of interactions that they can handle, and you can build thousands of them very quickly, but there are certainly people who are creating other types of random subspace ensemble methods, and I believe some of them are quite effective. Interestingly, I can't remember where I saw it, but I have seen some papers which show evidence that it doesn't really matter if you've got a truly flexible underlying model and you make it random enough and you create enough of them, it doesn't really matter which one you use or how you do it, which is a nice result.

It kind of suggests that we don't have to spend lots of time trying to come up with better and better and better generic predictive modeling tools. If you think about things, there are better or better versions of this in quotes like a rotation forest and then there's things like GBM, Australian Boosting Machines and so forth.

In practice, they can be faster for certain types of situation, but the general result here is that these ensemble methods are as flexible as you need them to be. How do you define the optimal size of the subspace? The question is how do you define the optimal size of the subspace, and that's a really tricky question.

The answer to it is really nice, and it's that you really don't have to. Generally speaking, the less rows and the less columns you use, the more trees you need, but the less you'll overfit and the better results you'll get. The nice thing normally is that for most data sets, because of the speed of random forest, you can pretty much always pick a row count and a column count that's small enough that you're absolutely sure it's going to be fine.

Sometimes it can become an issue, maybe you've got really huge data sets, or maybe you've got really big problems with data imbalances, or hard-mini training data, and in these cases you can use the kind of approaches which would be familiar to most of us around creating a grid of a few different values of the column count and the row count, and trying a few out, and watching that graph of as you add more trees, how does it improve.

The truth is it's so insensitive to this that if you pick a number of columns of somewhere between 10% and 50% of the total, and a number of rows of between 10% and 50% of the total, you'll be fine. You just keep adding more trees until you're sick of waiting, or it's obviously a flash.

If you do 1,000 trees, again, these are all, it really doesn't matter, they seem to, it's not sensitive to that assumption on the whole. The iRoutine samples you've replaced now? Yeah, the iRoutine actually, so this idea of a random subspace, there are different ways of creating this random subspace, and one key one is, can I go and pull out a row again that I've already pulled out before.

The R random forest and the portrait encoding, which is based by default, let you pull something out multiple times, and by default, in fact, pull out, if you've got n rows, it will pull out n rows, but because it's pulled out some multiple times, on average, it will cover I think 63.2% of the rows.

I don't have the best results when I use that, but it doesn't matter because in R random forest options, you can choose, is it with or without sampling, and how many of them represent. I absolutely think it makes a difference. To me, I'm sure it depends on the dataset, but I guess, I always enter Kaggle competitions which are in areas that I've never entered before, kind of domain-wise or algorithm-wise, so I guess I'd be getting a good spread of different types of situation, and in the ones I've looked at, sampling without replacement is kind of more random, and I also tend to pick much lower n than 63.2%, you know, I tend to use more like 10 or 20% of the data in my random subspaces.

Yeah, I know the concepts, I guess I can say it. That's my experience, but I'm sure it depends on the dataset, and I'm not sure it's terribly sensitive to it anyway. I always put it into two branches, so there's a few possibilities here, as you get in your decision tree.

In this case, I've got something here which is a binary variable, so obviously that has to be fit into two. In this case, I've got something here which is a continuous variable, now it's fitted into two, but if actually it's going to be optimal to split it into three, then if the variable appears again at the next level, it can always put it into another two at that other split point.

I can, absolutely I can. So it just depends whether, when I did that, remember at every level I repeat the sampling of a different bunch of columns, I could absolutely have the same column again in that group, and it could so happen that again I find the split point which is the best in that group.

If you're doing 10,000 trees with 100 levels each it's going to happen lots of times, so the nice thing is that if the true underlying system is a single univariate logarithmic relationship, these trees will absolutely find that, eventually. Definitely don't prune the trees, if you prune the trees you introduce mice, so the key thing here which makes them so fast and so easy but also so powerful is you don't prune trees.

No it doesn't necessarily because your split point will be such that the true halves will not necessarily be balanced in count. Yeah that's right, because in the under 18 group you could have not that many people, and in the over 18 group you can have quite a lot of people, so the weighted average of the two will come to a certain time.

Have you ever compared this with scrodiant boosting machines? With scrodiant boosting machines? Yes. Yeah absolutely I have. Scrodiant boosting machines are interesting, they're a lot harder to understand, gradient boosting machines, I mean they're still basically ensemble technique and they're more working with the residuals of previous models. There's a few pieces of theory around gradient boosting machines which are nicer than random forests, they ought to be faster and they ought to be more well directed, and you can do things like say with a graded boosting machine, this particular column has a monotonic relationship with a dependent variable, so you can actually add constraints in which you can't do with random forests.

In my experience I don't need the extra speed of GBMs because I just never have found it necessary. I find them harder to, they've got more parameters to deal with, so I haven't found them useful for me, and I know a lot of batter mining competitions and also a lot of real world predictive modelling problems, people try both and end up with random forests.

Well we're probably just about out of time, so maybe if there's any more questions I can chat to you guys afterwards, thanks very much. Bye. Transcribed by ESO; Translated by —

Getting In Shape For The Sport Of Data Science

Transcript