back to index

Getting In Shape For The Sport Of Data Science


Whisper Transcript | Transcript Only Page

00:00:00.000 | OK, so we're here at the Melbourne R meetup, and we are talking about some techniques that
00:00:09.560 | Jeremy Howard has used to do as well as he can in a variety of Kaggle competitions.
00:00:17.920 | And we're going to start by having a look at some of the tools that I've found useful
00:00:24.120 | in predictive modelling in general and in Kaggle competitions in particular.
00:00:31.480 | So I've tried to write down here what I think are some of the key steps.
00:00:37.500 | So after you download data from a Kaggle competition, you end up with CSV files, generally speaking,
00:00:46.720 | CSV files, which can be in all kinds of formats.
00:00:49.800 | So here's the first thing you see when you open up the time series CSV file.
00:00:54.480 | It's not very hopeful, is it?
00:00:57.120 | So each of these columns is actually-- oh, here we come-- is actually quarterly time
00:01:03.280 | series data.
00:01:09.360 | And so because-- well, for various reasons, each one's different lengths, and they kind
00:01:16.760 | of start further along, the particular way that this was provided didn't really suit
00:01:22.520 | the tools that I was using.
00:01:24.280 | And in fact, if I remember correctly, that's already-- I've already adjusted it slightly
00:01:30.640 | because it originally came in rows rather than columns.
00:01:34.560 | Yeah, that's right.
00:01:35.560 | This is how it originally came in rows rather than columns.
00:01:39.720 | So this is where this kind of data manipulation toolbox comes in.
00:01:45.280 | There's all kinds of ways to swap rows and columns around, which is where I started.
00:01:49.720 | The really simple approach is to select the whole lot, copy it, and then paste it and
00:01:56.320 | say transpose in Excel.
00:01:59.160 | And that's one way to do it.
00:02:02.380 | And then having done that, I ended up with something which I could open up in-- let's
00:02:10.480 | have a look at the original file.
00:02:13.440 | This is the original file in VIM, which is my text editor of choice.
00:02:20.320 | This is actually a really good time to get rid of all of those kind of bleeding commas
00:02:25.240 | because they kind of confuse me.
00:02:28.000 | So this is where stuff like VIM is great, even things like Notepad++ and VMAX and any
00:02:32.980 | of these kind of power user text editors will work fine.
00:02:35.900 | As long as you know how to use regular expressions-- and if you don't, I'm not going to show you
00:02:41.040 | now, but you should definitely look it up.
00:02:43.640 | So in this case, I'm just going to go, OK, let's use a regular expression.
00:02:47.400 | So I say, yes, to substitute for the whole file, start with any number of commas, and
00:02:53.920 | post it with nothing at all.
00:02:56.480 | And you can see it's in VIM, and I'm done.
00:02:59.800 | So I can now save that, and I've got a nice, easy file that I can start using.
00:03:06.040 | So that's why I've missed this idea of data manipulation tools in my toolbox.
00:03:13.120 | And to me, VIM or some regular expression, how a text editor which can handle large files
00:03:20.280 | is something to be familiar with.
00:03:22.520 | So just in case you can catch that, that is regular expressions.
00:03:30.880 | Probably the most powerful tool for doing text and data manipulation that I know of.
00:03:36.880 | Sometimes they're just called regexes.
00:03:40.160 | The most powerful types of regular expressions, I would say, would be the ones that are in
00:03:46.080 | They've been widely used elsewhere.
00:03:48.860 | Any C program that uses the PCRE engine has the same regular expressions as PEL, more
00:03:54.920 | or less.
00:03:55.920 | C# and .NET have the same regular expressions as PEL, more or less.
00:03:59.400 | So this is a nice example of one bunch of people getting it right, and everybody else
00:04:04.960 | plagiarizing.
00:04:05.960 | VIM's regular expressions are slightly different, unfortunately, which annoys me no end, but
00:04:11.520 | they still do the job.
00:04:14.240 | So yeah, make sure you've got a good text editor that you know well how to use.
00:04:20.800 | Something with a good macro facility is nice as well.
00:04:23.000 | Again, VIM's great for that.
00:04:24.680 | You can record a series of keystrokes and hit a button, and it repeats it basically
00:04:29.400 | on every line.
00:04:31.040 | I also wrote PEL here because, to me, PEL is a rather unloved programming language,
00:04:40.000 | but if you think back to where it comes from, it was originally developed as the Swiss Army
00:04:46.180 | chainsaw of text processing tools.
00:04:50.720 | And today, that is something it still does, I think, better than any other tool.
00:04:55.520 | It has amazing command line options you can pass to it that do things like run the following
00:05:02.740 | command on every line in a file, or run the following line on every command in a file
00:05:06.920 | in a print about.
00:05:09.440 | There's a command line option to back up each file before changing it to large up back.
00:05:15.120 | I find with PEL I can do stuff which would take me a much, much longer time than any
00:05:21.800 | other tool.
00:05:24.120 | Even simple little things like I was hacking some data on the weekend where I had to concatenate
00:05:27.900 | a whole bunch of files, but only the first one I wanted to keep the first line because
00:05:32.360 | there were a whole bunch of CSV files in which they had a line I had to delete.
00:05:36.160 | So in PEL, in fact, it's probably still going to be sitting here in my history, so in PEL,
00:05:51.760 | that's basically minus N means do this on every single row, minus A means I'm not even
00:05:56.720 | going to write a script file, I'm going to give you the thing to do it right here on
00:06:00.240 | the command line, and here's a piece of rather difficult to comprehend PEL, but trust me,
00:06:05.800 | what it says is if the line number is greater than one, then print that line.
00:06:10.240 | So here's something to strip the first line from every file.
00:06:15.240 | So this kind of stuff you can do in PEL is great, and I see a lot of people in the forums
00:06:19.800 | who complain about the format of the data wasn't quite what I expected or not quite
00:06:24.800 | convenient, can you please change it for me, and I always think, well, this is part of
00:06:28.720 | data science, this is part of data hacking, this is data munging or data manipulation.
00:06:35.240 | There's actually a really great book called, I don't know if it's hard to find nowadays,
00:06:41.680 | but I loved it, called Data Munging in PEL, and it's a whole book about all the cool stuff
00:06:46.640 | you can do in PEL in a line or two.
00:06:52.840 | So okay, I've now got the data into a form where I can kind of load it up into some tool
00:06:57.840 | and start looking at it.
00:06:59.420 | What's the tool I normally start with?
00:07:01.520 | I normally start with Excel.
00:07:03.280 | Now, your first reaction might be to think, Excel, not so good for big files, to which
00:07:10.080 | my reaction would be, if you're just looking at the data for the first time, why are you
00:07:13.360 | looking at a big file?
00:07:15.680 | Start by sampling it.
00:07:18.600 | And again, this is the kind of thing you can do in your data manipulation piece, that thing
00:07:22.280 | I just showed you in PEL, if that's if rand is greater than 0.9 and andgrid, that's going
00:07:28.240 | to sample every 10 rows, more or less.
00:07:32.400 | So if you've got a huge data file, get it to a size that you can easily start playing
00:07:37.020 | with it, which normally means some random sampling.
00:07:40.320 | So I like to look at it in Excel, and I will show you for a particular competition how
00:07:47.480 | I go about doing that.
00:07:51.280 | So let's have a look, for example, at a couple.
00:07:58.200 | So here's one which the New South Wales government basically ran, which was to predict how long
00:08:03.680 | it's going to take cars to travel along each segment of the M4 motorway, in each direction.
00:08:12.520 | The data for this is a lot of columns, because every column is another root, and lots and
00:08:19.940 | lots of rows, because every row is another two-minute observation, and very hard to get
00:08:25.120 | a feel for what's going on. There were various terrific attempts on the
00:08:29.960 | forum at trying to create animated pictures of what the road looks like over time.
00:08:34.840 | I did something extremely low-tech, which is something I'm proud of spending a lot of
00:08:40.480 | time doing these things for extremely low-tech, which is I created a simple little macro in
00:08:45.720 | Excel which selected each column, and then went conditional formatting, color scales,
00:08:54.600 | red to green, and I ran that on each column, and I got this picture.
00:09:00.280 | So here's each root on this road, and here's how long it took to go on that root at this
00:09:07.200 | time.
00:09:08.200 | And isn't this interesting, because I can immediately see what traffic jams look like.
00:09:14.120 | See how they kind of flow as you start getting a traffic jam here?
00:09:17.880 | They flow along the road as time goes on, and you can then start to see at what kind
00:09:23.720 | of times they happen and where they tend to start, so here's a really big jam.
00:09:28.240 | And it's interesting, isn't it?
00:09:30.280 | So if we go into Sydney in the afternoon, then obviously you start getting these jams
00:09:37.120 | up here, and as the afternoon progresses, you can see the jam moving so that at 5pm it looks
00:09:44.880 | like there's actually a couple of them, and at the other end of the road it stays jammed
00:09:50.000 | until everybody's cleared up through the freeway.
00:09:54.080 | So you get a real feel for it, and even when it's not peak hour, and even in some of the
00:09:58.520 | period areas which aren't so busy, you can see that's interesting.
00:10:03.000 | There are basically parts of the freeway which, out of peak hour, they're basically constant
00:10:07.280 | travel time.
00:10:08.560 | And the colours are immediately showing you.
00:10:10.760 | You see how easy it is?
00:10:12.160 | So when we actually got on the phone with the RTA to take them through the winning model,
00:10:17.320 | actually the people that won this competition were kind enough to organise a screencast
00:10:21.600 | with all the people in the RTA and from Kaggle to show the winning model.
00:10:24.960 | And the people from RTA said, "Well, this is interesting, because you tell me in your
00:10:29.880 | model," they said, "What we looked at was we basically created a model that looked at
00:10:34.760 | for a particular time, for a particular route.
00:10:37.960 | We looked at the times and routes just before and around it on both sides."
00:10:43.760 | And I remember one of the guys said, "That's weird, because normally these kind of queues
00:10:49.160 | traffic jams only go in one direction, so why would you look at both sides?"
00:10:53.600 | And so I was able to quickly say, "OK, guys, that's true, but have a look at this."
00:10:58.520 | So if you go to the other end, you can see how sometimes although queues kind of form
00:11:03.200 | in one direction, they can kind of slide away in the other direction, for example.
00:11:07.880 | So by looking at this kind of picture, you can see what your model is going to have to
00:11:13.080 | be able to model.
00:11:14.520 | So you can see what kind of inputs it's going to have and how it's going to have to be set
00:11:19.280 | And you can immediately see that if you created the model that basically tried to predict each
00:11:22.820 | thing based on the previous few periods of the routes around it, whatever modeling technique
00:11:32.520 | you're using, you're probably going to get a pretty good answer.
00:11:35.640 | And interestingly, the guys that won this competition, this is basically all they did,
00:11:39.640 | a really nice, simple model.
00:11:41.160 | They used random florists, as it happens, which we'll talk about soon.
00:11:46.700 | They added a couple of extra things, which was, I think, the rate of change of time.
00:11:51.000 | But that was basically it.
00:11:53.800 | So a really good example of how visualization can quite quickly tell you what you need to
00:11:59.960 | I'll show you another example.
00:12:03.360 | This is a recent competition that was set up by the dataists.com blog.
00:12:13.400 | And what it was, was they wanted to try and create a recommendation system for R packages.
00:12:20.160 | So they got a bunch of users to say, OK, this user, for this package, doesn't have it installed.
00:12:29.480 | This user, for this package, does have it installed.
00:12:33.040 | So you can kind of see how this is structured.
00:12:35.880 | They added a bunch of additional potential predictors for you.
00:12:39.480 | How many dependencies does this package have, how many suggestions does this package have,
00:12:43.520 | how many imports, how many of those task views on CRAN is it included in, is it a core
00:12:50.120 | package, is it a recommended package, who maintains it, and so forth.
00:12:55.920 | So I found this not particularly easy to get my head around what this looks like.
00:13:01.320 | So I used my number one most favorite tool for data visualization and then hook analysis,
00:13:09.040 | which is a pivot table.
00:13:10.040 | A pivot table is something which dynamically lets you slice and dice your data.
00:13:16.480 | So if you've used maybe Tableau or something like that, you'll know the field.
00:13:21.880 | This is kind of like Tableau, it doesn't cost a thousand dollars.
00:13:24.440 | No, I mean, Tableau's got cool stuff as well, but this is fantastic for most things I find
00:13:30.440 | I need to do.
00:13:31.560 | And so in this case, I simply drag user ID up to the top, and I dragged package name
00:13:38.760 | down to the side, and just quickly through this into a matrix, basically.
00:13:44.760 | And so you can see here what this data looks like, which is that those nasty people at
00:13:49.800 | dataversus.com is deleted a whole bunch of things in this matrix.
00:13:54.360 | So that's the stuff that they want us to predict.
00:13:57.160 | And then we can see that generally, as you expect, there's ones and zeros.
00:14:01.000 | There's some weird shit going on here where some people have things apparently there twice,
00:14:05.440 | which suggests to me maybe there's something funny with the data collection.
00:14:10.320 | And there's other interesting things. There are some things which seem to be quite widely
00:14:16.760 | installed.
00:14:18.560 | Most people don't install most packages.
00:14:22.400 | And there is this mysterious user number five, who is the world's biggest R package slut.
00:14:29.840 | He or she installs everything that they can.
00:14:33.120 | And I can only imagine that ADACGH is particularly hard to install, because not even user number
00:14:39.240 | five managed to get around to it.
00:14:42.440 | So you can see how creating a simple little picture like this, I can get a sense of what's
00:14:48.040 | going on.
00:14:52.500 | So I took that data in the R package competition, and I thought, well, if I just knew for a
00:15:00.600 | particular-- so let's say this empty cell is the one we're trying to predict.
00:15:04.400 | So if I just knew in general how commonly acceptance sampling was installed, and how
00:15:10.040 | often user number one installed stuff, I probably got a good sense of the probability of user
00:15:16.720 | number one installing acceptance standpoint.
00:15:19.800 | So to me, one of the interesting points here was to think, actually, I don't think I care
00:15:24.000 | about any of this stuff.
00:15:26.520 | So I jumped into R, and all I did was I basically said, OK, read that CSV file in.
00:15:36.960 | There's a whole bunch of rows here, because this is my entire solution.
00:15:39.080 | But I'm just going to show you the rows I used for solution number one.
00:15:42.240 | So read in the whole lot.
00:15:45.480 | Although user is a number, treated as a factor, because user number one is not 50 times worse
00:15:50.400 | than user number 50, those trues and falses turn them into 1 to 0 to make life a bit easier.
00:15:58.840 | And now apply the mean function to each user across their installations, and apply the
00:16:04.360 | mean function to each package across that package's installations.
00:16:08.880 | So now I've got a couple of lookups, basically, that tell me user number 50 installs this
00:16:13.360 | percent of packages, this particular package is installed by this percent of users.
00:16:20.680 | And then I just stuck them, basically, back into my file of predictors.
00:16:26.960 | So I basically did these simple lookups for each row to find lookup the user and find
00:16:32.640 | out for that row the mean for that user and the mean for that package.
00:16:39.600 | And that was actually it.
00:16:40.600 | At that point, I then created a GLM in which I created a GLM, in which obviously I had
00:16:51.800 | my ones and zeroes of installations as the thing I was predicting.
00:16:55.320 | And my first version I had UP and PP, so these two probabilities as my predictors.
00:17:01.600 | In fact, no, in the first version it was even easier than that.
00:17:06.760 | All I did, in fact, was I took the max of those two things.
00:17:12.200 | So P max, if you're not familiar with R, is just something that does a max on each row
00:17:16.280 | individually.
00:17:17.280 | In R, nearly everything works on vectors by default, except for max.
00:17:23.080 | So that's why you have to use P max.
00:17:25.040 | That's something well worth knowing.
00:17:28.120 | So I just took the max.
00:17:29.720 | So this user installs 30% of things, and this package is installed by 40% of users.
00:17:36.680 | So the max of the two is 40%.
00:17:38.640 | And I actually created a GLM with just one predictor.
00:17:42.780 | The benchmark that was created by the data people for this used the GLM on all of those
00:17:48.500 | predictors, including all kinds of relations analysis of the manual pages and maintain
00:17:55.560 | names and God knows what, and they had an AUC of 0.8.
00:18:00.200 | This five line of code thing had an AUC of 0.95.
00:18:05.520 | So the message here is, don't overcompact things.
00:18:13.120 | If people give you data, don't assume that you need to use it, and look at pictures.
00:18:19.560 | So if we have a look at kind of my progress in there, so here's my first attempt, which
00:18:24.640 | was basically to multiply the user probability by the package probability.
00:18:30.880 | And you can see one of the nice things in Kaggle is you get a history of your results.
00:18:34.400 | So here's my 0.84 AUC, and then I changed it to using the maximum of two, and there's my
00:18:40.560 | 0.95 AUC.
00:18:41.560 | And I thought, oh, that was good.
00:18:45.480 | Imagine how powerful this will be when I use all that data that they gave us with a fancy
00:18:50.400 | random forest, and it went backwards.
00:18:54.600 | So you can really see that actually a bit of focused simple analysis can often take
00:18:58.900 | you a lot further.
00:19:00.880 | So if we look to the next page, we can kind of see where, you know, I kind of kept thinking
00:19:06.120 | random forests.
00:19:07.120 | They thought the body works, they get more random forests, and that went backwards, and
00:19:10.920 | then I started adding in a few extra things.
00:19:13.280 | And then actually I thought, you know, there is one piece of data which is really useful,
00:19:18.040 | which is that dependency graph.
00:19:19.840 | If somebody has installed package A, and it depends on package B, and I know they've got
00:19:25.720 | package A, then I also know they've got package B.
00:19:28.800 | So I added that piece.
00:19:32.920 | That's the kind of thing I find a bit difficult to do in R, because I think R is a slightly
00:19:37.240 | shit programming language.
00:19:38.900 | So I did that piece in language, which I quite like, which is C#, imported it back to R, and
00:19:44.400 | then as you can see, each time I send something off to Kaggle, I generally copy and paste
00:19:49.000 | into my notes just the line of code that I ran, so I can see exactly what it was.
00:19:54.220 | So here I added this dependency graph, and I jumped up to 0.98.
00:20:00.700 | That's basically as far as I got in this competition, which was enough for sixth place.
00:20:05.920 | I made a really stupid mistake.
00:20:08.480 | Yes, if somebody has package A, and it depends on package B, then obviously that means they've
00:20:13.920 | got package B. I did that.
00:20:16.460 | If somebody doesn't have package B, and package A depends on it, then you know they definitely
00:20:20.520 | don't have package A. I forgot that piece.
00:20:23.140 | And so when I went back and put that in after the competition was over, and I realized I
00:20:26.680 | had forgotten it, and I realized I could have come about second if I'd just done that.
00:20:31.000 | In fact, to get the top three in this competition, that's probably as much modeling as you needed.
00:20:38.720 | So I think you can do well in these comps without necessarily being an R expert or necessarily
00:20:43.500 | being a stats expert, but you do need to kind of dig into the toolbox appropriately.
00:20:51.960 | So let's go back to my extensive slide presentation.
00:20:58.340 | So you can see here we talked about data manipulation, about interactive analysis, we've talked
00:21:04.220 | a bit about visualizations, and I include there even simple things like those tables
00:21:10.720 | we did.
00:21:11.720 | As I just indicated, in my toolbox is some kind of general purpose programming tool.
00:21:19.960 | And to me, there's kind of three or four clear leaders in this space.
00:21:24.520 | And I know from speaking to people in the data science world, about half the people
00:21:29.160 | I speak to don't really know how to program.
00:21:33.740 | You definitely should, because otherwise all you can do is use stuff that other people
00:21:37.400 | have made for you.
00:21:38.800 | And I would be picking from these tools.
00:21:42.080 | So I like the highly misunderstood C#.
00:21:51.400 | And I would combine it with these particular libraries for, yes, question?
00:21:56.760 | Yeah, I was just wondering whether you saw complementary or competing?
00:22:04.880 | Yeah, complementary.
00:22:05.880 | And I'll come to that in the very next bullet point, yes.
00:22:11.040 | So this general purpose programming tools is for the stuff that R doesn't do that well.
00:22:16.840 | And even the guy that wrote R, Ross Lock, says he's not that fun nowadays of various things
00:22:22.960 | of R as a kind of an underlying language.
00:22:29.200 | Whereas there are other languages which are just so powerful and so rich and so beautiful.
00:22:33.600 | I should have actually included some of the functional languages in here too, like Haskell
00:22:37.160 | would be another great choice.
00:22:39.240 | But if you've got a good powerful language, a good powerful matrix library, and a good
00:22:45.080 | powerful machine learning toolkit, you're doing great.
00:22:49.200 | So Python is fantastic.
00:22:50.720 | Python also has a really, really nice REPL.
00:22:54.280 | A REPL is like where you type in a line of code like an R, and it immediately gives you
00:22:59.440 | the results.
00:23:00.440 | And you can keep looking through like that.
00:23:01.920 | You can use IPython, which is a really fantastic REPL for Python.
00:23:10.320 | And in fact, the other really nice thing in Python is matplotlib, which gives you a really
00:23:15.280 | nice charting library.
00:23:19.280 | Much less elegant, but just as effective for C# and just as free is the MSChart controls.
00:23:28.720 | I've written a kind of a functional layer on top of those to make them easier to do
00:23:32.120 | analysis with, but they're super fast and super powerful, so that only takes 10 minutes.
00:23:38.760 | If you use C++, that also works great.
00:23:41.080 | There's a really brilliant thing very, very underutilized called Eigen, which originally
00:23:45.440 | came from the KDE project and just provides an amazingly powerful kind of vector and scientific
00:23:52.760 | programming kind of language on top of C++.
00:23:59.040 | Java to me is something that used to be on a par with C# back in the 1.0, 1.1.0.
00:24:05.640 | It's looking a bit sad nowadays, but on the other hand, it has just about the most powerful
00:24:11.520 | general purpose machine learning library on top of it, which is weaker.
00:24:15.440 | So there's a lot to be said for using that combination.
00:24:19.320 | In the end, if you're a data scientist who doesn't yet know how to program, my message
00:24:22.240 | is going to program.
00:24:24.240 | And I don't think it matters too much, which one you pick.
00:24:26.840 | I would be picking one of these, but without it, you're going to be struggling to go beyond
00:24:32.640 | what the tools provide.
00:24:33.640 | Question at the back.
00:24:34.640 | [INAUDIBLE]
00:24:47.080 | Yeah, OK, so the question was about visualization tools and equivalent to SaaS jump.
00:24:54.360 | Fairly available.
00:24:55.360 | Yeah, I would have a look at something like G-Gobi.
00:25:01.480 | G-Gobi is a fascinating tool, which kind of has-- and not free, but in the same kind of
00:25:11.680 | area, if we talked about Tableau.
00:25:17.400 | Supports this concept of brushing, which is this idea that you can look at a whole bunch
00:25:21.600 | of plots and scatter plots and parallel coordinate plots and all kinds of plots, and you can
00:25:26.360 | highlight one area of one plot, and it will show you where those points are in all the
00:25:31.440 | other plots.
00:25:32.880 | And so in terms of really powerful visualization libraries, I think G-Gobi would be where I
00:25:39.800 | would go.
00:25:41.400 | Having said that, it's amazing how little I use it in real life.
00:25:47.280 | Because things like Excel and what I'm about to come to, which is G-Gplot2, although much
00:25:54.320 | less fancy than things like Jump and Tableau and G-Gobi, support a hypothesis-driven problem-solving
00:26:02.560 | approach very well.
00:26:06.480 | Something else that I do is I tend to try to create visualizations which meet my particular
00:26:14.120 | needs.
00:26:15.120 | So we talked about the time series problem.
00:26:22.200 | And the time series problem is one in which I used a very simple ten-line JavaScript piece
00:26:28.680 | of code to plot every single time series in a huge mess like this.
00:26:35.040 | Now you kind of might think, well, if you're plotting hundreds and hundreds of time series,
00:26:40.000 | how much insight are you really getting from that?
00:26:41.880 | But I found it was amazing how just scrolling through hundreds of time series, how much
00:26:48.040 | my brain picked up.
00:26:49.680 | And what I then did was when I started modeling this, was I then turned these into something
00:26:58.480 | a bit better, which was to basically repeat it, but this time I showed both the orange,
00:27:10.280 | which is the actuals, and the blues, which is my predictions.
00:27:14.640 | And then I put the metric of how successful this particular time series was.
00:27:20.880 | So I kind of found that using more focused kind of visualization development, in this
00:27:30.520 | case I could immediately see whereabouts were these, which numbers were high, so here's
00:27:37.460 | one here, point one, that's a bit higher than the others, and I could immediately kind of
00:27:40.280 | see what have I done wrong, and I could get a feel of how my modeling was going straight.
00:27:45.740 | So I tend to think you don't necessarily need particularly sophisticated visualization tools,
00:27:52.480 | they just need to be fairly flexible and you need to know how to drive them to give you
00:27:56.960 | what you need.
00:27:59.400 | So through this kind of visualization, I was able to make sure every single chart in this
00:28:05.080 | competition, if it wasn't matching well, and I'd look at it and I'd say, yeah, it's not
00:28:10.840 | matching well because there was just a shock in some period which couldn't possibly be
00:28:15.040 | predicted, so that's okay.
00:28:18.360 | And so this was one of the competitions that I won, and I really think that this visualization
00:28:22.880 | approach was key.
00:28:29.760 | So I mentioned I was going to come back to one really interesting plotting tool, which
00:28:33.720 | is GG plot two.
00:28:36.080 | GG plot two is created by a particularly amazing New Zealander who seemed to have more time
00:28:43.480 | than everybody else in the world combined and creates all these fantastic tools.
00:28:48.140 | Thank you Hadley.
00:28:49.140 | I just wanted to show you what I meant by a really powerful but kind of simple plotting
00:28:54.520 | tool.
00:28:55.520 | Here's something really fascinating, you know how creating scatter plots with lots and lots
00:28:59.680 | of data is really hard because you end up with just big black blobs.
00:29:04.440 | So here's a really simple idea, which is why don't you give each point in the data a kind
00:29:09.880 | of a level of transparency, so that the more they sit on top of each other, it's like transparent
00:29:15.760 | disks stacking up and getting darker and darker.
00:29:18.920 | So in the amazing art package called GG plot two, you can add.
00:29:24.400 | So here's something that says plot the carrots of a diamond against its price, and I want
00:29:30.680 | you to vary, it's called the alpha channel to the graphic stakes amongst you, you know
00:29:34.020 | that means kind of the level of transparency, and I want you to basically set the alpha
00:29:39.180 | channel for each point to be one over 10, or one over 100, one over 200.
00:29:43.640 | And you end up with these plots, which actually show you kind of the heat, you know, the amount
00:29:49.840 | of that area.
00:29:50.840 | And it's just so much better than any other approach to scatter plots that I've ever seen.
00:29:55.880 | So simple, and just one little line of code in your GG plot.
00:30:00.160 | I'll show you another couple of examples.
00:30:01.920 | And this, by the way, is in a completely free chapter of the book that he's got up on his
00:30:05.480 | website.
00:30:06.480 | This is a fantastic book, you should definitely buy it by the author of the package about
00:30:09.920 | GG plot two.
00:30:11.760 | But this one, and most important chapter is available free on his website, so check it
00:30:17.840 | I'll show you another couple of examples.
00:30:20.280 | Everything's done just right.
00:30:21.980 | Here's a simple approach of plotting a lower smoother through a bunch of data, always handy.
00:30:28.020 | But every time you plot something, you should see the confidence intervals.
00:30:31.200 | No problem.
00:30:32.320 | This does it by default.
00:30:34.600 | The best kind of plot, kind of thing you want to see normally is a lower smoother.
00:30:42.560 | So if you ask for a fit, it gives you the lowest move by default, it gives you the confidence
00:30:46.800 | interval by default.
00:30:48.320 | So it makes it hard to create really bad graphs in GG plot two, although some people have
00:30:55.360 | managed, I've noticed.
00:30:59.840 | Things like box plots all stacked up next to each other, it's such an easy way of seeing
00:31:04.160 | in this case how the color of diamonds varies.
00:31:07.920 | They've all got roughly the same median, that some of them have really long tails in their
00:31:13.440 | prices.
00:31:14.440 | What a really powerful plotting device.
00:31:16.920 | And so impressive that in this chapter of the book, he shows a few options.
00:31:21.480 | Here's what would happen if you used a jitter approach, and he's got another one down here,
00:31:27.120 | which is, here's what would happen if you used that alpha transparency approach, and
00:31:31.040 | you can really compare the different approaches.
00:31:35.360 | So GG plot two is something which, and I'll scroll through these so you can see what kind
00:31:39.320 | of stuff you can do, is a really important part of the toolbox.
00:31:44.360 | Here's another one I love, right?
00:31:46.000 | Okay, so we do lots of scatter plots, and scatter plots are really powerful.
00:31:50.920 | And sometimes you actually want to see how, if the points are kind of order chronologically,
00:31:55.520 | how did they change over time?
00:31:57.240 | So one way to do that is to connect them up with the line, pretty bloody hard to read.
00:32:01.840 | So if you take this exact thing, but just add this simple thing, set the color to be
00:32:07.800 | related to the year of the date, and then bang.
00:32:11.240 | Now you can see, by following the color, exactly how this is sorted.
00:32:17.440 | And so you can see we've got, here's one end here, here's one end here, so GG plot again
00:32:23.360 | has done fantastic things to make us understand this data more easily.
00:32:32.000 | One other thing I will mention is carat.
00:32:34.760 | How many people here have used the carat package?
00:32:38.760 | So I'm not going to show you carat, but I will tell you this.
00:32:42.600 | If you go into R and you type some model equals train on my data, carat and spn, that's what
00:32:57.560 | a command carat looks like.
00:32:58.640 | You've got a command called train, and you can pass in a string which is any of 300 different,
00:33:04.480 | I think it's about 300 different possible models, classification and regression models.
00:33:11.840 | And then you can add various things in here about saying I want you to center the data
00:33:15.120 | first, please, and I'll do a PCA on it first, please, and it just, you know, it's kind of
00:33:23.440 | puts all of the pieces together.
00:33:25.680 | It can do things like remove columns from the data which hardly vary at all, and therefore
00:33:32.480 | use some modeling to do that automatically.
00:33:35.040 | It can automatically remove columns from the data that are highly collinear, but most powerfully
00:33:39.940 | it's got this wrapper that basically lets you take any of hundreds and hundreds of most
00:33:43.880 | powerful algorithms, really hard to use, and they all now can be done through one algorithm,
00:33:48.720 | through one command.
00:33:49.720 | And here's the cool bit, right?
00:33:51.880 | Imagine we're doing an spn.
00:33:52.880 | I don't know how many of you tried to do spn, but they're really hard to get a good result
00:33:57.800 | because they depend so much on the parameters.
00:34:00.880 | In this version, it automatically does a grid search to automatically find the best parameters.
00:34:06.120 | So you just create one command and it does spn for you.
00:34:10.320 | So you definitely should be using a character.
00:34:16.720 | There's one more thing in the toolbox I wanted to mention, which is you need to use some
00:34:21.040 | kind of version control tool.
00:34:24.160 | How many people here have used a version control tool like git, cbs, spn?
00:34:29.560 | Okay, so let me give you an example from our terrific designer at Kaggle.
00:34:40.400 | He's recently been changing some of the HTML on our site and he checked it into this version
00:34:44.240 | control tool that we use.
00:34:46.120 | And it's so nice, right, because I can go back to any file now and I can see exactly
00:34:52.840 | what was changed and when, and then I can go through and I can say, "Okay, I remember
00:34:57.960 | that thing broke at about this time.
00:35:00.160 | What changed?"
00:35:01.160 | "Oh, I think it was this file here."
00:35:04.720 | "Okay, that line was deleted.
00:35:07.560 | This line was changed.
00:35:09.440 | This section of this line was changed."
00:35:11.280 | And you can see with my version control tool, it's keeping track of everything I can do.
00:35:16.520 | Can you see how powerful this is for modeling?
00:35:19.520 | Because you go back through your submission history at Kaggle and you say, "Oh shit, I
00:35:24.120 | used to be getting 0.97 AUC.
00:35:26.200 | Now I'm getting 0.93.
00:35:27.200 | I'm sure I'm doing everything the same."
00:35:29.960 | Go back into your version control tool and have a look at the history, so the commits
00:35:38.160 | list, and you can go back to the date where Kaggle shows you that you had a really shit-height
00:35:43.640 | result and you can't now remember how the hell you did it.
00:35:46.760 | And you go back to that date and you go, "Oh yeah, it's this one here," and you go and
00:35:51.400 | you have a look and you see what changed.
00:35:53.840 | And it can do all kinds of cool stuff like it can merge back-in results from earlier
00:35:58.360 | pushes or you can undo the change you made between these two dates, so on and so forth.
00:36:04.640 | And most importantly, at the end of the competition, when you win, and Anthony sends you an email
00:36:09.280 | and says, "Fantastic.
00:36:11.120 | Send us your winning model," and you go, "Oh, I don't have the winning model anymore."
00:36:15.080 | No problem.
00:36:16.080 | You can go back into your version control tool and ask for it as it was on the day that
00:36:21.800 | you had that fantastic answer.
00:36:27.400 | So there's my toolkit.
00:36:31.640 | There's quite a lot of other things I wanted to show you, but I don't have time to do.
00:36:34.400 | So what I'm going to do is I'm going to jump to this interesting one, which was about predicting
00:36:43.700 | which grants would be successful or unsuccessful at the University of Melbourne, based on data
00:36:49.320 | structure about the people involved in the grant and all kinds of metadata about the
00:36:54.880 | application.
00:36:57.240 | This one's interesting because I won it by a fair margin, kind of from 0.967 to 0.97 is
00:37:04.880 | kind of 25% of the available error.
00:37:07.320 | It's interesting to think, "What did I do right this time and how did I set this up?"
00:37:17.080 | Actually what I did in this was I used a random forest.
00:37:19.560 | So I'm going to tell you guys a bit about random forests.
00:37:22.480 | What's also interesting in this is I didn't use R at all.
00:37:28.480 | That's not to say that R couldn't have come up with a pretty interesting answer.
00:37:32.920 | The guy who came second in his comp used SAS, but I think he used like 12 gig of RAM, multi-core,
00:37:39.040 | huge thing.
00:37:41.720 | Mine ran on my laptop in two seconds.
00:37:44.760 | So I'll show you an approach which is very efficient as well as been very powerful.
00:38:00.840 | I did this all in C#.
00:38:04.000 | The reason that I didn't use R for this is because the data was kind of complex.
00:38:08.200 | Each grant had a whole bunch of people attached to it.
00:38:12.440 | It was done in a denormalized form.
00:38:14.680 | I don't know how many of you guys are familiar with kind of normalization strategies, but
00:38:24.600 | basically, denormalized form basically means you had a whole bunch of information about
00:38:30.120 | the grant, kind of the date and blah, blah, blah.
00:38:34.680 | And then there was a whole bunch of columns about person one, did they have a PhD, and
00:38:43.560 | then there's a whole bunch of columns about person two and so forth for I think it was
00:38:48.440 | about 13 people.
00:38:52.320 | Very very difficult model is extremely wide and extremely messy data set.
00:39:01.080 | It's the kind of thing that general purpose computing tools are pretty good at.
00:39:04.720 | So I pulled this into C# and created a grants data class where basically I went, okay, read
00:39:12.960 | through this file, and I created this thing called grants data, and for each line I split
00:39:19.720 | it on a comma, and I added that grant to this grants data.
00:39:23.560 | For those people who maybe aren't so familiar with general purpose programming languages,
00:39:29.760 | you might be surprised to see how readable they are.
00:39:32.760 | This idea I can say for each something in lines dot select the lines bit by comma, if
00:39:39.360 | you haven't used anything with portrait you might be surprised that something like C#
00:39:44.560 | looks so easy.
00:39:45.840 | File dot read lines dot skip some lines, this is just a skip the first line of the header,
00:39:51.040 | and in fact later on I discovered the first couple of years of data were not very predictive
00:39:54.960 | of today, so I actually skipped all of those.
00:40:00.200 | And the other nice thing about these kind of tools is okay, what does this dot add do?
00:40:03.960 | I can work one button and bang, I need the definition of dot add.
00:40:08.360 | These kind of IDE features are really helpful, and this is equally true of most Python and
00:40:15.040 | Java and C++ editors as well.
00:40:19.160 | So the kind of stuff that I was able to do here was to create all kinds of interesting
00:40:26.400 | derived variables, like here's one called max year birth, so this one is one that goes
00:40:33.120 | through all of the people on this application and finds the one with the largest year of
00:40:38.840 | birth.
00:40:39.840 | Okay, again it's just a single line of code, if you kind of get around the kind of curly
00:40:46.160 | brackets and things like that the actual logic is extremely easy to understand, you know?
00:40:53.280 | Things like do any of them have a PhD, well if there's no people in it, none of them do,
00:40:58.440 | otherwise, oh this is just one person has a PhD, down here somewhere I've got, and he
00:41:06.960 | has a PhD, bang, straight to there, there you go, does any person have a PhD?
00:41:13.640 | So I created all these different derived fields, I used pivot tables to kind of work out which
00:41:19.040 | one seemed to be quite predictive before I put these together, thing, and so what did
00:41:25.880 | I do with this?
00:41:26.880 | Well I wanted to create a random forest from this.
00:41:30.080 | Now random forests are a very powerful, very general purpose tool, but the R implementation
00:41:39.800 | of them has some pretty nasty limitations.
00:41:45.560 | For example, if you have a categorical variable, in other words a factor, it can't have any
00:41:55.800 | more than 32 levels.
00:41:59.960 | If you have a continuous variable, so like an integer or a double or whatever, it can't
00:42:08.880 | have any nulls.
00:42:10.000 | So there are these kind of nasty limitations that make it quite difficult, and it's particularly
00:42:16.160 | difficult to use in this case because things like the RFCD codes had hundreds and hundreds
00:42:20.400 | of levels, and all the continuous variables were full of nulls, and in fact if I remember
00:42:27.360 | correctly, even the factors aren't allowed to have nulls, which I find a bit weird because
00:42:33.520 | to me null is just another factor, they're male or they're female or they're unknown.
00:42:41.680 | It's still something I should get a model on.
00:42:43.720 | So I created a system that basically made it easy for me to create a data set up on one.
00:42:54.480 | So I made this decision, I decided that for doubles that had nulls in them, I created
00:43:05.040 | something which basically simply added two rows, sorry two columns, one column which
00:43:14.920 | was is that column null or not, one or zero, and another column which is the actual data
00:43:23.280 | from that column, so whatever it was, 2.36 blah blah blah blah blah, and wherever there
00:43:33.000 | was a null, I just replaced it with the median.
00:43:39.440 | So I now had two columns where I used to have one, and both of them are now modelable.
00:43:45.720 | Why is that, why the median?
00:43:48.920 | Actually it doesn't matter, because every place where this is the median there's a one
00:43:53.960 | over here, so in my model I'm going to use this as a predictor, I'm going to use this
00:43:58.680 | as a predictor, so if all of the places that that data column was originally null all meant
00:44:04.400 | something interesting, then it'll be picked up by this is null version of the column.
00:44:12.640 | So to me this is something which I do which I did automatically because it's clearly the
00:44:18.520 | obvious way to deal with null values, and then as I said in the categorical variables
00:44:25.800 | I just said okay the factors, if there's a null just treat it as another level, and then
00:44:32.480 | finally in the factors I said okay take all of the levels and if there are more observations
00:44:42.600 | than I think it was 25 then keep it, or maybe it's more than that, I think if there's more
00:44:49.400 | levels maybe if there's more observations than 100 then keep it, if there's more observations
00:44:53.200 | than 25 less than 100, and it was quite predictive, in other words that level was different to
00:44:59.760 | the others in terms of application success then keep it, otherwise merge all the rest
00:45:04.520 | into one super level called the rest.
00:45:09.640 | So that way I basically was able to create a data set which actually I could then feed
00:45:17.760 | to R, although I think in this case I ended up using my own random forest implementation.
00:45:26.880 | So should we have a quick talk about random forests and how they work?
00:45:35.720 | So to me there's kind of basically two main types of model, there's these kind of parametric
00:45:41.760 | models, models with parameters, things where you say oh this bit's linear and this bit's
00:45:46.960 | interactive with this bit and this bit's kind of logarithmic and I specify how I think this
00:45:56.680 | system looks, and all the modeling tool does is it fills in parameters, okay this is the
00:46:01.080 | slope of that linear bit, this is the slope of that logarithmic bit, this is how these
00:46:05.160 | two things interact.
00:46:06.720 | So things like GLM, very well known parametric tools, then there are these kind of non-parametric
00:46:15.320 | or semi-parametric models which are things where I don't do any of that, I just say here's
00:46:19.600 | my data, I don't know how it's related to each other, just build a model, and so things
00:46:24.920 | like support vector machines, neural nets, random forests, decision trees all have that
00:46:33.660 | kind of flexibility.
00:46:38.080 | Non-parametric models are not necessarily better than parametric models, I mean think back
00:46:41.880 | to that example of the R package competition where really all I wanted was some weights
00:46:47.720 | to say how does this kind of this max column relate, and if all you really wanted some
00:46:51.600 | weights all you wanted some parameters, and so GLM is perfect.
00:46:58.960 | Analysts certainly can overfit, but there are ways of creating GLMs that don't, for
00:47:05.600 | example you can use stepwise regression, or the much more fancy modern version you can
00:47:12.800 | use GLMnet, which is basically another tool for doing GLMs which doesn't overfit, but
00:47:22.560 | anytime you don't really know what the model form is, this is where you use a non-parametric
00:47:26.720 | tool, and random forests are great because they're super, super fast and extremely flexible,
00:47:36.160 | and they don't really have any parameters in attitude, so they're pretty hard to get it
00:47:40.840 | wrong.
00:47:41.840 | So let me show you how that works.
00:47:44.200 | A random forest is simply, in fact we shouldn't even use this term random forest, because
00:47:50.540 | a random forest is a trademark term, so we will call it an ensemble of decision trees,
00:48:00.440 | and in fact the trademark term random forest, I think that was 2001, that wasn't where this
00:48:10.440 | ensemble of decision trees was invented, it goes all the way back to 1995.
00:48:14.560 | In fact it was actually kind of independently developed by three different people in 1995,
00:48:20.400 | 1996 and 1997.
00:48:22.800 | The random forest implementation is really just one way of doing it.
00:48:28.200 | It all rests on a really fascinating observation, which is that if you have a model that is
00:48:37.100 | really, really, really shit, but it's not quite random, it's slightly better than nothing,
00:48:45.320 | and if you've got 10,000 of these models that are all different to each other, and they're
00:48:50.840 | all shit in different ways, but they're all better than nothing, the average of those
00:48:55.760 | 10,000 models will actually be fantastically powerful as a model of its own.
00:49:03.040 | So this is the wisdom of crowds or ensemble learning techniques.
00:49:08.320 | You can kind of see why, because if out of these 10,000 models they're all kind of crap
00:49:13.680 | in different ways, they're all a bit random, they're all a bit better than nothing, 9,099
00:49:18.800 | of them might basically be useless, but one of them just happened upon the true structure
00:49:24.320 | of the data.
00:49:25.640 | So the other 9,099 will kind of average out, if they're unbiased, not correlated with
00:49:31.640 | each other, they'll all average out to whatever the average of the data is.
00:49:36.480 | So any difference in the predictions of this ensemble will all come down to that one model
00:49:42.600 | which happened to have actually figured it out right.
00:49:45.460 | Now that's an extreme version, but that's basically the concept behind all these ensemble
00:49:51.400 | techniques, and if you want to invent your own ensemble technique, all you have to do
00:49:56.160 | is come up with some learner, some underlying model, which you can randomise in some way
00:50:04.440 | and each one will be a bit different, and you run it lots of times.
00:50:08.560 | And generally speaking, this whole approach we call random subspace.
00:50:21.280 | So random subspace techniques, let me show you how unbelievably easy this is.
00:50:26.680 | Take any model, any kind of modelling algorithm you like.
00:50:31.680 | Here's our data, here's all the rows, here's all the columns.
00:50:39.280 | I'm now going to create a random subspace.
00:50:50.120 | Some of the columns, some of the rows.
00:50:53.440 | So let's now build a model using that subset of rows and that subset of columns.
00:50:59.720 | It's not going to be as perfect at recognising the training data as using the full art, but
00:51:07.960 | it's one way of building a model, but it's not going to build a second model.
00:51:13.560 | This time I'll use this subspace, a different set of rows and a different set of columns.
00:51:19.960 | No, absolutely not, but I didn't want to draw 4000 lines, so let's pretend.
00:51:28.080 | So in fact what I'm really doing each time here is I'm pulling out a bunch of random
00:51:34.080 | rows and a bunch of random columns.
00:51:38.880 | Correct, and this is a random subspace.
00:51:44.960 | It's just one way of creating a random subspace, but it's a nice easy one, and because I didn't
00:51:51.320 | do very well at linear algebra, in fact I'm just a philosophy graduate, I don't know any
00:51:54.720 | linear algebra, I don't know what subspace means well enough to do it properly, but this
00:51:59.920 | certainly works, and this is all decision trees do.
00:52:03.240 | So now I'll imagine that we're going to do this, and for each one of these different
00:52:07.240 | random subspaces we're going to build a decision tree.
00:52:10.600 | How do we build a decision tree?
00:52:12.480 | Easy.
00:52:13.480 | Let's create some data.
00:52:17.360 | So let's say we've got age, sex, smoker, and lung capacity, and we kind of predict people's
00:52:34.400 | lung capacity.
00:52:35.400 | So we've got a whole bunch of data there.
00:52:50.080 | So to build a decision tree, let's assume that this is the particular subset of columns
00:52:54.760 | and rows in a random subspace, so let's build a decision tree.
00:52:58.400 | So to build a decision tree, what I do is I say, okay, on which variable, on which predictor,
00:53:08.560 | and at which point of that predictor, can I do a single split which makes the biggest
00:53:13.920 | difference possible in my dependent variable?
00:53:17.360 | So it might turn out that if I looked at this smoker, yes, and no, that the average lung
00:53:29.320 | capacity for all of the smokers might be 30, and the average for all of the non-smokers
00:53:35.200 | might be 70.
00:53:37.060 | So literally all I've done is I've just gone through each of these and calculated the average
00:53:40.480 | for the two groups, and I've found the one split that makes that as big a difference
00:53:45.600 | as possible, okay?
00:53:48.200 | And then I keep doing that.
00:53:49.200 | So in those people that are non-smokers, I now, interestingly, with the random forest
00:53:56.960 | or these decision tree ensemble algorithms, generally speaking, at each point, I select
00:54:02.520 | a different group of columns.
00:54:04.440 | So I randomly select a new group of columns, but I'm going to use the same rows.
00:54:07.520 | I obviously have to use the same rows because I'm kind of taking them down the tree.
00:54:12.480 | So now it turns out that if we look at age amongst the people that are non-smokers, if
00:54:17.520 | you're less than 18 versus greater than 18, it's the number one biggest thing in this
00:54:22.960 | random subspace that makes the difference, and that's like 50, and that's like 80.
00:54:28.280 | And so this is how I create a decision tree, okay?
00:54:34.360 | So at each point, I've taken a different random subset of columns.
00:54:40.040 | For the whole tree, I've used the same random subset of rows.
00:54:43.680 | And at the end of that, I keep going until every one of my leaves either has only one
00:54:51.040 | or two data points left, or all of the data points at that leaf all have exactly the same
00:54:56.240 | outcome, the same line capacity, for example.
00:55:01.080 | And at that point, I've finished making my decision tree.
00:55:03.800 | So now I put that aside, and I say, okay, that is decision tree number one.
00:55:11.040 | Put that aside.
00:55:12.540 | And now go back and take a different set of rows and repeat the whole process.
00:55:20.400 | And that gives me decision tree number two.
00:55:23.760 | And I do that 1,000 times, whatever.
00:55:27.400 | And at the end of that, I've now got 1,000 decision trees.
00:55:31.280 | And for each thing I want to predict, I then stick that thing I want to predict down every
00:55:36.000 | one of these decision trees.
00:55:37.600 | So the first thing I'm trying to predict might be, you know, a non-smoker who is 16 years
00:55:42.480 | old, blah, blah, blah, blah, blah, and that gives me a prediction.
00:55:45.400 | So the predictions for these things at the very bottom is simply what's the average of
00:55:53.520 | the dependent variable, in this case, the lung capacity for that group.
00:55:57.200 | So that gives me 50 in decision tree one and it might be 30 in decision tree two and 14
00:56:02.080 | in decision tree three.
00:56:03.080 | I just take the average from all of those.
00:56:05.400 | And that's given me what I wanted, which is a whole bunch of independent, unbiased, not
00:56:13.600 | completely crap models.
00:56:17.080 | How not completely crap are they?
00:56:18.760 | Well, the nice thing is we can pick, right?
00:56:21.800 | If you want to be super cautious and you really need to make sure you're avoiding overfitting
00:56:27.080 | then what you do is you make sure your random subspaces are smaller.
00:56:31.400 | You pick less rows and less columns.
00:56:34.280 | So then each tree is shitter than average, whereas if you want to be quick, you make
00:56:42.920 | each one have more rows and more columns.
00:56:45.920 | So it better reflects the true data that you've got.
00:56:49.640 | Obviously the less rows and less columns you have each time, the less powerful each tree
00:56:54.360 | is, and therefore the more trees you need.
00:56:58.400 | And the nice thing about this is that building each of these trees takes like a ten thousandth
00:57:04.280 | of a second, you know, or less.
00:57:06.560 | It depends on how much data you've got, but you can build thousands of trees in a few
00:57:10.040 | seconds or the kind of datasets I look at.
00:57:12.440 | So generally speaking, this isn't an issue.
00:57:17.240 | And here's a really cool thing, the really cool thing.
00:57:21.360 | In this tree I built it with these rows, which means that these rows I didn't use to build
00:57:32.840 | my tree, which means these rows are out of sample for that tree.
00:57:38.600 | And what that means is I don't need to have a separate cross-validation dataset.
00:57:43.760 | What it means is I can create a table now of my full dataset, and for each one I can say,
00:57:51.640 | okay, row number one, how good am I at predicting row number one?
00:57:55.760 | Well, here's all of my trees from one to a thousand.
00:58:01.080 | Row number one is in fact one of the things that was included when I created tree number
00:58:07.240 | So I won't use it here, but row number one wasn't included when I built tree two.
00:58:13.000 | It wasn't included in the random subspace of tree three, and it was included in the one
00:58:16.880 | for four.
00:58:18.280 | So what I do is row number one, I send down trees two and three, and I get predictions
00:58:26.720 | for everything that it wasn't in, and average them out, and that gives me this fantastic
00:58:35.240 | thing, which is an out-of-band estimate for row one, and I do that for every row.
00:58:45.360 | So none of this, all of this stuff which is being predicted here is actually not using
00:58:50.120 | any of the data that was used to build the trees, and therefore it is truly out of sample
00:58:53.800 | or out-of-band, and therefore when I put this all together to create my final whatever it
00:58:58.520 | is, AUC, or log likelihood, or SSC, or whatever, and then I send that off to Kaggle.
00:59:06.640 | Kaggle should give you pretty much the same answer, because you're by definition not overfitting.
00:59:12.160 | Yes, you did.
00:59:13.160 | I'm just wondering if it's possible to say just pick a tree, but that has the best, is
00:59:18.160 | it averaging out a thousand trees, is it possible to pick that one tree that actually has the
00:59:24.760 | best performance, and would you recommend it?
00:59:25.760 | Yes, you can, no I wouldn't.
00:59:26.960 | So the question was, can you just pick one tree, and would that tree, picking that one
00:59:31.880 | tree, be better than what we've just done?
00:59:34.320 | And let's think about what, that's a really important question, and let's think about
00:59:38.360 | why that won't work.
00:59:40.160 | The whole purpose of this was to not overfit.
00:59:43.680 | So the whole purpose of this was to say each of these trees is pretty crap, but it's better
00:59:50.000 | than nothing, and so we averaged them all out, it tells us something about the true
00:59:53.960 | data, each one can't overfit on its own.
00:59:56.960 | If I now go back and do anything to those trees, if I try and prune them, which is in
01:00:01.680 | the old-fashioned decision tree algorithms, or if I weight them, or if I pick a subset
01:00:07.000 | of them, I'm now introducing bias based on the training set predictivity.
01:00:13.840 | So anytime I introduce bias, I now break the laws of ensemble methods fundamentally.
01:00:19.840 | So the other thing I'd say is there's no point, right?
01:00:25.280 | Because if you have something where actually you've got so much training data that out
01:00:33.880 | of sample isn't a big problem or whatever, you just use bigger subspaces and less trees.
01:00:40.840 | And in fact, the only reason you do that is for time, and because this approach is so
01:00:44.040 | fast anyway, I wouldn't even bother then, you see?
01:00:47.920 | And the nice thing about this is, is that you can say, okay, I'm going to use kind of
01:00:54.240 | this many columns and this many rows in each subspace, right?
01:00:58.240 | And I've got to start building my trees, and I build tree number one, and I get this out
01:01:03.160 | of band error.
01:01:04.480 | Tree number two, this out of band error.
01:01:06.000 | Tree number three, this out of band error.
01:01:07.760 | And the nice thing is I can watch and see, and it will be monotonic.
01:01:12.840 | Well, not exactly monotonic, but kind of bumpy monotonic, it will keep getting better on
01:01:17.880 | average.
01:01:18.880 | And I can get to a point where I say, okay, that's good enough, I'll stop.
01:01:22.160 | And as I say, normally, it's talking four or five seconds, it's just time's not an issue,
01:01:27.040 | but if you're talking about huge data sets, you can't sample them, this is a way you can
01:01:32.040 | watch it go.
01:01:35.400 | So this is a technique that I used in the Grant's prediction competition.
01:01:39.840 | I did a bunch of things to make it even more random than this.
01:01:46.120 | One of the big problems here, both in terms of time and lack of randomness, is that all
01:01:50.320 | of these continuous variables, the official random forest algorithm searches through every
01:01:56.960 | possible breakpoint to find the very best, which means that every single time that you
01:02:02.520 | use that particular variable, particularly if it's in the same spot, like at the top
01:02:07.680 | of the tree, it's going to do the same split, right?
01:02:11.080 | In the version I wrote, actually, all it does is every time it comes across a continuous
01:02:15.800 | variable, it randomly picks three breakpoints, so it might try 50, 70, and 90, and it just
01:02:24.360 | finds the best of those three.
01:02:26.320 | And to me, this is the secret of good ensemble algorithms, is to make every one as different
01:02:31.200 | to every other tree as possible.
01:02:34.520 | Does the distribution of the population variable you're trying to predict matter?
01:02:39.200 | No, not at all.
01:02:40.920 | So the question was, does the distribution of the dependent variable matter?
01:02:45.920 | And the answer is it doesn't, and the reason it doesn't is because we're using a tree.
01:02:50.380 | So the nice thing about a tree is, let's imagine that the dependent variable was kind of maybe
01:02:59.340 | very long tail distribution, like so.
01:03:04.000 | The nice thing is that, as it looks at the independent variables, it's looking at the
01:03:08.800 | difference in two groups, and trying to find the biggest difference between those two groups.
01:03:13.600 | So regardless of the distribution, it's more like a rank measure, isn't it?
01:03:18.780 | It's picked a particular breakpoint, and it's saying which one finds the biggest difference
01:03:23.100 | between the two groups.
01:03:24.680 | So regardless of the distribution of the dependent variable, it's still going to find the same
01:03:28.280 | breakpoints, because it's really a non-parametric measure.
01:03:33.280 | We're using something like, for example, GINI, or some kind of other measure of the information
01:03:40.440 | gain of that to build the decision tree.
01:03:42.800 | So this is true of really all decision tree approaches in fact.
01:03:48.040 | Does it work with the highly imbalanced data set?
01:03:53.040 | Yes and no.
01:03:54.560 | So the question is, does it work for a highly imbalanced data set?
01:03:58.640 | Sometimes some versions can, and some versions can't.
01:04:02.560 | The approaches which use more randomization are more likely to work okay, but the problem
01:04:07.120 | is in highly imbalanced data sets, you can quite quickly end up with nodes which are
01:04:12.440 | all the same value.
01:04:14.200 | So I actually have often found I get better results if I do some stratified sampling,
01:04:20.120 | so that, for example, think about the R competition, where most people don't have in store, other
01:04:27.240 | than using number five, 99% of packages.
01:04:31.340 | So in that case, I tend to say, all right, at least half of that data set is so obviously
01:04:36.480 | zero, let's just call it zero and just work with the rest, and I do find I often get better
01:04:42.000 | answers.
01:04:43.000 | But it does depend.
01:04:44.560 | Would it be better to instead of using tree for the forest, use another algorithm in the
01:04:50.160 | run by forest?
01:04:52.160 | Well, you can't call it forest if you use a different algorithm other than a tree, but
01:04:56.560 | yes, you can use other random subspace methods.
01:04:59.280 | You can use another class five.
01:05:01.000 | Yes, you absolutely can.
01:05:02.000 | A lot of people have been going down that path.
01:05:06.080 | It would have to be fast.
01:05:07.680 | So GLMnet would be a good example because that's very fast, but GLMnet is parametric.
01:05:16.460 | The nice thing about decision trees is that they're totally flexible.
01:05:21.080 | They don't assume any particular data structure.
01:05:23.520 | They kind of are almost unlimited in the amount of interactions that they can handle, and
01:05:29.960 | you can build thousands of them very quickly, but there are certainly people who are creating
01:05:33.720 | other types of random subspace ensemble methods, and I believe some of them are quite effective.
01:05:41.080 | Interestingly, I can't remember where I saw it, but I have seen some papers which show
01:05:46.400 | evidence that it doesn't really matter if you've got a truly flexible underlying model
01:05:53.240 | and you make it random enough and you create enough of them, it doesn't really matter which
01:05:58.380 | one you use or how you do it, which is a nice result.
01:06:01.440 | It kind of suggests that we don't have to spend lots of time trying to come up with
01:06:05.680 | better and better and better generic predictive modeling tools.
01:06:11.880 | If you think about things, there are better or better versions of this in quotes like
01:06:15.240 | a rotation forest and then there's things like GBM, Australian Boosting Machines and
01:06:19.320 | so forth.
01:06:20.320 | In practice, they can be faster for certain types of situation, but the general result
01:06:26.640 | here is that these ensemble methods are as flexible as you need them to be.
01:06:32.400 | How do you define the optimal size of the subspace?
01:06:37.740 | The question is how do you define the optimal size of the subspace, and that's a really
01:06:41.440 | tricky question.
01:06:43.240 | The answer to it is really nice, and it's that you really don't have to.
01:06:49.600 | Generally speaking, the less rows and the less columns you use, the more trees you need,
01:06:56.520 | but the less you'll overfit and the better results you'll get.
01:07:02.780 | The nice thing normally is that for most data sets, because of the speed of random forest,
01:07:07.200 | you can pretty much always pick a row count and a column count that's small enough that
01:07:12.280 | you're absolutely sure it's going to be fine.
01:07:16.440 | Sometimes it can become an issue, maybe you've got really huge data sets, or maybe you've
01:07:20.880 | got really big problems with data imbalances, or hard-mini training data, and in these cases
01:07:27.440 | you can use the kind of approaches which would be familiar to most of us around creating
01:07:33.400 | a grid of a few different values of the column count and the row count, and trying a few
01:07:39.600 | out, and watching that graph of as you add more trees, how does it improve.
01:07:46.120 | The truth is it's so insensitive to this that if you pick a number of columns of somewhere
01:07:56.520 | between 10% and 50% of the total, and a number of rows of between 10% and 50% of the total,
01:08:08.360 | you'll be fine.
01:08:09.360 | You just keep adding more trees until you're sick of waiting, or it's obviously a flash.
01:08:14.600 | If you do 1,000 trees, again, these are all, it really doesn't matter, they seem to, it's
01:08:22.200 | not sensitive to that assumption on the whole.
01:08:25.400 | The iRoutine samples you've replaced now?
01:08:28.720 | Yeah, the iRoutine actually, so this idea of a random subspace, there are different ways
01:08:35.240 | of creating this random subspace, and one key one is, can I go and pull out a row again
01:08:42.800 | that I've already pulled out before.
01:08:46.160 | The R random forest and the portrait encoding, which is based by default, let you pull something
01:08:51.920 | out multiple times, and by default, in fact, pull out, if you've got n rows, it will pull
01:08:57.960 | out n rows, but because it's pulled out some multiple times, on average, it will cover
01:09:03.000 | I think 63.2% of the rows.
01:09:08.440 | I don't have the best results when I use that, but it doesn't matter because in R random
01:09:14.600 | forest options, you can choose, is it with or without sampling, and how many of them
01:09:18.560 | represent.
01:09:19.560 | I absolutely think it makes a difference.
01:09:24.080 | To me, I'm sure it depends on the dataset, but I guess, I always enter Kaggle competitions
01:09:30.680 | which are in areas that I've never entered before, kind of domain-wise or algorithm-wise,
01:09:36.680 | so I guess I'd be getting a good spread of different types of situation, and in the ones
01:09:41.440 | I've looked at, sampling without replacement is kind of more random, and I also tend to
01:09:48.400 | pick much lower n than 63.2%, you know, I tend to use more like 10 or 20% of the data
01:09:54.600 | in my random subspaces.
01:09:56.640 | Yeah, I know the concepts, I guess I can say it.
01:10:10.000 | That's my experience, but I'm sure it depends on the dataset, and I'm not sure it's terribly
01:10:13.880 | sensitive to it anyway.
01:10:22.160 | I always put it into two branches, so there's a few possibilities here, as you get in your
01:10:28.000 | decision tree.
01:10:30.720 | In this case, I've got something here which is a binary variable, so obviously that has
01:10:37.240 | to be fit into two.
01:10:38.240 | In this case, I've got something here which is a continuous variable, now it's fitted
01:10:43.160 | into two, but if actually it's going to be optimal to split it into three, then if the
01:10:49.320 | variable appears again at the next level, it can always put it into another two at that
01:10:53.480 | other split point.
01:10:55.480 | I can, absolutely I can.
01:11:01.400 | So it just depends whether, when I did that, remember at every level I repeat the sampling
01:11:07.880 | of a different bunch of columns, I could absolutely have the same column again in that group,
01:11:13.320 | and it could so happen that again I find the split point which is the best in that group.
01:11:17.280 | If you're doing 10,000 trees with 100 levels each it's going to happen lots of times, so
01:11:22.800 | the nice thing is that if the true underlying system is a single univariate logarithmic
01:11:30.480 | relationship, these trees will absolutely find that, eventually.
01:11:40.460 | Definitely don't prune the trees, if you prune the trees you introduce mice, so the key thing
01:11:45.760 | here which makes them so fast and so easy but also so powerful is you don't prune trees.
01:11:57.860 | No it doesn't necessarily because your split point will be such that the true halves will
01:12:07.000 | not necessarily be balanced in count.
01:12:13.080 | Yeah that's right, because in the under 18 group you could have not that many people,
01:12:21.360 | and in the over 18 group you can have quite a lot of people, so the weighted average of
01:12:25.120 | the two will come to a certain time.
01:12:29.480 | Have you ever compared this with scrodiant boosting machines?
01:12:36.480 | With scrodiant boosting machines?
01:12:38.480 | Yeah absolutely I have.
01:12:39.480 | Scrodiant boosting machines are interesting, they're a lot harder to understand, gradient
01:12:45.280 | boosting machines, I mean they're still basically ensemble technique and they're more working
01:12:50.840 | with the residuals of previous models.
01:12:55.680 | There's a few pieces of theory around gradient boosting machines which are nicer than random
01:12:59.240 | forests, they ought to be faster and they ought to be more well directed, and you can do things
01:13:04.800 | like say with a graded boosting machine, this particular column has a monotonic relationship
01:13:09.680 | with a dependent variable, so you can actually add constraints in which you can't do with
01:13:14.680 | random forests.
01:13:15.680 | In my experience I don't need the extra speed of GBMs because I just never have found it
01:13:22.760 | necessary.
01:13:23.760 | I find them harder to, they've got more parameters to deal with, so I haven't found them useful
01:13:29.960 | for me, and I know a lot of batter mining competitions and also a lot of real world predictive
01:13:35.560 | modelling problems, people try both and end up with random forests.
01:13:39.720 | Well we're probably just about out of time, so maybe if there's any more questions I can
01:13:44.880 | chat to you guys afterwards, thanks very much.
01:13:49.760 | [END]
01:13:50.760 | Transcribed by ESO; Translated by —
01:13:55.760 | [BLANK_AUDIO]