back to indexfastai v2 walk-thru #8
Chapters
0:0
32:41 split the training set and the validation
40:41 create a categorical column in pandas
46:31 set up your processes
00:00:00.000 |
Okay. There we go. Hi, everybody. Can you see me? Can you hear me? Okay. So, I'm going to 00:00:29.920 |
turn it off. Okay. Great. Hi. Hope you're ready for Tabular. Oh, and we have Andrew 00:00:51.320 |
Shaw here today as well. Andrew just joined WAMRI, Medical Research Institute as Research 00:00:59.280 |
Director, and he is the person you may be familiar with from Music Auto, but if you 00:01:09.620 |
haven't checked it out, you should, because it's super, super cool. Now he's moving from 00:01:15.520 |
the Jackson 5 to Medical AI Research. All right. So, Tabular is a cool notebook, I think. 00:01:39.960 |
It's a bit of fun. And the basic idea, well, let's start at the end to see how it's going 00:01:51.160 |
to look. So, we're going to look at the adult dataset, which is the one that we use in most 00:01:57.840 |
of the docs in version one, and we used in some of the lessons, I think. It's a pretty 00:02:02.840 |
simple, small dataset. It's got 32,000 rows in it, and 15 columns in it. And we've got 00:02:32.320 |
32,000. And here's what it looks like when we grab it straight from Dandas. So, basically, 00:02:45.280 |
in order to create models from this, we need to take the categorical variables and convert 00:02:54.720 |
them into ints, possibly with missing category, the continuous variables, if there's anything 00:03:06.800 |
with a missing value in a continuous variable, we need to replace it with something, so normally 00:03:13.960 |
we replace it with something like the median, and then normally we add an additional column 00:03:19.720 |
for each thing that has a missing value, for each column that has a missing value, and 00:03:23.720 |
that column will be binary, which is, is it missing or not. So, we need to know which 00:03:29.320 |
things are going to be the categorical variables, which column names are going to be the continuous 00:03:35.000 |
variables, so we know how to process each one, we're going to need to know how to split 00:03:42.920 |
our validation and training set, and we need to know what's our dependent variable. So, 00:03:50.560 |
we've created a class called Tabular. Basically, Tabular contains a data frame, and it also 00:04:04.240 |
contains a list of which things are categorical, which things are continuous, and what's your 00:04:10.800 |
dependent variable, and also some processes, which we'll look at in a moment, where they 00:04:18.320 |
do things like turning strings into ints for categories, and filling the missing data, 00:04:24.160 |
and doing normalization of continuous. So that creates a Tabular object, and from a Tabular 00:04:32.640 |
object you can get a data source if you pass in a list of split indexes. So, feel free to 00:04:39.560 |
ask also Andrew if you have any questions as we go. Oh, what's that David? Do you want 00:04:47.840 |
a jagged competition? What's a jagged competition? Tell us more. Oh, Kaggle, which one? So now 00:04:59.440 |
that we've got a data source, we created data loader from it, and so we have a little wrapper 00:05:05.760 |
for that to make it easier for Tabular. And so then we can go show batch. Oh, it's broken 00:05:13.760 |
somehow. Nice one, Jeremy. Damn it. This was all working a moment ago. And then Andrew and 00:05:24.800 |
I were just changing things at the last minute, so what did we break? Okay, I am going to 00:05:40.800 |
ignore that. Alright, so we just broke something apparently before we started recording, but 00:05:50.640 |
anyway show batch would then show the data. And then you can take a test set that's not 00:05:58.320 |
processed. And basically what you want to be able to do with a test set is say I have 00:06:03.200 |
the same categorical names and continuous names and the same dependent variable and 00:06:06.800 |
the same processes, and you should then be able to apply the same pre-processing to the test set. 00:06:13.120 |
And so to.new creates a new Tabular object with all that metadata, and then process will do the 00:06:20.000 |
same processing. And so then you can see here, here's the normalized age, normalized education, 00:06:27.680 |
all these things have been turned into ints. Oh, and so forth. Excuse me. So that's basically what 00:06:37.120 |
this looks like. You'll see I've used a subclass of Tabular called Tabular Pandas. I don't know 00:06:46.080 |
if it's always going to stay this way or not, but currently, well we want to be able to support 00:06:51.040 |
multiple different backends. So currently we have a Pandas backend for Tabular and we're also 00:06:57.520 |
working on a RAPIDS backend. For those of you that haven't seen it, RAPIDS is a really great project 00:07:03.520 |
coming out of NVIDIA, which allows you to do, so they've got some encoder, QDF, which basically 00:07:09.440 |
gives you GPU-accelerated data frames. So that's why we have different subclasses. 00:07:20.320 |
I don't know if this will always be a subclass of Tabular. We may end up instead of having 00:07:23.680 |
subclasses of data source, but for now we've got subclasses of Tabular. So that's where we're 00:07:29.360 |
heading. All right. Do you have any advice to speed up inference on a tabular learn of predictions? 00:07:40.240 |
Well, hopefully there'll be fast, particularly if you use RAPIDS. I mean, yeah, 00:07:45.600 |
basically, if you use RAPIDS. In fact, if you look at RAPIDS, NVIDIA, fast.ai, hopefully, 00:07:54.880 |
yeah, here we go. So even, a lot of you will remember from the forums, he's at NVIDIA now, 00:08:04.480 |
which is pretty awesome, and he's working on the RAPIDS team and he recently posted how he got a 00:08:10.960 |
15x acceleration and got in the top 20 of a competition, a very popular competition on Kaggle 00:08:18.800 |
by combining RAPIDS by Torch and fast.ai. So that would be a good place to start, but hopefully by 00:08:24.480 |
the time fast.ai version 2 comes out this would be super simple, and we're working with Even on 00:08:30.000 |
making that happen. So thank you, Even and NVIDIA for that help. 00:08:37.360 |
All right. So let's start at tabular. So basically, the idea of tabular is that we 00:08:50.560 |
have a URL. Oh, yes. I just googled for NVIDIA AI RAPIDS, fast.ai. And here it is, accelerating, 00:09:09.200 |
keep learning recommendation systems. I'm sure we can, somebody will hopefully add it to the notes 00:09:12.800 |
as well. Okay. So the basic idea with tabular was I kind of wanted to, I kind of like to have a 00:09:28.160 |
class which kind of has all the information it needs to do what we want it to do. So in this case, 00:09:36.400 |
you know, a data frame doesn't have enough information to actually build models because 00:09:45.520 |
until you know what the categorical variables are, the continuous variables are and what pre-processing 00:09:50.480 |
to do and what the dependent variable is, you know, you can't really do much. So that's the 00:09:55.920 |
basic kind of idea of this design was to have something with that information. So categorical 00:10:03.600 |
names, continuous names, line names, normally it's just one dependent variable, so one line name, 00:10:10.560 |
but you could have more and processes. So those are basically the four things that we want to 00:10:16.960 |
put in our tabular. So -- oh, that was crazy. There we go. 00:10:44.560 |
All right. So that's why we're passing in those four things. And then the other thing 00:10:57.120 |
we need to know about why is why categorical or continuous, so you can just pass that into 00:11:03.440 |
some boolean. So Aman wants to know why might we need more than one y-name. So a few examples, 00:11:11.040 |
you could be doing a regression problem where you're trying to predict the x and y coordinates 00:11:15.920 |
of the destination of a taxi ride, or perhaps you're just doing a multi-label classification 00:11:22.320 |
where you can have multiple things being true, would be a couple examples. 00:11:26.560 |
And maybe it's already one-hot encoded or something in that case. 00:11:29.840 |
Okay. So that's the theory behind the design of tabular. So when we initialize it, we just 00:11:40.160 |
pass in that stuff. The processes that we pass in are just going to be transforms, 00:11:46.320 |
so we can dump in a pipeline. And so this stuff kind of you'll see that we just keep reusing the 00:11:51.280 |
same foundational concepts throughout fast.ai version 2, which is a good sign that they're 00:11:55.760 |
strong foundations. So the processes are, you know, things that we want to run a bunch of them, 00:12:03.360 |
we want to depend on what type something is, whether we run it or not, you know, stuff like 00:12:07.520 |
that. So it's got all that kind of behavior we want in the pipeline. Unlike TifemDS, TifemDL, 00:12:16.560 |
TifemList, all of those things apply transformations lazily. On tabular data, 00:12:23.680 |
we don't generally do that. There's a number of reasons why. The first is that unlike kind of 00:12:29.600 |
opening an image or something, it doesn't take a long time to grab a row of data. 00:12:36.640 |
So like it's fine to read the whole lot of rows normally, except in some cases of really, 00:12:42.720 |
really big datasets. The second reason is that most kind of tabular stuff is designed to work 00:12:49.520 |
quickly on lots of rows at a time. So it's going to be much, much faster if you do it ahead of time. 00:12:53.920 |
The third is that most preprocessing is going to be not data augmentation kind of stuff, but more 00:13:00.640 |
just once off cleaning up labels and things like that. So for all these kinds of reasons, 00:13:06.800 |
our processing in tabular is generally done ahead of time rather than lazily, 00:13:11.120 |
but it's still a pipeline of transforms. Okay, so then we're going to store something called 00:13:17.680 |
catY, which will be our Y variable if it's a categorical, otherwise none, and vice versa, 00:13:24.160 |
the convoy. So that's our initializer, and let's take a look at it in use. So basically we create 00:13:33.440 |
a data frame containing two columns, and from that we can create a tabular object passing in 00:13:41.120 |
that data frame and saying our cat names, con names, y names, whatever. So in this case we'll 00:13:46.400 |
just have cat names. One thing that's always a good idea is to make sure that things pickle 00:13:51.760 |
okay because for inference and stuff all this metadata has to be pickleable. So dumping 00:13:56.800 |
something to a string and then loading that string up again is always a good idea to make sure it 00:14:00.720 |
works and make sure it's the same as what you started with. So we're inheriting from 00:14:07.280 |
co-base, and co-base is just a small little thing in core which defines the basic things you would 00:14:16.480 |
expect to have in a collection, and it implements them by composition. So you pass in some list or 00:14:24.880 |
whatever, and so the length of your co-base will be the length of that list. So like you can often 00:14:31.040 |
just inherit from something to do this, but composition often can give you some more 00:14:37.520 |
flexibility. So this basically gives you the very quick, you know, getItem is defined by 00:14:43.520 |
calling the self.items getItem, and length is defined by calling the self.items length, 00:14:47.600 |
and so forth. So by inheriting from co-base we can then in the init simply say super.initdf, 00:14:57.840 |
and so that means we're going to now going to have something called self.items, 00:15:01.760 |
which is going to be that data frame. And so here we check that, you know, the pickled version of 00:15:10.080 |
items is the same as the TabularObjects version of items. Okay, so let me just press this. 00:15:39.600 |
Okay, so the next thing to notice is that there are various useful little attributes like 00:15:48.320 |
all columns, all cats for all categorical columns, all const for all continuous columns, 00:15:58.560 |
and then there's also the same with names at the end, all count names, all cat names, 00:16:04.080 |
all call names. So you can say all count names is just the continuous names plus the continuous y 00:16:10.880 |
if there is one. This would not work except for the fact that we use the capital L plus, so if you 00:16:19.360 |
add a none to a capital L plus it doesn't change it, which is exactly what we want most of the 00:16:24.800 |
time. So some of the things in capital L are a bit more convenient than this. I think that's right, 00:16:32.000 |
anyway. Maybe I should double check before I say things that might not be true. Yeah, so you can't 00:16:39.440 |
do that, or else you can do that and get exactly the expected behavior. And Ls, by the way, always 00:16:48.960 |
show you how big they are before they print out their contents. You'll see that all calls does not 00:17:03.040 |
actually appear anywhere here, and that's because we just created a little thing called add property 00:17:08.080 |
that just adds cats or cats, consts or consts or calls. And so for each one it creates a 00:17:15.200 |
read version of the property, which is to just grab whatever cat names or cat names, 00:17:20.720 |
cont names, etc., which are all defined in here, and then indexes into our data frame with that 00:17:28.160 |
list of columns. And then it creates a a setter, which simply sets that list of names to whatever 00:17:40.560 |
value you provide. So that's just a quick way to create the setter and get a version of all of 00:17:45.920 |
those. So that's where all calls come from. So in this case, because the TabularObject 00:17:53.600 |
only has this one column mentioned as being part of what we're modeling, 00:17:59.360 |
even though the data frame had an a and a b, TabularObject.all_holes only has the a 00:18:08.160 |
con in it, because by all calls it means all of the columns we're using in modeling, so 00:18:13.440 |
continuous and categorical and dependent variables. And so one of the nice things 00:18:20.080 |
is that, you know, because everything is super consistent in the API, we can now just say .show, 00:18:28.240 |
just like everything else, we can say .show, and in this case we see the all calls data frame. 00:18:41.600 |
processes, TabularProcesses, are just transforms. Now specifically they're in-place transforms, 00:18:53.360 |
but don't let that bother you because in-place transform is simply a transform where when we 00:19:00.800 |
we call it and then we return the original thing. So like all the transforms we've seen so far 00:19:07.600 |
return something different to what you passed in, like that's the whole point of them, 00:19:12.160 |
but processes, the whole point of them is that they change the actual stored data, 00:19:18.800 |
so that's why they just return whatever you started with. So that's all in-place transform 00:19:27.040 |
means. And so a TabularProcessor is just a transform that returns itself when you call it, 00:19:34.240 |
and when you set it up, it just does the normal set up, but it also calls dundercall, in other 00:19:46.000 |
words, self-roundbackers. Why does it do that? Well, let's take a look at an example, 00:19:52.400 |
Categorify. So Categorify is a TabularProc where the setup is going to create a 00:20:01.760 |
category map, so this is just a mapping from int-numbered bits of vocab to string vocab, 00:20:09.760 |
that's all a category map is. And so it's going to go through all of the categorical columns, 00:20:21.840 |
and it's going to go .iloc into the data frame for each of those columns, 00:20:26.080 |
and it's going to create a category map for that column. So this is just creating, so self.classes 00:20:34.480 |
then is going to be a dictionary that goes from the column names to the vocab for that categorical 00:20:41.600 |
column. So that's what setup does, right? So setup kind of sets up the metadata, it's the vocab. 00:20:49.440 |
Encodes, on the other hand, is the thing that actually takes a categorical column and converts 00:20:57.120 |
it into ints using the vocab that we created earlier. And so we need them to be two separate 00:21:04.480 |
things because if you think about inference, at inference time you don't want to run setup, 00:21:12.400 |
at inference time you just want to run encodes. But at training time you want to do both, right? 00:21:21.680 |
Any time you don't do setup, you're definitely also going to want to process. So that's why setup, 00:21:27.520 |
after it sets up, immediately does the encoding because that's in practice always what you want. 00:21:37.280 |
So that's why we override setup in TabularProc. That's all TabularProc is, it's just a transform 00:21:43.280 |
that when you set it up it also calls it straight away. 00:21:45.840 |
Okay, so that, so Categorify is pretty similar to the 00:21:54.960 |
categorized transform we've seen for dependent variables for like image classification, 00:22:04.560 |
but it's a little bit different because it's for TabularObjects. So you can see an example. 00:22:10.880 |
Here's a data frame 01202 in just a single column. So again we can create a TabularObject 00:22:19.680 |
passing in the data frame, passing in any processes we want to run, and passing in, 00:22:25.600 |
so the first thing that we pass in after that will be the category names, the categorical names, 00:22:31.920 |
so we're just going to have one. So once you have created your TabularObject, the next thing you 00:22:40.640 |
want to do is to call setup, and remember that setup's going to do two things. It's going to call 00:22:45.280 |
your setups and it's going to call your encodes. Are there any object or textures? We haven't done 00:22:58.080 |
any models yet, David. We're only really working through the transformation pipeline, but we have 00:23:02.800 |
certainly looked at a version to object detectors in previous lessons for like, well, not in detail. 00:23:09.600 |
We've touched on the places where the object detection data is defined, and 00:23:17.120 |
yeah, hopefully it's clear enough that you can figure that out. 00:23:26.880 |
Yeah, so patch property does not call, I mean you can check the code easily enough yourself, 00:23:34.000 |
right? So patch property, let's see, okay, so no, it doesn't, it calls patch2, 00:23:42.240 |
patch2, no, it doesn't. So there's no way to create a setup using that decorator yet. 00:23:56.240 |
If you can think of a good syntax for that that you think would work well, 00:23:59.680 |
feel free to suggest it or even implement it in a PR. 00:24:02.400 |
Okay, so, all right, so we've created our TabularObject. As we told it to 00:24:13.760 |
categorify, we said this is our only categorical variable. Then we call setup, which is going to 00:24:19.680 |
go ahead and let's have a look. Setup is going to, in our processor, create classes and then 00:24:32.640 |
it's going to change our items to call apply cats, which will map the data in that column 00:24:49.680 |
using this dictionary. So map is an interesting pandas method. You can pass it a function, 00:24:57.760 |
which is going to be super slow, so don't do that, but you can also pass it a dictionary 00:25:02.240 |
and that will just map from keys to values in the dictionary. So if we have a look at 00:25:11.520 |
at this case, we can see that 2.a starts out as 0.1.2.0.2 and ends up as 1.2.3.1.3 and the 00:25:24.640 |
reason for that is that the vocab that it created is #na 0.1.2. So whenever we create a 00:25:34.960 |
categorized column for the vocab, we always put a #na at the start, which is similar to what we've 00:25:42.480 |
done in version 1. So that way if you in the future get a value outside of your vocab, 00:25:51.280 |
then that's what we're going to set it to. We're going to set it to #na and so therefore 1, if we 00:25:57.520 |
index that into the vocab, that maps to 0. So that's why this is 0 became 1, for example. 00:26:04.800 |
One of the things that I recently added to pipeline, because remember that 2.prox 00:26:14.080 |
is a pipeline, right? And one of the things I added to pipeline is a getatra, which will 00:26:25.360 |
try to, so this is not an attribute in pipeline, not surprisingly, so if it finds an attribute 00:26:33.280 |
it doesn't understand, it will try and find that attribute in any transforms inside that pipeline, 00:26:39.440 |
which is exactly what we want, you know? So in this case, it's going to look for a transform 00:26:51.040 |
with a type category of 5 and it converts it to snake case. So this is very similar if you've 00:26:58.320 |
watched the part 2, the most recent part 2 videos, we did the same thing for callbacks. Callbacks 00:27:04.160 |
got automatically inserted, and I think version 1 does this too, getting automatically added as 00:27:08.160 |
attributes. So pipeline does something very similar, but it doesn't actually add them as 00:27:13.600 |
attributes, it uses getatra to do the same thing. So in this case we added a categoryify transform, 00:27:21.280 |
so we haven't instantiated it, we just passed in the type, so it's going to instantiate it for us. 00:27:25.840 |
Pipeline will always instantiate your types for you if you don't instantiate them, and so later 00:27:30.720 |
on we want to say, okay, let's find out what the vocab was, which means we need to grab the 00:27:36.160 |
processes out of our transform object and ask for the categoryify transform. So now that we've got 00:27:43.120 |
that, categoryify transform, categoryify defines getItem and it will return the vocab for that 00:27:54.640 |
column. So here is the vocab for that column, and so we can have a look, cat a, 00:28:11.600 |
okay, as you can see there it is, and that is 00:28:20.480 |
as well as having the items that we just saw, it also has the reverse mapping, 00:28:28.000 |
so this is, as you can see, goes the opposite direction. So to answer Aman's question, 00:28:36.160 |
yes, this should take care of the mapping in the test set, because if it comes across, 00:28:43.760 |
so it's going to look at o2i when it tries to call apply cats, it's going to try to find in your 00:28:52.000 |
categories, it's going to try and find that column, it's going to grab o2i, which is this dictionary, 00:29:00.960 |
that's then going to use that to map everything in the column, but because it's a default dict, 00:29:05.440 |
if there's anything it doesn't recognize, it will become zero, which is the na 00:29:11.040 |
category. So yeah, that should all work nicely. You have to figure out how to model it, 00:29:18.480 |
of course, but the data processing will handle it for you. Okay, so 00:29:27.360 |
what else? So now, imagine it's inference time, 00:29:36.720 |
and so we come along with some new data frame that we want to run inference on, 00:29:45.120 |
or it's a test set or whatever, so here's our data frame. So we now have to say, okay, 00:29:50.960 |
I want to create a new tabular object for the same metadata that we had before, so the same 00:29:56.880 |
processes, the same vocab, the same categorical continuous variables. The way to do that is to 00:30:03.200 |
start with an existing tabular object which has that metadata and call new and pass in a new data 00:30:08.720 |
frame that you have, and that's going to give you a new tabular object with the same metadata and 00:30:13.280 |
processes and stuff that we had before, but with this different data in it. Now of course, we don't 00:30:20.640 |
want to call setup on that because setup would replace the vocab, and the whole point is that 00:30:25.680 |
we want to actually use the same vocab for inference on this new data set. So instead you 00:30:34.000 |
call process, and so all process does in tabular is it just calls the processes, 00:30:41.760 |
which is a pipeline, so you can just treat it as a function. 00:30:50.480 |
So in this case, our vocab was 0, 1, 2 and an a, and here there's a couple of things that are not 00:31:00.000 |
in that list of 0, 1, 2, specifically this one is 3 and this one is -1. So 1, 0 and 2 will get 00:31:10.640 |
replaced by their 1, 0 and 2, so 2, 1, 3, 2, 1, 3, so there's our new a column, and then as we just 00:31:23.840 |
discussed the two things that don't map are going to be 0. So then if you call decode on our processor, 00:31:36.960 |
then as you would expect you end up with the same data you started with, but of course this is now 00:31:44.240 |
going to be Na, because we said we don't know what those are. So like decoding in general in fastai 00:31:53.840 |
doesn't mean you always get back exactly what you started with, right? It's kind of it's trying to 00:32:00.400 |
display the kind of transformed version of the data. In some cases like normalization, it should 00:32:11.280 |
pretty much be exactly what you started with, but for some things like categorify the missing values 00:32:16.640 |
it won't be exactly what you started with. You don't have to pass in just a type name, 00:32:24.400 |
you can instantiate a processor yourself and then pass that in, so then that means you don't have to 00:32:32.800 |
dig it out again like this, so sometimes that's more convenient, so this is just another way of 00:32:39.120 |
doing the same thing. But in this case we're also going to split the training set and the 00:32:44.160 |
validation set, and this is particularly important for things like categorify, because if our training 00:32:49.680 |
set is the first three elements and our validation set is the last two, then this thing here three 00:32:57.040 |
is not in the training set, and so therefore it should not be part of the vocab. So let's make 00:33:02.320 |
sure that that's the case. So here we are, categorical variable, yep it doesn't have three in it, 00:33:08.160 |
the vocab doesn't have three in it. So the way we pass in these split indexes is by calling 00:33:14.640 |
tabular object dot data source, and that converts the tabular object to a data source, the only 00:33:19.760 |
thing you pass it is the list of splits. And so that gives you a standard data source object, 00:33:30.480 |
just like the one that we saw in our last walkthrough. So that's what you get, and so 00:33:37.840 |
that data source, let's take it out, right, so that data source object will have a train, 00:33:44.240 |
for example, and a valid. And those are just other ways of saying subset zero and subset one. 00:33:59.680 |
Should test set be three comma two? No, these are the indexes of what's the indexes and things 00:34:14.000 |
in the training set, these are the indexes in the validation set, so these are indexes three and four 00:34:23.360 |
in the validation set. Yes, they're ID indexes. So data source, so like in terms of looking at 00:34:36.240 |
the code of tabular, it's super tiny, which is nice, in terms of the things that are more than 00:34:45.280 |
one line in it, because there's a bunch of things in store, and data source. And the only reason data 00:34:51.440 |
source is more than a couple of lines is because in rapids on the GPU, trying to index into a data 00:35:02.720 |
frame with like arbitrary indexes is really, really, really, really slow. So you have to pass 00:35:12.080 |
in a contiguous list of indexes to make rapids fast. So what we do is when you pass in splits, 00:35:20.720 |
we actually concatenate all of those splits together into a single list and we index into 00:35:31.360 |
the data frame with that list. So that's going to shuffle the list so that all the stuff that's in 00:35:36.640 |
the same validation or training set is all together. And so that way, now when we create 00:35:44.080 |
our data source rather than passing in the actual splits, we just pass in a range of all the numbers 00:35:49.760 |
from nought to the length of the first split, and then all of the numbers from the length of the 00:35:55.840 |
first split to the length of the whole thing. And so our data source is then able to always use 00:36:04.640 |
contiguous indexes. So that's why that bit of code is there. Other than that, one thing I don't like 00:36:17.840 |
about Python is that any time you want to create a property, you have to put it on another row 00:36:26.480 |
like this, like in a lot of programming languages you can kind of do that kind of thing on the same 00:36:31.520 |
line but not in Python for some reason. And so I find it takes up a lot of room just for the 00:36:37.040 |
process of saying these are properties. So I just added an alternative syntax which is to create a 00:36:42.160 |
list of all the things that are properties. So that's all that is. Like most of these things, 00:36:48.800 |
it's super tiny. So that's just one line of code. Just goes through and calls property on them to 00:36:55.840 |
make them properties. Oh, okay. So then another thing about this is I kind of tried to make 00:37:10.240 |
tabular look a lot like DataFrame. And one way that happens is we've inherited from getatra 00:37:16.480 |
which means that any unknown attributes it's going to pass down to whatever is the default 00:37:24.560 |
property which is soft.items which is a DataFrame. So in other words, it behaves a lot like a DataFrame 00:37:31.440 |
because anything unknown it will actually pass it along to the DataFrame. But one thing I did want 00:37:37.520 |
to change is in DataFrames it's not convenient to index into a row by row number and a column by 00:37:48.880 |
name. You can use Ilock to get row by number, column by number. You can do lock to say row by 00:37:56.400 |
name or index and column by name or index. But most of the time I want to use row numbers and 00:38:03.600 |
column names. So we redefine Ilock to use this tabular Ilock indexer which is here. 00:38:13.360 |
And as you can see if you have a row and a column then the columns I actually replace with 00:38:26.160 |
the integer index of the column. So that way we can use column names and row numbers. 00:38:33.520 |
And also it will wrap it back up in a tabular object as well. So we end up with 00:38:43.360 |
if you index into a tabular object with Ilock you'll get back a tabular object. 00:38:53.280 |
So then the way Categorify is implemented as you see in encodes is it calls transform, 00:39:02.800 |
passing in a bunch of column names and a function. The function apply cats is the thing we saw before 00:39:12.000 |
which is the thing that calls map unless you have a pandas categorical column in which case you 00:39:19.200 |
actually already have the pandas has done the coding for you. So you just return it which is 00:39:24.560 |
cat.codes+1. And so how does this function get applied to each of these columns? That's because 00:39:36.960 |
we have a thing called tabular object.transform and that's the thing that at the moment is defined 00:39:44.080 |
explicitly for pandas. And as you can see for pandas it just is this column equals the 00:39:51.840 |
transform version of this column because pandas has a dot transform method for series. 00:40:11.440 |
Okay, so Juvian your question there about numerical versus continuous you should watch the 00:40:17.040 |
introduction to machine learning for coders course where we talk about that in a lot of detail. 00:40:23.520 |
That's something you probably have time to cover here. 00:40:26.880 |
All right, so that's Categorify. So here's the other way to use Categorify as I was kind of 00:40:40.800 |
beginning to mention is you can actually create a categorical column in pandas. And one of the 00:40:46.080 |
reasons to do that is so that you can say these are the categories I want to use and they have 00:40:50.480 |
an order and so that way high, medium, low will now be ordered correctly which is super 00:40:56.960 |
useful. Also pandas is just nice and efficient at dealing with categories. So now if we go 00:41:06.320 |
Categorify just like before, it's going to give us exactly the same kind of results except that 00:41:12.720 |
when we look at the categorical processor, these will be in the right order. So we're going to end 00:41:21.120 |
up with things that have been mapped in that way and it will also be done potentially more 00:41:26.160 |
efficiently because it's using the internal pandas cat code stuff. Thank you David, that is very kind. 00:41:38.240 |
I'm so thrilled. I would love to know where you are working as a computer vision data scientist. 00:41:44.880 |
Andrew was studying his job as the in-house research director and data scientist here 00:41:55.840 |
in 10 days time and also working very, very, very, very hard on being amazing which also helps. 00:42:05.600 |
Lots of people do the course and don't end up doing jobs because they don't work as hard. 00:42:14.960 |
Although I think everybody listening to this by definition is going to an extra level of effort. 00:42:22.160 |
So I'm sure you will all do great. Normalize is you know just something where we're going to 00:42:31.760 |
subtract the means and divide by the state of deviations and for decodes we're going to do the 00:42:37.920 |
opposite. And you'll see what we generally do in these setup things, we use it do this in lots of 00:42:44.880 |
places, is we say get atra data source comma train comma data source. What this means is 00:42:55.200 |
it's the same as if we'd written this df equals dsrc dot train if has atra dsrc comma train 00:43:11.600 |
else dsrc. It's the same as writing that, right? And so the reason we're doing that 00:43:20.800 |
is because we want you to be able to either pass in a an actual data source object which has a train 00:43:28.960 |
and a valid or not. You know if you aren't doing separate things to train and valid then that 00:43:36.000 |
should be fine as well, right? And so as long as the thing you pass in, if it does have a train, 00:43:42.320 |
then it should you know it should give you back some kind of object that has the right methods 00:43:48.320 |
that you need. So in the case of you know data source and tabular it will, data source has a train 00:43:57.440 |
attribute that will return a tabular object or if you just pass in a tabular object directly 00:44:05.600 |
then it won't have a train so it'll just stay so dsrc will just be a tabular object which has a 00:44:12.480 |
continuous variables const attribute. Okay so now we have this this data frame containing just the 00:44:22.720 |
continuous variables and optionally just for our training set if we have one so then we can set up 00:44:29.440 |
our means and standard deviations you know that's the metadata we need. So here we can create our 00:44:37.360 |
normalize, create a data frame to test it out on, pass in that processor this time we just got to 00:44:43.840 |
say these are continuous variables this is one which is a make sure we call setup and so now 00:44:51.280 |
we should find here is the same data that we had before but let's make it into an array for testing 00:44:59.600 |
and calculate being a standard deviation so we should find if we go norm.means a 00:45:05.600 |
so self.means was df.mean this is quite nice right in pandas if you call .mean on a data frame 00:45:14.320 |
you will get back a I think it's a series object which you can index into with column names. 00:45:21.600 |
So this is quite neat right that we were able to get all the means and standard deviations 00:45:26.320 |
all at once for all the columns and even apply them to all the columns at once. This is kind of 00:45:31.200 |
the magic of Python indexes so I think that's that's actually pretty nice. So yeah make sure 00:45:39.120 |
that the main is m, standard deviation should be around s and the values after processing should 00:45:47.120 |
be around x minus m over s. One thing to notice is that this here setup we didn't call here why 00:46:03.760 |
didn't we call setup and the reason why is that if you look at data source it calls setup. Why? 00:46:16.000 |
Because we now definitely have all the information we need to set it up right. We know the data 00:46:20.640 |
because that was a data frame you passed in and now we know what the training set is because that 00:46:25.200 |
was passed in so there's no reason there's no reason to ask you to manually call setup. 00:46:30.080 |
So you've got two ways to set up your processes. One is to call setup on your tabular object 00:46:40.560 |
or the other is just to create a data source right and it's kind of 00:46:45.840 |
it's something you kind of have to be aware of right because calling data source is not just 00:46:52.400 |
returning a data source it is also modifying your tabular object data to process it and so it's kind 00:46:59.600 |
of like it's a very non-pure non-functional kind of approach coming on here. We're not changing 00:47:07.600 |
things and returning them we're like changing things in place and the reason for that is that 00:47:12.240 |
you know with with tabular data you know you you don't want to be creating lots of copies of it 00:47:18.240 |
you really want to be doing stuff in place it's got a lot of you know performance important 00:47:22.560 |
performance issues so we try to do things just once and do it where it is. So normalize in this 00:47:30.000 |
case we're calling setup and so again for inference you know here's some new data set we want to call 00:47:37.920 |
inference on we go tabular object.new on the new data frame we process it we don't call setup 00:47:45.520 |
because we don't want to create new meaning standard deviation and to use the same standard 00:47:49.280 |
deviation and mean that we use for our training. And then here's the version where we use instead 00:47:58.960 |
data source so you'll find that the mean and standard deviation now the mean standard deviation 00:48:07.600 |
is 0 1 2 because that's the only stuff in the training set and again normalization and stuff 00:48:12.480 |
should only be done with the training set. So you know all this stuff of kind of 00:48:17.360 |
using this stuff makes it much harder to screw things up in terms of modeling and accidentally 00:48:25.280 |
do things on the whole data set you know get leakage and stuff like that because you know 00:48:30.240 |
we try to automatically do the right thing for you. Okay so then fill missing is going to 00:48:40.720 |
go through each continuous column and it will see if there are any missing values in the column. 00:48:54.640 |
And if there are missing values then it's going to create a 00:49:06.960 |
nA dictionary so if there's any nAs or any missing or any nulls or the same idea in pandas 00:49:14.640 |
then that column will appear as a column name in this dictionary and the value of it will be 00:49:23.040 |
dependent on what fill strategy you ask for. So fill strategy is a class 00:49:38.720 |
and you can say which of those methods you want to use. Do you want to fill things with the median 00:49:49.360 |
or with some constant or with the mode, right? And so we assume by default that it's the median. 00:49:57.760 |
So this here is actually going to call fill strategy.median passing in the column 00:50:06.640 |
So that's the dictionary we create. So then later on when you're calling codes 00:50:16.080 |
we actually need to go through and do two things. The first thing is to use the pandas fill in A 00:50:31.680 |
to fill missing values with whatever value we put into the dictionary for that column 00:50:38.320 |
and again we do it in place. Then the second thing is if you're asked to add an extra column to say 00:50:48.320 |
which ones we feel are missing in, which by default is true, then we're going to add a column with the 00:50:53.920 |
same name that underscore nA at the end which is going to be a boolean of true if that was 00:51:02.000 |
originally missing and false otherwise. So here you can see we're creating three different processes 00:51:10.960 |
which are just processes, a fill missing processor with each of the possible 00:51:15.520 |
strategies. And so then we create a data frame with a missing value and then we just go through 00:51:23.360 |
and create three tabular objects with those three different processes and make sure that the 00:51:31.680 |
nA dict for our A column has the appropriate median or constant or load as requested. 00:51:39.760 |
And then remember setup also processes so then we can 00:51:46.800 |
go through and make sure that they have been replaced correctly. 00:51:55.680 |
And also make sure that the tabular object now has a categorical name which is 00:52:02.400 |
in this case a nA. So it's not enough just to add it to the data frame, it also has to be added to 00:52:10.160 |
cat names in the tabular object because this is something a categorical column we want to use for 00:52:16.000 |
modeling. So Madhavan asks shouldn't setups be called in the constructor and no it shouldn't. 00:52:23.680 |
Setups is what transforms call when you call setup using the type dispatch stuff we talked 00:52:31.600 |
about in the transforms walkthrough and so and then setup is something which should be called 00:52:38.880 |
automatically only when we have enough information to know what to set up with and that information 00:52:43.760 |
is only available once you've told us what your training set is so that's why it's called by data 00:52:49.280 |
source not called by the constructor but if you're not going to use a data source then you can call 00:52:56.960 |
it yourself. Okay. Great. So this section is mainly kind of a few more examples of putting 00:53:15.600 |
that all together. All right. So here's a bunch of processes normalize, categorify, film missing, 00:53:21.600 |
do nothing at all. Obviously you don't need this one it's just to show you. And here's a data frame 00:53:27.040 |
with a couple of columns. A is the categorical, B is the continuous because remember that was the 00:53:32.960 |
order that we use. It would be probably better if we actually wrote those here at least the first 00:53:39.200 |
time so you didn't have to remember. There we go. So recall setup because we're not using a data 00:53:47.760 |
source on this one. And so the processes you'll have noticed explicitly only work on the columns 00:53:59.200 |
of the right type so these work just on the continuous columns for normalize, for get a 00:54:04.480 |
categorify, it goes through the categorical columns. You might have noticed that was all 00:54:15.600 |
cat names and that's because you also want to categorize categorical dependent variables 00:54:21.600 |
but normalize we don't normalize continuous dependent variables. Normally for that you'll 00:54:27.040 |
do like a sigmoid in the model or something like that. So yeah so you can throw them all in there 00:54:36.320 |
and it'll do the right thing for the right columns automatically. So it just goes through and makes 00:54:40.080 |
sure that that all works fine. So these are really just a bunch of tests and examples. 00:54:46.160 |
Okay so last section which is okay so now we have a tabular object which has got some cats 00:54:56.320 |
and some cons and dependent variable wise. If we want to use this for modeling we need tensors. 00:55:04.960 |
We actually need three tensors, one tensor for the continuous, one for the categorical and one 00:55:11.920 |
for the dependent. And the reason for that is that they're continuous in the categorical of 00:55:16.720 |
different data types so we can't put them all on the same tensor because tensors have to be all 00:55:21.440 |
of the same data type. So if you look at the version one tabular stuff it's the same thing 00:55:28.240 |
right we have those three different tensors. So now we create some one normal transform so a lazy 00:55:37.360 |
transform that's you know it's applied as we're getting our batches and all we do is we say okay 00:55:45.200 |
this is the tabular object which we're going to be transforming and we just grab the 00:55:52.400 |
and in codes we don't actually need we don't need that state at all. For end codes we're just going 00:56:01.920 |
to grab all of the categorical variables turn them into a tensor and make it a long. And then 00:56:07.120 |
we'll grab all the continuous turn it into a tensor make it a float. And so then the first 00:56:12.240 |
thing our tuple is itself a tuple with those two things so that's our independent variables 00:56:17.920 |
and then our dependent variable is the target turned into a long. This is actually a mistake 00:56:25.120 |
it shouldn't always turn it into a long it should only turn it into a long if it's continuous 00:56:30.400 |
sorry categorical why otherwise it should be continuous I think. 00:56:39.760 |
No let's wait until we get to modeling I can't quite remember it's if it's categorical 00:56:44.800 |
where they're going to want to encode it. No we're going to use it as yeah that's right 00:56:51.200 |
so it's that's right so it's a long if it's categorical but for continuous it has to be float 00:56:55.760 |
that's right. So to use float for continuous target okay so that's a little mistake 00:57:07.280 |
we haven't done any tabular regression yet in version one version two. 00:57:10.960 |
So that's all encodes is going to do so then we'll come back to decodes later right. So in our 00:57:21.040 |
example here we grabbed our path to the adult sample we read the CSV we split it into a test 00:57:28.000 |
set and the main bit made a list of our categorical and continuous a list of the processes we wanted 00:57:37.920 |
to use the indexes of the splits that we wanted so then we can create that tabular as we discussed 00:57:48.000 |
we can turn it into a data source with the splits now you'll see here it never mentioned read tab 00:57:58.240 |
batch and the reason for that is that we don't want to force you to do things that we can do for 00:58:04.080 |
you so if you just say give me a tabular data loader rather than a normal data loader and tabular 00:58:10.000 |
data that loader is a transform data loader where we know that any after batch that you asked for 00:58:18.480 |
we have to also add in read tab batch so that's that's how that's automatically added to the 00:58:27.280 |
transforms for you the other thing about tabular data loader is we want to do everything a batch 00:58:39.920 |
at a time so particularly for the rapids on GPU stuff we don't want to pull out individual rows 00:58:47.440 |
and then collect them later everything's done by grabbing a whole batch at a time so we replace 00:58:52.720 |
do item which is the thing that normally grabs a single item for collation we replace it with do 00:58:58.560 |
nothing replace it with no up right and then we replace create batch which is the thing that 00:59:04.800 |
normally collects things to say don't collate things but instead actually grab all of the samples 00:59:11.040 |
directly from the tabular object using my lock so this is if you look at that blog post I mentioned 00:59:21.040 |
from even at Nvidia about how they got the 16x speed up by using rapids a key piece of that was 00:59:29.040 |
that they wrote their own version of this kind of stuff to kind of do everything batch at a time 00:59:35.280 |
and this is one of the key reasons we replaced the PyTorch data loader is to make this kind of thing 00:59:40.960 |
super easy so as you can see creating a kind of a batch a batch at a time data loader is seven lines 00:59:46.800 |
of code super nice and easy so yeah I was pretty excited when this came out so quick 01:00:02.240 |
so that's what happens when we create the tabular data loader 01:00:14.480 |
we could of course also create a data bunch we should probably add this to the example 01:00:21.040 |
and uh yeah that's basically it so then at inference time as we discussed you can now 01:00:35.920 |
do the same dot new trick we saw before in the dot process and then you can grab whatever 01:00:41.600 |
here's all coals which is going to give us a data frame with all the modeling columns 01:00:46.080 |
and since this is not so show batch will be the decoded version but this is not the decoded 01:00:51.280 |
version this is the encoded version that you can pass to your modeling all right any questions 01:00:59.760 |
Andrea no okay cool all right thanks so that is it and uh it's Friday right yeah so I think 01:01:10.000 |
I think we're on for Monday I'll double check and I'll let you all know um right away I will see