back to index

fastai v2 walk-thru #8


Chapters

0:0
32:41 split the training set and the validation
40:41 create a categorical column in pandas
46:31 set up your processes

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay. There we go. Hi, everybody. Can you see me? Can you hear me? Okay. So, I'm going to
00:00:29.920 | turn it off. Okay. Great. Hi. Hope you're ready for Tabular. Oh, and we have Andrew
00:00:51.320 | Shaw here today as well. Andrew just joined WAMRI, Medical Research Institute as Research
00:00:59.280 | Director, and he is the person you may be familiar with from Music Auto, but if you
00:01:09.620 | haven't checked it out, you should, because it's super, super cool. Now he's moving from
00:01:15.520 | the Jackson 5 to Medical AI Research. All right. So, Tabular is a cool notebook, I think.
00:01:39.960 | It's a bit of fun. And the basic idea, well, let's start at the end to see how it's going
00:01:51.160 | to look. So, we're going to look at the adult dataset, which is the one that we use in most
00:01:57.840 | of the docs in version one, and we used in some of the lessons, I think. It's a pretty
00:02:02.840 | simple, small dataset. It's got 32,000 rows in it, and 15 columns in it. And we've got
00:02:32.320 | 32,000. And here's what it looks like when we grab it straight from Dandas. So, basically,
00:02:45.280 | in order to create models from this, we need to take the categorical variables and convert
00:02:54.720 | them into ints, possibly with missing category, the continuous variables, if there's anything
00:03:06.800 | with a missing value in a continuous variable, we need to replace it with something, so normally
00:03:13.960 | we replace it with something like the median, and then normally we add an additional column
00:03:19.720 | for each thing that has a missing value, for each column that has a missing value, and
00:03:23.720 | that column will be binary, which is, is it missing or not. So, we need to know which
00:03:29.320 | things are going to be the categorical variables, which column names are going to be the continuous
00:03:35.000 | variables, so we know how to process each one, we're going to need to know how to split
00:03:42.920 | our validation and training set, and we need to know what's our dependent variable. So,
00:03:50.560 | we've created a class called Tabular. Basically, Tabular contains a data frame, and it also
00:04:04.240 | contains a list of which things are categorical, which things are continuous, and what's your
00:04:10.800 | dependent variable, and also some processes, which we'll look at in a moment, where they
00:04:18.320 | do things like turning strings into ints for categories, and filling the missing data,
00:04:24.160 | and doing normalization of continuous. So that creates a Tabular object, and from a Tabular
00:04:32.640 | object you can get a data source if you pass in a list of split indexes. So, feel free to
00:04:39.560 | ask also Andrew if you have any questions as we go. Oh, what's that David? Do you want
00:04:47.840 | a jagged competition? What's a jagged competition? Tell us more. Oh, Kaggle, which one? So now
00:04:59.440 | that we've got a data source, we created data loader from it, and so we have a little wrapper
00:05:05.760 | for that to make it easier for Tabular. And so then we can go show batch. Oh, it's broken
00:05:13.760 | somehow. Nice one, Jeremy. Damn it. This was all working a moment ago. And then Andrew and
00:05:24.800 | I were just changing things at the last minute, so what did we break? Okay, I am going to
00:05:40.800 | ignore that. Alright, so we just broke something apparently before we started recording, but
00:05:50.640 | anyway show batch would then show the data. And then you can take a test set that's not
00:05:58.320 | processed. And basically what you want to be able to do with a test set is say I have
00:06:03.200 | the same categorical names and continuous names and the same dependent variable and
00:06:06.800 | the same processes, and you should then be able to apply the same pre-processing to the test set.
00:06:13.120 | And so to.new creates a new Tabular object with all that metadata, and then process will do the
00:06:20.000 | same processing. And so then you can see here, here's the normalized age, normalized education,
00:06:27.680 | all these things have been turned into ints. Oh, and so forth. Excuse me. So that's basically what
00:06:37.120 | this looks like. You'll see I've used a subclass of Tabular called Tabular Pandas. I don't know
00:06:46.080 | if it's always going to stay this way or not, but currently, well we want to be able to support
00:06:51.040 | multiple different backends. So currently we have a Pandas backend for Tabular and we're also
00:06:57.520 | working on a RAPIDS backend. For those of you that haven't seen it, RAPIDS is a really great project
00:07:03.520 | coming out of NVIDIA, which allows you to do, so they've got some encoder, QDF, which basically
00:07:09.440 | gives you GPU-accelerated data frames. So that's why we have different subclasses.
00:07:20.320 | I don't know if this will always be a subclass of Tabular. We may end up instead of having
00:07:23.680 | subclasses of data source, but for now we've got subclasses of Tabular. So that's where we're
00:07:29.360 | heading. All right. Do you have any advice to speed up inference on a tabular learn of predictions?
00:07:40.240 | Well, hopefully there'll be fast, particularly if you use RAPIDS. I mean, yeah,
00:07:45.600 | basically, if you use RAPIDS. In fact, if you look at RAPIDS, NVIDIA, fast.ai, hopefully,
00:07:54.880 | yeah, here we go. So even, a lot of you will remember from the forums, he's at NVIDIA now,
00:08:04.480 | which is pretty awesome, and he's working on the RAPIDS team and he recently posted how he got a
00:08:10.960 | 15x acceleration and got in the top 20 of a competition, a very popular competition on Kaggle
00:08:18.800 | by combining RAPIDS by Torch and fast.ai. So that would be a good place to start, but hopefully by
00:08:24.480 | the time fast.ai version 2 comes out this would be super simple, and we're working with Even on
00:08:30.000 | making that happen. So thank you, Even and NVIDIA for that help.
00:08:37.360 | All right. So let's start at tabular. So basically, the idea of tabular is that we
00:08:50.560 | have a URL. Oh, yes. I just googled for NVIDIA AI RAPIDS, fast.ai. And here it is, accelerating,
00:09:09.200 | keep learning recommendation systems. I'm sure we can, somebody will hopefully add it to the notes
00:09:12.800 | as well. Okay. So the basic idea with tabular was I kind of wanted to, I kind of like to have a
00:09:28.160 | class which kind of has all the information it needs to do what we want it to do. So in this case,
00:09:36.400 | you know, a data frame doesn't have enough information to actually build models because
00:09:45.520 | until you know what the categorical variables are, the continuous variables are and what pre-processing
00:09:50.480 | to do and what the dependent variable is, you know, you can't really do much. So that's the
00:09:55.920 | basic kind of idea of this design was to have something with that information. So categorical
00:10:03.600 | names, continuous names, line names, normally it's just one dependent variable, so one line name,
00:10:10.560 | but you could have more and processes. So those are basically the four things that we want to
00:10:16.960 | put in our tabular. So -- oh, that was crazy. There we go.
00:10:44.560 | All right. So that's why we're passing in those four things. And then the other thing
00:10:57.120 | we need to know about why is why categorical or continuous, so you can just pass that into
00:11:03.440 | some boolean. So Aman wants to know why might we need more than one y-name. So a few examples,
00:11:11.040 | you could be doing a regression problem where you're trying to predict the x and y coordinates
00:11:15.920 | of the destination of a taxi ride, or perhaps you're just doing a multi-label classification
00:11:22.320 | where you can have multiple things being true, would be a couple examples.
00:11:26.560 | And maybe it's already one-hot encoded or something in that case.
00:11:29.840 | Okay. So that's the theory behind the design of tabular. So when we initialize it, we just
00:11:40.160 | pass in that stuff. The processes that we pass in are just going to be transforms,
00:11:46.320 | so we can dump in a pipeline. And so this stuff kind of you'll see that we just keep reusing the
00:11:51.280 | same foundational concepts throughout fast.ai version 2, which is a good sign that they're
00:11:55.760 | strong foundations. So the processes are, you know, things that we want to run a bunch of them,
00:12:03.360 | we want to depend on what type something is, whether we run it or not, you know, stuff like
00:12:07.520 | that. So it's got all that kind of behavior we want in the pipeline. Unlike TifemDS, TifemDL,
00:12:16.560 | TifemList, all of those things apply transformations lazily. On tabular data,
00:12:23.680 | we don't generally do that. There's a number of reasons why. The first is that unlike kind of
00:12:29.600 | opening an image or something, it doesn't take a long time to grab a row of data.
00:12:36.640 | So like it's fine to read the whole lot of rows normally, except in some cases of really,
00:12:42.720 | really big datasets. The second reason is that most kind of tabular stuff is designed to work
00:12:49.520 | quickly on lots of rows at a time. So it's going to be much, much faster if you do it ahead of time.
00:12:53.920 | The third is that most preprocessing is going to be not data augmentation kind of stuff, but more
00:13:00.640 | just once off cleaning up labels and things like that. So for all these kinds of reasons,
00:13:06.800 | our processing in tabular is generally done ahead of time rather than lazily,
00:13:11.120 | but it's still a pipeline of transforms. Okay, so then we're going to store something called
00:13:17.680 | catY, which will be our Y variable if it's a categorical, otherwise none, and vice versa,
00:13:24.160 | the convoy. So that's our initializer, and let's take a look at it in use. So basically we create
00:13:33.440 | a data frame containing two columns, and from that we can create a tabular object passing in
00:13:41.120 | that data frame and saying our cat names, con names, y names, whatever. So in this case we'll
00:13:46.400 | just have cat names. One thing that's always a good idea is to make sure that things pickle
00:13:51.760 | okay because for inference and stuff all this metadata has to be pickleable. So dumping
00:13:56.800 | something to a string and then loading that string up again is always a good idea to make sure it
00:14:00.720 | works and make sure it's the same as what you started with. So we're inheriting from
00:14:07.280 | co-base, and co-base is just a small little thing in core which defines the basic things you would
00:14:16.480 | expect to have in a collection, and it implements them by composition. So you pass in some list or
00:14:24.880 | whatever, and so the length of your co-base will be the length of that list. So like you can often
00:14:31.040 | just inherit from something to do this, but composition often can give you some more
00:14:37.520 | flexibility. So this basically gives you the very quick, you know, getItem is defined by
00:14:43.520 | calling the self.items getItem, and length is defined by calling the self.items length,
00:14:47.600 | and so forth. So by inheriting from co-base we can then in the init simply say super.initdf,
00:14:57.840 | and so that means we're going to now going to have something called self.items,
00:15:01.760 | which is going to be that data frame. And so here we check that, you know, the pickled version of
00:15:10.080 | items is the same as the TabularObjects version of items. Okay, so let me just press this.
00:15:39.600 | Okay, so the next thing to notice is that there are various useful little attributes like
00:15:48.320 | all columns, all cats for all categorical columns, all const for all continuous columns,
00:15:58.560 | and then there's also the same with names at the end, all count names, all cat names,
00:16:04.080 | all call names. So you can say all count names is just the continuous names plus the continuous y
00:16:10.880 | if there is one. This would not work except for the fact that we use the capital L plus, so if you
00:16:19.360 | add a none to a capital L plus it doesn't change it, which is exactly what we want most of the
00:16:24.800 | time. So some of the things in capital L are a bit more convenient than this. I think that's right,
00:16:32.000 | anyway. Maybe I should double check before I say things that might not be true. Yeah, so you can't
00:16:39.440 | do that, or else you can do that and get exactly the expected behavior. And Ls, by the way, always
00:16:48.960 | show you how big they are before they print out their contents. You'll see that all calls does not
00:17:03.040 | actually appear anywhere here, and that's because we just created a little thing called add property
00:17:08.080 | that just adds cats or cats, consts or consts or calls. And so for each one it creates a
00:17:15.200 | read version of the property, which is to just grab whatever cat names or cat names,
00:17:20.720 | cont names, etc., which are all defined in here, and then indexes into our data frame with that
00:17:28.160 | list of columns. And then it creates a a setter, which simply sets that list of names to whatever
00:17:40.560 | value you provide. So that's just a quick way to create the setter and get a version of all of
00:17:45.920 | those. So that's where all calls come from. So in this case, because the TabularObject
00:17:53.600 | only has this one column mentioned as being part of what we're modeling,
00:17:59.360 | even though the data frame had an a and a b, TabularObject.all_holes only has the a
00:18:08.160 | con in it, because by all calls it means all of the columns we're using in modeling, so
00:18:13.440 | continuous and categorical and dependent variables. And so one of the nice things
00:18:20.080 | is that, you know, because everything is super consistent in the API, we can now just say .show,
00:18:28.240 | just like everything else, we can say .show, and in this case we see the all calls data frame.
00:18:36.080 | Okay, so then
00:18:41.600 | processes, TabularProcesses, are just transforms. Now specifically they're in-place transforms,
00:18:53.360 | but don't let that bother you because in-place transform is simply a transform where when we
00:19:00.800 | we call it and then we return the original thing. So like all the transforms we've seen so far
00:19:07.600 | return something different to what you passed in, like that's the whole point of them,
00:19:12.160 | but processes, the whole point of them is that they change the actual stored data,
00:19:18.800 | so that's why they just return whatever you started with. So that's all in-place transform
00:19:27.040 | means. And so a TabularProcessor is just a transform that returns itself when you call it,
00:19:34.240 | and when you set it up, it just does the normal set up, but it also calls dundercall, in other
00:19:46.000 | words, self-roundbackers. Why does it do that? Well, let's take a look at an example,
00:19:52.400 | Categorify. So Categorify is a TabularProc where the setup is going to create a
00:20:01.760 | category map, so this is just a mapping from int-numbered bits of vocab to string vocab,
00:20:09.760 | that's all a category map is. And so it's going to go through all of the categorical columns,
00:20:21.840 | and it's going to go .iloc into the data frame for each of those columns,
00:20:26.080 | and it's going to create a category map for that column. So this is just creating, so self.classes
00:20:34.480 | then is going to be a dictionary that goes from the column names to the vocab for that categorical
00:20:41.600 | column. So that's what setup does, right? So setup kind of sets up the metadata, it's the vocab.
00:20:49.440 | Encodes, on the other hand, is the thing that actually takes a categorical column and converts
00:20:57.120 | it into ints using the vocab that we created earlier. And so we need them to be two separate
00:21:04.480 | things because if you think about inference, at inference time you don't want to run setup,
00:21:12.400 | at inference time you just want to run encodes. But at training time you want to do both, right?
00:21:21.680 | Any time you don't do setup, you're definitely also going to want to process. So that's why setup,
00:21:27.520 | after it sets up, immediately does the encoding because that's in practice always what you want.
00:21:37.280 | So that's why we override setup in TabularProc. That's all TabularProc is, it's just a transform
00:21:43.280 | that when you set it up it also calls it straight away.
00:21:45.840 | Okay, so that, so Categorify is pretty similar to the
00:21:54.960 | categorized transform we've seen for dependent variables for like image classification,
00:22:04.560 | but it's a little bit different because it's for TabularObjects. So you can see an example.
00:22:10.880 | Here's a data frame 01202 in just a single column. So again we can create a TabularObject
00:22:19.680 | passing in the data frame, passing in any processes we want to run, and passing in,
00:22:25.600 | so the first thing that we pass in after that will be the category names, the categorical names,
00:22:31.920 | so we're just going to have one. So once you have created your TabularObject, the next thing you
00:22:40.640 | want to do is to call setup, and remember that setup's going to do two things. It's going to call
00:22:45.280 | your setups and it's going to call your encodes. Are there any object or textures? We haven't done
00:22:58.080 | any models yet, David. We're only really working through the transformation pipeline, but we have
00:23:02.800 | certainly looked at a version to object detectors in previous lessons for like, well, not in detail.
00:23:09.600 | We've touched on the places where the object detection data is defined, and
00:23:17.120 | yeah, hopefully it's clear enough that you can figure that out.
00:23:26.880 | Yeah, so patch property does not call, I mean you can check the code easily enough yourself,
00:23:34.000 | right? So patch property, let's see, okay, so no, it doesn't, it calls patch2,
00:23:42.240 | patch2, no, it doesn't. So there's no way to create a setup using that decorator yet.
00:23:56.240 | If you can think of a good syntax for that that you think would work well,
00:23:59.680 | feel free to suggest it or even implement it in a PR.
00:24:02.400 | Okay, so, all right, so we've created our TabularObject. As we told it to
00:24:13.760 | categorify, we said this is our only categorical variable. Then we call setup, which is going to
00:24:19.680 | go ahead and let's have a look. Setup is going to, in our processor, create classes and then
00:24:32.640 | it's going to change our items to call apply cats, which will map the data in that column
00:24:49.680 | using this dictionary. So map is an interesting pandas method. You can pass it a function,
00:24:57.760 | which is going to be super slow, so don't do that, but you can also pass it a dictionary
00:25:02.240 | and that will just map from keys to values in the dictionary. So if we have a look at
00:25:11.520 | at this case, we can see that 2.a starts out as 0.1.2.0.2 and ends up as 1.2.3.1.3 and the
00:25:24.640 | reason for that is that the vocab that it created is #na 0.1.2. So whenever we create a
00:25:34.960 | categorized column for the vocab, we always put a #na at the start, which is similar to what we've
00:25:42.480 | done in version 1. So that way if you in the future get a value outside of your vocab,
00:25:51.280 | then that's what we're going to set it to. We're going to set it to #na and so therefore 1, if we
00:25:57.520 | index that into the vocab, that maps to 0. So that's why this is 0 became 1, for example.
00:26:04.800 | One of the things that I recently added to pipeline, because remember that 2.prox
00:26:14.080 | is a pipeline, right? And one of the things I added to pipeline is a getatra, which will
00:26:25.360 | try to, so this is not an attribute in pipeline, not surprisingly, so if it finds an attribute
00:26:33.280 | it doesn't understand, it will try and find that attribute in any transforms inside that pipeline,
00:26:39.440 | which is exactly what we want, you know? So in this case, it's going to look for a transform
00:26:51.040 | with a type category of 5 and it converts it to snake case. So this is very similar if you've
00:26:58.320 | watched the part 2, the most recent part 2 videos, we did the same thing for callbacks. Callbacks
00:27:04.160 | got automatically inserted, and I think version 1 does this too, getting automatically added as
00:27:08.160 | attributes. So pipeline does something very similar, but it doesn't actually add them as
00:27:13.600 | attributes, it uses getatra to do the same thing. So in this case we added a categoryify transform,
00:27:21.280 | so we haven't instantiated it, we just passed in the type, so it's going to instantiate it for us.
00:27:25.840 | Pipeline will always instantiate your types for you if you don't instantiate them, and so later
00:27:30.720 | on we want to say, okay, let's find out what the vocab was, which means we need to grab the
00:27:36.160 | processes out of our transform object and ask for the categoryify transform. So now that we've got
00:27:43.120 | that, categoryify transform, categoryify defines getItem and it will return the vocab for that
00:27:54.640 | column. So here is the vocab for that column, and so we can have a look, cat a,
00:28:11.600 | okay, as you can see there it is, and that is
00:28:14.800 | actually a type category map, so
00:28:20.480 | as well as having the items that we just saw, it also has the reverse mapping,
00:28:28.000 | so this is, as you can see, goes the opposite direction. So to answer Aman's question,
00:28:36.160 | yes, this should take care of the mapping in the test set, because if it comes across,
00:28:43.760 | so it's going to look at o2i when it tries to call apply cats, it's going to try to find in your
00:28:52.000 | categories, it's going to try and find that column, it's going to grab o2i, which is this dictionary,
00:29:00.960 | that's then going to use that to map everything in the column, but because it's a default dict,
00:29:05.440 | if there's anything it doesn't recognize, it will become zero, which is the na
00:29:11.040 | category. So yeah, that should all work nicely. You have to figure out how to model it,
00:29:18.480 | of course, but the data processing will handle it for you. Okay, so
00:29:27.360 | what else? So now, imagine it's inference time,
00:29:36.720 | and so we come along with some new data frame that we want to run inference on,
00:29:45.120 | or it's a test set or whatever, so here's our data frame. So we now have to say, okay,
00:29:50.960 | I want to create a new tabular object for the same metadata that we had before, so the same
00:29:56.880 | processes, the same vocab, the same categorical continuous variables. The way to do that is to
00:30:03.200 | start with an existing tabular object which has that metadata and call new and pass in a new data
00:30:08.720 | frame that you have, and that's going to give you a new tabular object with the same metadata and
00:30:13.280 | processes and stuff that we had before, but with this different data in it. Now of course, we don't
00:30:20.640 | want to call setup on that because setup would replace the vocab, and the whole point is that
00:30:25.680 | we want to actually use the same vocab for inference on this new data set. So instead you
00:30:34.000 | call process, and so all process does in tabular is it just calls the processes,
00:30:41.760 | which is a pipeline, so you can just treat it as a function.
00:30:50.480 | So in this case, our vocab was 0, 1, 2 and an a, and here there's a couple of things that are not
00:31:00.000 | in that list of 0, 1, 2, specifically this one is 3 and this one is -1. So 1, 0 and 2 will get
00:31:10.640 | replaced by their 1, 0 and 2, so 2, 1, 3, 2, 1, 3, so there's our new a column, and then as we just
00:31:23.840 | discussed the two things that don't map are going to be 0. So then if you call decode on our processor,
00:31:36.960 | then as you would expect you end up with the same data you started with, but of course this is now
00:31:44.240 | going to be Na, because we said we don't know what those are. So like decoding in general in fastai
00:31:53.840 | doesn't mean you always get back exactly what you started with, right? It's kind of it's trying to
00:32:00.400 | display the kind of transformed version of the data. In some cases like normalization, it should
00:32:11.280 | pretty much be exactly what you started with, but for some things like categorify the missing values
00:32:16.640 | it won't be exactly what you started with. You don't have to pass in just a type name,
00:32:24.400 | you can instantiate a processor yourself and then pass that in, so then that means you don't have to
00:32:32.800 | dig it out again like this, so sometimes that's more convenient, so this is just another way of
00:32:39.120 | doing the same thing. But in this case we're also going to split the training set and the
00:32:44.160 | validation set, and this is particularly important for things like categorify, because if our training
00:32:49.680 | set is the first three elements and our validation set is the last two, then this thing here three
00:32:57.040 | is not in the training set, and so therefore it should not be part of the vocab. So let's make
00:33:02.320 | sure that that's the case. So here we are, categorical variable, yep it doesn't have three in it,
00:33:08.160 | the vocab doesn't have three in it. So the way we pass in these split indexes is by calling
00:33:14.640 | tabular object dot data source, and that converts the tabular object to a data source, the only
00:33:19.760 | thing you pass it is the list of splits. And so that gives you a standard data source object,
00:33:30.480 | just like the one that we saw in our last walkthrough. So that's what you get, and so
00:33:37.840 | that data source, let's take it out, right, so that data source object will have a train,
00:33:44.240 | for example, and a valid. And those are just other ways of saying subset zero and subset one.
00:33:59.680 | Should test set be three comma two? No, these are the indexes of what's the indexes and things
00:34:14.000 | in the training set, these are the indexes in the validation set, so these are indexes three and four
00:34:23.360 | in the validation set. Yes, they're ID indexes. So data source, so like in terms of looking at
00:34:36.240 | the code of tabular, it's super tiny, which is nice, in terms of the things that are more than
00:34:45.280 | one line in it, because there's a bunch of things in store, and data source. And the only reason data
00:34:51.440 | source is more than a couple of lines is because in rapids on the GPU, trying to index into a data
00:35:02.720 | frame with like arbitrary indexes is really, really, really, really slow. So you have to pass
00:35:12.080 | in a contiguous list of indexes to make rapids fast. So what we do is when you pass in splits,
00:35:20.720 | we actually concatenate all of those splits together into a single list and we index into
00:35:31.360 | the data frame with that list. So that's going to shuffle the list so that all the stuff that's in
00:35:36.640 | the same validation or training set is all together. And so that way, now when we create
00:35:44.080 | our data source rather than passing in the actual splits, we just pass in a range of all the numbers
00:35:49.760 | from nought to the length of the first split, and then all of the numbers from the length of the
00:35:55.840 | first split to the length of the whole thing. And so our data source is then able to always use
00:36:04.640 | contiguous indexes. So that's why that bit of code is there. Other than that, one thing I don't like
00:36:17.840 | about Python is that any time you want to create a property, you have to put it on another row
00:36:26.480 | like this, like in a lot of programming languages you can kind of do that kind of thing on the same
00:36:31.520 | line but not in Python for some reason. And so I find it takes up a lot of room just for the
00:36:37.040 | process of saying these are properties. So I just added an alternative syntax which is to create a
00:36:42.160 | list of all the things that are properties. So that's all that is. Like most of these things,
00:36:48.800 | it's super tiny. So that's just one line of code. Just goes through and calls property on them to
00:36:55.840 | make them properties. Oh, okay. So then another thing about this is I kind of tried to make
00:37:10.240 | tabular look a lot like DataFrame. And one way that happens is we've inherited from getatra
00:37:16.480 | which means that any unknown attributes it's going to pass down to whatever is the default
00:37:24.560 | property which is soft.items which is a DataFrame. So in other words, it behaves a lot like a DataFrame
00:37:31.440 | because anything unknown it will actually pass it along to the DataFrame. But one thing I did want
00:37:37.520 | to change is in DataFrames it's not convenient to index into a row by row number and a column by
00:37:48.880 | name. You can use Ilock to get row by number, column by number. You can do lock to say row by
00:37:56.400 | name or index and column by name or index. But most of the time I want to use row numbers and
00:38:03.600 | column names. So we redefine Ilock to use this tabular Ilock indexer which is here.
00:38:13.360 | And as you can see if you have a row and a column then the columns I actually replace with
00:38:26.160 | the integer index of the column. So that way we can use column names and row numbers.
00:38:33.520 | And also it will wrap it back up in a tabular object as well. So we end up with
00:38:43.360 | if you index into a tabular object with Ilock you'll get back a tabular object.
00:38:53.280 | So then the way Categorify is implemented as you see in encodes is it calls transform,
00:39:02.800 | passing in a bunch of column names and a function. The function apply cats is the thing we saw before
00:39:12.000 | which is the thing that calls map unless you have a pandas categorical column in which case you
00:39:19.200 | actually already have the pandas has done the coding for you. So you just return it which is
00:39:24.560 | cat.codes+1. And so how does this function get applied to each of these columns? That's because
00:39:36.960 | we have a thing called tabular object.transform and that's the thing that at the moment is defined
00:39:44.080 | explicitly for pandas. And as you can see for pandas it just is this column equals the
00:39:51.840 | transform version of this column because pandas has a dot transform method for series.
00:39:58.720 | So that's all we needed to do there.
00:40:11.440 | Okay, so Juvian your question there about numerical versus continuous you should watch the
00:40:17.040 | introduction to machine learning for coders course where we talk about that in a lot of detail.
00:40:23.520 | That's something you probably have time to cover here.
00:40:26.880 | All right, so that's Categorify. So here's the other way to use Categorify as I was kind of
00:40:40.800 | beginning to mention is you can actually create a categorical column in pandas. And one of the
00:40:46.080 | reasons to do that is so that you can say these are the categories I want to use and they have
00:40:50.480 | an order and so that way high, medium, low will now be ordered correctly which is super
00:40:56.960 | useful. Also pandas is just nice and efficient at dealing with categories. So now if we go
00:41:06.320 | Categorify just like before, it's going to give us exactly the same kind of results except that
00:41:12.720 | when we look at the categorical processor, these will be in the right order. So we're going to end
00:41:21.120 | up with things that have been mapped in that way and it will also be done potentially more
00:41:26.160 | efficiently because it's using the internal pandas cat code stuff. Thank you David, that is very kind.
00:41:38.240 | I'm so thrilled. I would love to know where you are working as a computer vision data scientist.
00:41:43.440 | It's a very cool job.
00:41:44.880 | Andrew was studying his job as the in-house research director and data scientist here
00:41:55.840 | in 10 days time and also working very, very, very, very hard on being amazing which also helps.
00:42:05.600 | Lots of people do the course and don't end up doing jobs because they don't work as hard.
00:42:14.960 | Although I think everybody listening to this by definition is going to an extra level of effort.
00:42:22.160 | So I'm sure you will all do great. Normalize is you know just something where we're going to
00:42:31.760 | subtract the means and divide by the state of deviations and for decodes we're going to do the
00:42:37.920 | opposite. And you'll see what we generally do in these setup things, we use it do this in lots of
00:42:44.880 | places, is we say get atra data source comma train comma data source. What this means is
00:42:55.200 | it's the same as if we'd written this df equals dsrc dot train if has atra dsrc comma train
00:43:11.600 | else dsrc. It's the same as writing that, right? And so the reason we're doing that
00:43:20.800 | is because we want you to be able to either pass in a an actual data source object which has a train
00:43:28.960 | and a valid or not. You know if you aren't doing separate things to train and valid then that
00:43:36.000 | should be fine as well, right? And so as long as the thing you pass in, if it does have a train,
00:43:42.320 | then it should you know it should give you back some kind of object that has the right methods
00:43:48.320 | that you need. So in the case of you know data source and tabular it will, data source has a train
00:43:57.440 | attribute that will return a tabular object or if you just pass in a tabular object directly
00:44:05.600 | then it won't have a train so it'll just stay so dsrc will just be a tabular object which has a
00:44:12.480 | continuous variables const attribute. Okay so now we have this this data frame containing just the
00:44:22.720 | continuous variables and optionally just for our training set if we have one so then we can set up
00:44:29.440 | our means and standard deviations you know that's the metadata we need. So here we can create our
00:44:37.360 | normalize, create a data frame to test it out on, pass in that processor this time we just got to
00:44:43.840 | say these are continuous variables this is one which is a make sure we call setup and so now
00:44:51.280 | we should find here is the same data that we had before but let's make it into an array for testing
00:44:59.600 | and calculate being a standard deviation so we should find if we go norm.means a
00:45:05.600 | so self.means was df.mean this is quite nice right in pandas if you call .mean on a data frame
00:45:14.320 | you will get back a I think it's a series object which you can index into with column names.
00:45:21.600 | So this is quite neat right that we were able to get all the means and standard deviations
00:45:26.320 | all at once for all the columns and even apply them to all the columns at once. This is kind of
00:45:31.200 | the magic of Python indexes so I think that's that's actually pretty nice. So yeah make sure
00:45:39.120 | that the main is m, standard deviation should be around s and the values after processing should
00:45:47.120 | be around x minus m over s. One thing to notice is that this here setup we didn't call here why
00:46:03.760 | didn't we call setup and the reason why is that if you look at data source it calls setup. Why?
00:46:16.000 | Because we now definitely have all the information we need to set it up right. We know the data
00:46:20.640 | because that was a data frame you passed in and now we know what the training set is because that
00:46:25.200 | was passed in so there's no reason there's no reason to ask you to manually call setup.
00:46:30.080 | So you've got two ways to set up your processes. One is to call setup on your tabular object
00:46:40.560 | or the other is just to create a data source right and it's kind of
00:46:45.840 | it's something you kind of have to be aware of right because calling data source is not just
00:46:52.400 | returning a data source it is also modifying your tabular object data to process it and so it's kind
00:46:59.600 | of like it's a very non-pure non-functional kind of approach coming on here. We're not changing
00:47:07.600 | things and returning them we're like changing things in place and the reason for that is that
00:47:12.240 | you know with with tabular data you know you you don't want to be creating lots of copies of it
00:47:18.240 | you really want to be doing stuff in place it's got a lot of you know performance important
00:47:22.560 | performance issues so we try to do things just once and do it where it is. So normalize in this
00:47:30.000 | case we're calling setup and so again for inference you know here's some new data set we want to call
00:47:37.920 | inference on we go tabular object.new on the new data frame we process it we don't call setup
00:47:45.520 | because we don't want to create new meaning standard deviation and to use the same standard
00:47:49.280 | deviation and mean that we use for our training. And then here's the version where we use instead
00:47:58.960 | data source so you'll find that the mean and standard deviation now the mean standard deviation
00:48:07.600 | is 0 1 2 because that's the only stuff in the training set and again normalization and stuff
00:48:12.480 | should only be done with the training set. So you know all this stuff of kind of
00:48:17.360 | using this stuff makes it much harder to screw things up in terms of modeling and accidentally
00:48:25.280 | do things on the whole data set you know get leakage and stuff like that because you know
00:48:30.240 | we try to automatically do the right thing for you. Okay so then fill missing is going to
00:48:40.720 | go through each continuous column and it will see if there are any missing values in the column.
00:48:54.640 | And if there are missing values then it's going to create a
00:49:06.960 | nA dictionary so if there's any nAs or any missing or any nulls or the same idea in pandas
00:49:14.640 | then that column will appear as a column name in this dictionary and the value of it will be
00:49:23.040 | dependent on what fill strategy you ask for. So fill strategy is a class
00:49:33.200 | that contains three different methods
00:49:38.720 | and you can say which of those methods you want to use. Do you want to fill things with the median
00:49:49.360 | or with some constant or with the mode, right? And so we assume by default that it's the median.
00:49:57.760 | So this here is actually going to call fill strategy.median passing in the column
00:50:04.160 | and so it's going to return the median.
00:50:06.640 | So that's the dictionary we create. So then later on when you're calling codes
00:50:16.080 | we actually need to go through and do two things. The first thing is to use the pandas fill in A
00:50:31.680 | to fill missing values with whatever value we put into the dictionary for that column
00:50:38.320 | and again we do it in place. Then the second thing is if you're asked to add an extra column to say
00:50:48.320 | which ones we feel are missing in, which by default is true, then we're going to add a column with the
00:50:53.920 | same name that underscore nA at the end which is going to be a boolean of true if that was
00:51:02.000 | originally missing and false otherwise. So here you can see we're creating three different processes
00:51:10.960 | which are just processes, a fill missing processor with each of the possible
00:51:15.520 | strategies. And so then we create a data frame with a missing value and then we just go through
00:51:23.360 | and create three tabular objects with those three different processes and make sure that the
00:51:31.680 | nA dict for our A column has the appropriate median or constant or load as requested.
00:51:39.760 | And then remember setup also processes so then we can
00:51:46.800 | go through and make sure that they have been replaced correctly.
00:51:55.680 | And also make sure that the tabular object now has a categorical name which is
00:52:02.400 | in this case a nA. So it's not enough just to add it to the data frame, it also has to be added to
00:52:10.160 | cat names in the tabular object because this is something a categorical column we want to use for
00:52:16.000 | modeling. So Madhavan asks shouldn't setups be called in the constructor and no it shouldn't.
00:52:23.680 | Setups is what transforms call when you call setup using the type dispatch stuff we talked
00:52:31.600 | about in the transforms walkthrough and so and then setup is something which should be called
00:52:38.880 | automatically only when we have enough information to know what to set up with and that information
00:52:43.760 | is only available once you've told us what your training set is so that's why it's called by data
00:52:49.280 | source not called by the constructor but if you're not going to use a data source then you can call
00:52:56.960 | it yourself. Okay. Great. So this section is mainly kind of a few more examples of putting
00:53:15.600 | that all together. All right. So here's a bunch of processes normalize, categorify, film missing,
00:53:21.600 | do nothing at all. Obviously you don't need this one it's just to show you. And here's a data frame
00:53:27.040 | with a couple of columns. A is the categorical, B is the continuous because remember that was the
00:53:32.960 | order that we use. It would be probably better if we actually wrote those here at least the first
00:53:39.200 | time so you didn't have to remember. There we go. So recall setup because we're not using a data
00:53:47.760 | source on this one. And so the processes you'll have noticed explicitly only work on the columns
00:53:59.200 | of the right type so these work just on the continuous columns for normalize, for get a
00:54:04.480 | categorify, it goes through the categorical columns. You might have noticed that was all
00:54:15.600 | cat names and that's because you also want to categorize categorical dependent variables
00:54:21.600 | but normalize we don't normalize continuous dependent variables. Normally for that you'll
00:54:27.040 | do like a sigmoid in the model or something like that. So yeah so you can throw them all in there
00:54:36.320 | and it'll do the right thing for the right columns automatically. So it just goes through and makes
00:54:40.080 | sure that that all works fine. So these are really just a bunch of tests and examples.
00:54:46.160 | Okay so last section which is okay so now we have a tabular object which has got some cats
00:54:56.320 | and some cons and dependent variable wise. If we want to use this for modeling we need tensors.
00:55:04.960 | We actually need three tensors, one tensor for the continuous, one for the categorical and one
00:55:11.920 | for the dependent. And the reason for that is that they're continuous in the categorical of
00:55:16.720 | different data types so we can't put them all on the same tensor because tensors have to be all
00:55:21.440 | of the same data type. So if you look at the version one tabular stuff it's the same thing
00:55:28.240 | right we have those three different tensors. So now we create some one normal transform so a lazy
00:55:37.360 | transform that's you know it's applied as we're getting our batches and all we do is we say okay
00:55:45.200 | this is the tabular object which we're going to be transforming and we just grab the
00:55:52.400 | and in codes we don't actually need we don't need that state at all. For end codes we're just going
00:56:01.920 | to grab all of the categorical variables turn them into a tensor and make it a long. And then
00:56:07.120 | we'll grab all the continuous turn it into a tensor make it a float. And so then the first
00:56:12.240 | thing our tuple is itself a tuple with those two things so that's our independent variables
00:56:17.920 | and then our dependent variable is the target turned into a long. This is actually a mistake
00:56:25.120 | it shouldn't always turn it into a long it should only turn it into a long if it's continuous
00:56:30.400 | sorry categorical why otherwise it should be continuous I think.
00:56:39.760 | No let's wait until we get to modeling I can't quite remember it's if it's categorical
00:56:44.800 | where they're going to want to encode it. No we're going to use it as yeah that's right
00:56:51.200 | so it's that's right so it's a long if it's categorical but for continuous it has to be float
00:56:55.760 | that's right. So to use float for continuous target okay so that's a little mistake
00:57:07.280 | we haven't done any tabular regression yet in version one version two.
00:57:10.960 | So that's all encodes is going to do so then we'll come back to decodes later right. So in our
00:57:21.040 | example here we grabbed our path to the adult sample we read the CSV we split it into a test
00:57:28.000 | set and the main bit made a list of our categorical and continuous a list of the processes we wanted
00:57:37.920 | to use the indexes of the splits that we wanted so then we can create that tabular as we discussed
00:57:48.000 | we can turn it into a data source with the splits now you'll see here it never mentioned read tab
00:57:58.240 | batch and the reason for that is that we don't want to force you to do things that we can do for
00:58:04.080 | you so if you just say give me a tabular data loader rather than a normal data loader and tabular
00:58:10.000 | data that loader is a transform data loader where we know that any after batch that you asked for
00:58:18.480 | we have to also add in read tab batch so that's that's how that's automatically added to the
00:58:27.280 | transforms for you the other thing about tabular data loader is we want to do everything a batch
00:58:39.920 | at a time so particularly for the rapids on GPU stuff we don't want to pull out individual rows
00:58:47.440 | and then collect them later everything's done by grabbing a whole batch at a time so we replace
00:58:52.720 | do item which is the thing that normally grabs a single item for collation we replace it with do
00:58:58.560 | nothing replace it with no up right and then we replace create batch which is the thing that
00:59:04.800 | normally collects things to say don't collate things but instead actually grab all of the samples
00:59:11.040 | directly from the tabular object using my lock so this is if you look at that blog post I mentioned
00:59:21.040 | from even at Nvidia about how they got the 16x speed up by using rapids a key piece of that was
00:59:29.040 | that they wrote their own version of this kind of stuff to kind of do everything batch at a time
00:59:35.280 | and this is one of the key reasons we replaced the PyTorch data loader is to make this kind of thing
00:59:40.960 | super easy so as you can see creating a kind of a batch a batch at a time data loader is seven lines
00:59:46.800 | of code super nice and easy so yeah I was pretty excited when this came out so quick
00:59:54.240 | okay so
01:00:02.240 | so that's what happens when we create the tabular data loader
01:00:14.480 | we could of course also create a data bunch we should probably add this to the example
01:00:21.040 | and uh yeah that's basically it so then at inference time as we discussed you can now
01:00:35.920 | do the same dot new trick we saw before in the dot process and then you can grab whatever
01:00:41.600 | here's all coals which is going to give us a data frame with all the modeling columns
01:00:46.080 | and since this is not so show batch will be the decoded version but this is not the decoded
01:00:51.280 | version this is the encoded version that you can pass to your modeling all right any questions
01:00:59.760 | Andrea no okay cool all right thanks so that is it and uh it's Friday right yeah so I think
01:01:10.000 | I think we're on for Monday I'll double check and I'll let you all know um right away I will see
01:01:15.440 | you later bye
01:01:18.720 | [BLANK_AUDIO]