fastai v2 walk-thru #8

Okay. There we go. Hi, everybody. Can you see me? Can you hear me? Okay. So, I'm going to turn it off. Okay. Great. Hi. Hope you're ready for Tabular. Oh, and we have Andrew Shaw here today as well. Andrew just joined WAMRI, Medical Research Institute as Research Director, and he is the person you may be familiar with from Music Auto, but if you haven't checked it out, you should, because it's super, super cool.

Now he's moving from the Jackson 5 to Medical AI Research. All right. So, Tabular is a cool notebook, I think. It's a bit of fun. And the basic idea, well, let's start at the end to see how it's going to look. So, we're going to look at the adult dataset, which is the one that we use in most of the docs in version one, and we used in some of the lessons, I think.

It's a pretty simple, small dataset. It's got 32,000 rows in it, and 15 columns in it. And we've got 32,000. And here's what it looks like when we grab it straight from Dandas. So, basically, in order to create models from this, we need to take the categorical variables and convert them into ints, possibly with missing category, the continuous variables, if there's anything with a missing value in a continuous variable, we need to replace it with something, so normally we replace it with something like the median, and then normally we add an additional column for each thing that has a missing value, for each column that has a missing value, and that column will be binary, which is, is it missing or not.

So, we need to know which things are going to be the categorical variables, which column names are going to be the continuous variables, so we know how to process each one, we're going to need to know how to split our validation and training set, and we need to know what's our dependent variable.

So, we've created a class called Tabular. Basically, Tabular contains a data frame, and it also contains a list of which things are categorical, which things are continuous, and what's your dependent variable, and also some processes, which we'll look at in a moment, where they do things like turning strings into ints for categories, and filling the missing data, and doing normalization of continuous.

So that creates a Tabular object, and from a Tabular object you can get a data source if you pass in a list of split indexes. So, feel free to ask also Andrew if you have any questions as we go. Oh, what's that David? Do you want a jagged competition?

What's a jagged competition? Tell us more. Oh, Kaggle, which one? So now that we've got a data source, we created data loader from it, and so we have a little wrapper for that to make it easier for Tabular. And so then we can go show batch. Oh, it's broken somehow.

Nice one, Jeremy. Damn it. This was all working a moment ago. And then Andrew and I were just changing things at the last minute, so what did we break? Okay, I am going to ignore that. Alright, so we just broke something apparently before we started recording, but anyway show batch would then show the data.

And then you can take a test set that's not processed. And basically what you want to be able to do with a test set is say I have the same categorical names and continuous names and the same dependent variable and the same processes, and you should then be able to apply the same pre-processing to the test set.

And so to.new creates a new Tabular object with all that metadata, and then process will do the same processing. And so then you can see here, here's the normalized age, normalized education, all these things have been turned into ints. Oh, and so forth. Excuse me. So that's basically what this looks like.

You'll see I've used a subclass of Tabular called Tabular Pandas. I don't know if it's always going to stay this way or not, but currently, well we want to be able to support multiple different backends. So currently we have a Pandas backend for Tabular and we're also working on a RAPIDS backend.

For those of you that haven't seen it, RAPIDS is a really great project coming out of NVIDIA, which allows you to do, so they've got some encoder, QDF, which basically gives you GPU-accelerated data frames. So that's why we have different subclasses. I don't know if this will always be a subclass of Tabular.

We may end up instead of having subclasses of data source, but for now we've got subclasses of Tabular. So that's where we're heading. All right. Do you have any advice to speed up inference on a tabular learn of predictions? Well, hopefully there'll be fast, particularly if you use RAPIDS.

I mean, yeah, basically, if you use RAPIDS. In fact, if you look at RAPIDS, NVIDIA, fast.ai, hopefully, yeah, here we go. So even, a lot of you will remember from the forums, he's at NVIDIA now, which is pretty awesome, and he's working on the RAPIDS team and he recently posted how he got a 15x acceleration and got in the top 20 of a competition, a very popular competition on Kaggle by combining RAPIDS by Torch and fast.ai.

So that would be a good place to start, but hopefully by the time fast.ai version 2 comes out this would be super simple, and we're working with Even on making that happen. So thank you, Even and NVIDIA for that help. All right. So let's start at tabular. So basically, the idea of tabular is that we have a URL.

Oh, yes. I just googled for NVIDIA AI RAPIDS, fast.ai. And here it is, accelerating, keep learning recommendation systems. I'm sure we can, somebody will hopefully add it to the notes as well. Okay. So the basic idea with tabular was I kind of wanted to, I kind of like to have a class which kind of has all the information it needs to do what we want it to do.

So in this case, you know, a data frame doesn't have enough information to actually build models because until you know what the categorical variables are, the continuous variables are and what pre-processing to do and what the dependent variable is, you know, you can't really do much. So that's the basic kind of idea of this design was to have something with that information.

So categorical names, continuous names, line names, normally it's just one dependent variable, so one line name, but you could have more and processes. So those are basically the four things that we want to put in our tabular. So -- oh, that was crazy. There we go. All right. So that's why we're passing in those four things.

And then the other thing we need to know about why is why categorical or continuous, so you can just pass that into some boolean. So Aman wants to know why might we need more than one y-name. So a few examples, you could be doing a regression problem where you're trying to predict the x and y coordinates of the destination of a taxi ride, or perhaps you're just doing a multi-label classification where you can have multiple things being true, would be a couple examples.

And maybe it's already one-hot encoded or something in that case. Okay. So that's the theory behind the design of tabular. So when we initialize it, we just pass in that stuff. The processes that we pass in are just going to be transforms, so we can dump in a pipeline.

And so this stuff kind of you'll see that we just keep reusing the same foundational concepts throughout fast.ai version 2, which is a good sign that they're strong foundations. So the processes are, you know, things that we want to run a bunch of them, we want to depend on what type something is, whether we run it or not, you know, stuff like that.

So it's got all that kind of behavior we want in the pipeline. Unlike TifemDS, TifemDL, TifemList, all of those things apply transformations lazily. On tabular data, we don't generally do that. There's a number of reasons why. The first is that unlike kind of opening an image or something, it doesn't take a long time to grab a row of data.

So like it's fine to read the whole lot of rows normally, except in some cases of really, really big datasets. The second reason is that most kind of tabular stuff is designed to work quickly on lots of rows at a time. So it's going to be much, much faster if you do it ahead of time.

The third is that most preprocessing is going to be not data augmentation kind of stuff, but more just once off cleaning up labels and things like that. So for all these kinds of reasons, our processing in tabular is generally done ahead of time rather than lazily, but it's still a pipeline of transforms.

Okay, so then we're going to store something called catY, which will be our Y variable if it's a categorical, otherwise none, and vice versa, the convoy. So that's our initializer, and let's take a look at it in use. So basically we create a data frame containing two columns, and from that we can create a tabular object passing in that data frame and saying our cat names, con names, y names, whatever.

So in this case we'll just have cat names. One thing that's always a good idea is to make sure that things pickle okay because for inference and stuff all this metadata has to be pickleable. So dumping something to a string and then loading that string up again is always a good idea to make sure it works and make sure it's the same as what you started with.

So we're inheriting from co-base, and co-base is just a small little thing in core which defines the basic things you would expect to have in a collection, and it implements them by composition. So you pass in some list or whatever, and so the length of your co-base will be the length of that list.

So like you can often just inherit from something to do this, but composition often can give you some more flexibility. So this basically gives you the very quick, you know, getItem is defined by calling the self.items getItem, and length is defined by calling the self.items length, and so forth.

So by inheriting from co-base we can then in the init simply say super.initdf, and so that means we're going to now going to have something called self.items, which is going to be that data frame. And so here we check that, you know, the pickled version of items is the same as the TabularObjects version of items.

Okay, so let me just press this. Okay, so the next thing to notice is that there are various useful little attributes like all columns, all cats for all categorical columns, all const for all continuous columns, and then there's also the same with names at the end, all count names, all cat names, all call names.

So you can say all count names is just the continuous names plus the continuous y if there is one. This would not work except for the fact that we use the capital L plus, so if you add a none to a capital L plus it doesn't change it, which is exactly what we want most of the time.

So some of the things in capital L are a bit more convenient than this. I think that's right, anyway. Maybe I should double check before I say things that might not be true. Yeah, so you can't do that, or else you can do that and get exactly the expected behavior.

And Ls, by the way, always show you how big they are before they print out their contents. You'll see that all calls does not actually appear anywhere here, and that's because we just created a little thing called add property that just adds cats or cats, consts or consts or calls.

And so for each one it creates a read version of the property, which is to just grab whatever cat names or cat names, cont names, etc., which are all defined in here, and then indexes into our data frame with that list of columns. And then it creates a a setter, which simply sets that list of names to whatever value you provide.

So that's just a quick way to create the setter and get a version of all of those. So that's where all calls come from. So in this case, because the TabularObject only has this one column mentioned as being part of what we're modeling, even though the data frame had an a and a b, TabularObject.all_holes only has the a con in it, because by all calls it means all of the columns we're using in modeling, so continuous and categorical and dependent variables.

And so one of the nice things is that, you know, because everything is super consistent in the API, we can now just say .show, just like everything else, we can say .show, and in this case we see the all calls data frame. Okay, so then processes, TabularProcesses, are just transforms.

Now specifically they're in-place transforms, but don't let that bother you because in-place transform is simply a transform where when we we call it and then we return the original thing. So like all the transforms we've seen so far return something different to what you passed in, like that's the whole point of them, but processes, the whole point of them is that they change the actual stored data, so that's why they just return whatever you started with.

So that's all in-place transform means. And so a TabularProcessor is just a transform that returns itself when you call it, and when you set it up, it just does the normal set up, but it also calls dundercall, in other words, self-roundbackers. Why does it do that? Well, let's take a look at an example, Categorify.

So Categorify is a TabularProc where the setup is going to create a category map, so this is just a mapping from int-numbered bits of vocab to string vocab, that's all a category map is. And so it's going to go through all of the categorical columns, and it's going to go .iloc into the data frame for each of those columns, and it's going to create a category map for that column.

So this is just creating, so self.classes then is going to be a dictionary that goes from the column names to the vocab for that categorical column. So that's what setup does, right? So setup kind of sets up the metadata, it's the vocab. Encodes, on the other hand, is the thing that actually takes a categorical column and converts it into ints using the vocab that we created earlier.

And so we need them to be two separate things because if you think about inference, at inference time you don't want to run setup, at inference time you just want to run encodes. But at training time you want to do both, right? Any time you don't do setup, you're definitely also going to want to process.

So that's why setup, after it sets up, immediately does the encoding because that's in practice always what you want. So that's why we override setup in TabularProc. That's all TabularProc is, it's just a transform that when you set it up it also calls it straight away. Okay, so that, so Categorify is pretty similar to the categorized transform we've seen for dependent variables for like image classification, but it's a little bit different because it's for TabularObjects.

So you can see an example. Here's a data frame 01202 in just a single column. So again we can create a TabularObject passing in the data frame, passing in any processes we want to run, and passing in, so the first thing that we pass in after that will be the category names, the categorical names, so we're just going to have one.

So once you have created your TabularObject, the next thing you want to do is to call setup, and remember that setup's going to do two things. It's going to call your setups and it's going to call your encodes. Are there any object or textures? We haven't done any models yet, David.

We're only really working through the transformation pipeline, but we have certainly looked at a version to object detectors in previous lessons for like, well, not in detail. We've touched on the places where the object detection data is defined, and yeah, hopefully it's clear enough that you can figure that out.

Yeah, so patch property does not call, I mean you can check the code easily enough yourself, right? So patch property, let's see, okay, so no, it doesn't, it calls patch2, patch2, no, it doesn't. So there's no way to create a setup using that decorator yet. If you can think of a good syntax for that that you think would work well, feel free to suggest it or even implement it in a PR.

Okay, so, all right, so we've created our TabularObject. As we told it to categorify, we said this is our only categorical variable. Then we call setup, which is going to go ahead and let's have a look. Setup is going to, in our processor, create classes and then it's going to change our items to call apply cats, which will map the data in that column using this dictionary.

So map is an interesting pandas method. You can pass it a function, which is going to be super slow, so don't do that, but you can also pass it a dictionary and that will just map from keys to values in the dictionary. So if we have a look at at this case, we can see that 2.a starts out as 0.1.2.0.2 and ends up as 1.2.3.1.3 and the reason for that is that the vocab that it created is #na 0.1.2.

So whenever we create a categorized column for the vocab, we always put a #na at the start, which is similar to what we've done in version 1. So that way if you in the future get a value outside of your vocab, then that's what we're going to set it to.

We're going to set it to #na and so therefore 1, if we index that into the vocab, that maps to 0. So that's why this is 0 became 1, for example. One of the things that I recently added to pipeline, because remember that 2.prox is a pipeline, right? And one of the things I added to pipeline is a getatra, which will try to, so this is not an attribute in pipeline, not surprisingly, so if it finds an attribute it doesn't understand, it will try and find that attribute in any transforms inside that pipeline, which is exactly what we want, you know?

So in this case, it's going to look for a transform with a type category of 5 and it converts it to snake case. So this is very similar if you've watched the part 2, the most recent part 2 videos, we did the same thing for callbacks. Callbacks got automatically inserted, and I think version 1 does this too, getting automatically added as attributes.

So pipeline does something very similar, but it doesn't actually add them as attributes, it uses getatra to do the same thing. So in this case we added a categoryify transform, so we haven't instantiated it, we just passed in the type, so it's going to instantiate it for us. Pipeline will always instantiate your types for you if you don't instantiate them, and so later on we want to say, okay, let's find out what the vocab was, which means we need to grab the processes out of our transform object and ask for the categoryify transform.

So now that we've got that, categoryify transform, categoryify defines getItem and it will return the vocab for that column. So here is the vocab for that column, and so we can have a look, cat a, okay, as you can see there it is, and that is actually a type category map, so as well as having the items that we just saw, it also has the reverse mapping, so this is, as you can see, goes the opposite direction.

So to answer Aman's question, yes, this should take care of the mapping in the test set, because if it comes across, so it's going to look at o2i when it tries to call apply cats, it's going to try to find in your categories, it's going to try and find that column, it's going to grab o2i, which is this dictionary, that's then going to use that to map everything in the column, but because it's a default dict, if there's anything it doesn't recognize, it will become zero, which is the na category.

So yeah, that should all work nicely. You have to figure out how to model it, of course, but the data processing will handle it for you. Okay, so what else? So now, imagine it's inference time, and so we come along with some new data frame that we want to run inference on, or it's a test set or whatever, so here's our data frame.

So we now have to say, okay, I want to create a new tabular object for the same metadata that we had before, so the same processes, the same vocab, the same categorical continuous variables. The way to do that is to start with an existing tabular object which has that metadata and call new and pass in a new data frame that you have, and that's going to give you a new tabular object with the same metadata and processes and stuff that we had before, but with this different data in it.

Now of course, we don't want to call setup on that because setup would replace the vocab, and the whole point is that we want to actually use the same vocab for inference on this new data set. So instead you call process, and so all process does in tabular is it just calls the processes, which is a pipeline, so you can just treat it as a function.

So in this case, our vocab was 0, 1, 2 and an a, and here there's a couple of things that are not in that list of 0, 1, 2, specifically this one is 3 and this one is -1. So 1, 0 and 2 will get replaced by their 1, 0 and 2, so 2, 1, 3, 2, 1, 3, so there's our new a column, and then as we just discussed the two things that don't map are going to be 0.

So then if you call decode on our processor, then as you would expect you end up with the same data you started with, but of course this is now going to be Na, because we said we don't know what those are. So like decoding in general in fastai doesn't mean you always get back exactly what you started with, right?

It's kind of it's trying to display the kind of transformed version of the data. In some cases like normalization, it should pretty much be exactly what you started with, but for some things like categorify the missing values it won't be exactly what you started with. You don't have to pass in just a type name, you can instantiate a processor yourself and then pass that in, so then that means you don't have to dig it out again like this, so sometimes that's more convenient, so this is just another way of doing the same thing.

But in this case we're also going to split the training set and the validation set, and this is particularly important for things like categorify, because if our training set is the first three elements and our validation set is the last two, then this thing here three is not in the training set, and so therefore it should not be part of the vocab.

So let's make sure that that's the case. So here we are, categorical variable, yep it doesn't have three in it, the vocab doesn't have three in it. So the way we pass in these split indexes is by calling tabular object dot data source, and that converts the tabular object to a data source, the only thing you pass it is the list of splits.

And so that gives you a standard data source object, just like the one that we saw in our last walkthrough. So that's what you get, and so that data source, let's take it out, right, so that data source object will have a train, for example, and a valid. And those are just other ways of saying subset zero and subset one.

Should test set be three comma two? No, these are the indexes of what's the indexes and things in the training set, these are the indexes in the validation set, so these are indexes three and four in the validation set. Yes, they're ID indexes. So data source, so like in terms of looking at the code of tabular, it's super tiny, which is nice, in terms of the things that are more than one line in it, because there's a bunch of things in store, and data source.

And the only reason data source is more than a couple of lines is because in rapids on the GPU, trying to index into a data frame with like arbitrary indexes is really, really, really, really slow. So you have to pass in a contiguous list of indexes to make rapids fast.

So what we do is when you pass in splits, we actually concatenate all of those splits together into a single list and we index into the data frame with that list. So that's going to shuffle the list so that all the stuff that's in the same validation or training set is all together.

And so that way, now when we create our data source rather than passing in the actual splits, we just pass in a range of all the numbers from nought to the length of the first split, and then all of the numbers from the length of the first split to the length of the whole thing.

And so our data source is then able to always use contiguous indexes. So that's why that bit of code is there. Other than that, one thing I don't like about Python is that any time you want to create a property, you have to put it on another row like this, like in a lot of programming languages you can kind of do that kind of thing on the same line but not in Python for some reason.

And so I find it takes up a lot of room just for the process of saying these are properties. So I just added an alternative syntax which is to create a list of all the things that are properties. So that's all that is. Like most of these things, it's super tiny.

So that's just one line of code. Just goes through and calls property on them to make them properties. Oh, okay. So then another thing about this is I kind of tried to make tabular look a lot like DataFrame. And one way that happens is we've inherited from getatra which means that any unknown attributes it's going to pass down to whatever is the default property which is soft.items which is a DataFrame.

So in other words, it behaves a lot like a DataFrame because anything unknown it will actually pass it along to the DataFrame. But one thing I did want to change is in DataFrames it's not convenient to index into a row by row number and a column by name. You can use Ilock to get row by number, column by number.

You can do lock to say row by name or index and column by name or index. But most of the time I want to use row numbers and column names. So we redefine Ilock to use this tabular Ilock indexer which is here. And as you can see if you have a row and a column then the columns I actually replace with the integer index of the column.

So that way we can use column names and row numbers. And also it will wrap it back up in a tabular object as well. So we end up with if you index into a tabular object with Ilock you'll get back a tabular object. So then the way Categorify is implemented as you see in encodes is it calls transform, passing in a bunch of column names and a function.

The function apply cats is the thing we saw before which is the thing that calls map unless you have a pandas categorical column in which case you actually already have the pandas has done the coding for you. So you just return it which is cat.codes+1. And so how does this function get applied to each of these columns?

That's because we have a thing called tabular object.transform and that's the thing that at the moment is defined explicitly for pandas. And as you can see for pandas it just is this column equals the transform version of this column because pandas has a dot transform method for series. So that's all we needed to do there.

Okay, so Juvian your question there about numerical versus continuous you should watch the introduction to machine learning for coders course where we talk about that in a lot of detail. That's something you probably have time to cover here. All right, so that's Categorify. So here's the other way to use Categorify as I was kind of beginning to mention is you can actually create a categorical column in pandas.

And one of the reasons to do that is so that you can say these are the categories I want to use and they have an order and so that way high, medium, low will now be ordered correctly which is super useful. Also pandas is just nice and efficient at dealing with categories.

So now if we go Categorify just like before, it's going to give us exactly the same kind of results except that when we look at the categorical processor, these will be in the right order. So we're going to end up with things that have been mapped in that way and it will also be done potentially more efficiently because it's using the internal pandas cat code stuff.

Thank you David, that is very kind. I'm so thrilled. I would love to know where you are working as a computer vision data scientist. It's a very cool job. Andrew was studying his job as the in-house research director and data scientist here in 10 days time and also working very, very, very, very hard on being amazing which also helps.

Lots of people do the course and don't end up doing jobs because they don't work as hard. Although I think everybody listening to this by definition is going to an extra level of effort. So I'm sure you will all do great. Normalize is you know just something where we're going to subtract the means and divide by the state of deviations and for decodes we're going to do the opposite.

And you'll see what we generally do in these setup things, we use it do this in lots of places, is we say get atra data source comma train comma data source. What this means is it's the same as if we'd written this df equals dsrc dot train if has atra dsrc comma train else dsrc.

It's the same as writing that, right? And so the reason we're doing that is because we want you to be able to either pass in a an actual data source object which has a train and a valid or not. You know if you aren't doing separate things to train and valid then that should be fine as well, right?

And so as long as the thing you pass in, if it does have a train, then it should you know it should give you back some kind of object that has the right methods that you need. So in the case of you know data source and tabular it will, data source has a train attribute that will return a tabular object or if you just pass in a tabular object directly then it won't have a train so it'll just stay so dsrc will just be a tabular object which has a continuous variables const attribute.

Okay so now we have this this data frame containing just the continuous variables and optionally just for our training set if we have one so then we can set up our means and standard deviations you know that's the metadata we need. So here we can create our normalize, create a data frame to test it out on, pass in that processor this time we just got to say these are continuous variables this is one which is a make sure we call setup and so now we should find here is the same data that we had before but let's make it into an array for testing and calculate being a standard deviation so we should find if we go norm.means a so self.means was df.mean this is quite nice right in pandas if you call .mean on a data frame you will get back a I think it's a series object which you can index into with column names.

So this is quite neat right that we were able to get all the means and standard deviations all at once for all the columns and even apply them to all the columns at once. This is kind of the magic of Python indexes so I think that's that's actually pretty nice.

So yeah make sure that the main is m, standard deviation should be around s and the values after processing should be around x minus m over s. One thing to notice is that this here setup we didn't call here why didn't we call setup and the reason why is that if you look at data source it calls setup.

Why? Because we now definitely have all the information we need to set it up right. We know the data because that was a data frame you passed in and now we know what the training set is because that was passed in so there's no reason there's no reason to ask you to manually call setup.

So you've got two ways to set up your processes. One is to call setup on your tabular object or the other is just to create a data source right and it's kind of it's something you kind of have to be aware of right because calling data source is not just returning a data source it is also modifying your tabular object data to process it and so it's kind of like it's a very non-pure non-functional kind of approach coming on here.

We're not changing things and returning them we're like changing things in place and the reason for that is that you know with with tabular data you know you you don't want to be creating lots of copies of it you really want to be doing stuff in place it's got a lot of you know performance important performance issues so we try to do things just once and do it where it is.

So normalize in this case we're calling setup and so again for inference you know here's some new data set we want to call inference on we go tabular object.new on the new data frame we process it we don't call setup because we don't want to create new meaning standard deviation and to use the same standard deviation and mean that we use for our training.

And then here's the version where we use instead data source so you'll find that the mean and standard deviation now the mean standard deviation is 0 1 2 because that's the only stuff in the training set and again normalization and stuff should only be done with the training set.

So you know all this stuff of kind of using this stuff makes it much harder to screw things up in terms of modeling and accidentally do things on the whole data set you know get leakage and stuff like that because you know we try to automatically do the right thing for you.

Okay so then fill missing is going to go through each continuous column and it will see if there are any missing values in the column. And if there are missing values then it's going to create a nA dictionary so if there's any nAs or any missing or any nulls or the same idea in pandas then that column will appear as a column name in this dictionary and the value of it will be dependent on what fill strategy you ask for.

So fill strategy is a class that contains three different methods and you can say which of those methods you want to use. Do you want to fill things with the median or with some constant or with the mode, right? And so we assume by default that it's the median.

So this here is actually going to call fill strategy.median passing in the column and so it's going to return the median. So that's the dictionary we create. So then later on when you're calling codes we actually need to go through and do two things. The first thing is to use the pandas fill in A to fill missing values with whatever value we put into the dictionary for that column and again we do it in place.

Then the second thing is if you're asked to add an extra column to say which ones we feel are missing in, which by default is true, then we're going to add a column with the same name that underscore nA at the end which is going to be a boolean of true if that was originally missing and false otherwise.

So here you can see we're creating three different processes which are just processes, a fill missing processor with each of the possible strategies. And so then we create a data frame with a missing value and then we just go through and create three tabular objects with those three different processes and make sure that the nA dict for our A column has the appropriate median or constant or load as requested.

And then remember setup also processes so then we can go through and make sure that they have been replaced correctly. And also make sure that the tabular object now has a categorical name which is in this case a nA. So it's not enough just to add it to the data frame, it also has to be added to cat names in the tabular object because this is something a categorical column we want to use for modeling.

So Madhavan asks shouldn't setups be called in the constructor and no it shouldn't. Setups is what transforms call when you call setup using the type dispatch stuff we talked about in the transforms walkthrough and so and then setup is something which should be called automatically only when we have enough information to know what to set up with and that information is only available once you've told us what your training set is so that's why it's called by data source not called by the constructor but if you're not going to use a data source then you can call it yourself.

Okay. Great. So this section is mainly kind of a few more examples of putting that all together. All right. So here's a bunch of processes normalize, categorify, film missing, do nothing at all. Obviously you don't need this one it's just to show you. And here's a data frame with a couple of columns.

A is the categorical, B is the continuous because remember that was the order that we use. It would be probably better if we actually wrote those here at least the first time so you didn't have to remember. There we go. So recall setup because we're not using a data source on this one.

And so the processes you'll have noticed explicitly only work on the columns of the right type so these work just on the continuous columns for normalize, for get a categorify, it goes through the categorical columns. You might have noticed that was all cat names and that's because you also want to categorize categorical dependent variables but normalize we don't normalize continuous dependent variables.

Normally for that you'll do like a sigmoid in the model or something like that. So yeah so you can throw them all in there and it'll do the right thing for the right columns automatically. So it just goes through and makes sure that that all works fine. So these are really just a bunch of tests and examples.

Okay so last section which is okay so now we have a tabular object which has got some cats and some cons and dependent variable wise. If we want to use this for modeling we need tensors. We actually need three tensors, one tensor for the continuous, one for the categorical and one for the dependent.

And the reason for that is that they're continuous in the categorical of different data types so we can't put them all on the same tensor because tensors have to be all of the same data type. So if you look at the version one tabular stuff it's the same thing right we have those three different tensors.

So now we create some one normal transform so a lazy transform that's you know it's applied as we're getting our batches and all we do is we say okay this is the tabular object which we're going to be transforming and we just grab the and in codes we don't actually need we don't need that state at all.

For end codes we're just going to grab all of the categorical variables turn them into a tensor and make it a long. And then we'll grab all the continuous turn it into a tensor make it a float. And so then the first thing our tuple is itself a tuple with those two things so that's our independent variables and then our dependent variable is the target turned into a long.

This is actually a mistake it shouldn't always turn it into a long it should only turn it into a long if it's continuous sorry categorical why otherwise it should be continuous I think. No let's wait until we get to modeling I can't quite remember it's if it's categorical where they're going to want to encode it.

No we're going to use it as yeah that's right so it's that's right so it's a long if it's categorical but for continuous it has to be float that's right. So to use float for continuous target okay so that's a little mistake we haven't done any tabular regression yet in version one version two.

So that's all encodes is going to do so then we'll come back to decodes later right. So in our example here we grabbed our path to the adult sample we read the CSV we split it into a test set and the main bit made a list of our categorical and continuous a list of the processes we wanted to use the indexes of the splits that we wanted so then we can create that tabular as we discussed we can turn it into a data source with the splits now you'll see here it never mentioned read tab batch and the reason for that is that we don't want to force you to do things that we can do for you so if you just say give me a tabular data loader rather than a normal data loader and tabular data that loader is a transform data loader where we know that any after batch that you asked for we have to also add in read tab batch so that's that's how that's automatically added to the transforms for you the other thing about tabular data loader is we want to do everything a batch at a time so particularly for the rapids on GPU stuff we don't want to pull out individual rows and then collect them later everything's done by grabbing a whole batch at a time so we replace do item which is the thing that normally grabs a single item for collation we replace it with do nothing replace it with no up right and then we replace create batch which is the thing that normally collects things to say don't collate things but instead actually grab all of the samples directly from the tabular object using my lock so this is if you look at that blog post I mentioned from even at Nvidia about how they got the 16x speed up by using rapids a key piece of that was that they wrote their own version of this kind of stuff to kind of do everything batch at a time and this is one of the key reasons we replaced the PyTorch data loader is to make this kind of thing super easy so as you can see creating a kind of a batch a batch at a time data loader is seven lines of code super nice and easy so yeah I was pretty excited when this came out so quick okay so so that's what happens when we create the tabular data loader we could of course also create a data bunch we should probably add this to the example and uh yeah that's basically it so then at inference time as we discussed you can now do the same dot new trick we saw before in the dot process and then you can grab whatever here's all coals which is going to give us a data frame with all the modeling columns and since this is not so show batch will be the decoded version but this is not the decoded version this is the encoded version that you can pass to your modeling all right any questions Andrea no okay cool all right thanks so that is it and uh it's Friday right yeah so I think I think we're on for Monday I'll double check and I'll let you all know um right away I will see you later bye

fastai v2 walk-thru #8

Chapters

Transcript