Back to Index

fastai v2 walk-thru #9


Chapters

0:0
8:18 creating a tabular without passing in fields
10:27 attach vs code to a remote terminal
46:50 handle multiple indices
54:36 using an array of masks

Transcript

Hi, can you all see me and hear me okay? Great. What does 5x5 mean, Fred? And does anybody have any requests for stuff they would like to see today, if you'll ask? Oh, good. I think 5x5 is being the weightlifting I do. Five reps. Five plots of five reps.

Okay. So in the absence of requests, I will show you something that's changed. We've renamed things a little bit, as you can see. Transform now has some more stuff in it. Specifically it has pipeline. The pipeline has been moved into transform. Transform and pipeline are not specifically to do with data, they're just ways of doing functions and dispatch, basically.

And then data core -- no, we don't have a tentative release date. Data core contains transform data loader, data bunch, transform list, and data source. So there's not a transformed DS anymore. And then six is a new module called data.transforms, which is where, you know, some standard transforms live, basically, and not just standard transforms, but also stuff you would use in standard transforms like get files and split and stuff like that.

So those are those three things. If you're interested, I can tell you a bit about what happened with data source and transformed DS. Because it's kind of an interesting design question, and I'm not sure I have a simple rule of thumb for it, but basically, we like to kind of have layers that each thing does one thing and one thing separately to other things.

But if you have too many layers, then debugging gets confusing, and so I kind of find my approach to designing is extremely iterative. In fact, it's entirely iterative. I don't really design much upfront at all. And I found that we were getting weird bugs in data source, which I was having, you know, weird bugs mean that something's not clear enough or, you know, the things that you think are in your head aren't really in your head the way you thought they are.

And I realized that a data source without any filters, without any subsets, basically, was the same as a true from DS. And then that made me think having two separate classes for those things kind of seemed weird, and I wondered if we put them all in the same class, what would it look like?

And as you can see, data source, doing both data source and turf MDS now, is, if anything shorter than data source that was inheriting, it's ended up clearer, which is interesting. The only thing it inherits from is something called filtered base, which is super tiny. It's basically just something which you have to define subset, and then it's going to define train and valid properties for each of your two subsets.

And the other thing it does is it adds a data bunch, which will create a data bunch containing a default turfMDL for each of your subsets. And one of the nice things is that means that turfMList can also inherit from filtered base, which means that you can create a data bunch from a turfMList, or you can create a training or validation set from a turfMList.

So yeah, if you don't need multiple independent pipelines creating a tuple thing, then this might be an easy way to create really simple data sources. So yeah, the tests that were in turfMDS are still here, all the same tests are still here, but now they all say data source.

So here's an example of a data source test without any splits being applied. So it's just acting, so there's no use of train or valid or whatever. And here's one that does have filters applied. So we can check train and valid, as you can see. And then the actual use of the creating of the filters is done in turfMList.

So that's part of why data source is so simple now, because data source is simply something that contains a turfMList for each transform that you pass in. So that's a change. And our code ended up much simpler, and it's easier to debug, and the weird bugs we had went away, so that was all good.

Another change, which is less substantive, is tabular. So now that we have this filtered base subclass, tabular doesn't really need to use data sources anymore. It can just inherit from filtered base, and it will get a train and validation set automatically. So it just has to define subset, because that is the one not-implemented thing that subclasses have to define.

And so now there's no-- so as you can see, tabular is actually a bit smaller and simpler now. It doesn't have to have a data source method anymore. If you want to create a train and valid set, you can just pass in fields. So if we have a look, here's an example of creating a tabular without passing in fields, and so it just acts like a normal data frame type thing.

And here is a processed one with categorify, just like before. And then here is one with splits, which-- that's going to be confusing, because we call it splits in one place and filts in the other. We should change that. We'll add a note. OK. How do I navigate code?

I mainly use Vim. So I don't have to too much, because my code's pretty small and self-contained. But if I do need to jump around, I just use Vim using its tags functionality. And the other thing is the get nbsource link. Actually, there's a-- what's the one that gives it to us nbsource link?

There we go. So you can do it this way. You get something you can click on, and it will take you straight to the right spot. That looks like it is-- yep, it is pipeline. OK. So that's another option. Quite often, I just want to see how something is defined, in which case I'll just do the question marks just to double check.

But yeah, we have-- you can attach VS Code to a remote terminal easily enough. And so you can always explore it through VS Code or whatever. But yeah, it works fine in Vim. So I could go colon tag, pip tab, and it will tab complete to pipeline. And there is the class, as you can see.

And then if I go to-- oh, I want to know what transform is, so control right square bracket, and that will take me straight to the definition of transform, and so forth. Yeah, I guess most editors do the same stuff. Don't just remember, in local, you've got a full, browseable set of modules.

What kind of weird bugs did I have? Oh, you know how it is. After you've fixed a bug, you can throw it out of your head. One of the big challenges is around setup. Setup's actually quite tricky. So what we do in pipeline with setup is we, first of all, make a copy of our transforms.

And we then clear our transforms. And then we go through the copy of the transforms and add them back one at a time. And after adding each one back-- well, before adding each one back, we call setup. And then we add. If you don't do this, if you just call setup on all of them after adding them all, you kind of have this weird thing where all of your transforms are being called even before they're set up.

So you kind of have to add checks inside your transforms to make sure whether they're set up yet or not. And if they're not, then you'd like to do nothing. It's super awkward. And so like one of the problems was in the train and valid subsets, they both had their own kind of copy of the same pipeline.

And previously, I wasn't going to firms equals-- I wasn't clearing it out like this in place. But instead, I was going like, self.fs, tofums equals nothing, self.tofums. So before I was doing it like that, which kind of looks like it's doing the same thing, right? Self.fs to be empty, and it's setting tofums to be my previous set, and it looks the same.

But the problem is that if there are other pipelines that are pointing at the same list of transforms, they're not being emptied out by this, or else self.fs.clear does empty them out. So that was an example of a weird bug was with the old version. This weren't setting up properly.

And it was kind of hard to debug, because there was just a few too many layers. OK. So in tabular now, we don't have to call tabular object dot data source anymore. We can just pass splits, which I think I'll rename into "filts," or maybe I'll call them all "filts." Or maybe I'll call them all "filts." Anyway, we'll make them more consistent.

We can just pass that into our constructor. And so the other thing about this is we don't have to call setup anymore. We have all the information we need to set up as soon as we instantiate this. So we just call setup directly in it. Another example of weird bugs to avoid, again, it's the subset functionality.

When we subset, we want to create a new tabular object with a slice, the split of what we want. But we had to make sure that in new, we do set up equals false, otherwise when you create the subset, it's going to rerun setup, which would be annoying. So we found the bug, because we added some tests and found they weren't passing.

So we always try to think of tests that we can add. So yeah, tabular_rapids, you can check out. It's in 42. It's missing an underscore from the front, so that suggests that I haven't been working on that. It's been Sylvan's baby. But that suggests that it should be more or less working.

So you could certainly try it out. It certainly hasn't been much used, though. So it might be a bit buggy still. But yeah, hopefully you'll find that's working. So I believe it's a lot faster than the pandas one. OK. So those are those changes. So everything else here is basically the same.

Oh, and then the other thing I do is I added databunch. So that was nice and easy, because databunch is now in filtered base. So we get that for free. Sorry, Marlon, I don't know what you mean by probabilistic inference. OK. So that's that. So maybe we can go back and look at 00 and 01 a little bit.

That'll be fun. And actually, I don't know if you remember, but 00 and 01 aren't quite the start. There's all the ones that start with 9, which is the notebook stuff, which I don't know that we're bother looking at. But there's also a special one, which is imports.py. And that is not generated by a notebook.

And so we actually start with imports.py. So that's got all the imports, as you can see. These types here, I think, are only in Python 3.7. So we patch them in if they're missing. And then we have a tiny number of little functions just for checking equality or doing nothing and checking if something's an iterator or a collection.

I think these are probably things we needed in the notebook, the notebook notebooks. So that's why they're here. So that one's not created by a notebook. So yeah. So going all the way back to 00, the first thing I wanted to write was something which would test whether a and b could successfully be compared using some comparator.

For example, test whether 1, 2 and 1, 2 are equal. Problem is that if this could pass and be wrong, because what if test always returned true? I actually needed a way to test whether it successfully fails. But my test, the idea is that they always throw an exception if they fail, specifically an assertion exception.

The reason for that is that if you run a notebook that causes an exception, you'll get a nice stack trace and all that kind of a thing. So it's a good way to show a test failure, in my opinion. So that means I needed to have a way to test for failures.

You can't test for failures by just passing the code directly in like that, because that would actually run this code, it would cause an exception, and that's it. The exception already happens. So you always have to put a lambda there so it doesn't actually run it. So the first thing I actually needed to do was create a test fail function, which will try to call the function.

And if there is an exception, then if you passed in contains, and that says I want you to make sure that the string of the exception contains something, so either make sure they didn't pass that or that it was here, and then return. So if you didn't end up in the exception clause, then I failed.

I didn't get an exception, so that's test fail. So that was kind of step one, is something that would allow us to test for failures. And so here's something that checks that we actually get a failure. And so then we can test our test with equals and not equals for both failing and succeeding.

So all equal was one of the things that was defined in local.imports, but we can still display it here. And then we can create not equals. And, yeah, so then we can start using the fact that we have a general purpose test A and B in some comparator to start defining things like test ik, which is the one we normally use for testing that A and B are equals.

And then this is just what's printed. If there's a failure, it'll tell us what the failure was. So the equals tries to kind of do the right thing. So if either of them have an array equals method, then we should use that to test for error quality, that's kind of the Python or the NumPy protocol for checking for array equality.

If one of them is an nd array, we can use NumPy. If one of them is a string or a dicta or a set, we can just use operator.equals. If one of them is an iterator, we can use all equal, which, as you can see, checks whether everything in each one is equal.

Otherwise we'll just use operator equality. So we try to kind of make equals work across a variety of types. And that's why you can see test equal being checked with all kinds of things like arrays and dictionaries and data frames, series, so forth. So that's the main one we use all the time in our tests.

Sometimes we use test ik type, which tests whether A and B are equal, and also tests whether their types are equal. And if you pass a list or a tuple, then we'll also check that the types of all of its contents are equal. So test for not equals, test for the two things are close.

Okay. So that's OO. All right. I'm going to look at meta classes just yet. So here is O1 core. So quite often we use patch. For example, we use it for ls, for example, we have here define ls, self colon path. And it has at patch. So what that does is if we say P equals path dot, you can go P dot ls.

So how does that work? Well, I remember a decorator in Python is simply passed its function as an argument. So in this particular case for patch def func, patch will be passed func. And so then that function, we want to find out what to patch. So we want to patch this parameter's type.

And so to find that parameter's types, we go through all of the annotations and just find the first one, which means this is like, in some ways, I mean, it won't tell you if you do something dumb like that. It'll still end up being patched to T3. But that's fine.

I don't always check for every dumb thing you might do. Just as long as the behavior works correctly when used correctly and the really obvious mistakes are checked for. So that's going to tell us what the type we're patching. And then it will patch to that type with this function.

And so here's patch two, which there's really not much to tell you about that. It just goes through and uses the func tools stuff to make sure all of the metadata is correct and it will set in this class with this name the function that we asked for. Which is better, Win or Ubuntu?

Oh, it's up to you. I use Ubuntu in my server here, as you see. And I use Windows on my computer because I do a lot of-- I like to draw things a lot when I'm talking, so I like to use something with a stylus. And I-- yeah, there's a lot I like about Windows on my desktop.

OK, so that's patch. So then we've got a different thing, which is patch property. And patch property does the same thing as patch, but it passes as prop equals true, which as you can see simply turns a function into a property. Because remember, when you say at property in Python, property is just a decorator, so you can use it as a function.

So here it is being used as a property. So why not use wraps? The-- what was it? Oh, yeah. This is obviously the comment that was telling me it was something about pipeline. This is basically doing the same thing as functools.update wrapper or whatever it's called. And it's setting the function with its name to the attribute.

I don't remember anymore. Maybe this is now obsolete, because I added a comment to here to remind myself why I did it, but now I don't understand the comment, so I'm not sure. functools.update wrapper. Let's see what it looks like. So it uses wrapper assignments as a find, goes through each one, and it grabs it, and it sets it to the value.

So I'm not doing this bit, and I don't remember what that is, but maybe there was some reason why we do that, although-- yeah, I'm not sure. Yeah, I'm not sure. Maybe we can now. OK. So then we have things like delegates-- yeah, sorry, but I know you meant wraps, but wraps just calls update wrapper, so that's all wraps is.

As you can see, functools.wraps, yeah, so that's all it is. OK. So delegates we've kind of looked at before. So that's the thing that allows us-- you can either delegate passing and nothing at all, in which case it will delegate your init to your base classes init. So you can see here how I'm testing it, right?

I've added a little thing called test-sig, which checks that the signature string of five is equal to whatever you pass it. So here you can see we've got a foo, and we've got a, b equals 1, and quags, and then quags is being delegated to base foo, which has e and c equals 2.

And so that's not a quag, that is a quag. So it's going to therefore end up as a, b equals 1, and c equals 2. So we can see the signature is grabbing that stuff from base foo. Actually, the other thing we could test-- no, actually, that's not the right place to test it.

That's fine. We should get rid of this. This one, useQuags, is mainly used by other functions. We don't normally use it directly, but this is like something where you can basically say, I want you to replace quags with y and z. So you can see here I've got a, b equals 1, quags, and then that's it.

These add y and z, and so as you can see here, it's added y and z. We don't normally use it directly, and you can see it's just grabbing the signature and replacing stuff in the signature. But it is used in that very important funx-quags thing that we use all the time.

That's the thing where we say, oh, these methods, this list of methods, are things that you could pass in as quags. And if you do, it will replace the method here. And so as you can see there, I use quags to replace the signature with the correct signature. And here you can see I am using functools.update wrapper, which I could also have done it with by saying, at wraps, hold in it, I guess, would have worked just as well.

I'm trying to remember why this is here, and I now don't. What am I doing with that? Ah, yes. Okay. So when we-- so we've got functs-quags here. We said b is our methods. So if I create something of that type, then b is going to return 2, because it's the method.

But then I can pass in something and say, no, replace b with a method that returns 3, and make sure that's what happened. And then what you can do instead of passing in a function or lambda, you can pass in a method. And if you pass in a method, it's going to get self as well.

So to tell it that something should be a method, you put @method above it. And the way that is done is using this little trick here, which is to replace f with a types.method type wrapper. And that's what's checked here. Check to see whether something's a method. Okay. So that's what that does.

I added this little decorator that uses a external thing called type check, which basically does runtime type checking. It's part of this thing called type guard. Although honestly, I haven't actually used it since I added it. So I might remove it, or we might decide to use it more widely.

But basically what it does is if you add a annotation, and then you try to call it with the wrong type, then it'll fail. It's an interesting idea. I haven't found myself wanting it much yet. Okay. What else is there to show you here? Add docs, we've seen plenty of times.

So here's an example. We've got some class with some functions. And if we say, then say add docs, then we can say these are my doc strings for each function. And so I can then just check that it does in fact get those doc strings. Okay. So that's that.

And then get atra, I guess we've pretty much seen now. So get atra is the thing that we inherit from in order to get done to get atra for free. And specifically what it's going to do is it's going to try and find the unknown attribute in self.default. So here's an example where we set self.default to whatever you pass in.

So we passed in hi. So we would expect to be able to do dot lower. That would make a lot more sense if this was capitalized. There we go. And it fails if we try to say upper because underscore extra is the list of things that we are allowed to delegate.

Although by default it will delegate everything. So dir in Python gives you back a list of all of the attributes. So we can use anything by default that's in self.default as long as it doesn't start with underscore because that would be private. So dunder dir is a thing that Python calls when you call dir.

So when you do like tab completion that's how it does tab completion. So we then do custom dir which is looking at everything in the type and everything in the object and anything else that you add manually. So here we check that lower has been added to our dir.

Sometimes you don't want to inherit from getatra but instead you want to kind of do it manually. So you can also instead define your own dunder getatra and simply return this delegate atra which will basically do exactly the same thing except you don't get the dunder dir thing. One more thing.

Set state. When you override dunder getatra in Python it often kills pickle. And so we just I think we just looked it up on stack overflow and found a few. So pickle will use dunder set state to decide what to pickle basically. And I don't quite remember why but somehow doing this fixes pickling.

That's why that's there. Okay. So last one for today is L. This is the main one. So L is a collection base which also has getatra. And also uses new check meta to make sure that we don't that if you pass in an L then it just gives you back what you started with rather than creating another one.

A collection base is just something which contains, composes some items. And basically everything is just delegated down to that. So delegates down length and getitem, setitem, delitem, repra and itter. If you don't know what any of these things are check the Python data model docs. So then L adds a lot of behavior which is best understood by looking at the tests I think.

So you can pass in pretty much anything to an L that you could otherwise pass into the normal Python list. So list range 12, we try to make it behave as much like a Python list as possible. And if you pass in the same things, in fact you can see we actually test check that that's the same as list range 12.

But then we have another nice little thing. So we can do dot reverse, for example, as you can see. Now reverse is actually not listed anywhere here. As you can see. And the reason for that is that we inherited from getatra and that default is set to self.items and list has a reverse.

So actually all we were doing is we were dedicating to list. Okay. We have a dunder set item, as you can see. So we can set something, T3 equals H. And then some of the nice stuff that we're adding is being able to kind of more NumPy style set multiple things to multiple values and retrieve multiple things.

Yeah. So that's some basic functionality in L. You can create an empty one, which should be the match to an empty list, of course, a pen just like a list can plus equals to it like a list can. You can add things onto the left of it instead of the right, which a list can't.

You can multiply just like a list can. Unlike a list, you can negate. So this is the negation operation. The true false false becomes false true true. So then here's an interesting one, cycle. So cycle simply calls it a tool.cycle. So that's a useful thing to know about basically it at all start cycle.

Simply let's try it in a tools.cycle one, two, three. And then we'll need to just grab the first little bit of that. Otherwise it'll be infinitely long and I don't have an infinite amount of RAM. I sliced, grab the first bit, kind of 12. Oh, and then we'll need to listify that so you can see it.

Okay. So as you can see what cycle does one, two, three, one, two, three, one, two, three. So it'll do that forever. And then we sliced in the first 12. So we can say l.cycle one, two, three, for example. And then we can do the same thing, it a tools dot I slice that and then list that, oops.

And then slice by how much, there we go, same thing. All right, so questions, how do I handle multiple indices? So we handle multiple indices by defining get item. So get item, it's going to check whether the index that's passed in is an indexer or not. What's an indexer?

An indexer is something that is either an int or is something that has an end in property which is zero. Why is that? Because of this, T equals one, two, three, T one, that's an indexer, but here's something else that's an indexer, import torch, torch, that's an indexer too, okay?

And that's because torch.tensor.endim is zero. But you can't do that, okay? So that's what is indexer is checking for. So if it's an indexer, then we call underscore get, which as you can see, checks if it's an indexer, and if it is, it simply tries to find out whether self.items has an i lock.

In this case, it doesn't, so it's just going to give us self.items and i. So it's just going to be self.items i. But your question is, what happens if it's a list? In that case, we're going to end up over here. So we're going to create a new L containing self.getIdx, which in this case, it's not an indexer.

So we're going to convert a mask to indexes. So if it's Booleans, it'll convert into indexes. And then it'll check does it have i lock, which else doesn't, does it have dunder array, which else doesn't. So then it's going to return a list comprehension. And so that's how come that works.

OK, yeah, so how does none plus done work? As I mentioned, it's in dunder add. And specifically here, you can see we create a new L containing all of the items in A plus B listified, and listify none is an empty list. So that's why that works. OK, so here you can see we've got an infinite number of ones.

And if we zip that with T, where T is L range four, that should be the same as zipping range four with four ones. So that works there. L.range is almost the same as normal range, except it returns an L. Shuffled does what it sounds like. And we actually have a test shuffled now, I think, so we can do that instead.

So mapped is basically the same as calling map underscore f comma t, except that there's a few differences. One is that that returns a map object, where else our map actually does the mapping. So t.mat, as you can see. And you can pass in arguments, as you can see.

All keyword arguments. So we use that quite a lot. OK, so tens of things you can construct an L with. You can construct it with a list. You can construct it with another L. You can construct it with a string, in which case it will stay as a string, with a range.

You can construct it with a generator. Now this is different to how Python lists work. If I go list array zero like this, then as you can see, that gets converted into a list containing zero, or zero comma one, if your list is zero or one. Whereas L doesn't do that by default, L will create a single item containing the array.

Because most of the time, particularly with tensors, you don't want to unwrap them into a list. You want to actually put the tensor or the array into the list. Is there any way to know how L is shuffled? Not with the shuffled. You would have to use indexes or something for that.

OK. So that's an important difference. If you want the same behavior that list does, then you can pass use list equals true to give you the same behavior as list. So instead of having an array with zero one in, that will actually create two items now, zero and one.

So that does exactly the same thing as list would do if you say use list equals true. OK. You can pass the match parameter to the constructor to get the same behavior as listify had in version one, which is basically to say make this list as long as this list.

That's why that will create one, one, one. Here's the test that confirms that L of T is T. Note that is means that identical objects are the same reference. OK. And so then you can see some of the methods. So here's checking get item. As you can see here, we've got using an array of masks instead.

So that's just like NumPy. The mask array has to be the same number of booleans as the length of the list. It has a dot unique as you can see. This is basically kind of this is basically telling you the reverse mapping. So it's a mapping from where is the three, for example, and it's in location zero, one, two.

Whereas the one, it's in location zero, so it's a dictionary. So that valid to IDX and unique kind of two things you need to create a vocab. We can filter. This is basically the same as the filter function in Python. But it's going to return an L. Here's mapped.

Mapped dict is kind of handy. It does exactly the same as mapped, but rather than returning a list, it returns a dictionary from the original value of the list to the value of the function. So that's pretty handy. Zipped is basically the same as zipping lists, as you can see, it returns an L.

One nice thing you can add to zip, though, is if the lists are different lengths, then you can say cycled equals true, and it will replicate the shorter one, as you can see, and it'll cycle through it again to make it the same length as the longer one. Or else cycled equals false behaves the same way as the normal zip.

And then mapped zip basically takes the result of that zipped and puts it into a map. So for example, if we do mapped zip multiplication, then it's going to zip one, two, three with two, three, four, and then apply a multiplication to each one to give us element-wise multiplication.

It won't be fast like numpy, so don't use this instead of numpy, but it's quite handy sometimes. Zip with will take this L and zip it with this list, as you can see. And here's the same thing with the map as well. That's the same thing as before. Item-getter is just going to apply -- which one is it?

Item-getter. Oh, it's an operator. Of course it is. It's an operator.item-getter to every item of a list, so our t is 1, 0, 2, 1, 3, 2, 2. So t.item-got 1 will return the 1th element from each of those, so it will be 0, 1, 2, 2. I use that a lot, actually.

Attribute-got is basically the same thing, but it's going to return this attribute from each thing. So here we've got a3b4, a1b2. So this will be the b from each, so 4, 2. We use that quite a lot, too. Sort it is pretty obvious. Range is pretty obvious. All right.

So there's a little guided tour of the first half of O1 core. Thanks for tuning in, and I'll see you all next time. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye.