back to indexfastai v2 overview (at S4TF Design Meeting 2019-11-08)
Chapters
0:0 Introduction
3:0 literate programming
5:0 highlevel API
9:0 trading loop
11:10 callbacks
12:50 mixed precision
14:25 optimizer
17:0 datablock
21:10 tensors
27:55 function overloading
31:10 function pipeline
35:15 optimized pipeline
38:23 generic optimizer
41:27 runtenon
44:55 JIT
49:28 Data Science
00:00:00.000 |
to myself, to Paige, to Chris, to Ed, or others. 00:00:14.320 |
to the man who needs new introduction, Jeremy, 00:00:29.080 |
for when Adam presented a little bit of Hask torch code 00:00:34.040 |
a couple of weeks ago, which I thought was super cool. 00:00:37.280 |
And so mainly my goal here is to kind of encourage 00:00:42.240 |
other people to present cool things in other languages 00:00:45.800 |
and libraries, because I think it's a great way for us 00:00:50.840 |
But as tends to happen when you say, can somebody please do x, 00:00:54.840 |
somebody else says, hey, why don't you do x first? 00:01:00.160 |
about the library that Sylvia and I have been working on. 00:01:11.360 |
so for quite a while now, it's a library for PyTorch called 00:01:21.040 |
And I think there are things we can learn from it 00:01:30.800 |
But I'm going to focus on trying to sell you on Fast AI 00:01:46.880 |
And a lot of people think that a higher-level API is 00:01:56.800 |
on top of the serious business of TensorFlow or PyTorch 00:02:02.680 |
when I show you actually what's involved in a truly 00:02:13.600 |
to it in the meeting notes, and that will link you 00:02:27.280 |
Well, this is an example of what Fast AI V2 source code looks 00:02:35.680 |
Jeremy, we are just having a little trouble actually 00:02:41.920 |
OK, so that probably means I failed to present my screen. 00:03:03.600 |
So yeah, so here is an example of what the Fast AI V2 source 00:03:25.440 |
But actually, you'll find that also this pixel shuffle 00:03:29.120 |
appears here in a standard layers.py module, which 00:03:44.520 |
So we've developed a new literate programming system 00:03:48.040 |
that allows you to write code and have it automatically 00:03:53.720 |
turned into nice modules, which even do things 00:03:57.680 |
that most people don't bother to do because they're 00:04:00.160 |
annoying if they're automatic, like setting it under all 00:04:06.240 |
Also coming out of that is automatically documentation. 00:04:09.920 |
So all that gets turned into hyperlink documentation, 00:04:14.200 |
including links directly to the source code and automatic doc 00:04:24.800 |
And the tests are used both to document the behavior expected. 00:04:31.880 |
this test is a very good description of exactly what 00:04:34.760 |
it is and also ensures that our code is working. 00:04:38.200 |
And those tests can all be put in continuous integration 00:04:43.200 |
So that's the first interesting thing about FastAIV2 00:04:47.120 |
is it's the first truly literate programming-based system 00:04:55.800 |
for every part of this, which is kind of a theme for FastAIV2. 00:05:03.640 |
and I found something that didn't quite work the way we 00:05:06.800 |
wanted it at any part of the stack, we wrote our own. 00:05:12.800 |
with no particular deadline and trying to do everything 00:05:18.200 |
So the layered API of FastAIV2 starts at the applications 00:05:24.800 |
layer, which is where most beginners will start. 00:05:33.280 |
which is the released version of the software 00:05:36.720 |
But V2, everything is rewritten from scratch. 00:05:43.920 |
The idea is that in one, two, three, four lines of code, 00:05:49.480 |
you can create a state-of-the-art computer vision 00:05:57.920 |
With nearly the same one, two, three, four lines of code-- 00:06:03.080 |
five lines of code in this case, because we're also displaying-- 00:06:07.240 |
you can create a state-of-the-art segmentation model. 00:06:13.880 |
is, to the best of my knowledge, still better 00:06:16.280 |
than any published result on this particular Canvid data set. 00:06:19.560 |
So these five lines of code are super good five lines of code. 00:06:23.000 |
And as you can see, it includes a line of code, 00:06:26.400 |
which, if you say show batch, it will display your data 00:06:35.520 |
and the color-coded pixels overlaid on top of the picture. 00:06:50.000 |
is a system that we developed and wrote up along 00:06:52.480 |
with the Bastion router for transfer learning 00:06:59.080 |
is working on IMDB on a single epoch in four minutes. 00:07:03.320 |
The accuracy here is basically what was the state-of-the-art 00:07:13.720 |
Basically, a few lines of code, nearly exactly the same lines 00:07:16.960 |
of code, and you'll get a great result from your tabular data 00:07:29.280 |
is designed to be something where, regardless 00:07:36.920 |
using sensible defaults and carefully selected 00:07:39.240 |
hyperparameters, automatically, largely done for you 00:07:45.040 |
for the most common kinds of problems that people look at. 00:07:49.160 |
And that bit doesn't look that different to V1, 00:07:52.840 |
but understanding how we get to that is kind of interesting 00:08:07.840 |
on quite a few years of research to figure out 00:08:10.000 |
what are the best ways to solve various problems along the way. 00:08:19.640 |
that they've been working in TF2 for a while, 00:08:22.360 |
and for some reason, they couldn't figure out. 00:08:24.800 |
All of their models are suddenly working much better. 00:08:29.240 |
getting all these nice kind of curated best practices. 00:08:32.960 |
And somebody else on Twitter saw that and said, yep, 00:08:37.440 |
We were trying TensorFlow, spent months tweaking, 00:08:41.440 |
A couple of days later, we were getting better results. 00:08:44.200 |
So these kind of carefully curated defaults and algorithms 00:08:48.880 |
and high-level APIs that do things right for you 00:08:51.360 |
the first time, even for experienced practitioners, 00:09:00.960 |
that are more, I think, interesting for a Swift 00:09:05.640 |
go into how we make that work, the more stuff 00:09:09.920 |
you'll see, which will be a great fit, I think, with Swift. 00:09:22.840 |
actually, I guess the foundation layer is new. 00:09:24.840 |
So the mid-layer, I guess I'd say, is more rewritten for V1. 00:09:34.640 |
One of the bits which is the most interesting 00:09:47.080 |
This is what a training loop looks like in PyTorch. 00:10:09.120 |
Run the model, get the loss, do the gradients, 00:10:11.960 |
step the optimizer, do that a bunch of times. 00:10:23.600 |
statistics in TensorBoard or in fast progress or whatever. 00:10:28.840 |
You might want to schedule various hyperparameters 00:10:33.160 |
You might want to add various different kinds 00:10:45.680 |
have to write a new training loop for every time 00:10:56.840 |
Or you try and write one training loop which does 00:11:03.800 |
which only did a tiny subset of the things I just said 00:11:12.960 |
Now, the idea of callbacks has been around in deep learning 00:11:21.720 |
is that every callback is actually a two-way callback. 00:11:27.680 |
It can read gradients, parameters, data, so forth. 00:11:33.720 |
So it can actually change anything at any time. 00:11:37.120 |
So the callbacks are, we say, infinitely flexible. 00:11:41.720 |
We feel pretty confident in that because the training loop 00:11:49.680 |
to do any of the tweaks that I showed you before. 00:11:53.000 |
So even the entirety of training GANs can be done in a callback. 00:11:58.600 |
So basically, we switch out our basic training loop 00:12:01.800 |
and replace it with one with the same five steps 00:12:09.920 |
So that means, for example, if you want to do a scheduler, 00:12:16.560 |
sets the optimizer's learning rate to some function. 00:12:21.440 |
you can write an on epoch end that checks the metrics 00:12:27.360 |
Or you can do parallel training, set up data parallel, 00:12:44.280 |
at the end of the backward step, and so forth. 00:12:50.880 |
all things that have been written with fastAI callbacks, 00:12:57.520 |
All of NVIDIA's recommendations, mixed precision training, 00:13:03.360 |
will be added automatically if you just add a two FP16 00:13:15.240 |
can be combined with multi-GPU and one-cycle training 00:13:25.080 |
And so trying to create a state-of-the-art model, which 00:13:30.840 |
involves combining state-of-the-art regularization 00:13:33.200 |
and mixed precision and distributed training and so 00:13:39.600 |
But with this approach, it's actually just a single extra 00:13:47.080 |
to work with each other and are tested to work with each other. 00:13:51.200 |
So for instance, here is mix-up data augmentation, 00:13:54.760 |
which is a incredibly powerful data augmentation method that 00:13:59.320 |
has powered lots of state-of-the-art results. 00:14:02.440 |
And as you can see, it's under a screen of code. 00:14:17.960 |
full of all kinds of assumptions and only one 00:14:36.760 |
been lots and lots of different optimizers appearing 00:14:52.520 |
as decoupled weight decay Adam, was added to PyTorch 00:15:15.040 |
On the other hand, FastAI's implementation, as you can see, 00:15:22.840 |
two lines of code and this little bit of gray here. 00:15:25.760 |
So it's kind of like two and a half, three lines of code 00:15:37.600 |
see what's different for each of these state of the art 00:15:46.120 |
can be added and removed by just changing two things-- 00:15:58.920 |
during training, such as the gradients or the gradient 00:16:01.520 |
squared, or you might use dampening, or momentum, 00:16:07.000 |
uses those stats to change the weights in some way. 00:16:15.320 |
able to implement all these different optimizers. 00:16:27.000 |
which came out of Google and was super cool at reducing 00:16:30.960 |
the pre-training time from three days to 76 minutes, 00:16:34.880 |
we were able to implement that in this tiny piece of code. 00:16:38.840 |
And one of the nice things is that when you compare it 00:16:41.720 |
to the math, it really looks almost line for line 00:16:51.560 |
So it makes it really easy to do research as well, 00:16:55.160 |
because you can quite directly bring the equations across 00:17:04.240 |
is the data block API, which is something we had in version 1 00:17:26.560 |
helped us to rethink it in a more idiomatic Swift way. 00:17:36.840 |
And we ended up with something that was quite a bit nicer. 00:17:40.160 |
So there's been a nice interaction and interplay 00:17:42.400 |
between fast AI in Python and Swift AI in Swift 00:17:52.480 |
is something where you define each of the key things 00:17:56.680 |
that the program needs to know to flexibly get your data 00:18:04.160 |
So it needs to know what type of data do you have, 00:18:08.120 |
how do you get that data, how do you split it 00:18:13.320 |
and then put that all together into a data bunch, which 00:18:17.520 |
It's literally, I think, four lines of code, which just 00:18:20.760 |
has the validation set and the training set in one place. 00:18:26.840 |
So with a data block, you just say, OK, my types, 00:18:31.960 |
I want to create a black and white pillow image for my x 00:18:43.240 |
And to split those files into training and validation, 00:18:52.240 |
And to get the labels, use this function, which is 00:19:04.840 |
And so once you've done this, you end up with a data bunch. 00:19:08.640 |
And as I mentioned before, everything has a show batch. 00:19:12.360 |
So one of the nice things is it makes it very easy for you 00:19:14.840 |
to look at your data regardless of whether it's 00:19:17.080 |
tabular, or collaborative filtering, or vision, or text, 00:19:21.600 |
If it was audio, it would show you a spectrogram 00:19:28.120 |
So you can do custom labeling with data blocks 00:19:32.880 |
by using, for example, a regular expression labeler. 00:19:37.240 |
You can get your labels from an external file or data frame. 00:19:44.080 |
So this thing here knows it's a multi-label classification 00:19:47.600 |
So it's automatically put a semicolon between each label. 00:19:51.800 |
Again, it's still basically just three lines of code 00:19:59.600 |
And you can see, really, the only thing I had to change here 00:20:02.640 |
was that my dependent variable has been changed from category 00:20:09.000 |
And again, automatically, I show batch works. 00:20:11.640 |
And we can train a model from that straight away as well. 00:20:17.400 |
So here, I've just changed my dependent variable 00:20:20.600 |
And so now, it knows how to behave with that. 00:20:24.960 |
So now, I've changed my dependent variable to bounding box. 00:20:27.600 |
And you can see, I've got my bounding boxes here. 00:20:36.040 |
So actually, going back, I have a couple of questions 00:20:41.000 |
So the code, you've got sort of the x's and y's. 00:20:47.680 |
And these sounds like these different data types roughly 00:21:00.240 |
And actually, this is the way it looked about three weeks ago. 00:21:12.000 |
And this is the bit that I think is the most relevant to Swift. 00:21:14.600 |
A lot of this, I think, would be a lot easier to write in Swift. 00:21:31.840 |
with a data type called tensor, which has no semantics to it. 00:21:37.720 |
And so those tensors actually represent something 00:21:51.040 |
and say dot flip, or dot rotate, or dot resample, 00:21:58.280 |
Well, the answer is you can't, because it's just 00:22:08.680 |
So you can now have a tensor image, a tensor point, 00:22:13.720 |
And you can define a flip left, right for each. 00:22:16.560 |
And so this is some of the source code from-- 00:22:18.480 |
we've written our own computer vision library, 00:22:20.920 |
so that now you can say flip LR, and it flips the puppy. 00:22:26.800 |
And if it was a key points, it would fit the key points. 00:22:30.640 |
If it was a bounding box, it would fit the bounding boxes, 00:22:34.880 |
So this is an example of how tensors which carry around 00:22:40.360 |
It's also nice that I can just say dot show, right? 00:22:43.280 |
So dot show is something that's defined for all fast AIV2 00:22:53.400 |
It could even be a tuple containing a tensor, 00:22:56.560 |
and some bounding boxes, and some bounding box classes. 00:22:59.080 |
Whatever it is, it will be able to display it. 00:23:02.720 |
It will be able to convert it into batches for modeling, 00:23:10.480 |
So with that, we can now create, for example, 00:23:18.200 |
And we can say that the encoding of that random transformation 00:23:21.640 |
is defined for a pillow image or any tensor type. 00:23:31.480 |
Or we could do the dihedral symmetry transforms 00:23:36.280 |
Before we call, grab a random number between 0 and 7 00:23:40.000 |
to decide which of the eight transposes to do. 00:23:51.200 |
And so now we can call that transform a bunch of times. 00:23:55.480 |
And each time, we'll get back a different random augmentation. 00:23:59.000 |
So a lot of these things become nice and easy. 00:24:02.560 |
Maxim asked, why isn't tensor a backing data structure 00:24:06.320 |
Tensor image is a tensor, which is an image type. 00:24:13.200 |
Why isn't-- he says, why isn't tensor a backing-- 00:24:38.480 |
Inheritance just has the benefit that all the normal stuff you 00:24:41.760 |
can do with a tensor, you can do with a tensor that 00:24:46.600 |
doesn't mean you now don't want to be able to do fancy indexing 00:24:54.120 |
or stack it with other tensors across some axis. 00:25:09.720 |
We have a version that uses composition as well, 00:25:11.680 |
and it uses Python's nice get atra functionality 00:25:29.280 |
decided to officially implement semantic tensor subtypes now. 00:25:35.080 |
And so hopefully, in the next version of PyTorch, 00:25:37.320 |
you won't have to use the extremely ugly hacks 00:25:48.600 |
some of these ideas will be brought over there. 00:25:51.760 |
Can I ask you, so how does the type propagate? 00:26:00.320 |
So Chris and I had a conversation about this a few months ago, 00:26:05.280 |
and I said I'm banging my head around this issue of types 00:26:12.160 |
And Chris casually mentioned, oh, yes, that thing 00:26:16.320 |
So I went home, and that was one of these phrases 00:26:19.120 |
I thought only functional programming Dweeb's talked 00:26:22.000 |
about, and I would never care about a tensor. 00:26:24.600 |
And we have to care about it, because it actually 00:26:27.320 |
And it's basically the idea that if you have a tensor image 00:26:30.160 |
and you add one to it, you want to get back a tensor image, 00:26:33.920 |
because it should be an image that's a bit brighter 00:26:42.280 |
hacky, partial, higher kind of type implementation 00:26:53.400 |
you will nearly always get back the correctly subtype tensor. 00:26:58.680 |
Yeah, I mean, I saw the PyTorch recently sort of talking 00:27:01.600 |
about their named indexing extensions for their tensors 00:27:06.680 |
as well, and I assume they have a similar kind of challenge 00:27:13.640 |
has named dimensions, you want to propagate those along. 00:27:25.640 |
But yeah, we talked to the PyTorch team at the DevCon, 00:27:30.520 |
and we certainly are planning to bring these ideas together. 00:27:38.040 |
Yeah, I just mean that I assume that that feature has 00:27:44.560 |
So it would be interesting to see what they do. 00:27:55.400 |
Not only do we get to be able to say .show batch, 00:28:00.840 |
And in this case, it knows what the independent variables type 00:28:05.240 |
is, it knows what the dependent variables type is, 00:28:07.720 |
and it even knows things like, hey, for a classification task, 00:28:12.360 |
If they're not, by default, I will highlight them right. 00:28:18.240 |
the things that drive our ability to easily add 00:28:30.240 |
We had to write our own type dispatch system. 00:28:36.040 |
and those type annotations are actually semantic. 00:28:38.720 |
And so we now have the joyfully modern idea of function 00:28:45.080 |
overloading in Python, which has made life a lot easier, 00:28:51.800 |
Do you have many users that are using this yet? 00:28:59.960 |
But there is a enthusiastic early adopter community 00:29:08.000 |
So for example, the user-contributed audio library 00:29:14.960 |
I've also built a medical imaging library on top of it, 00:29:17.280 |
and I've written a series of five notebooks showing how 00:29:33.320 |
because there's this very strongly-held conception 00:29:41.760 |
of typing in the world, and I'm curious how they react to that. 00:29:46.040 |
The extremely biased subset of early adopter-class AI 00:29:53.120 |
And they tend to be people who have gone pretty deep 00:29:59.920 |
who wrote something called Music Autobot, which 00:30:03.920 |
in case you haven't seen it yet, which is something 00:30:07.120 |
where you can generate music using a neural network. 00:30:11.120 |
You can put in some melodies and some chords, 00:30:13.480 |
and it will auto-complete some additional melodies and chords. 00:30:16.520 |
Or you can put it in a melody, and it will automatically 00:30:19.360 |
add chords, or you can add chords that create melody. 00:30:23.960 |
And so he had to write his own MIDI library, fastai.midi. 00:30:29.200 |
He rewrote it in V2, and he said it's just like so, so, 00:30:33.640 |
so much easier, thanks to those mid-tier APIs. 00:30:44.720 |
I've been helping with some of the audio stuff, 00:30:51.240 |
So it makes things a lot more flexible than version 1. 00:30:56.440 |
So that's probably my favorite thing about it, 00:31:02.360 |
Nothing is like, well, it's got to be this way, 00:31:10.800 |
Another piece of the transform of the foundation 00:31:14.880 |
is the partially reversible composed function 00:31:20.760 |
really rolls off the tongue, we call them transform in pipeline. 00:31:35.600 |
and function composition to work in deep learning 00:31:43.560 |
The first is you often want to dispatch over tuples. 00:31:47.200 |
And what I mean by that is if you have a function called 00:31:52.760 |
flip left right, and you have a tuple representing 00:31:59.200 |
a mini batch where your independent variable is 00:32:03.480 |
is a set of bounding boxes, if you say flip left right 00:32:07.160 |
on that tuple, you would expect both the x and the y 00:32:11.680 |
to be flipped and to be flipped with the type appropriate 00:32:21.400 |
send each element of a tuple to the function separately 00:32:26.000 |
and/or dispatch according to their types automatically. 00:32:30.720 |
We've mentioned type retention, so the kind of basic 00:32:51.160 |
would take the word dog and convert it to the number 1, 00:32:55.520 |
perhaps, which is what you need for modeling. 00:33:02.720 |
So you need to reverse that transform and turn 1 back 00:33:08.120 |
Often those transforms also need data driven setup. 00:33:12.320 |
For example, in that example of dog becoming 1, 00:33:16.000 |
there needs to be something that actually creates that vocab 00:33:18.360 |
automatically, recognizing what are all the possible classes, 00:33:21.520 |
so it can create a different index for each one 00:33:39.040 |
At the lowest level is a class called transform, 00:33:41.920 |
which is a callable, which also has a decode, 00:33:48.720 |
does the type retention, higher kind of type thing, 00:33:51.400 |
and does the dispatch over tuples by default. 00:33:54.720 |
So then a pipeline is something that does function composition 00:34:00.120 |
And it knows about, for example, setting up transforms. 00:34:08.400 |
is a bit tricky because you have to make sure 00:34:21.600 |
And then we have something that applies a pipeline 00:34:23.760 |
to a collection to give you an indexable, lazily transformed 00:34:31.040 |
to get back an independent variable, for instance. 00:34:47.520 |
So in the end, all this stuff makes a lot of things 00:34:55.720 |
For example, the language model data loader in Fast AI v1 00:35:05.440 |
In Fast AI v2, it's less than a screen of code 00:35:08.480 |
by leveraging these powerful abstractions and foundations. 00:35:16.400 |
So then finally-- and again, this is something 00:35:22.160 |
we worked really hard to make everything extremely well 00:35:26.280 |
So for example, preprocessing and natural language processing, 00:35:30.200 |
we created a parallel generator in Python, which you can then 00:35:35.840 |
basically pass a class to that finds some setup and a call. 00:35:44.200 |
So for example, tokenization is done in parallel 00:35:50.000 |
But perhaps the thing I'm most excited about, 00:35:57.640 |
both in Python and Swift, is the optimized pipeline 00:36:04.960 |
So pretty much all of the transforms we've done can 00:36:14.880 |
So for example, when you do the flip left right 00:36:17.720 |
I showed you earlier, we'll actually run on the GPU, 00:36:20.920 |
as we'll warp, as we'll zoom, as we'll even things like crop. 00:36:26.960 |
So one of the basics of this is the affine coordinate transform, 00:36:36.280 |
which are very powerful PyTorch functions, which 00:36:40.560 |
would be great things to actually write in script 00:36:57.480 |
that lets us do a very wide range of data augmentations 00:37:03.840 |
For those of you that know about the DALI library 00:37:11.680 |
It's pretty similar in terms of its performance. 00:37:14.280 |
But the nice thing is, all the stuff you write, 00:37:19.720 |
So with DALI, if they don't have the exact transformation 00:37:24.160 |
you want, and there's a pretty high chance that they won't, 00:37:29.000 |
Or else with fast AI v2, you can write your own 00:37:41.480 |
So this kind of stuff, I feel like because Swift 00:37:46.360 |
is a much faster, more hackable language than Python, 00:37:52.800 |
or at least hackable in the sense of performance, 00:37:55.800 |
I guess not as hackable in terms of its type system necessarily, 00:37:59.640 |
I feel like we can build even more powerful foundations 00:38:07.040 |
and pipelines and a real Swift for TensorFlow computer vision 00:38:19.800 |
Stuff like that, I think, would be super cool. 00:38:41.560 |
seems relatively straightforward for Swift to handle. 00:38:44.120 |
Are there other sorts of things that you think 00:38:49.040 |
Yeah, the thing I really want you to think about, 00:38:51.240 |
and we've kind of been nagging you on and off since March, 00:38:58.760 |
Having them as a value type the way they are now 00:39:07.760 |
that I really, really want you guys to look into and build 00:39:16.680 |
and it's only kind of partially doing what we need it to do. 00:39:20.280 |
So I talked to Alexis about this idea quite a bit, 00:39:28.480 |
could be some type that represents the actual block 00:39:34.760 |
of GPU memory in a way where we can easily share that. 00:39:39.680 |
In practice, we've realized the vast majority of the time, 00:39:45.360 |
we want to refer to that exact piece of memory on the GPU, 00:39:49.760 |
not this idea of a tensor which may magically copy itself 00:39:56.560 |
And so, for example, with the generic optimizer, 00:40:02.440 |
part of this layer group, and this layer group 00:40:08.760 |
So I actually said to Ed, hey, could you please 00:40:12.800 |
have a look at the Swift AI generic optimizer, 00:40:15.480 |
because it's trying to be a similar design to the fast AI 00:40:20.480 |
V2 optimizer, but it's currently pretty unattractive. 00:40:26.120 |
The second is I feel like creating a really good computer 00:40:31.240 |
vision library is something which could be done now-ish. 00:40:35.800 |
When I tried to do it, I was getting kind of race conditions 00:40:41.960 |
and freezes inside Swift, and I don't have the Swift skills 00:40:45.400 |
to know where they were coming from or how to fix them. 00:40:50.480 |
I think all of my answers is, go back to the stuff 00:40:52.680 |
that we all built together back in March, April, May, 00:41:00.320 |
and build models with it, and put them in production, 00:41:04.400 |
and see the bits where it hits where you get stuck, 00:41:08.080 |
because you'll find things like, oh, there's no grid sample, 00:41:11.320 |
and, oh, there's race conditions in the interaction of OpenCV, 00:41:17.120 |
and the optimizer doesn't quite work properly, and that stuff. 00:41:28.000 |
I think we're also trying to figure out right now what 00:41:45.000 |
On the other hand, the performance, particularly in 00:41:48.880 |
So I think one of the things we're kicking around 00:42:02.720 |
I think that's the thing that's been stopping us all 00:42:05.000 |
from using stuff like Swift AI to actually build models, 00:42:14.240 |
I don't have to be polite-- so not at Google. 00:42:17.240 |
And it implements everything in six different ways 00:42:24.640 |
to be digging into these higher level APIs a lot more 00:42:34.240 |
XLA doesn't provide all the things in the box. 00:42:41.120 |
I'm just so kind to let stuff that we need it. 00:42:48.320 |
We're also investing a lot in AD and finishing that off. 00:42:51.040 |
Yeah, I mean, all the right work's thing done. 00:42:57.360 |
Yes, I think the challenge that we're really struggling with 00:42:59.920 |
is this decision to stick with the TensorFlow runtime 00:43:09.520 |
agree this is one of the major blockers for adoption of use. 00:43:15.160 |
I mean, especially if you want to take advantage of Swift, 00:43:18.440 |
which we do, you need something where the kernel launch 00:43:25.560 |
time is tiny or better still kind of non-existent 00:43:31.120 |
Otherwise, it's-- yeah, you don't really get the benefits. 00:43:36.240 |
so I'll say I'll answer your question in a second. 00:43:40.520 |
that XLA doesn't have really fast kernel launch time 00:43:47.280 |
On the other hand, there are a lot of opportunities 00:43:51.040 |
to do, for example, Fusion and other things like that 00:43:57.960 |
is this combination of tracing plus compilation, which 00:44:09.840 |
Just yesterday, there was a really fantastic talk 00:44:11.840 |
from some folks at Intel talking about their code generation 00:44:15.040 |
algorithm that are bringing over to MLIR, which I'm really, 00:44:23.520 |
Getting the ideal code gen for NVDA GPUs, for example, 00:44:36.880 |
of the different teams and the different companies 00:44:42.480 |
And the Intel stuff that they presented yesterday 00:44:53.480 |
The other thing I might mention in terms of tails 00:44:56.040 |
on the other side, what's life like in the Python world, 00:44:59.560 |
things that are and aren't working well over there. 00:45:04.040 |
The kind of the answer to Swift for TensorFlow in the PyTorch 00:45:19.160 |
is a dialect of subset of Python or else to actually parse. 00:45:29.240 |
It has reached the point now where it can actually 00:45:38.520 |
a bunch of our students actually have been working on a thing 00:45:41.360 |
called Mesh, including a young researcher who 00:45:53.560 |
And when we just took me half an hour to create a JIT version 00:45:58.160 |
and it ran at the same speed as somebody else's 00:46:05.280 |
two or three lines of code, that's working pretty well. 00:46:09.160 |
Although for bigger things, like a new batch norm implementation 00:46:22.400 |
not just for Python, but the whole world of non-Google 00:46:26.280 |
people, is that the best computer vision models by far 00:46:29.960 |
are largely those that are coming out of Google, 00:46:32.160 |
like EfficientNets and MixNets, like Kwokli's team. 00:46:36.160 |
They run very slowly and with a lot of memory on GPUs. 00:46:41.040 |
And so we tried wrapping an entire EfficientNet 00:46:44.840 |
and MixNet into a JIT-ed thing, so it wouldn't be so slow. 00:46:49.080 |
The MixNet didn't work at all, and the EfficientNet 00:46:53.840 |
So that's kind of the status of JIT in PyTorch 00:47:00.720 |
The way I look at this from the compiler-y code generation 00:47:03.960 |
piece is that I think the MLIR pieces are all 00:47:07.080 |
They're just going to take a while to get here. 00:47:10.240 |
XLA, as far as I know, is state of the art and code generation. 00:47:17.080 |
The challenge of those, it does have sort of limitations 00:47:19.800 |
like static shapes and the number of office supports. 00:47:23.280 |
You kind of have to be within its world for it to be useful. 00:47:34.680 |
TorchScript, my understanding is that the base model 00:47:39.320 |
of TorchScript and the interpreters they have, 00:47:46.000 |
But the kernel fusion piece is still fairly early 00:47:48.480 |
when it's mostly on-wise operations, for example. 00:47:54.960 |
they're partly a limitation of the Python type system. 00:48:01.800 |
that can work with different numbers of channels 00:48:03.720 |
while you're out of luck because they use Python type 00:48:06.680 |
limitations, which have no way of saying it's 00:48:12.680 |
So then you have to hard code all these assumptions 00:48:19.920 |
Well, so I mean, I think there's other spaces 00:48:25.320 |
I mean, this isn't the highest priority at this moment. 00:48:31.840 |
should we encode d-type in the static type system? 00:48:36.960 |
If you just say tensor, then you get rid of all the generics 00:48:46.120 |
But then I think if you go with more semantic tensor types 00:48:52.440 |
What you want is the semantics, and that you're actually 00:48:56.240 |
Like for mixed precision, we're switching stuff from one type 00:49:01.080 |
Depending on whether you're doing a loss function 00:49:04.640 |
you need to be changing between half and single. 00:49:09.720 |
that would be really interesting in terms of ergonomics, 00:49:13.520 |
but also simplification, which I think would be great. 00:49:17.120 |
Your point about the optimizer is that the key path 00:49:19.920 |
have all kinds of weirdness because you have multiple d-types 00:49:41.240 |
wanting to put lots and lots of verbose generic type 00:49:59.000 |
and combine it with an XLA-based approach that 00:50:09.560 |
so a couple of weeks ago, I presented the layering proposal 00:50:12.440 |
to separate out libtensor from libdeep learning 00:50:16.200 |
so that we can then get the freedom to then iterate 00:50:19.240 |
at that level and have multiple explorations on top. 00:50:23.280 |
So the progress update on there is that I started-- 00:50:26.760 |
we have the two different packages now in Swift APIs 00:50:31.400 |
so that you can depend only on one as opposed to the other. 00:50:36.080 |
that I caused while doing the initial move of the random 00:50:38.640 |
number generators out of what will become libdeep learning. 00:50:46.360 |
Well, I think that Jeremy is fundamentally right 00:50:48.760 |
that we need to spend more time with Swift AI 00:50:50.520 |
and the optimized designs and re-evaluate the training 00:51:13.840 |
Yeah, so we have to get those two things right. 00:51:18.560 |
so that we can build on it and take a program instead of-- 00:51:32.840 |
I think the Python model is to not check things 00:51:35.040 |
into what things crashed at runtime, if I understand. 00:51:38.840 |
I mean, I think that there's a couple of different options 00:51:42.160 |
But again, one of the things that PyTorch is doing 00:51:45.240 |
is they're doing more co-versions with dtypes. 00:51:49.920 |
it will actually promote an 8 into an N32, for example. 00:51:54.720 |
But that's the kind of thing that is just very nice. 00:51:59.440 |
And it just eliminates a certain kind of error. 00:52:01.520 |
On the other hand, it's kind of like broadcasting where 00:52:03.680 |
it makes certain things just work at the cost of potentially 00:52:10.760 |
I think if you do things that don't make sense, 00:52:13.640 |
like you try to do a floating point operation on an integer, 00:52:22.240 |
towards a much more runtime-centric approach. 00:52:33.360 |
of the major benefits of having a fast-paced language 00:52:38.120 |
And so now you can have super dynamic abstractions 00:52:43.600 |
If I torch, you do get a pretty clear runtime error. 00:52:47.440 |
If there's a type mismatch, it doesn't just crash. 00:52:49.720 |
It will tell you what to expect and what it got. 00:52:55.720 |
I think there are other ways around sort of encoding things 00:53:02.560 |
into the static-type system that you have to adhere to. 00:53:06.280 |
I think Adam's work on transitioning perfectly 00:53:08.360 |
shows that you can still get a lot of benefits 00:53:10.560 |
of static analysis without necessarily encoding 00:53:16.400 |
That said, I think it's still an open question as to how far 00:53:18.840 |
we can really push that and where we end up landing. 00:53:22.200 |
Yeah, I think it's just a really, really great opportunity 00:53:27.960 |
to re-evaluate these things as other pieces are coming together. 00:53:31.760 |
Maxim asks, why is runtime checking preferable 00:53:46.560 |
And so as we're trying to iterate on the programming 00:53:49.160 |
model, making sure that things are as dynamic as you want them 00:53:54.760 |
And then we should think about how static analysis can 00:54:11.160 |
because we're actively re-implementing pieces 00:54:15.080 |
So that's actually a lot more complicated than it sounds. 00:54:18.520 |
I would just say that MLIR is a broad scale compiler 00:54:28.000 |
technology that solves lots of problems. XLA, as a name, 00:54:35.160 |
And so I wouldn't over-index on the number of letters, I guess. 00:54:45.800 |
And once Swift-TensorFlow sits on top of MLIR, 00:55:03.080 |
And things that are XLA are changing in their implementation 00:55:07.880 |
And so there's a big investment going on in all these pieces 00:55:14.720 |
if you ignore which letters get attached to them, 00:55:16.760 |
the effort here culminates in a much more flexible 00:55:19.800 |
co-generation stack, support for dynamic shapes, 00:55:25.680 |
It's just that different pieces in this very complicated 00:55:28.880 |
technology come together at different points as well. 00:55:31.640 |
I don't know what the marketing-- the crack compiler marketing 00:55:39.120 |
team will end up labeling the resultant kind. 00:55:46.560 |
went into-- unless there's any pressing questions, 00:55:55.560 |
I think next week, Mark will be up talking about some of his work 00:56:03.600 |
There's some pretty good things that Mark's been up to there. 00:56:06.520 |
It's also exciting that AD is getting upstream to master, 00:56:13.240 |
Have a great week, and see you all next week.