Back to Index

fastai v2 overview (at S4TF Design Meeting 2019-11-08)


Chapters

0:0 Introduction
3:0 literate programming
5:0 highlevel API
9:0 trading loop
11:10 callbacks
12:50 mixed precision
14:25 optimizer
17:0 datablock
21:10 tensors
27:55 function overloading
31:10 function pipeline
35:15 optimized pipeline
38:23 generic optimizer
41:27 runtenon
44:55 JIT
49:28 Data Science

Transcript

to myself, to Paige, to Chris, to Ed, or others. Today, we actually have a very short agenda and a very welcome guest. So with that, I'd like to hand it off to the man who needs new introduction, Jeremy, to talk about Fast AI v2. Thanks, Brandon. So this actually comes out of my enthusiasm for when Adam presented a little bit of Hask torch code a couple of weeks ago, which I thought was super cool.

And so mainly my goal here is to kind of encourage other people to present cool things in other languages and libraries, because I think it's a great way for us all to learn what cool stuff you can do. But as tends to happen when you say, can somebody please do x, somebody else says, hey, why don't you do x first?

So here I am doing x, where x is telling you about the library that Sylvia and I have been working on. Basically, since Chris Latner and I finished our last Swift and Fast AI lesson, so for quite a while now, it's a library for PyTorch called Fast AI. And I think there are things we can learn from it regarding cool stuff we can do in Swift.

But I'm going to focus on trying to sell you on Fast AI rather than on the Swift bits. But where I think of Swift things, I will mention them as we go. So Fast AI is a library, as I said, that sits on top of PyTorch. And a lot of people think that a higher-level API is this small little thing that you slap on top of the serious business of TensorFlow or PyTorch or whatever.

But hopefully, you'll be convinced when I show you actually what's involved in a truly modern high-level API that there's actually quite a lot going on. If you want to check it out, I put a link to it in the meeting notes, and that will link you to the notebooks, the development notebooks.

So that's the first weird thing. What the hell are development notebooks? Well, this is an example of what Fast AI V2 source code looks like. It's written, as you see, in notebooks. Jeremy, we are just having a little trouble actually doing that seeing part. OK, so that probably means I failed to present my screen.

Shall I endeavor to do that? That would be great. Present your entire screen. Yeah, that explains a lot. There you go. Excellent. Nope. No. No, we don't. Nope. There we go. There we go. Victory. All right. Sorry about that. So yeah, so here is an example of what the Fast AI V2 source code looks like.

It has links. It has titles. It has pictures. It has code. And this may seem like a painful way to develop because these are notebooks that are designed for interactive stuff, not for normal development. But actually, you'll find that also this pixel shuffle appears here in a standard layers.py module, which you can import in the usual way.

So we've developed a new literate programming system that allows you to write code and have it automatically turned into nice modules, which even do things that most people don't bother to do because they're annoying if they're automatic, like setting it under all so it only exports the things that you want.

Also coming out of that is automatically documentation. So all that gets turned into hyperlink documentation, including links directly to the source code and automatic doc strings and parameter lists. Also, you'll see tests. And the tests are used both to document the behavior expected. So if you're not sure what pixel shuffle is, this test is a very good description of exactly what it is and also ensures that our code is working.

And those tests can all be put in continuous integration and so forth. So that's the first interesting thing about FastAIV2 is it's the first truly literate programming-based system I've worked on. And it's been an absolute life. So we've written our own framework for every part of this, which is kind of a theme for FastAIV2.

Basically, every time and I found something that didn't quite work the way we wanted it at any part of the stack, we wrote our own. So it's kind of like building something with no particular deadline and trying to do everything the very, very best we can. So the layered API of FastAIV2 starts at the applications layer, which is where most beginners will start.

And it looks a lot like FastAIV1, which is the released version of the software that people have seen before. But V2, everything is rewritten from scratch. It's totally new. There's no code borrowed. But the top-level API looks quite similar. The idea is that in one, two, three, four lines of code, you can create a state-of-the-art computer vision classifier, including transfer learning.

With nearly the same one, two, three, four lines of code-- five lines of code in this case, because we're also displaying-- you can create a state-of-the-art segmentation model. And actually, when I say state-of-the-art, for example, this segmentation model is, to the best of my knowledge, still better than any published result on this particular Canvid data set.

So these five lines of code are super good five lines of code. And as you can see, it includes a line of code, which, if you say show batch, it will display your data in an appropriate format, in this case, showing you segmentation, a picture, and the color-coded pixels overlaid on top of the picture.

The same basic four lines of code will do text classification. So here's the basis of ULMFIT, which is a system that we developed and wrote up along with the Bastion router for transfer learning in natural language processing. And as you can see, in here, this is working on IMDB on a single epoch in four minutes.

The accuracy here is basically what was the state-of-the-art as of a couple of years ago. Tabular or time series analysis, same deal. Basically, a few lines of code, nearly exactly the same lines of code, and you'll get a great result from your tabular data and Ditto for collaborative filtering.

So the high-level API for fastAIV2 is designed to be something where, regardless of what application you're working on, you can get a great result from it using sensible defaults and carefully selected hyperparameters, automatically, largely done for you for the most common kinds of problems that people look at. And that bit doesn't look that different to V1, but understanding how we get to that is kind of interesting and involves getting deeper and deeper.

This approach, though, does work super well. And partly, it's because this is based on quite a few years of research to figure out what are the best ways to solve various problems along the way. And when people actually try using fastAIV, they're often surprised. So this person posted on our forum that they've been working in TF2 for a while, and for some reason, they couldn't figure out.

All of their models are suddenly working much better. And the answer is, basically, they're getting all these nice kind of curated best practices. And somebody else on Twitter saw that and said, yep, we found the same thing. We were trying TensorFlow, spent months tweaking, and then we switched to fastAI.

A couple of days later, we were getting better results. So these kind of carefully curated defaults and algorithms and high-level APIs that do things right for you the first time, even for experienced practitioners, can give you better results faster. But it's actually the other pieces that are more, I think, interesting for a Swift conversation, because as the deeper we go into how we make that work, the more stuff you'll see, which will be a great fit, I think, with Swift.

So the mid-layer API is something which is largely new to fast-- actually, I guess the foundation layer is new. So the mid-layer, I guess I'd say, is more rewritten for V1. And it contains some of the things that make those high-level APIs easy. One of the bits which is the most interesting is the training loop itself.

And I thank Sylvain for the set of slides we have for the training loop. This is what a training loop looks like in PyTorch. We calculate some predictions. We get a loss. We do a backwards pass to get the gradients. We do an optimizer step. And then optionally, we run time to time.

We'll zero the gradients based on if we're doing when we're accumulating. So this is what that loop looks like. Run the model, get the loss, do the gradients, step the optimizer, do that a bunch of times. But you want to do something interesting. You'll need to add something to the loop to do keeping track of your training statistics in TensorBoard or in fast progress or whatever.

You might want to schedule various hyperparameters in various different ways. You might want to add various different kinds of characterization. You may want to do mixed precision training. You may want to do GANs. So this is a problem because either you have to write a new training loop for every time you want to add a different tweak.

And making all those tweaks work together then becomes incredibly complicated. Or you try and write one training loop which does everything you can think of. This is the training loop for fastAI 0.7, which only did a tiny subset of the things I just said but was getting ridiculous. Or you can add callbacks at each step.

Now, the idea of callbacks has been around in deep learning for a long time, APIs. But what's very different about fastAI is that every callback is actually a two-way callback. It can read absolutely everything. It can read gradients, parameters, data, so forth. And it can write them. So it can actually change anything at any time.

So the callbacks are, we say, infinitely flexible. We feel pretty confident in that because the training loop in fastAI has not needed to be modified to do any of the tweaks that I showed you before. So even the entirety of training GANs can be done in a callback. So basically, we switch out our basic training loop and replace it with one with the same five steps but callbacks between every step.

So that means, for example, if you want to do a scheduler, you can define a batch begin that sets the optimizer's learning rate to some function. Or if you want to do early stopping, you can write an on epoch end that checks the metrics and stops training. Or you can do parallel training, set up data parallel, and happy at the end of training, take data parallel off again.

Gradient clipping, you have access to the parameters themselves. So you can click the gradient norms at the end of the backward step, and so forth. So all of these different things are all things that have been written with fastAI callbacks, including, for example, mixed precision. All of NVIDIA's recommendations, mixed precision training, will be added automatically if you just add a two FP16 at the end of your learn call.

And really importantly, for example, all of those mixed precision things can be combined with multi-GPU and one-cycle training and gradient accumulation and so forth. And so trying to create a state-of-the-art model, which involves combining state-of-the-art regularization and mixed precision and distributed training and so forth is a really, really, really hard job.

But with this approach, it's actually just a single extra line of code to add each feature. And they all explicitly are designed to work with each other and are tested to work with each other. So for instance, here is mix-up data augmentation, which is a incredibly powerful data augmentation method that has powered lots of state-of-the-art results.

And as you can see, it's under a screen of code. By comparison, here is the version of mix-up from the paper. Not only is it far longer, but it only works with one particular data set and one particular optimizer and is full of all kinds of assumptions and only one particular kind of metric and so forth.

So that's an example of these mid-tier APIs. Another one is the optimizer. It turns out that it looks like there's been lots and lots of different optimizers appearing in the last year or two. It actually turns out that they're all minor tweaks on each other. Most libraries don't write them this way.

So for example, Adam W, also known as decoupled weight decay Adam, was added to PyTorch quite recently in the last month or two. And it required writing a whole new class and a whole new step to implement. And it took-- it was like two or three years after the paper was released.

On the other hand, FastAI's implementation, as you can see, involves a single extra function containing two lines of code and this little bit of gray here. So it's kind of like two and a half, three lines of code to implement the same thing. Because what we did was we realized, let's refactor the idea of an optimizer, see what's different for each of these state of the art optimizers that have appeared recently, and make it so that each of those things can be added and removed by just changing two things-- stats and steppers.

A stat is something that you measure during training, such as the gradients or the gradient squared, or you might use dampening, or momentum, or whatever. And then a stepper is something that uses those stats to change the weights in some way. And you can combine those things together. And by combining these, we've been able to implement all these different optimizers.

So for instance, the lamb optimizer, which came out of Google and was super cool at reducing the pre-training time from three days to 76 minutes, we were able to implement that in this tiny piece of code. And one of the nice things is that when you compare it to the math, it really looks almost line for line identical, except ours is a little bit nicer because we refactored some of the math.

So it makes it really easy to do research as well, because you can quite directly bring the equations across into your code. Then the last of the mid-tier APIs is the data block API, which is something we had in version 1 as well. But when we were porting that to Swift, we had an opportunity to rethink it.

And actually, Alexis Gallagher in particular helped us to rethink it in a more idiomatic Swift way. And it came out really nicely. And so then we took the result of that and ported it back into Python. And we ended up with something that was quite a bit nicer. So there's been a nice interaction and interplay between fast AI in Python and Swift AI in Swift in terms of helping each other's APIs.

But basically, the data block API is something where you define each of the key things that the program needs to know to flexibly get your data into a form you can put in a model. So it needs to know what type of data do you have, how do you get that data, how do you split it into a training set and a validation set, and then put that all together into a data bunch, which is just a simple little class.

It's literally, I think, four lines of code, which just has the validation set and the training set in one place. So with a data block, you just say, OK, my types, I want to create a black and white pillow image for my x and a category for my y.

And to get the list of files for those, I need to use this function. And to split those files into training and validation, use this function, which is looking at the grandparent path directory name. And to get the labels, use this function, which is use the parent's path name.

And so with that, that's enough to give you MNIST, for instance. And so once you've done this, you end up with a data bunch. And as I mentioned before, everything has a show batch. So one of the nice things is it makes it very easy for you to look at your data regardless of whether it's tabular, or collaborative filtering, or vision, or text, or even audio.

If it was audio, it would show you a spectrogram and let you play the sound. So you can do custom labeling with data blocks by using, for example, a regular expression labeler. You can get your labels from an external file or data frame. And they could be multi-labels. So this thing here knows it's a multi-label classification task.

So it's automatically put a semicolon between each label. Again, it's still basically just three lines of code to define the data block. So here's a data block for segmentation. And you can see, really, the only thing I had to change here was that my dependent variable has been changed from category to pillow mask.

And again, automatically, I show batch works. And we can train a model from that straight away as well. You could do key points. So here, I've just changed my dependent variable to tensor point. And so now, it knows how to behave with that. Object detection. So now, I've changed my dependent variable to bounding box.

And you can see, I've got my bounding boxes here. Text, and so forth. So actually, going back, I have a couple of questions if you're-- OK, time. Yeah. So the code, you've got sort of the x's and y's. And these sounds like these different data types roughly conform to a protocol.

Yep. We're going to get to that in a moment. Absolutely. OK. Fantastic. That's an excellent way to think of it. And actually, this is the way it looked about three weeks ago. Now, it looks even more like a protocol. So yes, this is where it all comes from, which is the foundation APIs.

And this is the bit that I think is the most relevant to Swift. A lot of this, I think, would be a lot easier to write in Swift. So the first thing that we added to PyTorch was object-oriented tensors. For too long, we've all been satisfied with a data type called tensor, which has no semantics to it.

And so those tensors actually represent something like a sentence, or a picture of a cat, or a recording of somebody saying something. So why can't I take one of those tensors and say dot flip, or dot rotate, or dot resample, or dot translate to German? Well, the answer is you can't, because it's just a tensor without a type.

So we have added types to tensors. So you can now have a tensor image, a tensor point, a tensor bounding box. And you can define a flip left, right for each. And so this is some of the source code from-- we've written our own computer vision library, so that now you can say flip LR, and it flips the puppy.

And if it was a key points, it would fit the key points. If it was a bounding box, it would fit the bounding boxes, and so forth. So this is an example of how tensors which carry around semantics are nice. It's also nice that I can just say dot show, right?

So dot show is something that's defined for all fast AIV2 tensor types. And it will just display that tensor. It could even be a tuple containing a tensor, and some bounding boxes, and some bounding box classes. Whatever it is, it will be able to display it. It will be able to convert it into batches for modeling, and so forth.

So with that, we can now create, for example, a random transformation called flip item. And we can say that the encoding of that random transformation is defined for a pillow image or any tensor type. And in each case, the implementation is simply to call x dot flip LR. Or we could do the dihedral symmetry transforms in the same way.

Before we call, grab a random number between 0 and 7 to decide which of the eight transposes to do. And then in codes, call x dot plus dihedral with that thing we just got. And so now we can call that transform a bunch of times. And each time, we'll get back a different random augmentation.

So a lot of these things become nice and easy. Hey, Jeremy. Maxim asked, why isn't tensor a backing data structure for an image type? Tensor image is a tensor, which is an image type. Why isn't-- he says, why isn't tensor a backing-- why not have a different type named image, I guess, that has a tensor inside of it?

Do you mean why inherit rather than compose? Apparently, yes, that. Yeah. So inheritance-- I mean, you can do both. And you can create identical APIs. Inheritance just has the benefit that all the normal stuff you can do with a tensor, you can do with a tensor that happens to be an image.

So just because a tensor is an image doesn't mean you now don't want to be able to do fancy indexing to it, or do an LU decomposition of it, or stack it with other tensors across some axis. So basically, a tensor image ought to have all the behavior of a tensor plus additional behavior.

So that's why we use inheritance. We have a version that uses composition as well, and it uses Python's nice get atra functionality to pass on all of the behavior of tensor. But it comes out more nicely in Python when you do inheritance. And actually, the PyTorch team has decided to officially implement semantic tensor subtypes now.

And so hopefully, in the next version of PyTorch, you won't have to use the extremely ugly hacks that we had to use to make this work. You'll be able to use the real ones. And hopefully, you'll see in TorchVision some of these ideas will be brought over there. Can I ask you, so how does the type propagate?

So if you do arithmetic on an image tensor, do you get an image tensor back there? So Chris and I had a conversation about this a few months ago, and I said I'm banging my head around this issue of types not carrying around their behavior. And Chris casually mentioned, oh, yes, that thing is called higher kind of types.

So I went home, and that was one of these phrases I thought only functional programming Dweeb's talked about, and I would never care about a tensor. And we have to care about it, because it actually matters a lot. And it's basically the idea that if you have a tensor image and you add one to it, you want to get back a tensor image, because it should be an image that's a bit brighter rather than something that loses its type.

So we implemented our own, again, hacky, partial, higher kind of type implementation in FastAIV2. So any of these things that you do to a tensor of a subtype, you will nearly always get back the correctly subtype tensor. Yeah, I mean, I saw the PyTorch recently sort of talking about their named indexing extensions for their tensors as well, and I assume they have a similar kind of challenge there, where when you start doing arithmetic and other things like that on a tensor that has named dimensions, you want to propagate those along.

Yeah, so we haven't started using that yet, because it hasn't quite landed at stable. But yeah, we talked to the PyTorch team at the DevCon, and we certainly are planning to bring these ideas together. They're orthog and orbit-related concerns. Yeah, I just mean that I assume that that feature has the same problem, the same challenge.

I assume so, yeah. So it would be interesting to see what they do. Yeah, yeah, it would. Yeah, so it's kind of nice. Not only do we get to be able to say .show batch, but you can even go .show results. And in this case, it knows what the independent variables type is, it knows what the dependent variables type is, and it even knows things like, hey, for a classification task, those two things should be the same.

If they're not, by default, I will highlight them right. So these lower-level foundations are the things that drive our ability to easily add this higher-level functionality. So this is the kind of ugly stuff we wouldn't have to do in Swift. We had to write our own type dispatch system.

We can annotate things with types, and those type annotations are actually semantic. And so we now have the joyfully modern idea of function overloading in Python, which has made life a lot easier, and we already have that. Do you have many users that are using this yet? Or is it still out of reference?

It's still pre-released. It's not even alpha. But there is a enthusiastic early adopter community who is using it. So for example, the user-contributed audio library has already been ported to it. I've also built a medical imaging library on top of it, and I've written a series of five notebooks showing how to do CT scan analysis with it.

So it's kind of like it works. And-- I was curious what your users think of it, because there's this very strongly-held conception that Python folks hate types. And you're kind of providing a little bit of typing in the world, and I'm curious how they react to that. The extremely biased subset of early adopter-class AI enthusiasts who are using it love it.

And they tend to be people who have gone pretty deep in the past. So for example, my friend Andrew Shaw, who wrote something called Music Autobot, which is one of the coolest things in the world, in case you haven't seen it yet, which is something where you can generate music using a neural network.

You can put in some melodies and some chords, and it will auto-complete some additional melodies and chords. Or you can put it in a melody, and it will automatically add chords, or you can add chords that create melody. And so he had to write his own MIDI library, fastai.midi.

He rewrote it in V2, and he said it's just like so, so, so much easier, thanks to those mid-tier APIs. So yeah, at this stage, it's easy as to-- I was just going to jump in quick. I've been helping with some of the audio stuff, and it's been really awesome.

So it makes things a lot more flexible than version 1. So that's probably my favorite thing about it, is everything can be interchanged. Nothing is like, well, it's got to be this way, because that's how it is. Yeah, that's cool. Cool, thanks. Another piece of the transform of the foundation is the partially reversible composed function pipeline dispatched over collections, which really rolls off the tongue, we call them transform in pipeline.

Basically, the idea is that the way you kind of want function dispatch to work and function composition to work in deep learning is a little different to other places. There's a couple of things. The first is you often want to dispatch over tuples. And what I mean by that is if you have a function called flip left right, and you have a tuple representing a mini batch where your independent variable is a picture and your dependent variable is a set of bounding boxes, if you say flip left right on that tuple, you would expect both the x and the y to be flipped and to be flipped with the type appropriate method.

So our transforms will automatically send each element of a tuple to the function separately and/or dispatch according to their types automatically. We've mentioned type retention, so the kind of basic higher type stuff we need. One interesting thing is not only encoding, so in other words, applying the function, you often need to be able to decode, which is to deapply the function.

So for example, a categorization transform would take the word dog and convert it to the number 1, perhaps, which is what you need for modeling. But then when your predictions come back, you need to know what 1 represents. So you need to reverse that transform and turn 1 back into dog.

Often those transforms also need data driven setup. For example, in that example of dog becoming 1, there needs to be something that actually creates that vocab automatically, recognizing what are all the possible classes, so it can create a different index for each one and then apply that to the validation set.

And quite often these transforms also have some kind of state, such as the vocab. So we built this bunch of stuff that builds on top of each other. At the lowest level is a class called transform, which is a callable, which also has a decode, does the type retention, higher kind of type thing, and does the dispatch over tuples by default.

So then a pipeline is something that does function composition over transforms. And it knows about, for example, setting up transforms. And setting up transforms in a pipeline is a bit tricky because you have to make sure that at each level of the pipeline, only the previous steps have been applied before you set up the next step.

So it does little things like that. And then we have something that applies a pipeline to a collection to give you an indexable, lazily transformed collection. And then you can do those in parallel to get back an independent variable, for instance. And then finally, we've built a data loader, which will apply these things in parallel and create collated batches.

So in the end, all this stuff makes a lot of things much easier. For example, the language model data loader in Fast AI v1 was like pages of code. In TensorFlow, it's pages of code. In Fast AI v2, it's less than a screen of code by leveraging these powerful abstractions and foundations.

So then finally-- and again, this is something I think Swift will be great for-- we worked really hard to make everything extremely well optimized. So for example, preprocessing and natural language processing, we created a parallel generator in Python, which you can then basically pass a class to that finds some setup and a call.

And it can automatically parallelize that. So for example, tokenization is done in parallel in a pretty memory efficient way. But perhaps the thing I'm most excited about, both in Python and Swift, is the optimized pipeline running on the GPU. So pretty much all of the transforms we've done can and by default do run on the GPU.

So for example, when you do the flip left right I showed you earlier, we'll actually run on the GPU, as we'll warp, as we'll zoom, as we'll even things like crop. So one of the basics of this is the affine coordinate transform, which uses affine grid and grid sample, which are very powerful PyTorch functions, which would be great things to actually write in script for TensorFlow's new meta programming, because they don't exist in TensorFlow, or at least not in any very complete way.

But with these basic ideas, we can create this affine coordinate transform that lets us do a very wide range of data augmentations in parallel on the GPU. For those of you that know about the DALI library that we've created, this provides a lot of the same benefits of DALI.

It's pretty similar in terms of its performance. But the nice thing is, all the stuff you write, you write it in Python, not in CUDA. So with DALI, if they don't have the exact transformation you want, and there's a pretty high chance that they won't, then you're stuck. Or else with fast AI v2, you can write your own in a few lines of Python.

You can test it out in a Jupyter Notebook. It makes life super easy. So this kind of stuff, I feel like because Swift is a much faster, more hackable language than Python, or at least hackable in the sense of performance, I guess not as hackable in terms of its type system necessarily, I feel like we can build even more powerful foundations and pipelines and a real Swift for TensorFlow computer vision library, leveraging the metaprogramming and leveraging Swift numerics.

Stuff like that, I think, would be super cool. And so that is the end of that. That was great. That was excellent. Thank you very much, Jeremy. My pleasure. So just sort of thinking through, so as you're propagating along the self-type amongst the transformations, that seems relatively straightforward for Swift to handle.

Are there other sorts of things that you think we should start thinking about now? Yeah, the thing I really want you to think about, and we've kind of been nagging you on and off since March, is the way that tensors are represented. Having them as a value type the way they are now makes some things hard or impossible.

So the generic optimizer is a thing that I really, really want you guys to look into and build properly. Currently, it uses ugly key path hacks, and it's only kind of partially doing what we need it to do. So I talked to Alexis about this idea quite a bit, and we kind of thought maybe there could be some type that represents the actual block of GPU memory in a way where we can easily share that.

In practice, we've realized the vast majority of the time, we want to refer to that exact piece of memory on the GPU, not this idea of a tensor which may magically copy itself if I change something. And so, for example, with the generic optimizer, we need to be able to say, oh, this layer is part of this layer group, and this layer group has these things that need to happen to it.

So I actually said to Ed, hey, could you please have a look at the Swift AI generic optimizer, because it's trying to be a similar design to the fast AI V2 optimizer, but it's currently pretty unattractive. The second is I feel like creating a really good computer vision library is something which could be done now-ish.

When I tried to do it, I was getting kind of race conditions and freezes inside Swift, and I don't have the Swift skills to know where they were coming from or how to fix them. It would be nice if folks could like-- I think all of my answers is, go back to the stuff that we all built together back in March, April, May, and try to start using it in real life, and build models with it, and put them in production, and see the bits where it hits where you get stuck, because you'll find things like, oh, there's no grid sample, and, oh, there's race conditions in the interaction of OpenCV, and the optimizer doesn't quite work properly, and that stuff.

That makes sense. I think we're also trying to figure out right now what the right path is with the runtime. So we've historically been building on top of the TensorFlow runtime, which is great for a lot of reasons. It has a lot of functionality in the box. It does pretty much everything.

On the other hand, the performance, particularly in eager mode, is not great. So I think one of the things we're kicking around is the idea of going more directly into XLA. Yeah. Well, I think that's a thing that's been-- And XLA being a stepping stone towards MLIR in the bigger future, which is also coming.

I think that's the thing that's been stopping us all from using stuff like Swift AI to actually build models, because the auto diff has memory leaks, and the TensorFlow runtime is-- I don't have to be polite-- so not at Google. So it's molasses. And it implements everything in six different ways in six different places, and so forth.

So yeah, I think everybody's going to be digging into these higher level APIs a lot more once the foundations are where they're at. Yeah, and so the trade-off there is if we go with that direction now, XLA doesn't provide all the things in the box. But I think that's probably fine.

We haven't fasted up something. I'm just so kind to let stuff that we need it. And so I think we're talking about that, trying to decide what to do there. We're also investing a lot in AD and finishing that off. Yeah, I mean, all the right work's thing done.

It's just, you know, it's just early days. Yes, I think the challenge that we're really struggling with is this decision to stick with the TensorFlow runtime or to move on to something else. That, I think, is complicated, but I agree this is one of the major blockers for adoption of use.

Yeah. I mean, especially if you want to take advantage of Swift, which we do, you need something where the kernel launch time is tiny or better still kind of non-existent because you can write everything in Swift. Otherwise, it's-- yeah, you don't really get the benefits. Yeah, and one of the-- so I'll say I'll answer your question in a second.

But one of the trade-offs there is that XLA doesn't have really fast kernel launch time because it effectively JIT compiles things before launching it. On the other hand, there are a lot of opportunities to do, for example, Fusion and other things like that that can offset it. And one of the nice hybrid models you get is this combination of tracing plus compilation, which I think could be really interesting.

Yeah. Said asked what's going on with MLIR. There's tons of stuff going on. It's really exciting. Just yesterday, there was a really fantastic talk from some folks at Intel talking about their code generation algorithm that are bringing over to MLIR, which I'm really, really, really excited about. And so there's tons of stuff going on.

Getting the ideal code gen for NVDA GPUs, for example, is probably still six plus months away. And I don't know how much plus that is. But what I'm encouraging is the community to come together and collaborate instead of the different teams and the different companies like kind of being in front of me.

And the Intel stuff that they presented yesterday is super, super impressive. So we'll see what happens with that. The other thing I might-- The other thing I might mention in terms of tails on the other side, what's life like in the Python world, things that are and aren't working well over there.

The kind of the answer to Swift for TensorFlow in the PyTorch world is JIT. So it's basically to trace your Python code and attempt to figure out what it's doing and create what they call TorchStrip, which is a dialect of subset of Python or else to actually parse. Your Python code is also an option and turn it into TorchStrip.

It has reached the point now where it can actually be used for good. So one of our students created-- a bunch of our students actually have been working on a thing called Mesh, including a young researcher who designed the original thing. It's a very nice activation function that's about performing everything else that anybody is trying it on.

And it was pretty slow. And when we just took me half an hour to create a JIT version and it ran at the same speed as somebody else's hand-created CUDA code. So for small things like that, where it's two or three lines of code, that's working pretty well. Although for bigger things, like a new batch norm implementation we tried to do during the last course, the performance wasn't there.

Or if we actually tried to take-- one of the big problems at the moment, not just for Python, but the whole world of non-Google people, is that the best computer vision models by far are largely those that are coming out of Google, like EfficientNets and MixNets, like Kwokli's team.

They run very slowly and with a lot of memory on GPUs. And so we tried wrapping an entire EfficientNet and MixNet into a JIT-ed thing, so it wouldn't be so slow. The MixNet didn't work at all, and the EfficientNet was a little bit slower. So that's kind of the status of JIT in PyTorch is bits of it are useful.

The way I look at this from the compiler-y code generation piece is that I think the MLIR pieces are all going the right direction. They're just going to take a while to get here. XLA, as far as I know, is state of the art and code generation. For the things it does, it does quite well.

The challenge of those, it does have sort of limitations like static shapes and the number of office supports. You kind of have to be within its world for it to be useful. But it has a very useful-- it has a large subset of the world that it covers very well.

It has a pretty useful world. It has a pretty useful world. TorchScript, my understanding is that the base model of TorchScript and the interpreters they have, I understand that's quite nice. But the kernel fusion piece is still fairly early when it's mostly on-wise operations, for example. I don't find them that quite nice.

I mean, simple things like-- they're partly a limitation of the Python type system. So you want to be able to write things that can work with different numbers of channels while you're out of luck because they use Python type limitations, which have no way of saying it's a tuple of size n.

You have to say it's a tuple of size 3. So then you have to hard code all these assumptions into your code. Lots of stuff I find pretty frustrating. I see. Interesting. Well, so I mean, I think there's other spaces that I'm eager to reevaluate as-- I mean, this isn't the highest priority at this moment.

But in terms of our APIs, there's still very legit questions around, should we encode d-type in the static type system? Or should we just say tensor? If you just say tensor, then you get rid of all the generics everywhere, cleans up tons of code at the cost of losing some of the checking.

But then I think if you go with more semantic tensor types that Jerry was pushing forward, you actually really don't even want the d-type. What you want is the semantics, and that you're actually in a better spot. Right. Like for mixed precision, we're switching stuff from one type to another all the time.

Depending on whether you're doing a loss function or a gradient calculation or whatever, you need to be changing between half and single. So if we went that direction, I think that would be really interesting in terms of ergonomics, but also simplification, which I think would be great. Your point about the optimizer is that the key path have all kinds of weirdness because you have multiple d-types and you want to be generic over d-type.

And so that's really unpleasant right now. Yeah. I think also for Swift wanting to bring over a big world of Python using data scientists, they're definitely not going to be wanting to put lots and lots of verbose generic type annotations in their Jupyter notebooks. Yep. Yeah. So I don't know when we'll have cycles to re-evaluate those APIs, but I think we should go do a fresh take of this and combine it with an XLA-based approach that changes a lot of the trade-offs.

Right. So it would be really interesting. Yeah. I mean, I think in my mind, right, so a couple of weeks ago, I presented the layering proposal to separate out libtensor from libdeep learning so that we can then get the freedom to then iterate at that level and have multiple explorations on top.

So the progress update on there is that I started-- we have the two different packages now in Swift APIs so that you can depend only on one as opposed to the other. And Dan helped fix all the issues that I caused while doing the initial move of the random number generators out of what will become libdeep learning.

That said, it's still very early, and I have a lot more code to move. Well, I think that Jeremy is fundamentally right that we need to spend more time with Swift AI and the optimized designs and re-evaluate the training with callback systems and things like that. Yeah. As each of these variables change, it affects other parts of the system.

And different trade-offs, I think, should be re-evaluated as opposed to that. But I think that getting AD full of proof is super important. And performance. Yeah, so we have to get those two things right. We'll end upstream and integrate in Swift so that we can build on it and take a program instead of-- yeah.

Quick question about tensor.dtype. I wonder if we would add any type assertions and any functions. I think the Python model is to not check things into what things crashed at runtime, if I understand. I don't know. I mean, I think that there's a couple of different options there. I don't know what the right answer is.

But again, one of the things that PyTorch is doing is they're doing more co-versions with dtypes. So if you take an Inte and add it to an N32, it will actually promote an 8 into an N32, for example. I mean, rocket science. But that's the kind of thing that is just very nice.

And it just eliminates a certain kind of error. On the other hand, it's kind of like broadcasting where it makes certain things just work at the cost of potentially again, surprising in some cases. So I don't know about that. I think if you do things that don't make sense, like you try to do a floating point operation on an integer, then you would want it to be runtime error.

But I think that our model is trying towards a much more runtime-centric approach. I think, ironically, Swift and Insta started out very static. But now, for me, I'm realizing one of the major benefits of having a fast-paced language is dynamic is free. And so now you can have super dynamic abstractions that you can do these things in a nice way.

If I torch, you do get a pretty clear runtime error. If there's a type mismatch, it doesn't just crash. It will tell you what to expect and what it got. Yeah. And one of the nice things about eager mode is that then you get a stack trace. I think there are other ways around sort of encoding things into the static-type system that you have to adhere to.

I think Adam's work on transitioning perfectly shows that you can still get a lot of benefits of static analysis without necessarily encoding into the type system. Yep. That said, I think it's still an open question as to how far we can really push that and where we end up landing.

Yeah, I think it's just a really, really great opportunity to re-evaluate these things as other pieces are coming together. Maxim asks, why is runtime checking preferable over static analysis? I think it's more that we're still trying to figure out what dimensions you want to be flexible on. And so doing things dynamically is sort of the ultimate in flexibility.

And so as we're trying to iterate on the programming model, making sure that things are as dynamic as you want them to be is sometimes nice. And then we should think about how static analysis can help catch errors sooner. Yeah, exactly. And so this is just a spectrum. And it's not that one end of the spectrum is better than the other.

It's about where in the spectrum you end up. And Nicholas's question, Nicholas, asks, how are MLIR and XLA related? That is a super complicated question because we're actively re-implementing pieces of XLA in terms of MLIR. So that's actually a lot more complicated than it sounds. I would just say that MLIR is a broad scale compiler technology that solves lots of problems.

XLA, as a name, is typically thought of as a thing that turns tensors into efficient code. And so I wouldn't over-index on the number of letters, I guess. And once Swift-TensorFlow sits on top of MLIR, we'll still use XLA target TPUs. Yeah, so I mean, this is internal work.

But we're doing a lot to change and enhance the TPU software stack in XLA. And things that are XLA are changing in their implementation as well. And so there's a big investment going on in all these pieces right now. And I think that more generally-- again, if you ignore which letters get attached to them, the effort here culminates in a much more flexible co-generation stack, support for dynamic shapes, and custom ops, and things like that.

It's just that different pieces in this very complicated technology come together at different points as well. I don't know what the marketing-- the crack compiler marketing team will end up labeling the resultant kind. Excellent. We're slightly over time, so I just went into-- unless there's any pressing questions, thank everyone for joining.

And see you all next week. I think next week, Mark will be up talking about some of his work on testing the auditive system to ensure that it's really reliable. There's some pretty good things that Mark's been up to there. It's also exciting that AD is getting upstream to master, too, which is really cool.

Thanks, everyone. Have a great week, and see you all next week. Thank you, Jeremy. Thank you. Bye.