back to index

fastai v2 overview (at S4TF Design Meeting 2019-11-08)


Chapters

0:0 Introduction
3:0 literate programming
5:0 highlevel API
9:0 trading loop
11:10 callbacks
12:50 mixed precision
14:25 optimizer
17:0 datablock
21:10 tensors
27:55 function overloading
31:10 function pipeline
35:15 optimized pipeline
38:23 generic optimizer
41:27 runtenon
44:55 JIT
49:28 Data Science

Whisper Transcript | Transcript Only Page

00:00:00.000 | to myself, to Paige, to Chris, to Ed, or others.
00:00:04.480 | Today, we actually have a very short agenda
00:00:10.040 | and a very welcome guest.
00:00:12.240 | So with that, I'd like to hand it off
00:00:14.320 | to the man who needs new introduction, Jeremy,
00:00:17.000 | to talk about Fast AI v2.
00:00:20.400 | Thanks, Brandon.
00:00:23.400 | So this actually comes out of my enthusiasm
00:00:29.080 | for when Adam presented a little bit of Hask torch code
00:00:34.040 | a couple of weeks ago, which I thought was super cool.
00:00:37.280 | And so mainly my goal here is to kind of encourage
00:00:42.240 | other people to present cool things in other languages
00:00:45.800 | and libraries, because I think it's a great way for us
00:00:48.160 | all to learn what cool stuff you can do.
00:00:50.840 | But as tends to happen when you say, can somebody please do x,
00:00:54.840 | somebody else says, hey, why don't you do x first?
00:00:57.200 | So here I am doing x, where x is telling you
00:01:00.160 | about the library that Sylvia and I have been working on.
00:01:03.480 | Basically, since Chris Latner and I
00:01:06.600 | finished our last Swift and Fast AI lesson,
00:01:11.360 | so for quite a while now, it's a library for PyTorch called
00:01:19.040 | Fast AI.
00:01:21.040 | And I think there are things we can learn from it
00:01:27.680 | regarding cool stuff we can do in Swift.
00:01:30.800 | But I'm going to focus on trying to sell you on Fast AI
00:01:33.840 | rather than on the Swift bits.
00:01:35.720 | But where I think of Swift things,
00:01:37.120 | I will mention them as we go.
00:01:40.880 | So Fast AI is a library, as I said,
00:01:44.760 | that sits on top of PyTorch.
00:01:46.880 | And a lot of people think that a higher-level API is
00:01:53.280 | this small little thing that you slap
00:01:56.800 | on top of the serious business of TensorFlow or PyTorch
00:01:59.760 | or whatever.
00:02:00.720 | But hopefully, you'll be convinced
00:02:02.680 | when I show you actually what's involved in a truly
00:02:06.840 | modern high-level API that there's actually
00:02:09.200 | quite a lot going on.
00:02:11.000 | If you want to check it out, I put a link
00:02:13.600 | to it in the meeting notes, and that will link you
00:02:17.880 | to the notebooks, the development notebooks.
00:02:23.000 | So that's the first weird thing.
00:02:24.720 | What the hell are development notebooks?
00:02:27.280 | Well, this is an example of what Fast AI V2 source code looks
00:02:31.880 | like.
00:02:33.000 | It's written, as you see, in notebooks.
00:02:35.680 | Jeremy, we are just having a little trouble actually
00:02:39.640 | doing that seeing part.
00:02:41.920 | OK, so that probably means I failed to present my screen.
00:02:46.560 | Shall I endeavor to do that?
00:02:49.680 | That would be great.
00:02:50.280 | Present your entire screen.
00:02:52.520 | Yeah, that explains a lot.
00:02:55.200 | There you go.
00:02:57.160 | Excellent.
00:02:57.680 | Nope.
00:02:58.640 | No, we don't.
00:02:59.120 | Nope.
00:02:59.360 | There we go.
00:03:00.280 | There we go.
00:03:01.040 | Victory.
00:03:01.520 | All right.
00:03:02.640 | Sorry about that.
00:03:03.600 | So yeah, so here is an example of what the Fast AI V2 source
00:03:08.520 | code looks like.
00:03:09.960 | It has links.
00:03:10.880 | It has titles.
00:03:11.680 | It has pictures.
00:03:13.000 | It has code.
00:03:16.000 | And this may seem like a painful way
00:03:20.000 | to develop because these are notebooks that
00:03:21.920 | are designed for interactive stuff,
00:03:23.760 | not for normal development.
00:03:25.440 | But actually, you'll find that also this pixel shuffle
00:03:29.120 | appears here in a standard layers.py module, which
00:03:41.680 | you can import in the usual way.
00:03:44.520 | So we've developed a new literate programming system
00:03:48.040 | that allows you to write code and have it automatically
00:03:53.720 | turned into nice modules, which even do things
00:03:57.680 | that most people don't bother to do because they're
00:04:00.160 | annoying if they're automatic, like setting it under all
00:04:03.280 | so it only exports the things that you want.
00:04:06.240 | Also coming out of that is automatically documentation.
00:04:09.920 | So all that gets turned into hyperlink documentation,
00:04:14.200 | including links directly to the source code and automatic doc
00:04:18.680 | strings and parameter lists.
00:04:22.200 | Also, you'll see tests.
00:04:24.800 | And the tests are used both to document the behavior expected.
00:04:29.520 | So if you're not sure what pixel shuffle is,
00:04:31.880 | this test is a very good description of exactly what
00:04:34.760 | it is and also ensures that our code is working.
00:04:38.200 | And those tests can all be put in continuous integration
00:04:41.720 | and so forth.
00:04:43.200 | So that's the first interesting thing about FastAIV2
00:04:47.120 | is it's the first truly literate programming-based system
00:04:50.760 | I've worked on.
00:04:51.600 | And it's been an absolute life.
00:04:54.040 | So we've written our own framework
00:04:55.800 | for every part of this, which is kind of a theme for FastAIV2.
00:05:02.200 | Basically, every time [INAUDIBLE]
00:05:03.640 | and I found something that didn't quite work the way we
00:05:06.800 | wanted it at any part of the stack, we wrote our own.
00:05:10.800 | So it's kind of like building something
00:05:12.800 | with no particular deadline and trying to do everything
00:05:15.040 | the very, very best we can.
00:05:18.200 | So the layered API of FastAIV2 starts at the applications
00:05:24.800 | layer, which is where most beginners will start.
00:05:29.960 | And it looks a lot like FastAIV1,
00:05:33.280 | which is the released version of the software
00:05:35.080 | that people have seen before.
00:05:36.720 | But V2, everything is rewritten from scratch.
00:05:39.360 | It's totally new.
00:05:40.200 | There's no code borrowed.
00:05:41.920 | But the top-level API looks quite similar.
00:05:43.920 | The idea is that in one, two, three, four lines of code,
00:05:49.480 | you can create a state-of-the-art computer vision
00:05:53.440 | classifier, including transfer learning.
00:05:57.920 | With nearly the same one, two, three, four lines of code--
00:06:03.080 | five lines of code in this case, because we're also displaying--
00:06:07.240 | you can create a state-of-the-art segmentation model.
00:06:10.200 | And actually, when I say state-of-the-art,
00:06:12.320 | for example, this segmentation model
00:06:13.880 | is, to the best of my knowledge, still better
00:06:16.280 | than any published result on this particular Canvid data set.
00:06:19.560 | So these five lines of code are super good five lines of code.
00:06:23.000 | And as you can see, it includes a line of code,
00:06:26.400 | which, if you say show batch, it will display your data
00:06:30.520 | in an appropriate format, in this case,
00:06:32.920 | showing you segmentation, a picture,
00:06:35.520 | and the color-coded pixels overlaid on top of the picture.
00:06:40.920 | The same basic four lines of code
00:06:42.800 | will do text classification.
00:06:44.920 | So here's the basis of ULMFIT, which
00:06:50.000 | is a system that we developed and wrote up along
00:06:52.480 | with the Bastion router for transfer learning
00:06:55.280 | in natural language processing.
00:06:57.120 | And as you can see, in here, this
00:06:59.080 | is working on IMDB on a single epoch in four minutes.
00:07:03.320 | The accuracy here is basically what was the state-of-the-art
00:07:06.720 | as of a couple of years ago.
00:07:07.800 | Tabular or time series analysis, same deal.
00:07:13.720 | Basically, a few lines of code, nearly exactly the same lines
00:07:16.960 | of code, and you'll get a great result from your tabular data
00:07:21.960 | and Ditto for collaborative filtering.
00:07:24.760 | So the high-level API for fastAIV2
00:07:29.280 | is designed to be something where, regardless
00:07:32.040 | of what application you're working on,
00:07:34.480 | you can get a great result from it
00:07:36.920 | using sensible defaults and carefully selected
00:07:39.240 | hyperparameters, automatically, largely done for you
00:07:45.040 | for the most common kinds of problems that people look at.
00:07:49.160 | And that bit doesn't look that different to V1,
00:07:52.840 | but understanding how we get to that is kind of interesting
00:07:57.640 | and involves getting deeper and deeper.
00:08:01.320 | This approach, though, does work super well.
00:08:05.400 | And partly, it's because this is based
00:08:07.840 | on quite a few years of research to figure out
00:08:10.000 | what are the best ways to solve various problems along the way.
00:08:13.680 | And when people actually try using fastAIV,
00:08:15.760 | they're often surprised.
00:08:17.400 | So this person posted on our forum
00:08:19.640 | that they've been working in TF2 for a while,
00:08:22.360 | and for some reason, they couldn't figure out.
00:08:24.800 | All of their models are suddenly working much better.
00:08:27.720 | And the answer is, basically, they're
00:08:29.240 | getting all these nice kind of curated best practices.
00:08:32.960 | And somebody else on Twitter saw that and said, yep,
00:08:36.080 | we found the same thing.
00:08:37.440 | We were trying TensorFlow, spent months tweaking,
00:08:39.680 | and then we switched to fastAI.
00:08:41.440 | A couple of days later, we were getting better results.
00:08:44.200 | So these kind of carefully curated defaults and algorithms
00:08:48.880 | and high-level APIs that do things right for you
00:08:51.360 | the first time, even for experienced practitioners,
00:08:55.080 | can give you better results faster.
00:08:59.360 | But it's actually the other pieces
00:09:00.960 | that are more, I think, interesting for a Swift
00:09:03.880 | conversation, because as the deeper we
00:09:05.640 | go into how we make that work, the more stuff
00:09:09.920 | you'll see, which will be a great fit, I think, with Swift.
00:09:13.560 | So the mid-layer API is something
00:09:18.440 | which is largely new to fast--
00:09:22.840 | actually, I guess the foundation layer is new.
00:09:24.840 | So the mid-layer, I guess I'd say, is more rewritten for V1.
00:09:28.480 | And it contains some of the things that
00:09:31.240 | make those high-level APIs easy.
00:09:34.640 | One of the bits which is the most interesting
00:09:37.800 | is the training loop itself.
00:09:42.120 | And I thank Sylvain for the set of slides
00:09:44.480 | we have for the training loop.
00:09:47.080 | This is what a training loop looks like in PyTorch.
00:09:50.400 | We calculate some predictions.
00:09:52.480 | We get a loss.
00:09:53.680 | We do a backwards pass to get the gradients.
00:09:56.840 | We do an optimizer step.
00:09:58.240 | And then optionally, we run time to time.
00:10:00.400 | We'll zero the gradients based on if we're
00:10:02.440 | doing when we're accumulating.
00:10:07.480 | So this is what that loop looks like.
00:10:09.120 | Run the model, get the loss, do the gradients,
00:10:11.960 | step the optimizer, do that a bunch of times.
00:10:16.000 | But you want to do something interesting.
00:10:19.600 | You'll need to add something to the loop
00:10:21.440 | to do keeping track of your training
00:10:23.600 | statistics in TensorBoard or in fast progress or whatever.
00:10:28.840 | You might want to schedule various hyperparameters
00:10:31.080 | in various different ways.
00:10:33.160 | You might want to add various different kinds
00:10:34.760 | of characterization.
00:10:37.200 | You may want to do mixed precision training.
00:10:40.320 | You may want to do GANs.
00:10:43.280 | So this is a problem because either you
00:10:45.680 | have to write a new training loop for every time
00:10:49.560 | you want to add a different tweak.
00:10:51.040 | And making all those tweaks work together
00:10:52.960 | then becomes incredibly complicated.
00:10:56.840 | Or you try and write one training loop which does
00:10:59.960 | everything you can think of.
00:11:01.120 | This is the training loop for fastAI 0.7,
00:11:03.800 | which only did a tiny subset of the things I just said
00:11:06.440 | but was getting ridiculous.
00:11:08.760 | Or you can add callbacks at each step.
00:11:12.960 | Now, the idea of callbacks has been around in deep learning
00:11:16.880 | for a long time, APIs.
00:11:19.480 | But what's very different about fastAI
00:11:21.720 | is that every callback is actually a two-way callback.
00:11:25.680 | It can read absolutely everything.
00:11:27.680 | It can read gradients, parameters, data, so forth.
00:11:32.160 | And it can write them.
00:11:33.720 | So it can actually change anything at any time.
00:11:37.120 | So the callbacks are, we say, infinitely flexible.
00:11:41.720 | We feel pretty confident in that because the training loop
00:11:46.280 | in fastAI has not needed to be modified
00:11:49.680 | to do any of the tweaks that I showed you before.
00:11:53.000 | So even the entirety of training GANs can be done in a callback.
00:11:58.600 | So basically, we switch out our basic training loop
00:12:01.800 | and replace it with one with the same five steps
00:12:05.040 | but callbacks between every step.
00:12:09.920 | So that means, for example, if you want to do a scheduler,
00:12:14.680 | you can define a batch begin that
00:12:16.560 | sets the optimizer's learning rate to some function.
00:12:19.760 | Or if you want to do early stopping,
00:12:21.440 | you can write an on epoch end that checks the metrics
00:12:25.400 | and stops training.
00:12:27.360 | Or you can do parallel training, set up data parallel,
00:12:31.760 | and happy at the end of training,
00:12:34.960 | take data parallel off again.
00:12:37.600 | Gradient clipping, you have access
00:12:40.000 | to the parameters themselves.
00:12:41.360 | So you can click the gradient norms
00:12:44.280 | at the end of the backward step, and so forth.
00:12:48.240 | So all of these different things are
00:12:50.880 | all things that have been written with fastAI callbacks,
00:12:55.600 | including, for example, mixed precision.
00:12:57.520 | All of NVIDIA's recommendations, mixed precision training,
00:13:03.360 | will be added automatically if you just add a two FP16
00:13:06.680 | at the end of your learn call.
00:13:07.920 | And really importantly, for example,
00:13:13.800 | all of those mixed precision things
00:13:15.240 | can be combined with multi-GPU and one-cycle training
00:13:20.760 | and gradient accumulation and so forth.
00:13:25.080 | And so trying to create a state-of-the-art model, which
00:13:30.840 | involves combining state-of-the-art regularization
00:13:33.200 | and mixed precision and distributed training and so
00:13:36.000 | forth is a really, really, really hard job.
00:13:39.600 | But with this approach, it's actually just a single extra
00:13:42.680 | line of code to add each feature.
00:13:45.360 | And they all explicitly are designed
00:13:47.080 | to work with each other and are tested to work with each other.
00:13:51.200 | So for instance, here is mix-up data augmentation,
00:13:54.760 | which is a incredibly powerful data augmentation method that
00:13:59.320 | has powered lots of state-of-the-art results.
00:14:02.440 | And as you can see, it's under a screen of code.
00:14:05.440 | By comparison, here is the version
00:14:08.440 | of mix-up from the paper.
00:14:10.920 | Not only is it far longer, but it only
00:14:13.000 | works with one particular data set
00:14:15.320 | and one particular optimizer and is
00:14:17.960 | full of all kinds of assumptions and only one
00:14:20.080 | particular kind of metric and so forth.
00:14:24.440 | So that's an example of these mid-tier APIs.
00:14:28.560 | Another one is the optimizer.
00:14:32.800 | It turns out that it looks like there's
00:14:36.760 | been lots and lots of different optimizers appearing
00:14:39.720 | in the last year or two.
00:14:43.040 | It actually turns out that they're
00:14:44.400 | all minor tweaks on each other.
00:14:47.160 | Most libraries don't write them this way.
00:14:49.360 | So for example, Adam W, also known
00:14:52.520 | as decoupled weight decay Adam, was added to PyTorch
00:14:58.560 | quite recently in the last month or two.
00:15:02.800 | And it required writing a whole new class
00:15:07.120 | and a whole new step to implement.
00:15:10.240 | And it took-- it was like two or three years
00:15:12.600 | after the paper was released.
00:15:15.040 | On the other hand, FastAI's implementation, as you can see,
00:15:19.400 | involves a single extra function containing
00:15:22.840 | two lines of code and this little bit of gray here.
00:15:25.760 | So it's kind of like two and a half, three lines of code
00:15:28.800 | to implement the same thing.
00:15:30.560 | Because what we did was we realized,
00:15:34.160 | let's refactor the idea of an optimizer,
00:15:37.600 | see what's different for each of these state of the art
00:15:41.960 | optimizers that have appeared recently,
00:15:44.120 | and make it so that each of those things
00:15:46.120 | can be added and removed by just changing two things--
00:15:51.240 | stats and steppers.
00:15:53.880 | A stat is something that you measure
00:15:58.920 | during training, such as the gradients or the gradient
00:16:01.520 | squared, or you might use dampening, or momentum,
00:16:03.840 | or whatever.
00:16:04.840 | And then a stepper is something that
00:16:07.000 | uses those stats to change the weights in some way.
00:16:11.720 | And you can combine those things together.
00:16:13.640 | And by combining these, we've been
00:16:15.320 | able to implement all these different optimizers.
00:16:18.200 | So for instance, the lamb optimizer,
00:16:27.000 | which came out of Google and was super cool at reducing
00:16:30.960 | the pre-training time from three days to 76 minutes,
00:16:34.880 | we were able to implement that in this tiny piece of code.
00:16:38.840 | And one of the nice things is that when you compare it
00:16:41.720 | to the math, it really looks almost line for line
00:16:45.960 | identical, except ours is a little bit nicer
00:16:49.040 | because we refactored some of the math.
00:16:51.560 | So it makes it really easy to do research as well,
00:16:55.160 | because you can quite directly bring the equations across
00:16:58.760 | into your code.
00:17:01.320 | Then the last of the mid-tier APIs
00:17:04.240 | is the data block API, which is something we had in version 1
00:17:10.880 | as well.
00:17:12.240 | But when we were porting that to Swift,
00:17:21.040 | we had an opportunity to rethink it.
00:17:23.040 | And actually, Alexis Gallagher in particular
00:17:26.560 | helped us to rethink it in a more idiomatic Swift way.
00:17:31.320 | And it came out really nicely.
00:17:32.960 | And so then we took the result of that
00:17:34.600 | and ported it back into Python.
00:17:36.840 | And we ended up with something that was quite a bit nicer.
00:17:40.160 | So there's been a nice interaction and interplay
00:17:42.400 | between fast AI in Python and Swift AI in Swift
00:17:47.200 | in terms of helping each other's APIs.
00:17:50.520 | But basically, the data block API
00:17:52.480 | is something where you define each of the key things
00:17:56.680 | that the program needs to know to flexibly get your data
00:18:01.120 | into a form you can put in a model.
00:18:04.160 | So it needs to know what type of data do you have,
00:18:08.120 | how do you get that data, how do you split it
00:18:11.120 | into a training set and a validation set,
00:18:13.320 | and then put that all together into a data bunch, which
00:18:15.960 | is just a simple little class.
00:18:17.520 | It's literally, I think, four lines of code, which just
00:18:20.760 | has the validation set and the training set in one place.
00:18:26.840 | So with a data block, you just say, OK, my types,
00:18:31.960 | I want to create a black and white pillow image for my x
00:18:36.000 | and a category for my y.
00:18:38.600 | And to get the list of files for those,
00:18:40.840 | I need to use this function.
00:18:43.240 | And to split those files into training and validation,
00:18:46.280 | use this function, which is looking
00:18:47.960 | at the grandparent path directory name.
00:18:52.240 | And to get the labels, use this function, which is
00:18:55.800 | use the parent's path name.
00:18:58.840 | And so with that, that's enough to give you
00:19:02.720 | MNIST, for instance.
00:19:04.840 | And so once you've done this, you end up with a data bunch.
00:19:08.640 | And as I mentioned before, everything has a show batch.
00:19:12.360 | So one of the nice things is it makes it very easy for you
00:19:14.840 | to look at your data regardless of whether it's
00:19:17.080 | tabular, or collaborative filtering, or vision, or text,
00:19:20.240 | or even audio.
00:19:21.600 | If it was audio, it would show you a spectrogram
00:19:24.240 | and let you play the sound.
00:19:28.120 | So you can do custom labeling with data blocks
00:19:32.880 | by using, for example, a regular expression labeler.
00:19:37.240 | You can get your labels from an external file or data frame.
00:19:42.320 | And they could be multi-labels.
00:19:44.080 | So this thing here knows it's a multi-label classification
00:19:46.800 | task.
00:19:47.600 | So it's automatically put a semicolon between each label.
00:19:51.800 | Again, it's still basically just three lines of code
00:19:54.800 | to define the data block.
00:19:57.440 | So here's a data block for segmentation.
00:19:59.600 | And you can see, really, the only thing I had to change here
00:20:02.640 | was that my dependent variable has been changed from category
00:20:05.680 | to pillow mask.
00:20:09.000 | And again, automatically, I show batch works.
00:20:11.640 | And we can train a model from that straight away as well.
00:20:15.800 | You could do key points.
00:20:17.400 | So here, I've just changed my dependent variable
00:20:19.440 | to tensor point.
00:20:20.600 | And so now, it knows how to behave with that.
00:20:23.440 | Object detection.
00:20:24.960 | So now, I've changed my dependent variable to bounding box.
00:20:27.600 | And you can see, I've got my bounding boxes here.
00:20:31.480 | Text, and so forth.
00:20:36.040 | So actually, going back, I have a couple of questions
00:20:38.560 | if you're--
00:20:39.160 | OK, time.
00:20:40.280 | Yeah.
00:20:41.000 | So the code, you've got sort of the x's and y's.
00:20:47.680 | And these sounds like these different data types roughly
00:20:51.240 | conform to a protocol.
00:20:53.800 | We're going to get to that in a moment.
00:20:55.200 | Absolutely.
00:20:56.200 | Fantastic.
00:20:58.720 | That's an excellent way to think of it.
00:21:00.240 | And actually, this is the way it looked about three weeks ago.
00:21:02.880 | Now, it looks even more like a protocol.
00:21:04.520 | So yes, this is where it all comes from,
00:21:09.880 | which is the foundation APIs.
00:21:12.000 | And this is the bit that I think is the most relevant to Swift.
00:21:14.600 | A lot of this, I think, would be a lot easier to write in Swift.
00:21:21.200 | So the first thing that we added to PyTorch
00:21:25.800 | was object-oriented tensors.
00:21:28.760 | For too long, we've all been satisfied
00:21:31.840 | with a data type called tensor, which has no semantics to it.
00:21:37.720 | And so those tensors actually represent something
00:21:40.720 | like a sentence, or a picture of a cat,
00:21:44.480 | or a recording of somebody saying something.
00:21:49.120 | So why can't I take one of those tensors
00:21:51.040 | and say dot flip, or dot rotate, or dot resample,
00:21:55.360 | or dot translate to German?
00:21:58.280 | Well, the answer is you can't, because it's just
00:22:01.720 | a tensor without a type.
00:22:04.080 | So we have added types to tensors.
00:22:08.680 | So you can now have a tensor image, a tensor point,
00:22:11.920 | a tensor bounding box.
00:22:13.720 | And you can define a flip left, right for each.
00:22:16.560 | And so this is some of the source code from--
00:22:18.480 | we've written our own computer vision library,
00:22:20.920 | so that now you can say flip LR, and it flips the puppy.
00:22:26.800 | And if it was a key points, it would fit the key points.
00:22:30.640 | If it was a bounding box, it would fit the bounding boxes,
00:22:33.560 | and so forth.
00:22:34.880 | So this is an example of how tensors which carry around
00:22:39.240 | semantics are nice.
00:22:40.360 | It's also nice that I can just say dot show, right?
00:22:43.280 | So dot show is something that's defined for all fast AIV2
00:22:48.560 | tensor types.
00:22:49.680 | And it will just display that tensor.
00:22:53.400 | It could even be a tuple containing a tensor,
00:22:56.560 | and some bounding boxes, and some bounding box classes.
00:22:59.080 | Whatever it is, it will be able to display it.
00:23:02.720 | It will be able to convert it into batches for modeling,
00:23:07.400 | and so forth.
00:23:10.480 | So with that, we can now create, for example,
00:23:14.800 | a random transformation called flip item.
00:23:18.200 | And we can say that the encoding of that random transformation
00:23:21.640 | is defined for a pillow image or any tensor type.
00:23:26.680 | And in each case, the implementation
00:23:28.920 | is simply to call x dot flip LR.
00:23:31.480 | Or we could do the dihedral symmetry transforms
00:23:34.400 | in the same way.
00:23:36.280 | Before we call, grab a random number between 0 and 7
00:23:40.000 | to decide which of the eight transposes to do.
00:23:45.560 | And then in codes, call x dot plus dihedral
00:23:48.720 | with that thing we just got.
00:23:51.200 | And so now we can call that transform a bunch of times.
00:23:55.480 | And each time, we'll get back a different random augmentation.
00:23:59.000 | So a lot of these things become nice and easy.
00:24:01.480 | Hey, Jeremy.
00:24:02.560 | Maxim asked, why isn't tensor a backing data structure
00:24:05.480 | for an image type?
00:24:06.320 | Tensor image is a tensor, which is an image type.
00:24:13.200 | Why isn't-- he says, why isn't tensor a backing--
00:24:17.960 | why not have a different type named image,
00:24:20.200 | I guess, that has a tensor inside of it?
00:24:23.680 | Do you mean why inherit rather than compose?
00:24:28.240 | Apparently, yes, that.
00:24:29.720 | Yeah.
00:24:31.360 | So inheritance-- I mean, you can do both.
00:24:36.200 | And you can create identical APIs.
00:24:38.480 | Inheritance just has the benefit that all the normal stuff you
00:24:41.760 | can do with a tensor, you can do with a tensor that
00:24:43.640 | happens to be an image.
00:24:44.960 | So just because a tensor is an image
00:24:46.600 | doesn't mean you now don't want to be able to do fancy indexing
00:24:49.440 | to it, or do an LU decomposition of it,
00:24:54.120 | or stack it with other tensors across some axis.
00:24:57.920 | So basically, a tensor image ought
00:25:03.400 | to have all the behavior of a tensor
00:25:05.120 | plus additional behavior.
00:25:06.800 | So that's why we use inheritance.
00:25:09.720 | We have a version that uses composition as well,
00:25:11.680 | and it uses Python's nice get atra functionality
00:25:16.240 | to pass on all of the behavior of tensor.
00:25:23.240 | But it comes out more nicely in Python
00:25:25.680 | when you do inheritance.
00:25:27.240 | And actually, the PyTorch team has
00:25:29.280 | decided to officially implement semantic tensor subtypes now.
00:25:35.080 | And so hopefully, in the next version of PyTorch,
00:25:37.320 | you won't have to use the extremely ugly hacks
00:25:40.320 | that we had to use to make this work.
00:25:43.560 | You'll be able to use the real ones.
00:25:46.080 | And hopefully, you'll see in TorchVision
00:25:48.600 | some of these ideas will be brought over there.
00:25:51.760 | Can I ask you, so how does the type propagate?
00:25:55.760 | So if you do arithmetic on an image tensor,
00:25:58.160 | do you get an image tensor back there?
00:26:00.320 | So Chris and I had a conversation about this a few months ago,
00:26:05.280 | and I said I'm banging my head around this issue of types
00:26:10.360 | not carrying around their behavior.
00:26:12.160 | And Chris casually mentioned, oh, yes, that thing
00:26:14.560 | is called higher kind of types.
00:26:16.320 | So I went home, and that was one of these phrases
00:26:19.120 | I thought only functional programming Dweeb's talked
00:26:22.000 | about, and I would never care about a tensor.
00:26:24.600 | And we have to care about it, because it actually
00:26:26.720 | matters a lot.
00:26:27.320 | And it's basically the idea that if you have a tensor image
00:26:30.160 | and you add one to it, you want to get back a tensor image,
00:26:33.920 | because it should be an image that's a bit brighter
00:26:36.320 | rather than something that loses its type.
00:26:38.960 | So we implemented our own, again,
00:26:42.280 | hacky, partial, higher kind of type implementation
00:26:45.520 | in FastAIV2. So any of these things
00:26:49.160 | that you do to a tensor of a subtype,
00:26:53.400 | you will nearly always get back the correctly subtype tensor.
00:26:58.680 | Yeah, I mean, I saw the PyTorch recently sort of talking
00:27:01.600 | about their named indexing extensions for their tensors
00:27:06.680 | as well, and I assume they have a similar kind of challenge
00:27:09.320 | there, where when you start doing arithmetic
00:27:11.720 | and other things like that on a tensor that
00:27:13.640 | has named dimensions, you want to propagate those along.
00:27:18.240 | Yeah, so we haven't started using that yet,
00:27:22.400 | because it hasn't quite landed at stable.
00:27:25.640 | But yeah, we talked to the PyTorch team at the DevCon,
00:27:30.520 | and we certainly are planning to bring these ideas together.
00:27:35.960 | They're orthog and orbit-related concerns.
00:27:38.040 | Yeah, I just mean that I assume that that feature has
00:27:41.520 | the same problem, the same challenge.
00:27:43.120 | I assume so, yeah.
00:27:44.560 | So it would be interesting to see what they do.
00:27:48.240 | Yeah, yeah, it would.
00:27:50.040 | Yeah, so it's kind of nice.
00:27:55.400 | Not only do we get to be able to say .show batch,
00:27:58.320 | but you can even go .show results.
00:28:00.840 | And in this case, it knows what the independent variables type
00:28:05.240 | is, it knows what the dependent variables type is,
00:28:07.720 | and it even knows things like, hey, for a classification task,
00:28:10.840 | those two things should be the same.
00:28:12.360 | If they're not, by default, I will highlight them right.
00:28:15.080 | So these lower-level foundations are
00:28:18.240 | the things that drive our ability to easily add
00:28:20.680 | this higher-level functionality.
00:28:24.920 | So this is the kind of ugly stuff
00:28:28.040 | we wouldn't have to do in Swift.
00:28:30.240 | We had to write our own type dispatch system.
00:28:34.280 | We can annotate things with types,
00:28:36.040 | and those type annotations are actually semantic.
00:28:38.720 | And so we now have the joyfully modern idea of function
00:28:45.080 | overloading in Python, which has made life a lot easier,
00:28:48.560 | and we already have that.
00:28:51.800 | Do you have many users that are using this yet?
00:28:55.360 | Or is it still out of reference?
00:28:56.840 | It's still pre-released.
00:28:57.920 | It's not even alpha.
00:28:59.960 | But there is a enthusiastic early adopter community
00:29:05.920 | who is using it.
00:29:08.000 | So for example, the user-contributed audio library
00:29:12.440 | has already been ported to it.
00:29:14.960 | I've also built a medical imaging library on top of it,
00:29:17.280 | and I've written a series of five notebooks showing how
00:29:19.960 | to do CT scan analysis with it.
00:29:23.480 | So it's kind of like it works.
00:29:28.680 | And--
00:29:30.880 | I was curious what your users think of it,
00:29:33.320 | because there's this very strongly-held conception
00:29:36.440 | that Python folks hate types.
00:29:39.640 | And you're kind of providing a little bit
00:29:41.760 | of typing in the world, and I'm curious how they react to that.
00:29:46.040 | The extremely biased subset of early adopter-class AI
00:29:49.960 | enthusiasts who are using it love it.
00:29:53.120 | And they tend to be people who have gone pretty deep
00:29:56.120 | in the past.
00:29:57.040 | So for example, my friend Andrew Shaw,
00:29:59.920 | who wrote something called Music Autobot, which
00:30:02.240 | is one of the coolest things in the world,
00:30:03.920 | in case you haven't seen it yet, which is something
00:30:07.120 | where you can generate music using a neural network.
00:30:11.120 | You can put in some melodies and some chords,
00:30:13.480 | and it will auto-complete some additional melodies and chords.
00:30:16.520 | Or you can put it in a melody, and it will automatically
00:30:19.360 | add chords, or you can add chords that create melody.
00:30:23.960 | And so he had to write his own MIDI library, fastai.midi.
00:30:29.200 | He rewrote it in V2, and he said it's just like so, so,
00:30:33.640 | so much easier, thanks to those mid-tier APIs.
00:30:38.160 | So yeah, at this stage, it's easy as to--
00:30:41.560 | I was just going to jump in quick.
00:30:44.720 | I've been helping with some of the audio stuff,
00:30:47.120 | and it's been really awesome.
00:30:51.240 | So it makes things a lot more flexible than version 1.
00:30:56.440 | So that's probably my favorite thing about it,
00:30:58.800 | is everything can be interchanged.
00:31:02.360 | Nothing is like, well, it's got to be this way,
00:31:04.640 | because that's how it is.
00:31:06.680 | Yeah, that's cool.
00:31:08.520 | Cool, thanks.
00:31:10.800 | Another piece of the transform of the foundation
00:31:14.880 | is the partially reversible composed function
00:31:18.080 | pipeline dispatched over collections, which
00:31:20.760 | really rolls off the tongue, we call them transform in pipeline.
00:31:24.920 | Basically, the idea is that the way
00:31:29.480 | you kind of want function dispatch to work
00:31:35.600 | and function composition to work in deep learning
00:31:39.240 | is a little different to other places.
00:31:42.160 | There's a couple of things.
00:31:43.560 | The first is you often want to dispatch over tuples.
00:31:47.200 | And what I mean by that is if you have a function called
00:31:52.760 | flip left right, and you have a tuple representing
00:31:59.200 | a mini batch where your independent variable is
00:32:01.640 | a picture and your dependent variable
00:32:03.480 | is a set of bounding boxes, if you say flip left right
00:32:07.160 | on that tuple, you would expect both the x and the y
00:32:11.680 | to be flipped and to be flipped with the type appropriate
00:32:16.360 | method.
00:32:17.760 | So our transforms will automatically
00:32:21.400 | send each element of a tuple to the function separately
00:32:26.000 | and/or dispatch according to their types automatically.
00:32:30.720 | We've mentioned type retention, so the kind of basic
00:32:33.120 | higher type stuff we need.
00:32:37.720 | One interesting thing is not only encoding,
00:32:40.440 | so in other words, applying the function,
00:32:43.880 | you often need to be able to decode,
00:32:45.600 | which is to deapply the function.
00:32:48.480 | So for example, a categorization transform
00:32:51.160 | would take the word dog and convert it to the number 1,
00:32:55.520 | perhaps, which is what you need for modeling.
00:32:58.800 | But then when your predictions come back,
00:33:00.480 | you need to know what 1 represents.
00:33:02.720 | So you need to reverse that transform and turn 1 back
00:33:06.520 | into dog.
00:33:08.120 | Often those transforms also need data driven setup.
00:33:12.320 | For example, in that example of dog becoming 1,
00:33:16.000 | there needs to be something that actually creates that vocab
00:33:18.360 | automatically, recognizing what are all the possible classes,
00:33:21.520 | so it can create a different index for each one
00:33:24.880 | and then apply that to the validation set.
00:33:28.240 | And quite often these transforms also
00:33:30.080 | have some kind of state, such as the vocab.
00:33:35.520 | So we built this bunch of stuff that
00:33:37.920 | builds on top of each other.
00:33:39.040 | At the lowest level is a class called transform,
00:33:41.920 | which is a callable, which also has a decode,
00:33:48.720 | does the type retention, higher kind of type thing,
00:33:51.400 | and does the dispatch over tuples by default.
00:33:54.720 | So then a pipeline is something that does function composition
00:33:57.400 | over transforms.
00:34:00.120 | And it knows about, for example, setting up transforms.
00:34:06.000 | And setting up transforms in a pipeline
00:34:08.400 | is a bit tricky because you have to make sure
00:34:11.560 | that at each level of the pipeline,
00:34:14.040 | only the previous steps have been applied
00:34:17.000 | before you set up the next step.
00:34:19.200 | So it does little things like that.
00:34:21.600 | And then we have something that applies a pipeline
00:34:23.760 | to a collection to give you an indexable, lazily transformed
00:34:28.000 | collection.
00:34:29.080 | And then you can do those in parallel
00:34:31.040 | to get back an independent variable, for instance.
00:34:35.600 | And then finally, we've built a data loader,
00:34:40.120 | which will apply these things in parallel
00:34:45.360 | and create collated batches.
00:34:47.520 | So in the end, all this stuff makes a lot of things
00:34:54.560 | much easier.
00:34:55.720 | For example, the language model data loader in Fast AI v1
00:35:00.040 | was like pages of code.
00:35:02.520 | In TensorFlow, it's pages of code.
00:35:05.440 | In Fast AI v2, it's less than a screen of code
00:35:08.480 | by leveraging these powerful abstractions and foundations.
00:35:16.400 | So then finally-- and again, this is something
00:35:19.320 | I think Swift will be great for--
00:35:22.160 | we worked really hard to make everything extremely well
00:35:25.280 | optimized.
00:35:26.280 | So for example, preprocessing and natural language processing,
00:35:30.200 | we created a parallel generator in Python, which you can then
00:35:35.840 | basically pass a class to that finds some setup and a call.
00:35:39.600 | And it can automatically parallelize that.
00:35:44.200 | So for example, tokenization is done in parallel
00:35:48.520 | in a pretty memory efficient way.
00:35:50.000 | But perhaps the thing I'm most excited about,
00:35:57.640 | both in Python and Swift, is the optimized pipeline
00:36:03.520 | running on the GPU.
00:36:04.960 | So pretty much all of the transforms we've done can
00:36:11.640 | and by default do run on the GPU.
00:36:14.880 | So for example, when you do the flip left right
00:36:17.720 | I showed you earlier, we'll actually run on the GPU,
00:36:20.920 | as we'll warp, as we'll zoom, as we'll even things like crop.
00:36:26.960 | So one of the basics of this is the affine coordinate transform,
00:36:33.360 | which uses affine grid and grid sample,
00:36:36.280 | which are very powerful PyTorch functions, which
00:36:40.560 | would be great things to actually write in script
00:36:44.480 | for TensorFlow's new meta programming,
00:36:46.920 | because they don't exist in TensorFlow,
00:36:50.320 | or at least not in any very complete way.
00:36:53.400 | But with these basic ideas, we can
00:36:55.800 | create this affine coordinate transform
00:36:57.480 | that lets us do a very wide range of data augmentations
00:37:01.680 | in parallel on the GPU.
00:37:03.840 | For those of you that know about the DALI library
00:37:06.440 | that we've created, this provides
00:37:08.880 | a lot of the same benefits of DALI.
00:37:11.680 | It's pretty similar in terms of its performance.
00:37:14.280 | But the nice thing is, all the stuff you write,
00:37:17.000 | you write it in Python, not in CUDA.
00:37:19.720 | So with DALI, if they don't have the exact transformation
00:37:24.160 | you want, and there's a pretty high chance that they won't,
00:37:27.920 | then you're stuck.
00:37:29.000 | Or else with fast AI v2, you can write your own
00:37:33.120 | in a few lines of Python.
00:37:34.680 | You can test it out in a Jupyter Notebook.
00:37:37.920 | It makes life super easy.
00:37:41.480 | So this kind of stuff, I feel like because Swift
00:37:46.360 | is a much faster, more hackable language than Python,
00:37:52.800 | or at least hackable in the sense of performance,
00:37:55.800 | I guess not as hackable in terms of its type system necessarily,
00:37:59.640 | I feel like we can build even more powerful foundations
00:38:07.040 | and pipelines and a real Swift for TensorFlow computer vision
00:38:13.360 | library, leveraging the metaprogramming
00:38:17.000 | and leveraging Swift numerics.
00:38:19.800 | Stuff like that, I think, would be super cool.
00:38:24.440 | And so that is the end of that.
00:38:28.840 | That was great.
00:38:29.600 | That was excellent.
00:38:30.680 | Thank you very much, Jeremy.
00:38:32.320 | My pleasure.
00:38:34.960 | So just sort of thinking through,
00:38:37.120 | so as you're propagating along the self-type
00:38:39.960 | amongst the transformations, that
00:38:41.560 | seems relatively straightforward for Swift to handle.
00:38:44.120 | Are there other sorts of things that you think
00:38:46.520 | we should start thinking about now?
00:38:49.040 | Yeah, the thing I really want you to think about,
00:38:51.240 | and we've kind of been nagging you on and off since March,
00:38:53.720 | is the way that tensors are represented.
00:38:58.760 | Having them as a value type the way they are now
00:39:02.760 | makes some things hard or impossible.
00:39:05.560 | So the generic optimizer is a thing
00:39:07.760 | that I really, really want you guys to look into and build
00:39:12.040 | properly.
00:39:12.520 | Currently, it uses ugly key path hacks,
00:39:16.680 | and it's only kind of partially doing what we need it to do.
00:39:20.280 | So I talked to Alexis about this idea quite a bit,
00:39:25.560 | and we kind of thought maybe there
00:39:28.480 | could be some type that represents the actual block
00:39:34.760 | of GPU memory in a way where we can easily share that.
00:39:39.680 | In practice, we've realized the vast majority of the time,
00:39:45.360 | we want to refer to that exact piece of memory on the GPU,
00:39:49.760 | not this idea of a tensor which may magically copy itself
00:39:54.040 | if I change something.
00:39:56.560 | And so, for example, with the generic optimizer,
00:39:59.320 | we need to be able to say, oh, this layer is
00:40:02.440 | part of this layer group, and this layer group
00:40:04.840 | has these things that need to happen to it.
00:40:08.760 | So I actually said to Ed, hey, could you please
00:40:12.800 | have a look at the Swift AI generic optimizer,
00:40:15.480 | because it's trying to be a similar design to the fast AI
00:40:20.480 | V2 optimizer, but it's currently pretty unattractive.
00:40:26.120 | The second is I feel like creating a really good computer
00:40:31.240 | vision library is something which could be done now-ish.
00:40:35.800 | When I tried to do it, I was getting kind of race conditions
00:40:41.960 | and freezes inside Swift, and I don't have the Swift skills
00:40:45.400 | to know where they were coming from or how to fix them.
00:40:48.040 | It would be nice if folks could like--
00:40:50.480 | I think all of my answers is, go back to the stuff
00:40:52.680 | that we all built together back in March, April, May,
00:40:57.320 | and try to start using it in real life,
00:41:00.320 | and build models with it, and put them in production,
00:41:04.400 | and see the bits where it hits where you get stuck,
00:41:08.080 | because you'll find things like, oh, there's no grid sample,
00:41:11.320 | and, oh, there's race conditions in the interaction of OpenCV,
00:41:17.120 | and the optimizer doesn't quite work properly, and that stuff.
00:41:26.200 | That makes sense.
00:41:28.000 | I think we're also trying to figure out right now what
00:41:30.960 | the right path is with the runtime.
00:41:33.520 | So we've historically been building
00:41:35.400 | on top of the TensorFlow runtime, which
00:41:37.240 | is great for a lot of reasons.
00:41:38.960 | It has a lot of functionality in the box.
00:41:41.840 | It does pretty much everything.
00:41:45.000 | On the other hand, the performance, particularly in
00:41:46.920 | eager mode, is not great.
00:41:48.880 | So I think one of the things we're kicking around
00:41:50.760 | is the idea of going more directly into XLA.
00:41:54.320 | Yeah.
00:41:55.200 | Well, I think that's a thing that's been--
00:41:56.760 | And XLA being a stepping stone towards MLIR
00:42:00.680 | in the bigger future, which is also coming.
00:42:02.720 | I think that's the thing that's been stopping us all
00:42:05.000 | from using stuff like Swift AI to actually build models,
00:42:08.480 | because the auto diff has memory leaks,
00:42:11.520 | and the TensorFlow runtime is--
00:42:14.240 | I don't have to be polite-- so not at Google.
00:42:16.040 | So it's molasses.
00:42:17.240 | And it implements everything in six different ways
00:42:20.320 | in six different places, and so forth.
00:42:22.120 | So yeah, I think everybody's going
00:42:24.640 | to be digging into these higher level APIs a lot more
00:42:27.640 | once the foundations are where they're at.
00:42:29.920 | Yeah, and so the trade-off there is
00:42:32.600 | if we go with that direction now,
00:42:34.240 | XLA doesn't provide all the things in the box.
00:42:37.080 | But I think that's probably fine.
00:42:39.760 | We haven't fasted up something.
00:42:41.120 | I'm just so kind to let stuff that we need it.
00:42:45.080 | And so I think we're talking about that,
00:42:46.640 | trying to decide what to do there.
00:42:48.320 | We're also investing a lot in AD and finishing that off.
00:42:51.040 | Yeah, I mean, all the right work's thing done.
00:42:53.640 | It's just, you know, it's just early days.
00:42:57.360 | Yes, I think the challenge that we're really struggling with
00:42:59.920 | is this decision to stick with the TensorFlow runtime
00:43:02.600 | or to move on to something else.
00:43:06.840 | That, I think, is complicated, but I
00:43:09.520 | agree this is one of the major blockers for adoption of use.
00:43:14.520 | Yeah.
00:43:15.160 | I mean, especially if you want to take advantage of Swift,
00:43:18.440 | which we do, you need something where the kernel launch
00:43:25.560 | time is tiny or better still kind of non-existent
00:43:29.120 | because you can write everything in Swift.
00:43:31.120 | Otherwise, it's-- yeah, you don't really get the benefits.
00:43:35.080 | Yeah, and one of the--
00:43:36.240 | so I'll say I'll answer your question in a second.
00:43:38.640 | But one of the trade-offs there is
00:43:40.520 | that XLA doesn't have really fast kernel launch time
00:43:42.760 | because it effectively JIT compiles things
00:43:45.600 | before launching it.
00:43:47.280 | On the other hand, there are a lot of opportunities
00:43:51.040 | to do, for example, Fusion and other things like that
00:43:54.080 | that can offset it.
00:43:55.200 | And one of the nice hybrid models you get
00:43:57.960 | is this combination of tracing plus compilation, which
00:44:02.000 | I think could be really interesting.
00:44:03.400 | Yeah.
00:44:03.920 | [INAUDIBLE]
00:44:05.320 | Said asked what's going on with MLIR.
00:44:07.200 | There's tons of stuff going on.
00:44:08.480 | It's really exciting.
00:44:09.840 | Just yesterday, there was a really fantastic talk
00:44:11.840 | from some folks at Intel talking about their code generation
00:44:15.040 | algorithm that are bringing over to MLIR, which I'm really,
00:44:17.840 | really, really excited about.
00:44:20.120 | And so there's tons of stuff going on.
00:44:23.520 | Getting the ideal code gen for NVDA GPUs, for example,
00:44:26.960 | is probably still six plus months away.
00:44:29.400 | And I don't know how much plus that is.
00:44:32.400 | But what I'm encouraging is the community
00:44:35.200 | to come together and collaborate instead
00:44:36.880 | of the different teams and the different companies
00:44:39.800 | like kind of being in front of me.
00:44:42.480 | And the Intel stuff that they presented yesterday
00:44:44.960 | is super, super impressive.
00:44:48.160 | So we'll see what happens with that.
00:44:50.480 | The other thing I might--
00:44:51.520 | [INTERPOSING VOICES]
00:44:53.480 | The other thing I might mention in terms of tails
00:44:56.040 | on the other side, what's life like in the Python world,
00:44:59.560 | things that are and aren't working well over there.
00:45:04.040 | The kind of the answer to Swift for TensorFlow in the PyTorch
00:45:07.880 | world is JIT.
00:45:10.520 | So it's basically to trace your Python code
00:45:15.360 | and attempt to figure out what it's doing
00:45:17.280 | and create what they call TorchStrip, which
00:45:19.160 | is a dialect of subset of Python or else to actually parse.
00:45:24.760 | Your Python code is also an option
00:45:26.360 | and turn it into TorchStrip.
00:45:29.240 | It has reached the point now where it can actually
00:45:32.320 | be used for good.
00:45:34.160 | So one of our students created--
00:45:38.520 | a bunch of our students actually have been working on a thing
00:45:41.360 | called Mesh, including a young researcher who
00:45:45.080 | designed the original thing.
00:45:46.320 | It's a very nice activation function
00:45:48.560 | that's about performing everything else
00:45:50.160 | that anybody is trying it on.
00:45:51.560 | And it was pretty slow.
00:45:53.560 | And when we just took me half an hour to create a JIT version
00:45:58.160 | and it ran at the same speed as somebody else's
00:46:01.080 | hand-created CUDA code.
00:46:03.120 | So for small things like that, where it's
00:46:05.280 | two or three lines of code, that's working pretty well.
00:46:09.160 | Although for bigger things, like a new batch norm implementation
00:46:12.480 | we tried to do during the last course,
00:46:15.440 | the performance wasn't there.
00:46:17.320 | Or if we actually tried to take--
00:46:19.480 | one of the big problems at the moment,
00:46:22.400 | not just for Python, but the whole world of non-Google
00:46:26.280 | people, is that the best computer vision models by far
00:46:29.960 | are largely those that are coming out of Google,
00:46:32.160 | like EfficientNets and MixNets, like Kwokli's team.
00:46:36.160 | They run very slowly and with a lot of memory on GPUs.
00:46:41.040 | And so we tried wrapping an entire EfficientNet
00:46:44.840 | and MixNet into a JIT-ed thing, so it wouldn't be so slow.
00:46:49.080 | The MixNet didn't work at all, and the EfficientNet
00:46:51.840 | was a little bit slower.
00:46:53.840 | So that's kind of the status of JIT in PyTorch
00:46:57.480 | is bits of it are useful.
00:47:00.720 | The way I look at this from the compiler-y code generation
00:47:03.960 | piece is that I think the MLIR pieces are all
00:47:06.080 | going the right direction.
00:47:07.080 | They're just going to take a while to get here.
00:47:10.240 | XLA, as far as I know, is state of the art and code generation.
00:47:14.520 | For the things it does, it does quite well.
00:47:17.080 | The challenge of those, it does have sort of limitations
00:47:19.800 | like static shapes and the number of office supports.
00:47:23.280 | You kind of have to be within its world for it to be useful.
00:47:27.120 | But it has a very useful--
00:47:28.320 | it has a large subset of the world
00:47:29.680 | that it covers very well.
00:47:30.760 | It has a pretty useful world.
00:47:32.240 | It has a pretty useful world.
00:47:34.680 | TorchScript, my understanding is that the base model
00:47:39.320 | of TorchScript and the interpreters they have,
00:47:42.920 | I understand that's quite nice.
00:47:46.000 | But the kernel fusion piece is still fairly early
00:47:48.480 | when it's mostly on-wise operations, for example.
00:47:50.920 | I don't find them that quite nice.
00:47:52.480 | I mean, simple things like--
00:47:54.960 | they're partly a limitation of the Python type system.
00:47:59.200 | So you want to be able to write things
00:48:01.800 | that can work with different numbers of channels
00:48:03.720 | while you're out of luck because they use Python type
00:48:06.680 | limitations, which have no way of saying it's
00:48:09.160 | a tuple of size n.
00:48:10.720 | You have to say it's a tuple of size 3.
00:48:12.680 | So then you have to hard code all these assumptions
00:48:14.680 | into your code.
00:48:16.240 | Lots of stuff I find pretty frustrating.
00:48:18.880 | I see.
00:48:19.400 | Interesting.
00:48:19.920 | Well, so I mean, I think there's other spaces
00:48:21.880 | that I'm eager to reevaluate as--
00:48:25.320 | I mean, this isn't the highest priority at this moment.
00:48:28.040 | But in terms of our APIs, there's
00:48:29.880 | still very legit questions around,
00:48:31.840 | should we encode d-type in the static type system?
00:48:34.560 | Or should we just say tensor?
00:48:36.960 | If you just say tensor, then you get rid of all the generics
00:48:39.600 | everywhere, cleans up tons of code
00:48:43.040 | at the cost of losing some of the checking.
00:48:46.120 | But then I think if you go with more semantic tensor types
00:48:49.080 | that Jerry was pushing forward, you actually
00:48:51.000 | really don't even want the d-type.
00:48:52.440 | What you want is the semantics, and that you're actually
00:48:54.800 | in a better spot.
00:48:55.600 | Right.
00:48:56.240 | Like for mixed precision, we're switching stuff from one type
00:48:59.320 | to another all the time.
00:49:01.080 | Depending on whether you're doing a loss function
00:49:03.080 | or a gradient calculation or whatever,
00:49:04.640 | you need to be changing between half and single.
00:49:08.120 | So if we went that direction, I think
00:49:09.720 | that would be really interesting in terms of ergonomics,
00:49:13.520 | but also simplification, which I think would be great.
00:49:17.120 | Your point about the optimizer is that the key path
00:49:19.920 | have all kinds of weirdness because you have multiple d-types
00:49:22.720 | and you want to be generic over d-type.
00:49:24.400 | And so that's really unpleasant right now.
00:49:27.360 | Yeah.
00:49:28.040 | I think also for Swift wanting to bring over
00:49:33.560 | a big world of Python using data scientists,
00:49:39.160 | they're definitely not going to be
00:49:41.240 | wanting to put lots and lots of verbose generic type
00:49:45.560 | annotations in their Jupyter notebooks.
00:49:48.320 | Yeah.
00:49:50.560 | So I don't know when we'll have cycles
00:49:52.800 | to re-evaluate those APIs, but I think
00:49:55.360 | we should go do a fresh take of this
00:49:59.000 | and combine it with an XLA-based approach that
00:50:01.800 | changes a lot of the trade-offs.
00:50:04.520 | Right.
00:50:05.560 | So it would be really interesting.
00:50:07.480 | Yeah.
00:50:07.960 | I mean, I think in my mind, right,
00:50:09.560 | so a couple of weeks ago, I presented the layering proposal
00:50:12.440 | to separate out libtensor from libdeep learning
00:50:16.200 | so that we can then get the freedom to then iterate
00:50:19.240 | at that level and have multiple explorations on top.
00:50:23.280 | So the progress update on there is that I started--
00:50:26.760 | we have the two different packages now in Swift APIs
00:50:31.400 | so that you can depend only on one as opposed to the other.
00:50:34.360 | And Dan helped fix all the issues
00:50:36.080 | that I caused while doing the initial move of the random
00:50:38.640 | number generators out of what will become libdeep learning.
00:50:43.280 | That said, it's still very early,
00:50:44.680 | and I have a lot more code to move.
00:50:46.360 | Well, I think that Jeremy is fundamentally right
00:50:48.760 | that we need to spend more time with Swift AI
00:50:50.520 | and the optimized designs and re-evaluate the training
00:50:54.160 | with callback systems and things like that.
00:50:57.120 | Yeah.
00:50:58.160 | As each of these variables change,
00:51:01.520 | it affects other parts of the system.
00:51:02.880 | And different trade-offs, I think,
00:51:05.400 | should be re-evaluated as opposed to that.
00:51:07.680 | But I think that getting AD full of proof
00:51:10.720 | is super important.
00:51:12.960 | And performance.
00:51:13.840 | Yeah, so we have to get those two things right.
00:51:16.200 | We'll end upstream and integrate in Swift
00:51:18.560 | so that we can build on it and take a program instead of--
00:51:23.760 | yeah.
00:51:25.920 | Quick question about tensor.dtype.
00:51:28.560 | I wonder if we would add any type assertions
00:51:31.520 | and any functions.
00:51:32.840 | I think the Python model is to not check things
00:51:35.040 | into what things crashed at runtime, if I understand.
00:51:38.160 | I don't know.
00:51:38.840 | I mean, I think that there's a couple of different options
00:51:40.320 | there.
00:51:40.480 | I don't know what the right answer is.
00:51:42.160 | But again, one of the things that PyTorch is doing
00:51:45.240 | is they're doing more co-versions with dtypes.
00:51:48.000 | So if you take an Inte and add it to an N32,
00:51:49.920 | it will actually promote an 8 into an N32, for example.
00:51:53.560 | I mean, rocket science.
00:51:54.720 | But that's the kind of thing that is just very nice.
00:51:59.440 | And it just eliminates a certain kind of error.
00:52:01.520 | On the other hand, it's kind of like broadcasting where
00:52:03.680 | it makes certain things just work at the cost of potentially
00:52:07.280 | again, surprising in some cases.
00:52:08.760 | So I don't know about that.
00:52:10.760 | I think if you do things that don't make sense,
00:52:13.640 | like you try to do a floating point operation on an integer,
00:52:18.760 | then you would want it to be runtime error.
00:52:20.720 | But I think that our model is trying
00:52:22.240 | towards a much more runtime-centric approach.
00:52:25.800 | I think, ironically, Swift and Insta
00:52:27.400 | started out very static.
00:52:29.280 | But now, for me, I'm realizing one
00:52:33.360 | of the major benefits of having a fast-paced language
00:52:35.680 | is dynamic is free.
00:52:38.120 | And so now you can have super dynamic abstractions
00:52:41.520 | that you can do these things in a nice way.
00:52:43.600 | If I torch, you do get a pretty clear runtime error.
00:52:47.440 | If there's a type mismatch, it doesn't just crash.
00:52:49.720 | It will tell you what to expect and what it got.
00:52:52.200 | Yeah.
00:52:52.720 | And one of the nice things about eager mode
00:52:54.120 | is that then you get a stack trace.
00:52:55.720 | I think there are other ways around sort of encoding things
00:53:02.560 | into the static-type system that you have to adhere to.
00:53:06.280 | I think Adam's work on transitioning perfectly
00:53:08.360 | shows that you can still get a lot of benefits
00:53:10.560 | of static analysis without necessarily encoding
00:53:12.600 | into the type system.
00:53:16.400 | That said, I think it's still an open question as to how far
00:53:18.840 | we can really push that and where we end up landing.
00:53:22.200 | Yeah, I think it's just a really, really great opportunity
00:53:27.960 | to re-evaluate these things as other pieces are coming together.
00:53:31.760 | Maxim asks, why is runtime checking preferable
00:53:34.120 | over static analysis?
00:53:36.200 | I think it's more that we're still
00:53:38.280 | trying to figure out what dimensions you
00:53:39.840 | want to be flexible on.
00:53:41.880 | And so doing things dynamically is sort
00:53:44.560 | of the ultimate in flexibility.
00:53:46.560 | And so as we're trying to iterate on the programming
00:53:49.160 | model, making sure that things are as dynamic as you want them
00:53:52.480 | to be is sometimes nice.
00:53:54.760 | And then we should think about how static analysis can
00:53:57.040 | help catch errors sooner.
00:53:58.480 | Yeah, exactly.
00:53:59.240 | And so this is just a spectrum.
00:54:01.000 | And it's not that one end of the spectrum
00:54:02.760 | is better than the other.
00:54:03.600 | It's about where in the spectrum you end up.
00:54:05.560 | And Nicholas's question, Nicholas,
00:54:07.120 | asks, how are MLIR and XLA related?
00:54:09.360 | That is a super complicated question
00:54:11.160 | because we're actively re-implementing pieces
00:54:13.560 | of XLA in terms of MLIR.
00:54:15.080 | So that's actually a lot more complicated than it sounds.
00:54:18.520 | I would just say that MLIR is a broad scale compiler
00:54:28.000 | technology that solves lots of problems. XLA, as a name,
00:54:31.320 | is typically thought of as a thing
00:54:33.320 | that turns tensors into efficient code.
00:54:35.160 | And so I wouldn't over-index on the number of letters, I guess.
00:54:45.800 | And once Swift-TensorFlow sits on top of MLIR,
00:54:48.440 | we'll still use XLA target TPUs.
00:54:50.480 | Yeah, so I mean, this is internal work.
00:54:54.920 | But we're doing a lot to change and enhance
00:54:59.240 | the TPU software stack in XLA.
00:55:03.080 | And things that are XLA are changing in their implementation
00:55:07.200 | as well.
00:55:07.880 | And so there's a big investment going on in all these pieces
00:55:10.400 | right now.
00:55:13.240 | And I think that more generally-- again,
00:55:14.720 | if you ignore which letters get attached to them,
00:55:16.760 | the effort here culminates in a much more flexible
00:55:19.800 | co-generation stack, support for dynamic shapes,
00:55:23.720 | and custom ops, and things like that.
00:55:25.680 | It's just that different pieces in this very complicated
00:55:28.880 | technology come together at different points as well.
00:55:31.640 | I don't know what the marketing-- the crack compiler marketing
00:55:39.120 | team will end up labeling the resultant kind.
00:55:41.000 | Excellent.
00:55:44.840 | We're slightly over time, so I just
00:55:46.560 | went into-- unless there's any pressing questions,
00:55:51.160 | thank everyone for joining.
00:55:52.920 | And see you all next week.
00:55:55.560 | I think next week, Mark will be up talking about some of his work
00:55:58.600 | on testing the auditive system to ensure
00:56:02.480 | that it's really reliable.
00:56:03.600 | There's some pretty good things that Mark's been up to there.
00:56:06.520 | It's also exciting that AD is getting upstream to master,
00:56:09.040 | too, which is really cool.
00:56:12.600 | Thanks, everyone.
00:56:13.240 | Have a great week, and see you all next week.
00:56:15.000 | Thank you, Jeremy.
00:56:15.680 | Thank you.
00:56:17.720 | [BLANK_AUDIO]