fastai v2 overview (at S4TF Design Meeting 2019-11-08)

00:00:00.000 | to myself, to Paige, to Chris, to Ed, or others.

00:00:04.480 | Today, we actually have a very short agenda

00:00:10.040 | and a very welcome guest.

00:00:12.240 | So with that, I'd like to hand it off

00:00:14.320 | to the man who needs new introduction, Jeremy,

00:00:17.000 | to talk about Fast AI v2.

00:00:20.400 | Thanks, Brandon.

00:00:23.400 | So this actually comes out of my enthusiasm

00:00:29.080 | for when Adam presented a little bit of Hask torch code

00:00:34.040 | a couple of weeks ago, which I thought was super cool.

00:00:37.280 | And so mainly my goal here is to kind of encourage

00:00:42.240 | other people to present cool things in other languages

00:00:45.800 | and libraries, because I think it's a great way for us

00:00:48.160 | all to learn what cool stuff you can do.

00:00:50.840 | But as tends to happen when you say, can somebody please do x,

00:00:54.840 | somebody else says, hey, why don't you do x first?

00:00:57.200 | So here I am doing x, where x is telling you

00:01:00.160 | about the library that Sylvia and I have been working on.

00:01:03.480 | Basically, since Chris Latner and I

00:01:06.600 | finished our last Swift and Fast AI lesson,

00:01:11.360 | so for quite a while now, it's a library for PyTorch called

00:01:19.040 | Fast AI.

00:01:21.040 | And I think there are things we can learn from it

00:01:27.680 | regarding cool stuff we can do in Swift.

00:01:30.800 | But I'm going to focus on trying to sell you on Fast AI

00:01:33.840 | rather than on the Swift bits.

00:01:35.720 | But where I think of Swift things,

00:01:37.120 | I will mention them as we go.

00:01:40.880 | So Fast AI is a library, as I said,

00:01:44.760 | that sits on top of PyTorch.

00:01:46.880 | And a lot of people think that a higher-level API is

00:01:53.280 | this small little thing that you slap

00:01:56.800 | on top of the serious business of TensorFlow or PyTorch

00:01:59.760 | or whatever.

00:02:00.720 | But hopefully, you'll be convinced

00:02:02.680 | when I show you actually what's involved in a truly

00:02:06.840 | modern high-level API that there's actually

00:02:09.200 | quite a lot going on.

00:02:11.000 | If you want to check it out, I put a link

00:02:13.600 | to it in the meeting notes, and that will link you

00:02:17.880 | to the notebooks, the development notebooks.

00:02:23.000 | So that's the first weird thing.

00:02:24.720 | What the hell are development notebooks?

00:02:27.280 | Well, this is an example of what Fast AI V2 source code looks

00:02:31.880 | like.

00:02:33.000 | It's written, as you see, in notebooks.

00:02:35.680 | Jeremy, we are just having a little trouble actually

00:02:39.640 | doing that seeing part.

00:02:41.920 | OK, so that probably means I failed to present my screen.

00:02:46.560 | Shall I endeavor to do that?

00:02:49.680 | That would be great.

00:02:50.280 | Present your entire screen.

00:02:52.520 | Yeah, that explains a lot.

00:02:55.200 | There you go.

00:02:57.160 | Excellent.

00:02:57.680 | Nope.

00:02:58.160 | No.

00:02:58.640 | No, we don't.

00:02:59.120 | Nope.

00:02:59.360 | There we go.

00:03:00.280 | There we go.

00:03:01.040 | Victory.

00:03:01.520 | All right.

00:03:02.640 | Sorry about that.

00:03:03.600 | So yeah, so here is an example of what the Fast AI V2 source

00:03:08.520 | code looks like.

00:03:09.960 | It has links.

00:03:10.880 | It has titles.

00:03:11.680 | It has pictures.

00:03:13.000 | It has code.

00:03:16.000 | And this may seem like a painful way

00:03:20.000 | to develop because these are notebooks that

00:03:21.920 | are designed for interactive stuff,

00:03:23.760 | not for normal development.

00:03:25.440 | But actually, you'll find that also this pixel shuffle

00:03:29.120 | appears here in a standard layers.py module, which

00:03:41.680 | you can import in the usual way.

00:03:44.520 | So we've developed a new literate programming system

00:03:48.040 | that allows you to write code and have it automatically

00:03:53.720 | turned into nice modules, which even do things

00:03:57.680 | that most people don't bother to do because they're

00:04:00.160 | annoying if they're automatic, like setting it under all

00:04:03.280 | so it only exports the things that you want.

00:04:06.240 | Also coming out of that is automatically documentation.

00:04:09.920 | So all that gets turned into hyperlink documentation,

00:04:14.200 | including links directly to the source code and automatic doc

00:04:18.680 | strings and parameter lists.

00:04:22.200 | Also, you'll see tests.

00:04:24.800 | And the tests are used both to document the behavior expected.

00:04:29.520 | So if you're not sure what pixel shuffle is,

00:04:31.880 | this test is a very good description of exactly what

00:04:34.760 | it is and also ensures that our code is working.

00:04:38.200 | And those tests can all be put in continuous integration

00:04:41.720 | and so forth.

00:04:43.200 | So that's the first interesting thing about FastAIV2

00:04:47.120 | is it's the first truly literate programming-based system

00:04:50.760 | I've worked on.

00:04:51.600 | And it's been an absolute life.

00:04:54.040 | So we've written our own framework

00:04:55.800 | for every part of this, which is kind of a theme for FastAIV2.

00:05:02.200 | Basically, every time [INAUDIBLE]

00:05:03.640 | and I found something that didn't quite work the way we

00:05:06.800 | wanted it at any part of the stack, we wrote our own.

00:05:10.800 | So it's kind of like building something

00:05:12.800 | with no particular deadline and trying to do everything

00:05:15.040 | the very, very best we can.

00:05:18.200 | So the layered API of FastAIV2 starts at the applications

00:05:24.800 | layer, which is where most beginners will start.

00:05:29.960 | And it looks a lot like FastAIV1,

00:05:33.280 | which is the released version of the software

00:05:35.080 | that people have seen before.

00:05:36.720 | But V2, everything is rewritten from scratch.

00:05:39.360 | It's totally new.

00:05:40.200 | There's no code borrowed.

00:05:41.920 | But the top-level API looks quite similar.

00:05:43.920 | The idea is that in one, two, three, four lines of code,

00:05:49.480 | you can create a state-of-the-art computer vision

00:05:53.440 | classifier, including transfer learning.

00:05:57.920 | With nearly the same one, two, three, four lines of code--

00:06:03.080 | five lines of code in this case, because we're also displaying--

00:06:07.240 | you can create a state-of-the-art segmentation model.

00:06:10.200 | And actually, when I say state-of-the-art,

00:06:12.320 | for example, this segmentation model

00:06:13.880 | is, to the best of my knowledge, still better

00:06:16.280 | than any published result on this particular Canvid data set.

00:06:19.560 | So these five lines of code are super good five lines of code.

00:06:23.000 | And as you can see, it includes a line of code,

00:06:26.400 | which, if you say show batch, it will display your data

00:06:30.520 | in an appropriate format, in this case,

00:06:32.920 | showing you segmentation, a picture,

00:06:35.520 | and the color-coded pixels overlaid on top of the picture.

00:06:40.920 | The same basic four lines of code

00:06:42.800 | will do text classification.

00:06:44.920 | So here's the basis of ULMFIT, which

00:06:50.000 | is a system that we developed and wrote up along

00:06:52.480 | with the Bastion router for transfer learning

00:06:55.280 | in natural language processing.

00:06:57.120 | And as you can see, in here, this

00:06:59.080 | is working on IMDB on a single epoch in four minutes.

00:07:03.320 | The accuracy here is basically what was the state-of-the-art

00:07:06.720 | as of a couple of years ago.

00:07:07.800 | Tabular or time series analysis, same deal.

00:07:13.720 | Basically, a few lines of code, nearly exactly the same lines

00:07:16.960 | of code, and you'll get a great result from your tabular data

00:07:21.960 | and Ditto for collaborative filtering.

00:07:24.760 | So the high-level API for fastAIV2

00:07:29.280 | is designed to be something where, regardless

00:07:32.040 | of what application you're working on,

00:07:34.480 | you can get a great result from it

00:07:36.920 | using sensible defaults and carefully selected

00:07:39.240 | hyperparameters, automatically, largely done for you

00:07:45.040 | for the most common kinds of problems that people look at.

00:07:49.160 | And that bit doesn't look that different to V1,

00:07:52.840 | but understanding how we get to that is kind of interesting

00:07:57.640 | and involves getting deeper and deeper.

00:08:01.320 | This approach, though, does work super well.

00:08:05.400 | And partly, it's because this is based

00:08:07.840 | on quite a few years of research to figure out

00:08:10.000 | what are the best ways to solve various problems along the way.

00:08:13.680 | And when people actually try using fastAIV,

00:08:15.760 | they're often surprised.

00:08:17.400 | So this person posted on our forum

00:08:19.640 | that they've been working in TF2 for a while,

00:08:22.360 | and for some reason, they couldn't figure out.

00:08:24.800 | All of their models are suddenly working much better.

00:08:27.720 | And the answer is, basically, they're

00:08:29.240 | getting all these nice kind of curated best practices.

00:08:32.960 | And somebody else on Twitter saw that and said, yep,

00:08:36.080 | we found the same thing.

00:08:37.440 | We were trying TensorFlow, spent months tweaking,

00:08:39.680 | and then we switched to fastAI.

00:08:41.440 | A couple of days later, we were getting better results.

00:08:44.200 | So these kind of carefully curated defaults and algorithms

00:08:48.880 | and high-level APIs that do things right for you

00:08:51.360 | the first time, even for experienced practitioners,

00:08:55.080 | can give you better results faster.

00:08:59.360 | But it's actually the other pieces

00:09:00.960 | that are more, I think, interesting for a Swift

00:09:03.880 | conversation, because as the deeper we

00:09:05.640 | go into how we make that work, the more stuff

00:09:09.920 | you'll see, which will be a great fit, I think, with Swift.

00:09:13.560 | So the mid-layer API is something

00:09:18.440 | which is largely new to fast--

00:09:22.840 | actually, I guess the foundation layer is new.

00:09:24.840 | So the mid-layer, I guess I'd say, is more rewritten for V1.

00:09:28.480 | And it contains some of the things that

00:09:31.240 | make those high-level APIs easy.

00:09:34.640 | One of the bits which is the most interesting

00:09:37.800 | is the training loop itself.

00:09:42.120 | And I thank Sylvain for the set of slides

00:09:44.480 | we have for the training loop.

00:09:47.080 | This is what a training loop looks like in PyTorch.

00:09:50.400 | We calculate some predictions.

00:09:52.480 | We get a loss.

00:09:53.680 | We do a backwards pass to get the gradients.

00:09:56.840 | We do an optimizer step.

00:09:58.240 | And then optionally, we run time to time.

00:10:00.400 | We'll zero the gradients based on if we're

00:10:02.440 | doing when we're accumulating.

00:10:07.480 | So this is what that loop looks like.

00:10:09.120 | Run the model, get the loss, do the gradients,

00:10:11.960 | step the optimizer, do that a bunch of times.

00:10:16.000 | But you want to do something interesting.

00:10:19.600 | You'll need to add something to the loop

00:10:21.440 | to do keeping track of your training

00:10:23.600 | statistics in TensorBoard or in fast progress or whatever.

00:10:28.840 | You might want to schedule various hyperparameters

00:10:31.080 | in various different ways.

00:10:33.160 | You might want to add various different kinds

00:10:34.760 | of characterization.

00:10:37.200 | You may want to do mixed precision training.

00:10:40.320 | You may want to do GANs.

00:10:43.280 | So this is a problem because either you

00:10:45.680 | have to write a new training loop for every time

00:10:49.560 | you want to add a different tweak.

00:10:51.040 | And making all those tweaks work together

00:10:52.960 | then becomes incredibly complicated.

00:10:56.840 | Or you try and write one training loop which does

00:10:59.960 | everything you can think of.

00:11:01.120 | This is the training loop for fastAI 0.7,

00:11:03.800 | which only did a tiny subset of the things I just said

00:11:06.440 | but was getting ridiculous.

00:11:08.760 | Or you can add callbacks at each step.

00:11:12.960 | Now, the idea of callbacks has been around in deep learning

00:11:16.880 | for a long time, APIs.

00:11:19.480 | But what's very different about fastAI

00:11:21.720 | is that every callback is actually a two-way callback.

00:11:25.680 | It can read absolutely everything.

00:11:27.680 | It can read gradients, parameters, data, so forth.

00:11:32.160 | And it can write them.

00:11:33.720 | So it can actually change anything at any time.

00:11:37.120 | So the callbacks are, we say, infinitely flexible.

00:11:41.720 | We feel pretty confident in that because the training loop

00:11:46.280 | in fastAI has not needed to be modified

00:11:49.680 | to do any of the tweaks that I showed you before.

00:11:53.000 | So even the entirety of training GANs can be done in a callback.

00:11:58.600 | So basically, we switch out our basic training loop

00:12:01.800 | and replace it with one with the same five steps

00:12:05.040 | but callbacks between every step.

00:12:09.920 | So that means, for example, if you want to do a scheduler,

00:12:14.680 | you can define a batch begin that

00:12:16.560 | sets the optimizer's learning rate to some function.

00:12:19.760 | Or if you want to do early stopping,

00:12:21.440 | you can write an on epoch end that checks the metrics

00:12:25.400 | and stops training.

00:12:27.360 | Or you can do parallel training, set up data parallel,

00:12:31.760 | and happy at the end of training,

00:12:34.960 | take data parallel off again.

00:12:37.600 | Gradient clipping, you have access

00:12:40.000 | to the parameters themselves.

00:12:41.360 | So you can click the gradient norms

00:12:44.280 | at the end of the backward step, and so forth.

00:12:48.240 | So all of these different things are

00:12:50.880 | all things that have been written with fastAI callbacks,

00:12:55.600 | including, for example, mixed precision.

00:12:57.520 | All of NVIDIA's recommendations, mixed precision training,

00:13:03.360 | will be added automatically if you just add a two FP16

00:13:06.680 | at the end of your learn call.

00:13:07.920 | And really importantly, for example,

00:13:13.800 | all of those mixed precision things

00:13:15.240 | can be combined with multi-GPU and one-cycle training

00:13:20.760 | and gradient accumulation and so forth.

00:13:25.080 | And so trying to create a state-of-the-art model, which

00:13:30.840 | involves combining state-of-the-art regularization

00:13:33.200 | and mixed precision and distributed training and so

00:13:36.000 | forth is a really, really, really hard job.

00:13:39.600 | But with this approach, it's actually just a single extra

00:13:42.680 | line of code to add each feature.

00:13:45.360 | And they all explicitly are designed

00:13:47.080 | to work with each other and are tested to work with each other.

00:13:51.200 | So for instance, here is mix-up data augmentation,

00:13:54.760 | which is a incredibly powerful data augmentation method that

00:13:59.320 | has powered lots of state-of-the-art results.

00:14:02.440 | And as you can see, it's under a screen of code.

00:14:05.440 | By comparison, here is the version

00:14:08.440 | of mix-up from the paper.

00:14:10.920 | Not only is it far longer, but it only

00:14:13.000 | works with one particular data set

00:14:15.320 | and one particular optimizer and is

00:14:17.960 | full of all kinds of assumptions and only one

00:14:20.080 | particular kind of metric and so forth.

00:14:24.440 | So that's an example of these mid-tier APIs.

00:14:28.560 | Another one is the optimizer.

00:14:32.800 | It turns out that it looks like there's

00:14:36.760 | been lots and lots of different optimizers appearing

00:14:39.720 | in the last year or two.

00:14:43.040 | It actually turns out that they're

00:14:44.400 | all minor tweaks on each other.

00:14:47.160 | Most libraries don't write them this way.

00:14:49.360 | So for example, Adam W, also known

00:14:52.520 | as decoupled weight decay Adam, was added to PyTorch

00:14:58.560 | quite recently in the last month or two.

00:15:02.800 | And it required writing a whole new class

00:15:07.120 | and a whole new step to implement.

00:15:10.240 | And it took-- it was like two or three years

00:15:12.600 | after the paper was released.

00:15:15.040 | On the other hand, FastAI's implementation, as you can see,

00:15:19.400 | involves a single extra function containing

00:15:22.840 | two lines of code and this little bit of gray here.

00:15:25.760 | So it's kind of like two and a half, three lines of code

00:15:28.800 | to implement the same thing.

00:15:30.560 | Because what we did was we realized,

00:15:34.160 | let's refactor the idea of an optimizer,

00:15:37.600 | see what's different for each of these state of the art

00:15:41.960 | optimizers that have appeared recently,

00:15:44.120 | and make it so that each of those things

00:15:46.120 | can be added and removed by just changing two things--

00:15:51.240 | stats and steppers.

00:15:53.880 | A stat is something that you measure

00:15:58.920 | during training, such as the gradients or the gradient

00:16:01.520 | squared, or you might use dampening, or momentum,

00:16:03.840 | or whatever.

00:16:04.840 | And then a stepper is something that

00:16:07.000 | uses those stats to change the weights in some way.

00:16:11.720 | And you can combine those things together.

00:16:13.640 | And by combining these, we've been

00:16:15.320 | able to implement all these different optimizers.

00:16:18.200 | So for instance, the lamb optimizer,

00:16:27.000 | which came out of Google and was super cool at reducing

00:16:30.960 | the pre-training time from three days to 76 minutes,

00:16:34.880 | we were able to implement that in this tiny piece of code.

00:16:38.840 | And one of the nice things is that when you compare it

00:16:41.720 | to the math, it really looks almost line for line

00:16:45.960 | identical, except ours is a little bit nicer

00:16:49.040 | because we refactored some of the math.

00:16:51.560 | So it makes it really easy to do research as well,

00:16:55.160 | because you can quite directly bring the equations across

00:16:58.760 | into your code.

00:17:01.320 | Then the last of the mid-tier APIs

00:17:04.240 | is the data block API, which is something we had in version 1

00:17:10.880 | as well.

00:17:12.240 | But when we were porting that to Swift,

00:17:21.040 | we had an opportunity to rethink it.

00:17:23.040 | And actually, Alexis Gallagher in particular

00:17:26.560 | helped us to rethink it in a more idiomatic Swift way.

00:17:31.320 | And it came out really nicely.

00:17:32.960 | And so then we took the result of that

00:17:34.600 | and ported it back into Python.

00:17:36.840 | And we ended up with something that was quite a bit nicer.

00:17:40.160 | So there's been a nice interaction and interplay

00:17:42.400 | between fast AI in Python and Swift AI in Swift

00:17:47.200 | in terms of helping each other's APIs.

00:17:50.520 | But basically, the data block API

00:17:52.480 | is something where you define each of the key things

00:17:56.680 | that the program needs to know to flexibly get your data

00:18:01.120 | into a form you can put in a model.

00:18:04.160 | So it needs to know what type of data do you have,

00:18:08.120 | how do you get that data, how do you split it

00:18:11.120 | into a training set and a validation set,

00:18:13.320 | and then put that all together into a data bunch, which

00:18:15.960 | is just a simple little class.

00:18:17.520 | It's literally, I think, four lines of code, which just

00:18:20.760 | has the validation set and the training set in one place.

00:18:26.840 | So with a data block, you just say, OK, my types,

00:18:31.960 | I want to create a black and white pillow image for my x

00:18:36.000 | and a category for my y.

00:18:38.600 | And to get the list of files for those,

00:18:40.840 | I need to use this function.

00:18:43.240 | And to split those files into training and validation,

00:18:46.280 | use this function, which is looking

00:18:47.960 | at the grandparent path directory name.

00:18:52.240 | And to get the labels, use this function, which is

00:18:55.800 | use the parent's path name.

00:18:58.840 | And so with that, that's enough to give you

00:19:02.720 | MNIST, for instance.

00:19:04.840 | And so once you've done this, you end up with a data bunch.

00:19:08.640 | And as I mentioned before, everything has a show batch.

00:19:12.360 | So one of the nice things is it makes it very easy for you

00:19:14.840 | to look at your data regardless of whether it's

00:19:17.080 | tabular, or collaborative filtering, or vision, or text,

00:19:20.240 | or even audio.

00:19:21.600 | If it was audio, it would show you a spectrogram

00:19:24.240 | and let you play the sound.

00:19:28.120 | So you can do custom labeling with data blocks

00:19:32.880 | by using, for example, a regular expression labeler.

00:19:37.240 | You can get your labels from an external file or data frame.

00:19:42.320 | And they could be multi-labels.

00:19:44.080 | So this thing here knows it's a multi-label classification

00:19:46.800 | task.

00:19:47.600 | So it's automatically put a semicolon between each label.

00:19:51.800 | Again, it's still basically just three lines of code

00:19:54.800 | to define the data block.

00:19:57.440 | So here's a data block for segmentation.

00:19:59.600 | And you can see, really, the only thing I had to change here

00:20:02.640 | was that my dependent variable has been changed from category

00:20:05.680 | to pillow mask.

00:20:09.000 | And again, automatically, I show batch works.

00:20:11.640 | And we can train a model from that straight away as well.

00:20:15.800 | You could do key points.

00:20:17.400 | So here, I've just changed my dependent variable

00:20:19.440 | to tensor point.

00:20:20.600 | And so now, it knows how to behave with that.

00:20:23.440 | Object detection.

00:20:24.960 | So now, I've changed my dependent variable to bounding box.

00:20:27.600 | And you can see, I've got my bounding boxes here.

00:20:31.480 | Text, and so forth.

00:20:36.040 | So actually, going back, I have a couple of questions

00:20:38.560 | if you're--

00:20:39.160 | OK, time.

00:20:40.280 | Yeah.

00:20:41.000 | So the code, you've got sort of the x's and y's.

00:20:47.680 | And these sounds like these different data types roughly

00:20:51.240 | conform to a protocol.

00:20:53.200 | Yep.

00:20:53.800 | We're going to get to that in a moment.

00:20:55.200 | Absolutely.

00:20:55.720 | OK.

00:20:56.200 | Fantastic.

00:20:58.720 | That's an excellent way to think of it.

00:21:00.240 | And actually, this is the way it looked about three weeks ago.

00:21:02.880 | Now, it looks even more like a protocol.

00:21:04.520 | So yes, this is where it all comes from,

00:21:09.880 | which is the foundation APIs.

00:21:12.000 | And this is the bit that I think is the most relevant to Swift.

00:21:14.600 | A lot of this, I think, would be a lot easier to write in Swift.

00:21:21.200 | So the first thing that we added to PyTorch

00:21:25.800 | was object-oriented tensors.

00:21:28.760 | For too long, we've all been satisfied

00:21:31.840 | with a data type called tensor, which has no semantics to it.

00:21:37.720 | And so those tensors actually represent something

00:21:40.720 | like a sentence, or a picture of a cat,

00:21:44.480 | or a recording of somebody saying something.

00:21:49.120 | So why can't I take one of those tensors

00:21:51.040 | and say dot flip, or dot rotate, or dot resample,

00:21:55.360 | or dot translate to German?

00:21:58.280 | Well, the answer is you can't, because it's just

00:22:01.720 | a tensor without a type.

00:22:04.080 | So we have added types to tensors.

00:22:08.680 | So you can now have a tensor image, a tensor point,

00:22:11.920 | a tensor bounding box.

00:22:13.720 | And you can define a flip left, right for each.

00:22:16.560 | And so this is some of the source code from--

00:22:18.480 | we've written our own computer vision library,

00:22:20.920 | so that now you can say flip LR, and it flips the puppy.

00:22:26.800 | And if it was a key points, it would fit the key points.

00:22:30.640 | If it was a bounding box, it would fit the bounding boxes,

00:22:33.560 | and so forth.

00:22:34.880 | So this is an example of how tensors which carry around

00:22:39.240 | semantics are nice.

00:22:40.360 | It's also nice that I can just say dot show, right?

00:22:43.280 | So dot show is something that's defined for all fast AIV2

00:22:48.560 | tensor types.

00:22:49.680 | And it will just display that tensor.

00:22:53.400 | It could even be a tuple containing a tensor,

00:22:56.560 | and some bounding boxes, and some bounding box classes.

00:22:59.080 | Whatever it is, it will be able to display it.

00:23:02.720 | It will be able to convert it into batches for modeling,

00:23:07.400 | and so forth.

00:23:10.480 | So with that, we can now create, for example,

00:23:14.800 | a random transformation called flip item.

00:23:18.200 | And we can say that the encoding of that random transformation

00:23:21.640 | is defined for a pillow image or any tensor type.

00:23:26.680 | And in each case, the implementation

00:23:28.920 | is simply to call x dot flip LR.

00:23:31.480 | Or we could do the dihedral symmetry transforms

00:23:34.400 | in the same way.

00:23:36.280 | Before we call, grab a random number between 0 and 7

00:23:40.000 | to decide which of the eight transposes to do.

00:23:45.560 | And then in codes, call x dot plus dihedral

00:23:48.720 | with that thing we just got.

00:23:51.200 | And so now we can call that transform a bunch of times.

00:23:55.480 | And each time, we'll get back a different random augmentation.

00:23:59.000 | So a lot of these things become nice and easy.

00:24:01.480 | Hey, Jeremy.

00:24:02.560 | Maxim asked, why isn't tensor a backing data structure

00:24:05.480 | for an image type?

00:24:06.320 | Tensor image is a tensor, which is an image type.

00:24:13.200 | Why isn't-- he says, why isn't tensor a backing--

00:24:17.960 | why not have a different type named image,

00:24:20.200 | I guess, that has a tensor inside of it?

00:24:23.680 | Do you mean why inherit rather than compose?

00:24:28.240 | Apparently, yes, that.

00:24:29.720 | Yeah.

00:24:31.360 | So inheritance-- I mean, you can do both.

00:24:36.200 | And you can create identical APIs.

00:24:38.480 | Inheritance just has the benefit that all the normal stuff you

00:24:41.760 | can do with a tensor, you can do with a tensor that

00:24:43.640 | happens to be an image.

00:24:44.960 | So just because a tensor is an image

00:24:46.600 | doesn't mean you now don't want to be able to do fancy indexing

00:24:49.440 | to it, or do an LU decomposition of it,

00:24:54.120 | or stack it with other tensors across some axis.

00:24:57.920 | So basically, a tensor image ought

00:25:03.400 | to have all the behavior of a tensor

00:25:05.120 | plus additional behavior.

00:25:06.800 | So that's why we use inheritance.

00:25:09.720 | We have a version that uses composition as well,

00:25:11.680 | and it uses Python's nice get atra functionality

00:25:16.240 | to pass on all of the behavior of tensor.

00:25:23.240 | But it comes out more nicely in Python

00:25:25.680 | when you do inheritance.

00:25:27.240 | And actually, the PyTorch team has

00:25:29.280 | decided to officially implement semantic tensor subtypes now.

00:25:35.080 | And so hopefully, in the next version of PyTorch,

00:25:37.320 | you won't have to use the extremely ugly hacks

00:25:40.320 | that we had to use to make this work.

00:25:43.560 | You'll be able to use the real ones.

00:25:46.080 | And hopefully, you'll see in TorchVision

00:25:48.600 | some of these ideas will be brought over there.

00:25:51.760 | Can I ask you, so how does the type propagate?

00:25:55.760 | So if you do arithmetic on an image tensor,

00:25:58.160 | do you get an image tensor back there?

00:26:00.320 | So Chris and I had a conversation about this a few months ago,

00:26:05.280 | and I said I'm banging my head around this issue of types

00:26:10.360 | not carrying around their behavior.

00:26:12.160 | And Chris casually mentioned, oh, yes, that thing

00:26:14.560 | is called higher kind of types.

00:26:16.320 | So I went home, and that was one of these phrases

00:26:19.120 | I thought only functional programming Dweeb's talked

00:26:22.000 | about, and I would never care about a tensor.

00:26:24.600 | And we have to care about it, because it actually

00:26:26.720 | matters a lot.

00:26:27.320 | And it's basically the idea that if you have a tensor image

00:26:30.160 | and you add one to it, you want to get back a tensor image,

00:26:33.920 | because it should be an image that's a bit brighter

00:26:36.320 | rather than something that loses its type.

00:26:38.960 | So we implemented our own, again,

00:26:42.280 | hacky, partial, higher kind of type implementation

00:26:45.520 | in FastAIV2. So any of these things

00:26:49.160 | that you do to a tensor of a subtype,

00:26:53.400 | you will nearly always get back the correctly subtype tensor.

00:26:58.680 | Yeah, I mean, I saw the PyTorch recently sort of talking

00:27:01.600 | about their named indexing extensions for their tensors

00:27:06.680 | as well, and I assume they have a similar kind of challenge

00:27:09.320 | there, where when you start doing arithmetic

00:27:11.720 | and other things like that on a tensor that

00:27:13.640 | has named dimensions, you want to propagate those along.

00:27:18.240 | Yeah, so we haven't started using that yet,

00:27:22.400 | because it hasn't quite landed at stable.

00:27:25.640 | But yeah, we talked to the PyTorch team at the DevCon,

00:27:30.520 | and we certainly are planning to bring these ideas together.

00:27:35.960 | They're orthog and orbit-related concerns.

00:27:38.040 | Yeah, I just mean that I assume that that feature has

00:27:41.520 | the same problem, the same challenge.

00:27:43.120 | I assume so, yeah.

00:27:44.560 | So it would be interesting to see what they do.

00:27:48.240 | Yeah, yeah, it would.

00:27:50.040 | Yeah, so it's kind of nice.

00:27:55.400 | Not only do we get to be able to say .show batch,

00:27:58.320 | but you can even go .show results.

00:28:00.840 | And in this case, it knows what the independent variables type

00:28:05.240 | is, it knows what the dependent variables type is,

00:28:07.720 | and it even knows things like, hey, for a classification task,

00:28:10.840 | those two things should be the same.

00:28:12.360 | If they're not, by default, I will highlight them right.

00:28:15.080 | So these lower-level foundations are

00:28:18.240 | the things that drive our ability to easily add

00:28:20.680 | this higher-level functionality.

00:28:24.920 | So this is the kind of ugly stuff

00:28:28.040 | we wouldn't have to do in Swift.

00:28:30.240 | We had to write our own type dispatch system.

00:28:34.280 | We can annotate things with types,

00:28:36.040 | and those type annotations are actually semantic.

00:28:38.720 | And so we now have the joyfully modern idea of function

00:28:45.080 | overloading in Python, which has made life a lot easier,

00:28:48.560 | and we already have that.

00:28:51.800 | Do you have many users that are using this yet?

00:28:55.360 | Or is it still out of reference?

00:28:56.840 | It's still pre-released.

00:28:57.920 | It's not even alpha.

00:28:59.960 | But there is a enthusiastic early adopter community

00:29:05.920 | who is using it.

00:29:08.000 | So for example, the user-contributed audio library

00:29:12.440 | has already been ported to it.

00:29:14.960 | I've also built a medical imaging library on top of it,

00:29:17.280 | and I've written a series of five notebooks showing how

00:29:19.960 | to do CT scan analysis with it.

00:29:23.480 | So it's kind of like it works.

00:29:28.680 | And--

00:29:30.880 | I was curious what your users think of it,

00:29:33.320 | because there's this very strongly-held conception

00:29:36.440 | that Python folks hate types.

00:29:39.640 | And you're kind of providing a little bit

00:29:41.760 | of typing in the world, and I'm curious how they react to that.

00:29:46.040 | The extremely biased subset of early adopter-class AI

00:29:49.960 | enthusiasts who are using it love it.

00:29:53.120 | And they tend to be people who have gone pretty deep

00:29:56.120 | in the past.

00:29:57.040 | So for example, my friend Andrew Shaw,

00:29:59.920 | who wrote something called Music Autobot, which

00:30:02.240 | is one of the coolest things in the world,

00:30:03.920 | in case you haven't seen it yet, which is something

00:30:07.120 | where you can generate music using a neural network.

00:30:11.120 | You can put in some melodies and some chords,

00:30:13.480 | and it will auto-complete some additional melodies and chords.

00:30:16.520 | Or you can put it in a melody, and it will automatically

00:30:19.360 | add chords, or you can add chords that create melody.

00:30:23.960 | And so he had to write his own MIDI library, fastai.midi.

00:30:29.200 | He rewrote it in V2, and he said it's just like so, so,

00:30:33.640 | so much easier, thanks to those mid-tier APIs.

00:30:38.160 | So yeah, at this stage, it's easy as to--

00:30:41.560 | I was just going to jump in quick.

00:30:44.720 | I've been helping with some of the audio stuff,

00:30:47.120 | and it's been really awesome.

00:30:51.240 | So it makes things a lot more flexible than version 1.

00:30:56.440 | So that's probably my favorite thing about it,

00:30:58.800 | is everything can be interchanged.

00:31:02.360 | Nothing is like, well, it's got to be this way,

00:31:04.640 | because that's how it is.

00:31:06.680 | Yeah, that's cool.

00:31:08.520 | Cool, thanks.

00:31:10.800 | Another piece of the transform of the foundation

00:31:14.880 | is the partially reversible composed function

00:31:18.080 | pipeline dispatched over collections, which

00:31:20.760 | really rolls off the tongue, we call them transform in pipeline.

00:31:24.920 | Basically, the idea is that the way

00:31:29.480 | you kind of want function dispatch to work

00:31:35.600 | and function composition to work in deep learning

00:31:39.240 | is a little different to other places.

00:31:42.160 | There's a couple of things.

00:31:43.560 | The first is you often want to dispatch over tuples.

00:31:47.200 | And what I mean by that is if you have a function called

00:31:52.760 | flip left right, and you have a tuple representing

00:31:59.200 | a mini batch where your independent variable is

00:32:01.640 | a picture and your dependent variable

00:32:03.480 | is a set of bounding boxes, if you say flip left right

00:32:07.160 | on that tuple, you would expect both the x and the y

00:32:11.680 | to be flipped and to be flipped with the type appropriate

00:32:16.360 | method.

00:32:17.760 | So our transforms will automatically

00:32:21.400 | send each element of a tuple to the function separately

00:32:26.000 | and/or dispatch according to their types automatically.

00:32:30.720 | We've mentioned type retention, so the kind of basic

00:32:33.120 | higher type stuff we need.

00:32:37.720 | One interesting thing is not only encoding,

00:32:40.440 | so in other words, applying the function,

00:32:43.880 | you often need to be able to decode,

00:32:45.600 | which is to deapply the function.

00:32:48.480 | So for example, a categorization transform

00:32:51.160 | would take the word dog and convert it to the number 1,

00:32:55.520 | perhaps, which is what you need for modeling.

00:32:58.800 | But then when your predictions come back,

00:33:00.480 | you need to know what 1 represents.

00:33:02.720 | So you need to reverse that transform and turn 1 back

00:33:06.520 | into dog.

00:33:08.120 | Often those transforms also need data driven setup.

00:33:12.320 | For example, in that example of dog becoming 1,

00:33:16.000 | there needs to be something that actually creates that vocab

00:33:18.360 | automatically, recognizing what are all the possible classes,

00:33:21.520 | so it can create a different index for each one

00:33:24.880 | and then apply that to the validation set.

00:33:28.240 | And quite often these transforms also

00:33:30.080 | have some kind of state, such as the vocab.

00:33:35.520 | So we built this bunch of stuff that

00:33:37.920 | builds on top of each other.

00:33:39.040 | At the lowest level is a class called transform,

00:33:41.920 | which is a callable, which also has a decode,

00:33:48.720 | does the type retention, higher kind of type thing,

00:33:51.400 | and does the dispatch over tuples by default.

00:33:54.720 | So then a pipeline is something that does function composition

00:33:57.400 | over transforms.

00:34:00.120 | And it knows about, for example, setting up transforms.

00:34:06.000 | And setting up transforms in a pipeline

00:34:08.400 | is a bit tricky because you have to make sure

00:34:11.560 | that at each level of the pipeline,

00:34:14.040 | only the previous steps have been applied

00:34:17.000 | before you set up the next step.

00:34:19.200 | So it does little things like that.

00:34:21.600 | And then we have something that applies a pipeline

00:34:23.760 | to a collection to give you an indexable, lazily transformed

00:34:28.000 | collection.

00:34:29.080 | And then you can do those in parallel

00:34:31.040 | to get back an independent variable, for instance.

00:34:35.600 | And then finally, we've built a data loader,

00:34:40.120 | which will apply these things in parallel

00:34:45.360 | and create collated batches.

00:34:47.520 | So in the end, all this stuff makes a lot of things

00:34:54.560 | much easier.

00:34:55.720 | For example, the language model data loader in Fast AI v1

00:35:00.040 | was like pages of code.

00:35:02.520 | In TensorFlow, it's pages of code.

00:35:05.440 | In Fast AI v2, it's less than a screen of code

00:35:08.480 | by leveraging these powerful abstractions and foundations.

00:35:16.400 | So then finally-- and again, this is something

00:35:19.320 | I think Swift will be great for--

00:35:22.160 | we worked really hard to make everything extremely well

00:35:25.280 | optimized.

00:35:26.280 | So for example, preprocessing and natural language processing,

00:35:30.200 | we created a parallel generator in Python, which you can then

00:35:35.840 | basically pass a class to that finds some setup and a call.

00:35:39.600 | And it can automatically parallelize that.

00:35:44.200 | So for example, tokenization is done in parallel

00:35:48.520 | in a pretty memory efficient way.

00:35:50.000 | But perhaps the thing I'm most excited about,

00:35:57.640 | both in Python and Swift, is the optimized pipeline

00:36:03.520 | running on the GPU.

00:36:04.960 | So pretty much all of the transforms we've done can

00:36:11.640 | and by default do run on the GPU.

00:36:14.880 | So for example, when you do the flip left right

00:36:17.720 | I showed you earlier, we'll actually run on the GPU,

00:36:20.920 | as we'll warp, as we'll zoom, as we'll even things like crop.

00:36:26.960 | So one of the basics of this is the affine coordinate transform,

00:36:33.360 | which uses affine grid and grid sample,

00:36:36.280 | which are very powerful PyTorch functions, which

00:36:40.560 | would be great things to actually write in script

00:36:44.480 | for TensorFlow's new meta programming,

00:36:46.920 | because they don't exist in TensorFlow,

00:36:50.320 | or at least not in any very complete way.

00:36:53.400 | But with these basic ideas, we can

00:36:55.800 | create this affine coordinate transform

00:36:57.480 | that lets us do a very wide range of data augmentations

00:37:01.680 | in parallel on the GPU.

00:37:03.840 | For those of you that know about the DALI library

00:37:06.440 | that we've created, this provides

00:37:08.880 | a lot of the same benefits of DALI.

00:37:11.680 | It's pretty similar in terms of its performance.

00:37:14.280 | But the nice thing is, all the stuff you write,

00:37:17.000 | you write it in Python, not in CUDA.

00:37:19.720 | So with DALI, if they don't have the exact transformation

00:37:24.160 | you want, and there's a pretty high chance that they won't,

00:37:27.920 | then you're stuck.

00:37:29.000 | Or else with fast AI v2, you can write your own

00:37:33.120 | in a few lines of Python.

00:37:34.680 | You can test it out in a Jupyter Notebook.

00:37:37.920 | It makes life super easy.

00:37:41.480 | So this kind of stuff, I feel like because Swift

00:37:46.360 | is a much faster, more hackable language than Python,

00:37:52.800 | or at least hackable in the sense of performance,

00:37:55.800 | I guess not as hackable in terms of its type system necessarily,

00:37:59.640 | I feel like we can build even more powerful foundations

00:38:07.040 | and pipelines and a real Swift for TensorFlow computer vision

00:38:13.360 | library, leveraging the metaprogramming

00:38:17.000 | and leveraging Swift numerics.

00:38:19.800 | Stuff like that, I think, would be super cool.

00:38:24.440 | And so that is the end of that.

00:38:28.840 | That was great.

00:38:29.600 | That was excellent.

00:38:30.680 | Thank you very much, Jeremy.

00:38:32.320 | My pleasure.

00:38:34.960 | So just sort of thinking through,

00:38:37.120 | so as you're propagating along the self-type

00:38:39.960 | amongst the transformations, that

00:38:41.560 | seems relatively straightforward for Swift to handle.

00:38:44.120 | Are there other sorts of things that you think

00:38:46.520 | we should start thinking about now?

00:38:49.040 | Yeah, the thing I really want you to think about,

00:38:51.240 | and we've kind of been nagging you on and off since March,

00:38:53.720 | is the way that tensors are represented.

00:38:58.760 | Having them as a value type the way they are now

00:39:02.760 | makes some things hard or impossible.

00:39:05.560 | So the generic optimizer is a thing

00:39:07.760 | that I really, really want you guys to look into and build

00:39:12.040 | properly.

00:39:12.520 | Currently, it uses ugly key path hacks,

00:39:16.680 | and it's only kind of partially doing what we need it to do.

00:39:20.280 | So I talked to Alexis about this idea quite a bit,

00:39:25.560 | and we kind of thought maybe there

00:39:28.480 | could be some type that represents the actual block

00:39:34.760 | of GPU memory in a way where we can easily share that.

00:39:39.680 | In practice, we've realized the vast majority of the time,

00:39:45.360 | we want to refer to that exact piece of memory on the GPU,

00:39:49.760 | not this idea of a tensor which may magically copy itself

00:39:54.040 | if I change something.

00:39:56.560 | And so, for example, with the generic optimizer,

00:39:59.320 | we need to be able to say, oh, this layer is

00:40:02.440 | part of this layer group, and this layer group

00:40:04.840 | has these things that need to happen to it.

00:40:08.760 | So I actually said to Ed, hey, could you please

00:40:12.800 | have a look at the Swift AI generic optimizer,

00:40:15.480 | because it's trying to be a similar design to the fast AI

00:40:20.480 | V2 optimizer, but it's currently pretty unattractive.

00:40:26.120 | The second is I feel like creating a really good computer

00:40:31.240 | vision library is something which could be done now-ish.

00:40:35.800 | When I tried to do it, I was getting kind of race conditions

00:40:41.960 | and freezes inside Swift, and I don't have the Swift skills

00:40:45.400 | to know where they were coming from or how to fix them.

00:40:48.040 | It would be nice if folks could like--

00:40:50.480 | I think all of my answers is, go back to the stuff

00:40:52.680 | that we all built together back in March, April, May,

00:40:57.320 | and try to start using it in real life,

00:41:00.320 | and build models with it, and put them in production,

00:41:04.400 | and see the bits where it hits where you get stuck,

00:41:08.080 | because you'll find things like, oh, there's no grid sample,

00:41:11.320 | and, oh, there's race conditions in the interaction of OpenCV,

00:41:17.120 | and the optimizer doesn't quite work properly, and that stuff.

00:41:26.200 | That makes sense.

00:41:28.000 | I think we're also trying to figure out right now what

00:41:30.960 | the right path is with the runtime.

00:41:33.520 | So we've historically been building

00:41:35.400 | on top of the TensorFlow runtime, which

00:41:37.240 | is great for a lot of reasons.

00:41:38.960 | It has a lot of functionality in the box.

00:41:41.840 | It does pretty much everything.

00:41:45.000 | On the other hand, the performance, particularly in

00:41:46.920 | eager mode, is not great.

00:41:48.880 | So I think one of the things we're kicking around

00:41:50.760 | is the idea of going more directly into XLA.

00:41:54.320 | Yeah.

00:41:55.200 | Well, I think that's a thing that's been--

00:41:56.760 | And XLA being a stepping stone towards MLIR

00:42:00.680 | in the bigger future, which is also coming.

00:42:02.720 | I think that's the thing that's been stopping us all

00:42:05.000 | from using stuff like Swift AI to actually build models,

00:42:08.480 | because the auto diff has memory leaks,

00:42:11.520 | and the TensorFlow runtime is--

00:42:14.240 | I don't have to be polite-- so not at Google.

00:42:16.040 | So it's molasses.

00:42:17.240 | And it implements everything in six different ways

00:42:20.320 | in six different places, and so forth.

00:42:22.120 | So yeah, I think everybody's going

00:42:24.640 | to be digging into these higher level APIs a lot more

00:42:27.640 | once the foundations are where they're at.

00:42:29.920 | Yeah, and so the trade-off there is

00:42:32.600 | if we go with that direction now,

00:42:34.240 | XLA doesn't provide all the things in the box.

00:42:37.080 | But I think that's probably fine.

00:42:39.760 | We haven't fasted up something.

00:42:41.120 | I'm just so kind to let stuff that we need it.

00:42:45.080 | And so I think we're talking about that,

00:42:46.640 | trying to decide what to do there.

00:42:48.320 | We're also investing a lot in AD and finishing that off.

00:42:51.040 | Yeah, I mean, all the right work's thing done.

00:42:53.640 | It's just, you know, it's just early days.

00:42:57.360 | Yes, I think the challenge that we're really struggling with

00:42:59.920 | is this decision to stick with the TensorFlow runtime

00:43:02.600 | or to move on to something else.

00:43:06.840 | That, I think, is complicated, but I

00:43:09.520 | agree this is one of the major blockers for adoption of use.

00:43:14.520 | Yeah.

00:43:15.160 | I mean, especially if you want to take advantage of Swift,

00:43:18.440 | which we do, you need something where the kernel launch

00:43:25.560 | time is tiny or better still kind of non-existent

00:43:29.120 | because you can write everything in Swift.

00:43:31.120 | Otherwise, it's-- yeah, you don't really get the benefits.

00:43:35.080 | Yeah, and one of the--

00:43:36.240 | so I'll say I'll answer your question in a second.

00:43:38.640 | But one of the trade-offs there is

00:43:40.520 | that XLA doesn't have really fast kernel launch time

00:43:42.760 | because it effectively JIT compiles things

00:43:45.600 | before launching it.

00:43:47.280 | On the other hand, there are a lot of opportunities

00:43:51.040 | to do, for example, Fusion and other things like that

00:43:54.080 | that can offset it.

00:43:55.200 | And one of the nice hybrid models you get

00:43:57.960 | is this combination of tracing plus compilation, which

00:44:02.000 | I think could be really interesting.

00:44:03.400 | Yeah.

00:44:03.920 | [INAUDIBLE]

00:44:05.320 | Said asked what's going on with MLIR.

00:44:07.200 | There's tons of stuff going on.

00:44:08.480 | It's really exciting.

00:44:09.840 | Just yesterday, there was a really fantastic talk

00:44:11.840 | from some folks at Intel talking about their code generation

00:44:15.040 | algorithm that are bringing over to MLIR, which I'm really,

00:44:17.840 | really, really excited about.

00:44:20.120 | And so there's tons of stuff going on.

00:44:23.520 | Getting the ideal code gen for NVDA GPUs, for example,

00:44:26.960 | is probably still six plus months away.

00:44:29.400 | And I don't know how much plus that is.

00:44:32.400 | But what I'm encouraging is the community

00:44:35.200 | to come together and collaborate instead

00:44:36.880 | of the different teams and the different companies

00:44:39.800 | like kind of being in front of me.

00:44:42.480 | And the Intel stuff that they presented yesterday

00:44:44.960 | is super, super impressive.

00:44:48.160 | So we'll see what happens with that.

00:44:50.480 | The other thing I might--

00:44:51.520 | [INTERPOSING VOICES]

00:44:53.480 | The other thing I might mention in terms of tails

00:44:56.040 | on the other side, what's life like in the Python world,

00:44:59.560 | things that are and aren't working well over there.

00:45:04.040 | The kind of the answer to Swift for TensorFlow in the PyTorch

00:45:07.880 | world is JIT.

00:45:10.520 | So it's basically to trace your Python code

00:45:15.360 | and attempt to figure out what it's doing

00:45:17.280 | and create what they call TorchStrip, which

00:45:19.160 | is a dialect of subset of Python or else to actually parse.

00:45:24.760 | Your Python code is also an option

00:45:26.360 | and turn it into TorchStrip.

00:45:29.240 | It has reached the point now where it can actually

00:45:32.320 | be used for good.

00:45:34.160 | So one of our students created--

00:45:38.520 | a bunch of our students actually have been working on a thing

00:45:41.360 | called Mesh, including a young researcher who

00:45:45.080 | designed the original thing.

00:45:46.320 | It's a very nice activation function

00:45:48.560 | that's about performing everything else

00:45:50.160 | that anybody is trying it on.

00:45:51.560 | And it was pretty slow.

00:45:53.560 | And when we just took me half an hour to create a JIT version

00:45:58.160 | and it ran at the same speed as somebody else's

00:46:01.080 | hand-created CUDA code.

00:46:03.120 | So for small things like that, where it's

00:46:05.280 | two or three lines of code, that's working pretty well.

00:46:09.160 | Although for bigger things, like a new batch norm implementation

00:46:12.480 | we tried to do during the last course,

00:46:15.440 | the performance wasn't there.

00:46:17.320 | Or if we actually tried to take--

00:46:19.480 | one of the big problems at the moment,

00:46:22.400 | not just for Python, but the whole world of non-Google

00:46:26.280 | people, is that the best computer vision models by far

00:46:29.960 | are largely those that are coming out of Google,

00:46:32.160 | like EfficientNets and MixNets, like Kwokli's team.

00:46:36.160 | They run very slowly and with a lot of memory on GPUs.

00:46:41.040 | And so we tried wrapping an entire EfficientNet

00:46:44.840 | and MixNet into a JIT-ed thing, so it wouldn't be so slow.

00:46:49.080 | The MixNet didn't work at all, and the EfficientNet

00:46:51.840 | was a little bit slower.

00:46:53.840 | So that's kind of the status of JIT in PyTorch

00:46:57.480 | is bits of it are useful.

00:47:00.720 | The way I look at this from the compiler-y code generation

00:47:03.960 | piece is that I think the MLIR pieces are all

00:47:06.080 | going the right direction.

00:47:07.080 | They're just going to take a while to get here.

00:47:10.240 | XLA, as far as I know, is state of the art and code generation.

00:47:14.520 | For the things it does, it does quite well.

00:47:17.080 | The challenge of those, it does have sort of limitations

00:47:19.800 | like static shapes and the number of office supports.

00:47:23.280 | You kind of have to be within its world for it to be useful.

00:47:27.120 | But it has a very useful--

00:47:28.320 | it has a large subset of the world

00:47:29.680 | that it covers very well.

00:47:30.760 | It has a pretty useful world.

00:47:32.240 | It has a pretty useful world.

00:47:34.680 | TorchScript, my understanding is that the base model

00:47:39.320 | of TorchScript and the interpreters they have,

00:47:42.920 | I understand that's quite nice.

00:47:46.000 | But the kernel fusion piece is still fairly early

00:47:48.480 | when it's mostly on-wise operations, for example.

00:47:50.920 | I don't find them that quite nice.

00:47:52.480 | I mean, simple things like--

00:47:54.960 | they're partly a limitation of the Python type system.

00:47:59.200 | So you want to be able to write things

00:48:01.800 | that can work with different numbers of channels

00:48:03.720 | while you're out of luck because they use Python type

00:48:06.680 | limitations, which have no way of saying it's

00:48:09.160 | a tuple of size n.

00:48:10.720 | You have to say it's a tuple of size 3.

00:48:12.680 | So then you have to hard code all these assumptions

00:48:14.680 | into your code.

00:48:16.240 | Lots of stuff I find pretty frustrating.

00:48:18.880 | I see.

00:48:19.400 | Interesting.

00:48:19.920 | Well, so I mean, I think there's other spaces

00:48:21.880 | that I'm eager to reevaluate as--

00:48:25.320 | I mean, this isn't the highest priority at this moment.

00:48:28.040 | But in terms of our APIs, there's

00:48:29.880 | still very legit questions around,

00:48:31.840 | should we encode d-type in the static type system?

00:48:34.560 | Or should we just say tensor?

00:48:36.960 | If you just say tensor, then you get rid of all the generics

00:48:39.600 | everywhere, cleans up tons of code

00:48:43.040 | at the cost of losing some of the checking.

00:48:46.120 | But then I think if you go with more semantic tensor types

00:48:49.080 | that Jerry was pushing forward, you actually

00:48:51.000 | really don't even want the d-type.

00:48:52.440 | What you want is the semantics, and that you're actually

00:48:54.800 | in a better spot.

00:48:55.600 | Right.

00:48:56.240 | Like for mixed precision, we're switching stuff from one type

00:48:59.320 | to another all the time.

00:49:01.080 | Depending on whether you're doing a loss function

00:49:03.080 | or a gradient calculation or whatever,

00:49:04.640 | you need to be changing between half and single.

00:49:08.120 | So if we went that direction, I think

00:49:09.720 | that would be really interesting in terms of ergonomics,

00:49:13.520 | but also simplification, which I think would be great.

00:49:17.120 | Your point about the optimizer is that the key path

00:49:19.920 | have all kinds of weirdness because you have multiple d-types

00:49:22.720 | and you want to be generic over d-type.

00:49:24.400 | And so that's really unpleasant right now.

00:49:27.360 | Yeah.

00:49:28.040 | I think also for Swift wanting to bring over

00:49:33.560 | a big world of Python using data scientists,

00:49:39.160 | they're definitely not going to be

00:49:41.240 | wanting to put lots and lots of verbose generic type

00:49:45.560 | annotations in their Jupyter notebooks.

00:49:47.800 | Yep.

00:49:48.320 | Yeah.

00:49:50.560 | So I don't know when we'll have cycles

00:49:52.800 | to re-evaluate those APIs, but I think

00:49:55.360 | we should go do a fresh take of this

00:49:59.000 | and combine it with an XLA-based approach that

00:50:01.800 | changes a lot of the trade-offs.

00:50:04.520 | Right.

00:50:05.560 | So it would be really interesting.

00:50:07.480 | Yeah.

00:50:07.960 | I mean, I think in my mind, right,

00:50:09.560 | so a couple of weeks ago, I presented the layering proposal

00:50:12.440 | to separate out libtensor from libdeep learning

00:50:16.200 | so that we can then get the freedom to then iterate

00:50:19.240 | at that level and have multiple explorations on top.

00:50:23.280 | So the progress update on there is that I started--

00:50:26.760 | we have the two different packages now in Swift APIs

00:50:31.400 | so that you can depend only on one as opposed to the other.

00:50:34.360 | And Dan helped fix all the issues

00:50:36.080 | that I caused while doing the initial move of the random

00:50:38.640 | number generators out of what will become libdeep learning.

00:50:43.280 | That said, it's still very early,

00:50:44.680 | and I have a lot more code to move.

00:50:46.360 | Well, I think that Jeremy is fundamentally right

00:50:48.760 | that we need to spend more time with Swift AI

00:50:50.520 | and the optimized designs and re-evaluate the training

00:50:54.160 | with callback systems and things like that.

00:50:57.120 | Yeah.

00:50:58.160 | As each of these variables change,

00:51:01.520 | it affects other parts of the system.

00:51:02.880 | And different trade-offs, I think,

00:51:05.400 | should be re-evaluated as opposed to that.

00:51:07.680 | But I think that getting AD full of proof

00:51:10.720 | is super important.

00:51:12.960 | And performance.

00:51:13.840 | Yeah, so we have to get those two things right.

00:51:16.200 | We'll end upstream and integrate in Swift

00:51:18.560 | so that we can build on it and take a program instead of--

00:51:23.760 | yeah.

00:51:25.920 | Quick question about tensor.dtype.

00:51:28.560 | I wonder if we would add any type assertions

00:51:31.520 | and any functions.

00:51:32.840 | I think the Python model is to not check things

00:51:35.040 | into what things crashed at runtime, if I understand.

00:51:38.160 | I don't know.

00:51:38.840 | I mean, I think that there's a couple of different options

00:51:40.320 | there.

00:51:40.480 | I don't know what the right answer is.

00:51:42.160 | But again, one of the things that PyTorch is doing

00:51:45.240 | is they're doing more co-versions with dtypes.

00:51:48.000 | So if you take an Inte and add it to an N32,

00:51:49.920 | it will actually promote an 8 into an N32, for example.

00:51:53.560 | I mean, rocket science.

00:51:54.720 | But that's the kind of thing that is just very nice.

00:51:59.440 | And it just eliminates a certain kind of error.

00:52:01.520 | On the other hand, it's kind of like broadcasting where

00:52:03.680 | it makes certain things just work at the cost of potentially

00:52:07.280 | again, surprising in some cases.

00:52:08.760 | So I don't know about that.

00:52:10.760 | I think if you do things that don't make sense,

00:52:13.640 | like you try to do a floating point operation on an integer,

00:52:18.760 | then you would want it to be runtime error.

00:52:20.720 | But I think that our model is trying

00:52:22.240 | towards a much more runtime-centric approach.

00:52:25.800 | I think, ironically, Swift and Insta

00:52:27.400 | started out very static.

00:52:29.280 | But now, for me, I'm realizing one

00:52:33.360 | of the major benefits of having a fast-paced language

00:52:35.680 | is dynamic is free.

00:52:38.120 | And so now you can have super dynamic abstractions

00:52:41.520 | that you can do these things in a nice way.

00:52:43.600 | If I torch, you do get a pretty clear runtime error.

00:52:47.440 | If there's a type mismatch, it doesn't just crash.

00:52:49.720 | It will tell you what to expect and what it got.

00:52:52.200 | Yeah.

00:52:52.720 | And one of the nice things about eager mode

00:52:54.120 | is that then you get a stack trace.

00:52:55.720 | I think there are other ways around sort of encoding things

00:53:02.560 | into the static-type system that you have to adhere to.

00:53:06.280 | I think Adam's work on transitioning perfectly

00:53:08.360 | shows that you can still get a lot of benefits

00:53:10.560 | of static analysis without necessarily encoding

00:53:12.600 | into the type system.

00:53:13.880 | Yep.

00:53:16.400 | That said, I think it's still an open question as to how far

00:53:18.840 | we can really push that and where we end up landing.

00:53:22.200 | Yeah, I think it's just a really, really great opportunity

00:53:27.960 | to re-evaluate these things as other pieces are coming together.

00:53:31.760 | Maxim asks, why is runtime checking preferable

00:53:34.120 | over static analysis?

00:53:36.200 | I think it's more that we're still

00:53:38.280 | trying to figure out what dimensions you

00:53:39.840 | want to be flexible on.

00:53:41.880 | And so doing things dynamically is sort

00:53:44.560 | of the ultimate in flexibility.

00:53:46.560 | And so as we're trying to iterate on the programming

00:53:49.160 | model, making sure that things are as dynamic as you want them

00:53:52.480 | to be is sometimes nice.

00:53:54.760 | And then we should think about how static analysis can

00:53:57.040 | help catch errors sooner.

00:53:58.480 | Yeah, exactly.

00:53:59.240 | And so this is just a spectrum.

00:54:01.000 | And it's not that one end of the spectrum

00:54:02.760 | is better than the other.

00:54:03.600 | It's about where in the spectrum you end up.

00:54:05.560 | And Nicholas's question, Nicholas,

00:54:07.120 | asks, how are MLIR and XLA related?

00:54:09.360 | That is a super complicated question

00:54:11.160 | because we're actively re-implementing pieces

00:54:13.560 | of XLA in terms of MLIR.

00:54:15.080 | So that's actually a lot more complicated than it sounds.

00:54:18.520 | I would just say that MLIR is a broad scale compiler

00:54:28.000 | technology that solves lots of problems. XLA, as a name,

00:54:31.320 | is typically thought of as a thing

00:54:33.320 | that turns tensors into efficient code.

00:54:35.160 | And so I wouldn't over-index on the number of letters, I guess.

00:54:45.800 | And once Swift-TensorFlow sits on top of MLIR,

00:54:48.440 | we'll still use XLA target TPUs.

00:54:50.480 | Yeah, so I mean, this is internal work.

00:54:54.920 | But we're doing a lot to change and enhance

00:54:59.240 | the TPU software stack in XLA.

00:55:03.080 | And things that are XLA are changing in their implementation

00:55:07.200 | as well.

00:55:07.880 | And so there's a big investment going on in all these pieces

00:55:10.400 | right now.

00:55:13.240 | And I think that more generally-- again,

00:55:14.720 | if you ignore which letters get attached to them,

00:55:16.760 | the effort here culminates in a much more flexible

00:55:19.800 | co-generation stack, support for dynamic shapes,

00:55:23.720 | and custom ops, and things like that.

00:55:25.680 | It's just that different pieces in this very complicated

00:55:28.880 | technology come together at different points as well.

00:55:31.640 | I don't know what the marketing-- the crack compiler marketing

00:55:39.120 | team will end up labeling the resultant kind.

00:55:41.000 | Excellent.

00:55:44.840 | We're slightly over time, so I just

00:55:46.560 | went into-- unless there's any pressing questions,

00:55:51.160 | thank everyone for joining.

00:55:52.920 | And see you all next week.

00:55:55.560 | I think next week, Mark will be up talking about some of his work

00:55:58.600 | on testing the auditive system to ensure

00:56:02.480 | that it's really reliable.

00:56:03.600 | There's some pretty good things that Mark's been up to there.

00:56:06.520 | It's also exciting that AD is getting upstream to master,

00:56:09.040 | too, which is really cool.

00:56:12.600 | Thanks, everyone.

00:56:13.240 | Have a great week, and see you all next week.

00:56:15.000 | Thank you, Jeremy.

00:56:15.680 | Thank you.

00:56:16.200 | Bye.

00:56:17.720 | [BLANK_AUDIO]

fastai v2 overview (at S4TF Design Meeting 2019-11-08)

Chapters