back to indexJeremy Howard demo for Mojo launch
00:00:00.000 |
But instead of telling you all this now, I'd love to introduce Jeremy Howard, who will 00:00:11.400 |
You know, I realised it's been 30 years since I first trained a neural network. 00:00:16.740 |
And to be honest, I haven't been that satisfied with any of the languages that I've been able 00:00:23.440 |
In fact, I complained to Chris about this when I first met him years ago. 00:00:28.520 |
And Chris has been telling me ever since, don't worry Jeremy, one day we are going to fix this. 00:00:34.780 |
The thing that I really want to fix is to see a language where I can write performant, 00:00:41.720 |
flexible, hardcore code, but it should also be concise, readable, understandable code. 00:00:48.960 |
And I think that actually Chris and his team here have done it with this new language called 00:00:56.100 |
Mojo is actually a super set of Python, so I can use my Python code. 00:01:00.960 |
Here, check this out, I'll show you what I mean. 00:01:05.920 |
But this notebook is no normal notebook, this is a Mojo notebook. 00:01:09.800 |
And by way of demonstration, because this is the most fundamental foundational algorithm 00:01:15.620 |
in deep learning, we're going to look at matrix multiplication. 00:01:17.820 |
Now of course Mojo's got its own, we don't need to write our own, but we're just showing 00:01:21.480 |
you we can actually write our own Hive performance matrix multiplication. 00:01:26.200 |
I start by comparing to Python, that's very easy to do because we can just type percent 00:01:29.840 |
Python in a Mojo notebook, and then it actually is going to run it on the CPython interpreter. 00:01:34.860 |
So here's our basic matrix multiplication, go across the rows and the columns, multiply 00:01:39.040 |
together, add it up, let's write a little matrix and a little benchmark and try it out, 00:01:51.000 |
Well actually, believe it or not, we just take that code. 00:01:53.560 |
We copy and paste it into a new cell without the percent Python, and because Mojo is a 00:01:59.120 |
superset of Python, this runs too, but this times it runs in Mojo, not in Python. 00:02:05.120 |
And immediately we get an eight and a half times speed up. 00:02:09.680 |
Now there's a lot of performance left on the table here, and to go faster, we're going 00:02:18.720 |
Of course, we can use the one that Mojo provides for us, but just to show you that we can, 00:02:25.140 |
So we're actually creating a struct here, so this is nice, compact in memory, and it's 00:02:29.080 |
got the normal things we're used to, dunder get item and dunder set item, and stuff you 00:02:33.080 |
don't expect to see in Python, like alloc and like SIMD. 00:02:37.360 |
And as you can see, the whole thing fits in about a page of code, a screen of code. 00:02:43.400 |
And so to use it, we copy and paste the code again, but this time just add a type annotation. 00:02:52.280 |
Suddenly things are looking pretty amazing, but there's a lot more we can do. 00:02:56.880 |
We can look at doing, if our CPU supports it, say eight elements at a time using SIMD 00:03:06.880 |
There's quite a bit of code, but we can do it manually, and we get a 570 times speed 00:03:13.240 |
But better still, we can just call vectorize. 00:03:15.440 |
So just write a dot product operation, call vectorize, and it will automatically handle 00:03:19.960 |
it on SIMD for us with the same performance speed up. 00:03:23.880 |
So that's going to be happening in the innermost loop. 00:03:27.400 |
And in the outermost loop, what if we just call parallelize? 00:03:32.380 |
And now suddenly the rows are going to be done on separate cores for a 2000 times speed 00:03:38.120 |
So we've only got four cores going on here, so it's not huge. 00:03:40.060 |
If you've got more cores, it'll be much bigger. 00:03:42.420 |
This is something you absolutely can't do with Python. 00:03:45.460 |
You can do some very, very basic parallel processing with Python, but it's literally 00:03:50.400 |
creating separate processes and having to move memory around, and it's pretty nasty. 00:03:54.840 |
And there's all kinds of complexities around the global interpreter lock, and so forth 00:04:01.240 |
And so suddenly we've got a 2000 times faster matrix multiplication written from scratch. 00:04:06.300 |
We can also make sure that we're using the cache really effectively by doing tiling, 00:04:11.120 |
so doing a few bits of memory that's close to each other at a time and reusing them. 00:04:16.160 |
Tiling is as easy as creating this little tiling function and then calling it to tile 00:04:23.600 |
So now we've got something that is parallelized, tiled, and vectorized for a 2170 times speed 00:04:31.320 |
We can also add unrolling for a 2200 times speed up. 00:04:36.160 |
So vectorized unroll is already built into Mojo, so we don't even have to write that. 00:04:41.480 |
Now there's a lot of complexity here though, like what tile size do we use, how many processors, 00:04:47.400 |
what Cindy size, all this mess to worry about. 00:04:51.160 |
And each different person you deploy to is going to have different versions of these. 00:04:54.520 |
They'll have different memory, they're going to have different CPUs, and so forth. 00:05:01.860 |
We can create an auto-tuned version by simply calling auto-tune. 00:05:05.880 |
So if we want an auto-tune tile size, we just say hey Mojo, try these different tile sizes 00:05:10.400 |
for us, figure out which one's the fastest, compile the fastest version for us, cache 00:05:15.800 |
it for this individual computer, and then use that parallelized, tiled, unrolled, vectorized 00:05:29.400 |
Now it's not just linear algebra stuff, we can do really iterative stuff like calculating 00:05:36.580 |
So we can create our own complex number type and it's going to be a struct, so again it's 00:05:41.880 |
It looks like absolutely standard Python, as you can see, multiplying, subtracting using 00:05:49.440 |
And to create the Mandelbrot kernel, we just take the classic Mandelbrot set equation, 00:05:54.680 |
iterative equation, and pop it in Python here, and then we can call it a bunch of times in 00:06:00.920 |
a loop, returning at the appropriate time to compute the Mandelbrot set. 00:06:21.560 |
Plot is import the Python module Matplotlib, np is import the module numpy, and the rest 00:06:27.880 |
of it, this is actually Mojo code, but it's also Python code, and it works. 00:06:36.400 |
And I don't know if you remember, but Chris actually said the Mandelbrot set is 35,000 00:06:40.560 |
times faster than Python, and that's because we can also do an even faster version where 00:06:45.320 |
we're handling it with SIMD, and we can actually create the kind of iterative algorithm that 00:06:51.400 |
you just can't do in Python, even with the help of stuff like NumPy. 00:06:57.360 |
This is something which is really unique to Mojo. 00:07:01.280 |
So we now have something here, which is incredibly flexible, incredibly fast, can utilize the 00:07:07.200 |
hardware you have no matter what it is, and is really understandable to Python programmers 00:07:14.600 |
I think finally, we're at a point where we are going to have something where I actually