Jeremy Howard demo for Mojo launch

00:00:00.000 | But instead of telling you all this now, I'd love to introduce Jeremy Howard, who will

00:00:04.480 | show you how Mojo works in practice.

00:00:10.400 | Thanks Chris.

00:00:11.400 | You know, I realised it's been 30 years since I first trained a neural network.

00:00:16.740 | And to be honest, I haven't been that satisfied with any of the languages that I've been able

00:00:21.320 | to use throughout that time.

00:00:23.440 | In fact, I complained to Chris about this when I first met him years ago.

00:00:28.520 | And Chris has been telling me ever since, don't worry Jeremy, one day we are going to fix this.

00:00:34.780 | The thing that I really want to fix is to see a language where I can write performant,

00:00:41.720 | flexible, hardcore code, but it should also be concise, readable, understandable code.

00:00:48.960 | And I think that actually Chris and his team here have done it with this new language called

00:00:55.100 | Mojo.

00:00:56.100 | Mojo is actually a super set of Python, so I can use my Python code.

00:01:00.960 | Here, check this out, I'll show you what I mean.

00:01:03.720 | So here is a notebook, right?

00:01:05.920 | But this notebook is no normal notebook, this is a Mojo notebook.

00:01:09.800 | And by way of demonstration, because this is the most fundamental foundational algorithm

00:01:15.620 | in deep learning, we're going to look at matrix multiplication.

00:01:17.820 | Now of course Mojo's got its own, we don't need to write our own, but we're just showing

00:01:21.480 | you we can actually write our own Hive performance matrix multiplication.

00:01:26.200 | I start by comparing to Python, that's very easy to do because we can just type percent

00:01:29.840 | Python in a Mojo notebook, and then it actually is going to run it on the CPython interpreter.

00:01:34.860 | So here's our basic matrix multiplication, go across the rows and the columns, multiply

00:01:39.040 | together, add it up, let's write a little matrix and a little benchmark and try it out,

00:01:45.240 | and oh dear, 0.005 gigaflops.

00:01:48.600 | That's not great.

00:01:49.600 | How do we speed it up?

00:01:51.000 | Well actually, believe it or not, we just take that code.

00:01:53.560 | We copy and paste it into a new cell without the percent Python, and because Mojo is a

00:01:59.120 | superset of Python, this runs too, but this times it runs in Mojo, not in Python.

00:02:05.120 | And immediately we get an eight and a half times speed up.

00:02:09.680 | Now there's a lot of performance left on the table here, and to go faster, we're going

00:02:14.480 | to want a nice, fast, compact matrix type.

00:02:18.720 | Of course, we can use the one that Mojo provides for us, but just to show you that we can,

00:02:23.000 | here we've implemented it from scratch.

00:02:25.140 | So we're actually creating a struct here, so this is nice, compact in memory, and it's

00:02:29.080 | got the normal things we're used to, dunder get item and dunder set item, and stuff you

00:02:33.080 | don't expect to see in Python, like alloc and like SIMD.

00:02:37.360 | And as you can see, the whole thing fits in about a page of code, a screen of code.

00:02:41.160 | So that's our matrix.

00:02:43.400 | And so to use it, we copy and paste the code again, but this time just add a type annotation.

00:02:48.320 | These are matrices.

00:02:49.600 | And now it's a 300 times speed up.

00:02:52.280 | Suddenly things are looking pretty amazing, but there's a lot more we can do.

00:02:56.880 | We can look at doing, if our CPU supports it, say eight elements at a time using SIMD

00:03:04.160 | instructions.

00:03:05.280 | It's a bit of a mess to do that manually.

00:03:06.880 | There's quite a bit of code, but we can do it manually, and we get a 570 times speed

00:03:12.240 | up.

00:03:13.240 | But better still, we can just call vectorize.

00:03:15.440 | So just write a dot product operation, call vectorize, and it will automatically handle

00:03:19.960 | it on SIMD for us with the same performance speed up.

00:03:23.880 | So that's going to be happening in the innermost loop.

00:03:25.800 | We're going to be using SIMD.

00:03:27.400 | And in the outermost loop, what if we just call parallelize?

00:03:31.380 | This is something we can do.

00:03:32.380 | And now suddenly the rows are going to be done on separate cores for a 2000 times speed

00:03:37.120 | up.

00:03:38.120 | So we've only got four cores going on here, so it's not huge.

00:03:40.060 | If you've got more cores, it'll be much bigger.

00:03:42.420 | This is something you absolutely can't do with Python.

00:03:45.460 | You can do some very, very basic parallel processing with Python, but it's literally

00:03:50.400 | creating separate processes and having to move memory around, and it's pretty nasty.

00:03:54.840 | And there's all kinds of complexities around the global interpreter lock, and so forth

00:03:58.400 | as well.

00:03:59.400 | This is how easy it is in Mojo.

00:04:01.240 | And so suddenly we've got a 2000 times faster matrix multiplication written from scratch.

00:04:06.300 | We can also make sure that we're using the cache really effectively by doing tiling,

00:04:11.120 | so doing a few bits of memory that's close to each other at a time and reusing them.

00:04:16.160 | Tiling is as easy as creating this little tiling function and then calling it to tile

00:04:22.080 | our function.

00:04:23.600 | So now we've got something that is parallelized, tiled, and vectorized for a 2170 times speed

00:04:29.840 | up over Python.

00:04:31.320 | We can also add unrolling for a 2200 times speed up.

00:04:36.160 | So vectorized unroll is already built into Mojo, so we don't even have to write that.

00:04:41.480 | Now there's a lot of complexity here though, like what tile size do we use, how many processors,

00:04:47.400 | what Cindy size, all this mess to worry about.

00:04:51.160 | And each different person you deploy to is going to have different versions of these.

00:04:54.520 | They'll have different memory, they're going to have different CPUs, and so forth.

00:04:59.380 | No worries, look at what you can do.

00:05:01.860 | We can create an auto-tuned version by simply calling auto-tune.

00:05:05.880 | So if we want an auto-tune tile size, we just say hey Mojo, try these different tile sizes

00:05:10.400 | for us, figure out which one's the fastest, compile the fastest version for us, cache

00:05:15.800 | it for this individual computer, and then use that parallelized, tiled, unrolled, vectorized

00:05:24.420 | for a 4164 times speed up.

00:05:27.240 | So this is pretty remarkable, right?

00:05:29.400 | Now it's not just linear algebra stuff, we can do really iterative stuff like calculating

00:05:35.360 | Mandelbrot.

00:05:36.580 | So we can create our own complex number type and it's going to be a struct, so again it's

00:05:40.140 | going to be compact in memory.

00:05:41.880 | It looks like absolutely standard Python, as you can see, multiplying, subtracting using

00:05:47.240 | the operations.

00:05:49.440 | And to create the Mandelbrot kernel, we just take the classic Mandelbrot set equation,

00:05:54.680 | iterative equation, and pop it in Python here, and then we can call it a bunch of times in

00:06:00.920 | a loop, returning at the appropriate time to compute the Mandelbrot set.

00:06:06.960 | That's all very well and good.

00:06:08.320 | Did it work?

00:06:09.320 | Well, it'd be nice to look at it.

00:06:10.840 | So how would you look at it?

00:06:11.840 | Well, it'd be nice to use Matplotlib.

00:06:13.680 | Oh, no worries.

00:06:15.520 | Every single Python library works in Mojo.

00:06:19.560 | And you can import it.

00:06:20.560 | Check this out.

00:06:21.560 | Plot is import the Python module Matplotlib, np is import the module numpy, and the rest

00:06:27.880 | of it, this is actually Mojo code, but it's also Python code, and it works.

00:06:36.400 | And I don't know if you remember, but Chris actually said the Mandelbrot set is 35,000

00:06:40.560 | times faster than Python, and that's because we can also do an even faster version where

00:06:45.320 | we're handling it with SIMD, and we can actually create the kind of iterative algorithm that

00:06:51.400 | you just can't do in Python, even with the help of stuff like NumPy.

00:06:57.360 | This is something which is really unique to Mojo.

00:07:01.280 | So we now have something here, which is incredibly flexible, incredibly fast, can utilize the

00:07:07.200 | hardware you have no matter what it is, and is really understandable to Python programmers

00:07:13.380 | like you and me.

00:07:14.600 | I think finally, we're at a point where we are going to have something where I actually

00:07:18.800 | enjoy writing neural networks.

00:07:22.760 | Wow, how awesome was that?