But instead of telling you all this now, I'd love to introduce Jeremy Howard, who will show you how Mojo works in practice. Thanks Chris. You know, I realised it's been 30 years since I first trained a neural network. And to be honest, I haven't been that satisfied with any of the languages that I've been able to use throughout that time.
In fact, I complained to Chris about this when I first met him years ago. And Chris has been telling me ever since, don't worry Jeremy, one day we are going to fix this. The thing that I really want to fix is to see a language where I can write performant, flexible, hardcore code, but it should also be concise, readable, understandable code.
And I think that actually Chris and his team here have done it with this new language called Mojo. Mojo is actually a super set of Python, so I can use my Python code. Here, check this out, I'll show you what I mean. So here is a notebook, right? But this notebook is no normal notebook, this is a Mojo notebook.
And by way of demonstration, because this is the most fundamental foundational algorithm in deep learning, we're going to look at matrix multiplication. Now of course Mojo's got its own, we don't need to write our own, but we're just showing you we can actually write our own Hive performance matrix multiplication.
I start by comparing to Python, that's very easy to do because we can just type percent Python in a Mojo notebook, and then it actually is going to run it on the CPython interpreter. So here's our basic matrix multiplication, go across the rows and the columns, multiply together, add it up, let's write a little matrix and a little benchmark and try it out, and oh dear, 0.005 gigaflops.
That's not great. How do we speed it up? Well actually, believe it or not, we just take that code. We copy and paste it into a new cell without the percent Python, and because Mojo is a superset of Python, this runs too, but this times it runs in Mojo, not in Python.
And immediately we get an eight and a half times speed up. Now there's a lot of performance left on the table here, and to go faster, we're going to want a nice, fast, compact matrix type. Of course, we can use the one that Mojo provides for us, but just to show you that we can, here we've implemented it from scratch.
So we're actually creating a struct here, so this is nice, compact in memory, and it's got the normal things we're used to, dunder get item and dunder set item, and stuff you don't expect to see in Python, like alloc and like SIMD. And as you can see, the whole thing fits in about a page of code, a screen of code.
So that's our matrix. And so to use it, we copy and paste the code again, but this time just add a type annotation. These are matrices. And now it's a 300 times speed up. Suddenly things are looking pretty amazing, but there's a lot more we can do. We can look at doing, if our CPU supports it, say eight elements at a time using SIMD instructions.
It's a bit of a mess to do that manually. There's quite a bit of code, but we can do it manually, and we get a 570 times speed up. But better still, we can just call vectorize. So just write a dot product operation, call vectorize, and it will automatically handle it on SIMD for us with the same performance speed up.
So that's going to be happening in the innermost loop. We're going to be using SIMD. And in the outermost loop, what if we just call parallelize? This is something we can do. And now suddenly the rows are going to be done on separate cores for a 2000 times speed up.
So we've only got four cores going on here, so it's not huge. If you've got more cores, it'll be much bigger. This is something you absolutely can't do with Python. You can do some very, very basic parallel processing with Python, but it's literally creating separate processes and having to move memory around, and it's pretty nasty.
And there's all kinds of complexities around the global interpreter lock, and so forth as well. This is how easy it is in Mojo. And so suddenly we've got a 2000 times faster matrix multiplication written from scratch. We can also make sure that we're using the cache really effectively by doing tiling, so doing a few bits of memory that's close to each other at a time and reusing them.
Tiling is as easy as creating this little tiling function and then calling it to tile our function. So now we've got something that is parallelized, tiled, and vectorized for a 2170 times speed up over Python. We can also add unrolling for a 2200 times speed up. So vectorized unroll is already built into Mojo, so we don't even have to write that.
Now there's a lot of complexity here though, like what tile size do we use, how many processors, what Cindy size, all this mess to worry about. And each different person you deploy to is going to have different versions of these. They'll have different memory, they're going to have different CPUs, and so forth.
No worries, look at what you can do. We can create an auto-tuned version by simply calling auto-tune. So if we want an auto-tune tile size, we just say hey Mojo, try these different tile sizes for us, figure out which one's the fastest, compile the fastest version for us, cache it for this individual computer, and then use that parallelized, tiled, unrolled, vectorized for a 4164 times speed up.
So this is pretty remarkable, right? Now it's not just linear algebra stuff, we can do really iterative stuff like calculating Mandelbrot. So we can create our own complex number type and it's going to be a struct, so again it's going to be compact in memory. It looks like absolutely standard Python, as you can see, multiplying, subtracting using the operations.
And to create the Mandelbrot kernel, we just take the classic Mandelbrot set equation, iterative equation, and pop it in Python here, and then we can call it a bunch of times in a loop, returning at the appropriate time to compute the Mandelbrot set. That's all very well and good.
Did it work? Well, it'd be nice to look at it. So how would you look at it? Well, it'd be nice to use Matplotlib. Oh, no worries. Every single Python library works in Mojo. And you can import it. Check this out. Plot is import the Python module Matplotlib, np is import the module numpy, and the rest of it, this is actually Mojo code, but it's also Python code, and it works.
And I don't know if you remember, but Chris actually said the Mandelbrot set is 35,000 times faster than Python, and that's because we can also do an even faster version where we're handling it with SIMD, and we can actually create the kind of iterative algorithm that you just can't do in Python, even with the help of stuff like NumPy.
This is something which is really unique to Mojo. So we now have something here, which is incredibly flexible, incredibly fast, can utilize the hardware you have no matter what it is, and is really understandable to Python programmers like you and me. I think finally, we're at a point where we are going to have something where I actually enjoy writing neural networks.
Wow, how awesome was that?