back to indexUnlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

00:00:15.080 |
I'm here to talk to you about modular and accelerating 00:00:33.680 |
a lot of ways to make it super easy to build a prototype 00:00:39.640 |
But despite all the availability of all these different endpoints, 00:00:45.320 |
Sometimes you might want to go and control your data 00:00:48.840 |
instead of sending your data to somebody else. 00:00:50.920 |
Sometimes you might want to integrate it into your own security 00:00:53.980 |
because you've got your critical company data in your model, 00:00:57.260 |
and you don't want to fine tune it somewhere else. 00:01:02.780 |
A lot of things in building proprietary models 00:01:09.860 |
Of course, the inference endpoints are expensive, 00:01:14.020 |
Sometimes there's hardware out there that's really interesting, 00:01:16.300 |
and you want to explore out from the mainstream, 00:01:21.340 |
And if you care about any of these things, what you need to do 00:01:29.120 |
Well, if you have-- many of you have explored this, I'm sure. 00:01:33.800 |
It used to be that we had things like PyTorch and TensorFlow 00:01:37.680 |
But as inference became more important, the world shifted. 00:01:40.800 |
First, we got Onyx, TensorRT, things like this. 00:01:43.760 |
And today, we have an explosion of these different frameworks, 00:01:50.720 |
And that's cool if you care about that one model. 00:01:52.440 |
But if you have many different things you want to deploy 00:01:54.200 |
and you want to work with, it's very frustrating 00:01:55.720 |
to have to switch between all these different technologies. 00:02:00.640 |
You all know there's this gigantic array of different technologies 00:02:04.080 |
that get used to build real-world things in production. 00:02:07.160 |
And of course, none of these are really actually designed for Gen AI. 00:02:11.720 |
So my concern about this, my objection to the status quo, 00:02:15.440 |
is that this fragmentation slows down getting the research 00:02:18.600 |
and the innovations coming into Gen AI into your products. 00:02:22.440 |
And I think we've seen so many of these demos. 00:02:24.560 |
Last year was really the year of the Gen AI demo. 00:02:27.520 |
But still, we're struggling to get Gen AI into products 00:02:39.880 |
If you don't, let's sympathize with the plight of the AI engineer. 00:02:43.760 |
Because y'all, these folks that are building this, 00:02:47.640 |
have new models and optimizations coming out every week. 00:02:52.160 |
Every product needs to be enhanced with Gen AI. 00:02:54.000 |
This is not one thing that we're getting dumped on. 00:02:56.480 |
And there's so much to do, we can't even keep up. 00:03:02.520 |
And of course, once you get something that actually works, 00:03:04.920 |
the costs end up making it very difficult to scale these things. 00:03:07.440 |
because getting things into production means suddenly you're paying on a per unit basis. 00:03:15.200 |
We should look at the concerns and look at the challenges faced here. 00:03:28.880 |
And so I'll give you a quick intro of what we're doing and kind of our approach on this. 00:03:39.400 |
We have brought together some of the world's experts that built all of these things. 00:03:45.200 |
We built compilers like LLVM and MLIR and XLA and all of these different things. 00:03:51.080 |
So what I can say about that is that we learned a lot, and I apologize, because we know why 00:03:59.640 |
it is so frustrating to use all these things. 00:04:01.640 |
But really, the world looked very different five years ago. 00:04:11.640 |
And so what our goal is is to make it so you can own your AI. 00:04:20.560 |
You can do this and make it much easier than the current systems work today. 00:04:28.040 |
Well, what we're doing is really going back to the basics. 00:04:31.080 |
We're bringing together the best-in-class technologies into one stack, not one solution per model. 00:04:38.640 |
Our goal is to lift Python developers, PyTorch users. 00:04:42.320 |
This is where the entire industry is, and so we want to work with existing people. 00:04:45.960 |
We're not trying to say, hey, ditch everything you know and try something new. 00:04:50.140 |
We want to gradually teach and give folks new tools so they can be superpowers, so they 00:04:56.660 |
And finally, so I spent a lot of time at Apple. 00:05:01.140 |
Like, you want to build on top of infrastructure. 00:05:02.500 |
You do not want to have to be experts in the infrastructure. 00:05:04.500 |
And this is the way all of this stuff should work. 00:05:07.500 |
And unfortunately, it's just not the case today in AI. 00:05:10.180 |
And so at Modular, we're building this technology called Max. 00:05:17.740 |
One is an AI framework, which I'll spend a bunch of time about. 00:05:29.740 |
We're not going to spend a lot of time talking about that today. 00:05:31.740 |
And so if you dive into this AI framework, well, we see it as two things. 00:05:43.620 |
And both halves of this are really important. 00:05:46.140 |
And Max is currently very focused on inference. 00:05:48.940 |
And so these are areas where PyTorch is challenging at times. 00:05:53.340 |
This is where Gen AI is driving us crazy with cost and complexity. 00:05:56.940 |
And so really focusing on this problem is something that we're all about. 00:06:01.900 |
The other thing, as I said before, is Python. 00:06:07.820 |
We also have other options, including C++, which we'll talk about later. 00:06:13.580 |
Well, as I said, we work with PyTorch out of the box. 00:06:15.820 |
You can bring your models to your model works. 00:06:17.260 |
We can talk to the wide array of PyTorchy things, 00:06:20.780 |
like Onyx and TorchScript and TorchCompile and all this stuff. 00:06:28.380 |
If you want to go deeper, you can use native APIs. 00:06:30.780 |
Native APIs are great if you speak the language of KV caches and page attention 00:06:37.020 |
And you care about pushing the state of the art of LLM and other Gen AI techniques. 00:06:41.740 |
And also, Max is very different in that it really rebuilds a ton of the stack, 00:06:49.020 |
But we do not build on top of QDNN and the NVIDIA libraries and on top of the Intel libraries. 00:06:55.980 |
We replace all that with a single consistent stack, which is a really different approach. 00:07:02.860 |
So what you get is you get a whole bunch of technology that you don't have to worry about. 00:07:06.380 |
And so, again, as a next generation technology, you get a lot of fancy compiler technologies, 00:07:11.500 |
run times, high performance kernels, like all this stuff in the box. 00:07:16.380 |
And you don't have to worry about it, which is really the point. 00:07:24.460 |
And so there are lots of different reasons why people might want to use an alternative thing. 00:07:29.980 |
For example, developer velocity, your team being more productive, 00:07:34.460 |
that's actually incredibly important, particularly if you're pushing state of the art. 00:07:39.420 |
And so I'll do the same thing that kind of people generally do, 00:07:42.780 |
is go and talk about the quantifiable thing, which is performance. 00:07:47.660 |
We just shipped a release that has our int 4, int 6K fancy quantization approach. 00:08:00.140 |
So if you're using Lama.cpp today in cloud CPUs, this is actually a pretty big deal. 00:08:06.860 |
And 5x can have a pretty big impact on the actual perceived latency of your product 00:08:15.420 |
And the way this is possible is, again, this combination of really crazy compiler and technology 00:08:23.900 |
But the fact that you don't have to care about that is actually pretty nice. 00:08:26.940 |
It's also pretty nice that this isn't just one model. 00:08:30.140 |
This is, you know, we have this make it easy to do int 4 technology. 00:08:34.140 |
And then we demonstrate it with a model that people are very familiar with. 00:08:37.980 |
And so if you care about this kind of stuff, this is actually pretty interesting. 00:08:41.420 |
And it's a next generation approach to a lot of the things that are very familiar. 00:08:48.940 |
And so, I mean, so far we've been talking about CPUs. 00:08:54.060 |
And what I would say and what I've seen is that the CPUs and AI are kind of well understood. 00:09:03.420 |
And so I'll talk just a little bit about our approach on this. 00:09:06.140 |
And so first, before I tell you what we're doing, let me tell you our dream. 00:09:15.020 |
Imagine a world where you can program a GPU as easily as you can program a CPU in Python. 00:09:27.500 |
That is a very different thing than the world is today. 00:09:33.420 |
Imagine a world in which you can actually get better utilization from the GPUs you're already paying for. 00:09:39.260 |
I don't know your workload, but you're probably somewhere between 30%, maybe 50% utilization, 00:09:44.940 |
which means you're paying for, like, two to three times the amount of GPU that you should be. 00:09:48.780 |
And that is understandable given the technology today. 00:09:53.340 |
But that's not great for lots of obvious reasons. 00:09:55.660 |
Imagine a world where you have the full power of CUDA. 00:09:59.340 |
So you don't have to say there's a powerful thing and there's an easy-to-use thing. 00:10:04.860 |
You can have one technology stack that scales. 00:10:10.220 |
This is something where, you know, NVIDIA has a lot of very good software people, 00:10:14.220 |
and they've been working on this for 15 years. 00:10:17.740 |
I don't run 15-year software on my cell phone. 00:10:20.140 |
Like, it doesn't run BlackBerry software either. 00:10:22.380 |
And I think that it's time to really rethink this technology stack and push the world forward. 00:10:32.700 |
You use one line of code and switch out CPU to GPU. 00:10:41.660 |
Because the way this is usually implemented is by having a big fork at the top of two completely 00:10:52.220 |
And so as a consequence, nothing actually works the same except for the thing on the slide. 00:10:57.820 |
And so what modular has done here is we've gone down and said, 00:11:01.180 |
let's replace that entire layer of technology. 00:11:11.900 |
And then make it work super easily, super predictably. 00:11:19.420 |
So if you are an advanced developer, and so many of you don't want to know about this, 00:11:25.820 |
If you are an advanced developer, like I said, you get the full power of CUDA. 00:11:29.260 |
And so if you want, you can go write custom kernels directly against max. 00:11:37.020 |
And for advanced developers, which I'm not going to dive too deeply into, 00:11:40.460 |
it's way easier to use than things like the Triton language and things like this. 00:11:44.620 |
And it has good developer tools, and it has all the things that you'd expect from a world-class 00:11:49.900 |
implementation of GPU programming technology. 00:11:51.980 |
For people who don't want to write kernels, you also get a very fancy autofusing compiler and 00:11:57.740 |
And so you get good performance for the normal cases without having to write the hand-fused 00:12:02.540 |
kernels, which is, again, a major usability improvement. 00:12:08.380 |
Like, there's a lot of things out there that promise to be easy. 00:12:12.940 |
A lot of the reason to use a GPU in the first place is about performance. 00:12:16.380 |
And so one of the things that I think is pretty cool, and one of the things that's very important 00:12:20.700 |
to modular is that we're not comparing against the standards. 00:12:25.820 |
In this case, NVIDIA, they're experts in their architecture. 00:12:30.380 |
And so if you go look at, again, there's a million ways to measure things, a microbenchmark. 00:12:34.940 |
Go look at the core operation within a neural network, matrix multiplication. 00:12:40.220 |
This is the most important thing for a wide variety of workloads and, again, one set of data. 00:12:45.660 |
But we compare against Kublas, the hard-coded thing, and then also against Cutlass, 00:12:55.260 |
And so max is meeting and beating both of these, you know, by just a little bit. 00:13:00.620 |
I mean, you know, it depends on your bar, and data is complicated. 00:13:04.060 |
But, you know, if you're winning by 30%, 30% is actually a pretty big deal given the amount of 00:13:09.340 |
cost, the amount of complexity, the amount of effort that goes into these kinds of things. 00:13:13.420 |
And so I've talked a lot about the what, but I haven't talked about the how. 00:13:18.700 |
And so the how is actually a very important part of this, and I'll just give you a sample of this. 00:13:23.500 |
So we are crazy enough that we decided to go rebuild the world's 00:13:27.180 |
first AI stack from the bottom up for gen AI. 00:13:29.980 |
And as part of doing that, what we realized is we had to go even deeper. 00:13:36.140 |
We have a new programming language. It's called Mojo. 00:13:40.700 |
And so the thing about Mojo is if you don't want to know about Mojo, you don't have to use Mojo. 00:13:46.140 |
You can just use Max. It's fine. But we had to build Mojo in order to build Max. 00:13:51.500 |
And I'll tell you just a couple of things about this. Our goal is that Mojo is the best way to 00:13:56.060 |
extend Python. And that means that you can get out of C, C++, and Rust. 00:14:01.020 |
And so what is it as a programming language? It's a full -- it's Pythonic. So it looks like Python. 00:14:06.620 |
It feels like Python. Everything you know about Python comes over. And you cannot have to retrain 00:14:10.620 |
everything, which is a really big deal. You get a full tool chain. You can download it on your 00:14:14.780 |
computer. You can use it in Visual Studio Code. It's open source. Available on Linux, Mac, Windows. 00:14:20.140 |
200,000 people, 20,000 people in Discord. It's really cool. I would love for you to go check it 00:14:26.220 |
out if you're interested in this. But what is Mojo? Like, what actually is it? Fine. There's a programming 00:14:33.660 |
language thing going on. Well, what we decided is we decided that AI needs two things. It needs 00:14:39.500 |
everything that's amazing about Python. This is, in my opinion, the developers. This is the ecosystem. 00:14:46.620 |
This is the libraries. This is the community. This is even the package managing. And, like, 00:14:52.220 |
all the things that people are used to using already. Those are the things that are great about Python. 00:14:57.660 |
But what is not great about Python, unfortunately, is its implementation. And so, what we've done is 00:15:02.540 |
we've combined the things that are great about Python with some very fancy highfalutin compiler-y stuff, 00:15:08.700 |
MLIR, all this good stuff that then allows us to build something really special. And so, 00:15:15.420 |
while it looks like Python, please do forget everything you know about Python, because this is a different 00:15:20.700 |
beast. And I'm not going to give you a full hour-long presentation on Mojo. But I'll give you 00:15:27.180 |
one example of why it's a different beast, and I'll pull it back to something many of you care about, 00:15:30.540 |
which is performance. And what I'll say is that Mojo is fast. How fast? Well, it depends. This isn't a 00:15:37.820 |
slightly faster Python. This is a working back in the speed of light of hardware kind of system. And so, 00:15:43.820 |
many people out there have found that it's about 100 times to 1,000 times faster. In crazy cases, 00:15:48.540 |
it can be even better than that. But the speed is not the point. The point is what it means. 00:15:55.260 |
And so, in Python, for example, you should never write a for loop. Python is not designed for writing 00:16:01.420 |
for loops, if you care about performance, at least. In Mojo, you can go write code that does arbitrary 00:16:09.180 |
things. This is an example pulled from our Lama 3 written in Mojo that does tokenization using a standard 00:16:14.220 |
algorithm. It's chasing linked lists. It has if statements for loops. It's just normal code. 00:16:19.020 |
And it's Python. It feels like Python. And that is really the point. And so, for you, the benefit 00:16:24.780 |
of Mojo is, first of all, you can ignore it if you don't want to care about it. But if you do, you don't 00:16:29.820 |
have to learn C, C++. You have lower cost by default versus Python because performance is cost. It means that, 00:16:37.900 |
as a researcher, if you use this, you can actually have full stack hackability. And if you're a manager, 00:16:43.180 |
it means that you don't have to have people that know Rust on your team and C++ and things like 00:16:47.740 |
this. You can have a much more coherent engineering structure where you're able to scale into the 00:16:52.460 |
problem no matter where it is. And so, if you want to see something super polarizing, go check the 00:16:56.780 |
modular blog, and we'll explain how it's actually faster than Rust, which many people consider to be the 00:17:00.860 |
gold standard, even though it's, again, a 15-year-old language. So, I have to wrap things up. 00:17:05.740 |
They'll get mad at me if I go over. The thing that I'm here to say is that many of you may want to 00:17:11.820 |
go beyond the API. And they're fantastic. There's amazing technology out there. I'm very excited about 00:17:17.980 |
them, too. But if you care about control over your data, you want to integrate into your security, 00:17:23.180 |
you want customization, you want to save money, you want portability across hardware, then you need 00:17:26.860 |
to get on to something else. And so, if you're interested in these things, then Macs can be very 00:17:30.460 |
interesting to you. Macs is free. You can download today. It's totally available. You can go nuts. 00:17:36.460 |
We didn't talk about production or deployment or things like this, but if you want to do that, 00:17:40.940 |
we can also help. We support production deployment on Kubernetes, SageMaker, and we can make it super 00:17:45.580 |
easy for you. Our GPU support, like I said, is actually really hard. We're working really hard on 00:17:51.500 |
this. We want to do this right. And so, it'll launch officially in September. If you join our Discord, 00:17:56.780 |
you can get early access, and we'd be very happy to work with you ahead of that, too. 00:18:00.060 |
We're cranking out new stuff all the time. And so, if you are interested in learning more, you can 00:18:06.700 |
check out modular.com. Find us on GitHub. A lot of this is open source. And join our Discord. Thank you, everyone.