Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

BRENT JOHNSON: All right. Good morning, everyone. I'm here to talk to you about modular and accelerating the pace of AI. You know what Gen AI is. I'm not going to tell you all about this. Let me tell you one of the things I think is really cool about it and very different than certain other technologies is that it's super easy to deploy.

There's lots of great endpoints out there. There's a lot of good implementations, a lot of ways to make it super easy to build a prototype and get going very quickly. But despite all the availability of all these different endpoints, sometimes you do have other needs. Sometimes you might want to go and control your data instead of sending your data to somebody else.

Sometimes you might want to integrate it into your own security because you've got your critical company data in your model, and you don't want to fine tune it somewhere else. Sometimes you want to customize the model. There's research happening all the time. A lot of things in building proprietary models that work best for your use cases can make your applications even better.

Of course, the inference endpoints are expensive, and so sometimes you want to save money. Sometimes there's hardware out there that's really interesting, and you want to explore out from the mainstream, and you want to go do this. And if you care about any of these things, what you need to do is you need to go beyond the endpoint.

And so how do you do that? Well, if you have-- many of you have explored this, I'm sure. The answers shifted. It used to be that we had things like PyTorch and TensorFlow and CAFE and things like this. But as inference became more important, the world shifted. First, we got Onyx, TensorRT, things like this.

And today, we have an explosion of these different frameworks, some of which are specific to one model. And that's cool if you care about that one model. But if you have many different things you want to deploy and you want to work with, it's very frustrating to have to switch between all these different technologies.

And of course, it's not just the model. You all know there's this gigantic array of different technologies that get used to build real-world things in production. And of course, none of these are really actually designed for Gen AI. So my concern about this, my objection to the status quo, is that this fragmentation slows down getting the research and the innovations coming into Gen AI into your products.

And I think we've seen so many of these demos. Last year was really the year of the Gen AI demo. But still, we're struggling to get Gen AI into products in an economical and good way. And so whose fault is it? Well, is it our fault? Many of you are AI engineers.

If you don't, let's sympathize with the plight of the AI engineer. Because y'all, these folks that are building this, have new models and optimizations coming out every week. Every product needs to be enhanced with Gen AI. This is not one thing that we're getting dumped on. And there's so much to do, we can't even keep up.

There's no time to deal with new hardware and all the other exciting new features. And of course, once you get something that actually works, the costs end up making it very difficult to scale these things. because getting things into production means suddenly you're paying on a per unit basis.

So it's not the AI engineer's fault. We should look at the concerns and look at the challenges faced here. And so I think that we need a new approach. We've learned so much. Let's look at what we need to do. How do we solve and improve the world here?

This is what Modular is about. And so I'll give you a quick intro of what we're doing and kind of our approach on this. First of all, who are we? Modular is a fairly young company. We've been around for a couple of years. We have brought together some of the world's experts that built all of these things.

And so we've built TensorFlow and PyTorch. We built compilers like LLVM and MLIR and XLA and all of these different things. So what I can say about that is that we learned a lot, and I apologize, because we know why it is so frustrating to use all these things.

But really, the world looked very different five years ago. Gen AI didn't exist. It's understandable. We tried really hard, but we have learned. And so what our goal is is to make it so you can own your AI. You can own your data. You can control your product. You can deploy where you want to.

You can do this and make it much easier than the current systems work today. And so how? Well, what we're doing is really going back to the basics. We're bringing together the best-in-class technologies into one stack, not one solution per model. Our goal is to lift Python developers, PyTorch users.

This is where the entire industry is, and so we want to work with existing people. We're not trying to say, hey, ditch everything you know and try something new. We want to gradually teach and give folks new tools so they can be superpowers, so they can have superpowers. And finally, so I spent a lot of time at Apple.

Like, I want things that just work. Like, you want to build on top of infrastructure. You do not want to have to be experts in the infrastructure. And this is the way all of this stuff should work. And unfortunately, it's just not the case today in AI. And so at Modular, we're building this technology called Max.

I'll explain super fast what this is. Max is two things. One is an AI framework, which I'll spend a bunch of time about. The AI framework is free, widely available. We'll talk about it today. The other is our managed services. This is how Modular makes money. Very traditional. We're not going to spend a lot of time talking about that today.

And so if you dive into this AI framework, well, we see it as two things. It's the best way to deploy PyTorch. It's also the best way to do Gen AI. And both halves of this are really important. And Max is currently very focused on inference. And so these are areas where PyTorch is challenging at times.

This is where Gen AI is driving us crazy with cost and complexity. And so really focusing on this problem is something that we're all about. The other thing, as I said before, is Python. So we natively speak Python. That is where the entire world is. We also have other options, including C++, which we'll talk about later.

So how do we approach this? Well, as I said, we work with PyTorch out of the box. You can bring your models to your model works. We can talk to the wide array of PyTorchy things, like Onyx and TorchScript and TorchCompile and all this stuff. And so you can pick your path.

And that's all good. If you want to go deeper, you can use native APIs. Native APIs are great if you speak the language of KV caches and page attention and things like this. And you care about pushing the state of the art of LLM and other Gen AI techniques.

That's very cool. And also, Max is very different in that it really rebuilds a ton of the stack, which I don't have time to talk about. But we do not build on top of QDNN and the NVIDIA libraries and on top of the Intel libraries. We replace all that with a single consistent stack, which is a really different approach.

And I'll talk about what that means later. So what you get is you get a whole bunch of technology that you don't have to worry about. And so, again, as a next generation technology, you get a lot of fancy compiler technologies, run times, high performance kernels, like all this stuff in the box.

And you don't have to worry about it, which is really the point. Now, why would you use Max? So it's an AI framework. You have one, right? And so there are lots of different reasons why people might want to use an alternative thing. For example, developer velocity, your team being more productive, that's actually incredibly important, particularly if you're pushing state of the art.

But it's also very hard to quantify. And so I'll do the same thing that kind of people generally do, is go and talk about the quantifiable thing, which is performance. And so I'll give you one example of this. We just shipped a release that has our int 4, int 6K fancy quantization approach.

This is actually 5x faster than Lama.cpp. So if you're using Lama.cpp today in cloud CPUs, this is actually a pretty big deal. And 5x can have a pretty big impact on the actual perceived latency of your product and performance and cost characteristics. And the way this is possible is, again, this combination of really crazy compiler and technology and other stuff underneath the covers.

But the fact that you don't have to care about that is actually pretty nice. It's also pretty nice that this isn't just one model. This is, you know, we have this make it easy to do int 4 technology. And then we demonstrate it with a model that people are very familiar with.

And so if you care about this kind of stuff, this is actually pretty interesting. And it's a next generation approach to a lot of the things that are very familiar. But it's also done in a generalizable way. Now, CPUs are cool. And so, I mean, so far we've been talking about CPUs.

But GPUs are also cool. And what I would say and what I've seen is that the CPUs and AI are kind of well understood. But GPUs are where most of the pain is. And so I'll talk just a little bit about our approach on this. And so first, before I tell you what we're doing, let me tell you our dream.

And this is not a small ambition. This is kind of a crazy dream. Imagine a world where you can program a GPU as easily as you can program a CPU in Python. Not C++ in Python. That is a very different thing than the world is today. Imagine a world in which you can actually get better utilization from the GPUs you're already paying for.

I don't know your workload, but you're probably somewhere between 30%, maybe 50% utilization, which means you're paying for, like, two to three times the amount of GPU that you should be. And that is understandable given the technology today. But that's not great for lots of obvious reasons. Imagine a world where you have the full power of CUDA.

So you don't have to say there's a powerful thing and there's an easy-to-use thing. You can have one technology stack that scales. Well, this is something that is really hard. This is something where, you know, NVIDIA has a lot of very good software people, and they've been working on this for 15 years.

But I don't know about you. I don't run 15-year software on my cell phone. Like, it doesn't run BlackBerry software either. And I think that it's time to really rethink this technology stack and push the world forward. And that's what we're trying to do. And so how does it work?

Well, you know, it's just like PyTorch. You use one line of code and switch out CPU to GPU. Ha-ha. We've all seen this, right? This doesn't say anything. I actually hate this kind of a demo. Because the way this is usually implemented is by having a big fork at the top of two completely different technology stacks.

One built on top of Intel MKL. One built on top of CUDA. And so as a consequence, nothing actually works the same except for the thing on the slide. And so what modular has done here is we've gone down and said, let's replace that entire layer of technology. Let's replace the matrix multiplications.

Let's replace the fuse attention layers. Let's replace the graph thingies. Let's replace all this kind of stuff. And then make it work super easily, super predictably. And let's make it all stitched together. And yeah, it looks fine on the slide. But the slide is missing the point. So if you are an advanced developer, and so many of you don't want to know about this, and that's cool.

If you are an advanced developer, like I said, you get the full power of CUDA. And so if you want, you can go write custom kernels directly against max. And that's great. And for advanced developers, which I'm not going to dive too deeply into, it's way easier to use than things like the Triton language and things like this.

And it has good developer tools, and it has all the things that you'd expect from a world-class implementation of GPU programming technology. For people who don't want to write kernels, you also get a very fancy autofusing compiler and things like this. And so you get good performance for the normal cases without having to write the hand-fused kernels, which is, again, a major usability improvement.

Now, you know, it's cool. Like, there's a lot of things out there that promise to be easy. But what about performance? A lot of the reason to use a GPU in the first place is about performance. And so one of the things that I think is pretty cool, and one of the things that's very important to modular is that we're not comparing against the standards.

We're comparing against the vendor's best. In this case, NVIDIA, they're experts in their architecture. And so if you go look at, again, there's a million ways to measure things, a microbenchmark. Go look at the core operation within a neural network, matrix multiplication. This is the most important thing for a wide variety of workloads and, again, one set of data.

But we compare against Kublas, the hard-coded thing, and then also against Cutlass, the more programmable C++-y thing. And so max is meeting and beating both of these, you know, by just a little bit. I mean, you know, it depends on your bar, and data is complicated. But, you know, if you're winning by 30%, 30% is actually a pretty big deal given the amount of cost, the amount of complexity, the amount of effort that goes into these kinds of things.

And so I've talked a lot about the what, but I haven't talked about the how. And so the how is actually a very important part of this, and I'll just give you a sample of this. So we are crazy enough that we decided to go rebuild the world's first AI stack from the bottom up for gen AI.

And as part of doing that, what we realized is we had to go even deeper. And so we built a programming language. We have a new programming language. It's called Mojo. And so the thing about Mojo is if you don't want to know about Mojo, you don't have to use Mojo.

You can just use Max. It's fine. But we had to build Mojo in order to build Max. And I'll tell you just a couple of things about this. Our goal is that Mojo is the best way to extend Python. And that means that you can get out of C, C++, and Rust.

And so what is it as a programming language? It's a full -- it's Pythonic. So it looks like Python. It feels like Python. Everything you know about Python comes over. And you cannot have to retrain everything, which is a really big deal. You get a full tool chain. You can download it on your computer.

You can use it in Visual Studio Code. It's open source. Available on Linux, Mac, Windows. 200,000 people, 20,000 people in Discord. It's really cool. I would love for you to go check it out if you're interested in this. But what is Mojo? Like, what actually is it? Fine. There's a programming language thing going on.

Well, what we decided is we decided that AI needs two things. It needs everything that's amazing about Python. This is, in my opinion, the developers. This is the ecosystem. This is the libraries. This is the community. This is even the package managing. And, like, all the things that people are used to using already.

Those are the things that are great about Python. But what is not great about Python, unfortunately, is its implementation. And so, what we've done is we've combined the things that are great about Python with some very fancy highfalutin compiler-y stuff, MLIR, all this good stuff that then allows us to build something really special.

And so, while it looks like Python, please do forget everything you know about Python, because this is a different beast. And I'm not going to give you a full hour-long presentation on Mojo. But I'll give you one example of why it's a different beast, and I'll pull it back to something many of you care about, which is performance.

And what I'll say is that Mojo is fast. How fast? Well, it depends. This isn't a slightly faster Python. This is a working back in the speed of light of hardware kind of system. And so, many people out there have found that it's about 100 times to 1,000 times faster.

In crazy cases, it can be even better than that. But the speed is not the point. The point is what it means. And so, in Python, for example, you should never write a for loop. Python is not designed for writing for loops, if you care about performance, at least.

In Mojo, you can go write code that does arbitrary things. This is an example pulled from our Lama 3 written in Mojo that does tokenization using a standard algorithm. It's chasing linked lists. It has if statements for loops. It's just normal code. And it's Python. It feels like Python.

And that is really the point. And so, for you, the benefit of Mojo is, first of all, you can ignore it if you don't want to care about it. But if you do, you don't have to learn C, C++. You have lower cost by default versus Python because performance is cost.

It means that, as a researcher, if you use this, you can actually have full stack hackability. And if you're a manager, it means that you don't have to have people that know Rust on your team and C++ and things like this. You can have a much more coherent engineering structure where you're able to scale into the problem no matter where it is.

And so, if you want to see something super polarizing, go check the modular blog, and we'll explain how it's actually faster than Rust, which many people consider to be the gold standard, even though it's, again, a 15-year-old language. So, I have to wrap things up. They'll get mad at me if I go over.

The thing that I'm here to say is that many of you may want to go beyond the API. And they're fantastic. There's amazing technology out there. I'm very excited about them, too. But if you care about control over your data, you want to integrate into your security, you want customization, you want to save money, you want portability across hardware, then you need to get on to something else.

And so, if you're interested in these things, then Macs can be very interesting to you. Macs is free. You can download today. It's totally available. You can go nuts. We didn't talk about production or deployment or things like this, but if you want to do that, we can also help.

We support production deployment on Kubernetes, SageMaker, and we can make it super easy for you. Our GPU support, like I said, is actually really hard. We're working really hard on this. We want to do this right. And so, it'll launch officially in September. If you join our Discord, you can get early access, and we'd be very happy to work with you ahead of that, too.

We're cranking out new stuff all the time. And so, if you are interested in learning more, you can check out modular.com. Find us on GitHub. A lot of this is open source. And join our Discord. Thank you, everyone. We'll see you in the next one.

Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

Transcript