Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner

00:00:00.280 | BRENT JOHNSON: All right.

00:00:13.880 | Good morning, everyone.

00:00:15.080 | I'm here to talk to you about modular and accelerating

00:00:17.640 | the pace of AI.

00:00:19.880 | You know what Gen AI is.

00:00:21.280 | I'm not going to tell you all about this.

00:00:22.760 | Let me tell you one of the things I think

00:00:24.200 | is really cool about it and very different

00:00:26.160 | than certain other technologies is

00:00:28.160 | that it's super easy to deploy.

00:00:30.300 | There's lots of great endpoints out there.

00:00:32.120 | There's a lot of good implementations,

00:00:33.680 | a lot of ways to make it super easy to build a prototype

00:00:37.000 | and get going very quickly.

00:00:39.640 | But despite all the availability of all these different endpoints,

00:00:43.080 | sometimes you do have other needs.

00:00:45.320 | Sometimes you might want to go and control your data

00:00:48.840 | instead of sending your data to somebody else.

00:00:50.920 | Sometimes you might want to integrate it into your own security

00:00:53.980 | because you've got your critical company data in your model,

00:00:57.260 | and you don't want to fine tune it somewhere else.

00:00:59.300 | Sometimes you want to customize the model.

00:01:01.100 | There's research happening all the time.

00:01:02.780 | A lot of things in building proprietary models

00:01:05.580 | that work best for your use cases

00:01:07.660 | can make your applications even better.

00:01:09.860 | Of course, the inference endpoints are expensive,

00:01:12.220 | and so sometimes you want to save money.

00:01:14.020 | Sometimes there's hardware out there that's really interesting,

00:01:16.300 | and you want to explore out from the mainstream,

00:01:19.460 | and you want to go do this.

00:01:21.340 | And if you care about any of these things, what you need to do

00:01:24.280 | is you need to go beyond the endpoint.

00:01:27.400 | And so how do you do that?

00:01:29.120 | Well, if you have-- many of you have explored this, I'm sure.

00:01:32.600 | The answers shifted.

00:01:33.800 | It used to be that we had things like PyTorch and TensorFlow

00:01:36.240 | and CAFE and things like this.

00:01:37.680 | But as inference became more important, the world shifted.

00:01:40.800 | First, we got Onyx, TensorRT, things like this.

00:01:43.760 | And today, we have an explosion of these different frameworks,

00:01:47.600 | some of which are specific to one model.

00:01:50.720 | And that's cool if you care about that one model.

00:01:52.440 | But if you have many different things you want to deploy

00:01:54.200 | and you want to work with, it's very frustrating

00:01:55.720 | to have to switch between all these different technologies.

00:01:58.680 | And of course, it's not just the model.

00:02:00.640 | You all know there's this gigantic array of different technologies

00:02:04.080 | that get used to build real-world things in production.

00:02:07.160 | And of course, none of these are really actually designed for Gen AI.

00:02:11.720 | So my concern about this, my objection to the status quo,

00:02:15.440 | is that this fragmentation slows down getting the research

00:02:18.600 | and the innovations coming into Gen AI into your products.

00:02:22.440 | And I think we've seen so many of these demos.

00:02:24.560 | Last year was really the year of the Gen AI demo.

00:02:27.520 | But still, we're struggling to get Gen AI into products

00:02:30.080 | in an economical and good way.

00:02:33.760 | And so whose fault is it?

00:02:35.840 | Well, is it our fault?

00:02:37.960 | Many of you are AI engineers.

00:02:39.880 | If you don't, let's sympathize with the plight of the AI engineer.

00:02:43.760 | Because y'all, these folks that are building this,

00:02:47.640 | have new models and optimizations coming out every week.

00:02:52.160 | Every product needs to be enhanced with Gen AI.

00:02:54.000 | This is not one thing that we're getting dumped on.

00:02:56.480 | And there's so much to do, we can't even keep up.

00:02:58.640 | There's no time to deal with new hardware

00:03:00.080 | and all the other exciting new features.

00:03:02.520 | And of course, once you get something that actually works,

00:03:04.920 | the costs end up making it very difficult to scale these things.

00:03:07.440 | because getting things into production means suddenly you're paying on a per unit basis.

00:03:13.320 | So it's not the AI engineer's fault.

00:03:15.200 | We should look at the concerns and look at the challenges faced here.

00:03:19.520 | And so I think that we need a new approach.

00:03:21.720 | We've learned so much.

00:03:22.720 | Let's look at what we need to do.

00:03:24.320 | How do we solve and improve the world here?

00:03:27.320 | This is what Modular is about.

00:03:28.880 | And so I'll give you a quick intro of what we're doing and kind of our approach on this.

00:03:33.200 | First of all, who are we?

00:03:34.200 | Modular is a fairly young company.

00:03:37.080 | We've been around for a couple of years.

00:03:39.400 | We have brought together some of the world's experts that built all of these things.

00:03:43.200 | And so we've built TensorFlow and PyTorch.

00:03:45.200 | We built compilers like LLVM and MLIR and XLA and all of these different things.

00:03:51.080 | So what I can say about that is that we learned a lot, and I apologize, because we know why

00:03:59.640 | it is so frustrating to use all these things.

00:04:01.640 | But really, the world looked very different five years ago.

00:04:06.520 | Gen AI didn't exist.

00:04:07.720 | It's understandable.

00:04:08.840 | We tried really hard, but we have learned.

00:04:11.640 | And so what our goal is is to make it so you can own your AI.

00:04:15.760 | You can own your data.

00:04:16.760 | You can control your product.

00:04:18.320 | You can deploy where you want to.

00:04:20.560 | You can do this and make it much easier than the current systems work today.

00:04:26.760 | And so how?

00:04:28.040 | Well, what we're doing is really going back to the basics.

00:04:31.080 | We're bringing together the best-in-class technologies into one stack, not one solution per model.

00:04:38.640 | Our goal is to lift Python developers, PyTorch users.

00:04:42.320 | This is where the entire industry is, and so we want to work with existing people.

00:04:45.960 | We're not trying to say, hey, ditch everything you know and try something new.

00:04:50.140 | We want to gradually teach and give folks new tools so they can be superpowers, so they

00:04:55.300 | can have superpowers.

00:04:56.660 | And finally, so I spent a lot of time at Apple.

00:04:58.940 | Like, I want things that just work.

00:05:01.140 | Like, you want to build on top of infrastructure.

00:05:02.500 | You do not want to have to be experts in the infrastructure.

00:05:04.500 | And this is the way all of this stuff should work.

00:05:07.500 | And unfortunately, it's just not the case today in AI.

00:05:10.180 | And so at Modular, we're building this technology called Max.

00:05:12.740 | I'll explain super fast what this is.

00:05:15.740 | Max is two things.

00:05:17.740 | One is an AI framework, which I'll spend a bunch of time about.

00:05:21.420 | The AI framework is free, widely available.

00:05:24.740 | We'll talk about it today.

00:05:25.740 | The other is our managed services.

00:05:27.740 | This is how Modular makes money.

00:05:28.740 | Very traditional.

00:05:29.740 | We're not going to spend a lot of time talking about that today.

00:05:31.740 | And so if you dive into this AI framework, well, we see it as two things.

00:05:38.300 | It's the best way to deploy PyTorch.

00:05:40.700 | It's also the best way to do Gen AI.

00:05:43.620 | And both halves of this are really important.

00:05:46.140 | And Max is currently very focused on inference.

00:05:48.940 | And so these are areas where PyTorch is challenging at times.

00:05:53.340 | This is where Gen AI is driving us crazy with cost and complexity.

00:05:56.940 | And so really focusing on this problem is something that we're all about.

00:06:01.900 | The other thing, as I said before, is Python.

00:06:04.380 | So we natively speak Python.

00:06:06.220 | That is where the entire world is.

00:06:07.820 | We also have other options, including C++, which we'll talk about later.

00:06:11.980 | So how do we approach this?

00:06:13.580 | Well, as I said, we work with PyTorch out of the box.

00:06:15.820 | You can bring your models to your model works.

00:06:17.260 | We can talk to the wide array of PyTorchy things,

00:06:20.780 | like Onyx and TorchScript and TorchCompile and all this stuff.

00:06:25.260 | And so you can pick your path.

00:06:26.780 | And that's all good.

00:06:28.380 | If you want to go deeper, you can use native APIs.

00:06:30.780 | Native APIs are great if you speak the language of KV caches and page attention

00:06:35.660 | and things like this.

00:06:37.020 | And you care about pushing the state of the art of LLM and other Gen AI techniques.

00:06:40.700 | That's very cool.

00:06:41.740 | And also, Max is very different in that it really rebuilds a ton of the stack,

00:06:47.660 | which I don't have time to talk about.

00:06:49.020 | But we do not build on top of QDNN and the NVIDIA libraries and on top of the Intel libraries.

00:06:55.980 | We replace all that with a single consistent stack, which is a really different approach.

00:06:59.980 | And I'll talk about what that means later.

00:07:02.860 | So what you get is you get a whole bunch of technology that you don't have to worry about.

00:07:06.380 | And so, again, as a next generation technology, you get a lot of fancy compiler technologies,

00:07:11.500 | run times, high performance kernels, like all this stuff in the box.

00:07:16.380 | And you don't have to worry about it, which is really the point.

00:07:19.100 | Now, why would you use Max?

00:07:21.980 | So it's an AI framework.

00:07:23.340 | You have one, right?

00:07:24.460 | And so there are lots of different reasons why people might want to use an alternative thing.

00:07:29.980 | For example, developer velocity, your team being more productive,

00:07:34.460 | that's actually incredibly important, particularly if you're pushing state of the art.

00:07:37.900 | But it's also very hard to quantify.

00:07:39.420 | And so I'll do the same thing that kind of people generally do,

00:07:42.780 | is go and talk about the quantifiable thing, which is performance.

00:07:46.060 | And so I'll give you one example of this.

00:07:47.660 | We just shipped a release that has our int 4, int 6K fancy quantization approach.

00:07:54.620 | This is actually 5x faster than Lama.cpp.

00:08:00.140 | So if you're using Lama.cpp today in cloud CPUs, this is actually a pretty big deal.

00:08:06.860 | And 5x can have a pretty big impact on the actual perceived latency of your product

00:08:12.940 | and performance and cost characteristics.

00:08:15.420 | And the way this is possible is, again, this combination of really crazy compiler and technology

00:08:21.980 | and other stuff underneath the covers.

00:08:23.900 | But the fact that you don't have to care about that is actually pretty nice.

00:08:26.940 | It's also pretty nice that this isn't just one model.

00:08:30.140 | This is, you know, we have this make it easy to do int 4 technology.

00:08:34.140 | And then we demonstrate it with a model that people are very familiar with.

00:08:37.980 | And so if you care about this kind of stuff, this is actually pretty interesting.

00:08:41.420 | And it's a next generation approach to a lot of the things that are very familiar.

00:08:45.260 | But it's also done in a generalizable way.

00:08:47.100 | Now, CPUs are cool.

00:08:48.940 | And so, I mean, so far we've been talking about CPUs.

00:08:51.900 | But GPUs are also cool.

00:08:54.060 | And what I would say and what I've seen is that the CPUs and AI are kind of well understood.

00:09:01.260 | But GPUs are where most of the pain is.

00:09:03.420 | And so I'll talk just a little bit about our approach on this.

00:09:06.140 | And so first, before I tell you what we're doing, let me tell you our dream.

00:09:11.340 | And this is not a small ambition.

00:09:13.660 | This is kind of a crazy dream.

00:09:15.020 | Imagine a world where you can program a GPU as easily as you can program a CPU in Python.

00:09:23.500 | Not C++ in Python.

00:09:27.500 | That is a very different thing than the world is today.

00:09:33.420 | Imagine a world in which you can actually get better utilization from the GPUs you're already paying for.

00:09:39.260 | I don't know your workload, but you're probably somewhere between 30%, maybe 50% utilization,

00:09:44.940 | which means you're paying for, like, two to three times the amount of GPU that you should be.

00:09:48.780 | And that is understandable given the technology today.

00:09:53.340 | But that's not great for lots of obvious reasons.

00:09:55.660 | Imagine a world where you have the full power of CUDA.

00:09:59.340 | So you don't have to say there's a powerful thing and there's an easy-to-use thing.

00:10:04.860 | You can have one technology stack that scales.

00:10:08.460 | Well, this is something that is really hard.

00:10:10.220 | This is something where, you know, NVIDIA has a lot of very good software people,

00:10:14.220 | and they've been working on this for 15 years.

00:10:15.820 | But I don't know about you.

00:10:17.740 | I don't run 15-year software on my cell phone.

00:10:20.140 | Like, it doesn't run BlackBerry software either.

00:10:22.380 | And I think that it's time to really rethink this technology stack and push the world forward.

00:10:26.860 | And that's what we're trying to do.

00:10:29.180 | And so how does it work?

00:10:30.780 | Well, you know, it's just like PyTorch.

00:10:32.700 | You use one line of code and switch out CPU to GPU.

00:10:35.100 | Ha-ha.

00:10:36.860 | We've all seen this, right?

00:10:38.220 | This doesn't say anything.

00:10:39.260 | I actually hate this kind of a demo.

00:10:41.660 | Because the way this is usually implemented is by having a big fork at the top of two completely

00:10:47.900 | different technology stacks.

00:10:49.420 | One built on top of Intel MKL.

00:10:51.100 | One built on top of CUDA.

00:10:52.220 | And so as a consequence, nothing actually works the same except for the thing on the slide.

00:10:57.820 | And so what modular has done here is we've gone down and said,

00:11:01.180 | let's replace that entire layer of technology.

00:11:04.060 | Let's replace the matrix multiplications.

00:11:06.140 | Let's replace the fuse attention layers.

00:11:08.540 | Let's replace the graph thingies.

00:11:10.220 | Let's replace all this kind of stuff.

00:11:11.900 | And then make it work super easily, super predictably.

00:11:14.220 | And let's make it all stitched together.

00:11:16.300 | And yeah, it looks fine on the slide.

00:11:17.980 | But the slide is missing the point.

00:11:19.420 | So if you are an advanced developer, and so many of you don't want to know about this,

00:11:25.100 | and that's cool.

00:11:25.820 | If you are an advanced developer, like I said, you get the full power of CUDA.

00:11:29.260 | And so if you want, you can go write custom kernels directly against max.

00:11:35.580 | And that's great.

00:11:37.020 | And for advanced developers, which I'm not going to dive too deeply into,

00:11:40.460 | it's way easier to use than things like the Triton language and things like this.

00:11:44.620 | And it has good developer tools, and it has all the things that you'd expect from a world-class

00:11:49.900 | implementation of GPU programming technology.

00:11:51.980 | For people who don't want to write kernels, you also get a very fancy autofusing compiler and

00:11:57.100 | things like this.

00:11:57.740 | And so you get good performance for the normal cases without having to write the hand-fused

00:12:02.540 | kernels, which is, again, a major usability improvement.

00:12:05.740 | Now, you know, it's cool.

00:12:08.380 | Like, there's a lot of things out there that promise to be easy.

00:12:10.780 | But what about performance?

00:12:12.940 | A lot of the reason to use a GPU in the first place is about performance.

00:12:16.380 | And so one of the things that I think is pretty cool, and one of the things that's very important

00:12:20.700 | to modular is that we're not comparing against the standards.

00:12:23.900 | We're comparing against the vendor's best.

00:12:25.820 | In this case, NVIDIA, they're experts in their architecture.

00:12:30.380 | And so if you go look at, again, there's a million ways to measure things, a microbenchmark.

00:12:34.940 | Go look at the core operation within a neural network, matrix multiplication.

00:12:40.220 | This is the most important thing for a wide variety of workloads and, again, one set of data.

00:12:45.660 | But we compare against Kublas, the hard-coded thing, and then also against Cutlass,

00:12:52.620 | the more programmable C++-y thing.

00:12:55.260 | And so max is meeting and beating both of these, you know, by just a little bit.

00:13:00.620 | I mean, you know, it depends on your bar, and data is complicated.

00:13:04.060 | But, you know, if you're winning by 30%, 30% is actually a pretty big deal given the amount of

00:13:09.340 | cost, the amount of complexity, the amount of effort that goes into these kinds of things.

00:13:13.420 | And so I've talked a lot about the what, but I haven't talked about the how.

00:13:18.700 | And so the how is actually a very important part of this, and I'll just give you a sample of this.

00:13:23.500 | So we are crazy enough that we decided to go rebuild the world's

00:13:27.180 | first AI stack from the bottom up for gen AI.

00:13:29.980 | And as part of doing that, what we realized is we had to go even deeper.

00:13:33.900 | And so we built a programming language.

00:13:36.140 | We have a new programming language. It's called Mojo.

00:13:40.700 | And so the thing about Mojo is if you don't want to know about Mojo, you don't have to use Mojo.

00:13:46.140 | You can just use Max. It's fine. But we had to build Mojo in order to build Max.

00:13:51.500 | And I'll tell you just a couple of things about this. Our goal is that Mojo is the best way to

00:13:56.060 | extend Python. And that means that you can get out of C, C++, and Rust.

00:14:01.020 | And so what is it as a programming language? It's a full -- it's Pythonic. So it looks like Python.

00:14:06.620 | It feels like Python. Everything you know about Python comes over. And you cannot have to retrain

00:14:10.620 | everything, which is a really big deal. You get a full tool chain. You can download it on your

00:14:14.780 | computer. You can use it in Visual Studio Code. It's open source. Available on Linux, Mac, Windows.

00:14:20.140 | 200,000 people, 20,000 people in Discord. It's really cool. I would love for you to go check it

00:14:26.220 | out if you're interested in this. But what is Mojo? Like, what actually is it? Fine. There's a programming

00:14:33.660 | language thing going on. Well, what we decided is we decided that AI needs two things. It needs

00:14:39.500 | everything that's amazing about Python. This is, in my opinion, the developers. This is the ecosystem.

00:14:46.620 | This is the libraries. This is the community. This is even the package managing. And, like,

00:14:52.220 | all the things that people are used to using already. Those are the things that are great about Python.

00:14:57.660 | But what is not great about Python, unfortunately, is its implementation. And so, what we've done is

00:15:02.540 | we've combined the things that are great about Python with some very fancy highfalutin compiler-y stuff,

00:15:08.700 | MLIR, all this good stuff that then allows us to build something really special. And so,

00:15:15.420 | while it looks like Python, please do forget everything you know about Python, because this is a different

00:15:20.700 | beast. And I'm not going to give you a full hour-long presentation on Mojo. But I'll give you

00:15:27.180 | one example of why it's a different beast, and I'll pull it back to something many of you care about,

00:15:30.540 | which is performance. And what I'll say is that Mojo is fast. How fast? Well, it depends. This isn't a

00:15:37.820 | slightly faster Python. This is a working back in the speed of light of hardware kind of system. And so,

00:15:43.820 | many people out there have found that it's about 100 times to 1,000 times faster. In crazy cases,

00:15:48.540 | it can be even better than that. But the speed is not the point. The point is what it means.

00:15:55.260 | And so, in Python, for example, you should never write a for loop. Python is not designed for writing

00:16:01.420 | for loops, if you care about performance, at least. In Mojo, you can go write code that does arbitrary

00:16:09.180 | things. This is an example pulled from our Lama 3 written in Mojo that does tokenization using a standard

00:16:14.220 | algorithm. It's chasing linked lists. It has if statements for loops. It's just normal code.

00:16:19.020 | And it's Python. It feels like Python. And that is really the point. And so, for you, the benefit

00:16:24.780 | of Mojo is, first of all, you can ignore it if you don't want to care about it. But if you do, you don't

00:16:29.820 | have to learn C, C++. You have lower cost by default versus Python because performance is cost. It means that,

00:16:37.900 | as a researcher, if you use this, you can actually have full stack hackability. And if you're a manager,

00:16:43.180 | it means that you don't have to have people that know Rust on your team and C++ and things like

00:16:47.740 | this. You can have a much more coherent engineering structure where you're able to scale into the

00:16:52.460 | problem no matter where it is. And so, if you want to see something super polarizing, go check the

00:16:56.780 | modular blog, and we'll explain how it's actually faster than Rust, which many people consider to be the

00:17:00.860 | gold standard, even though it's, again, a 15-year-old language. So, I have to wrap things up.

00:17:05.740 | They'll get mad at me if I go over. The thing that I'm here to say is that many of you may want to

00:17:11.820 | go beyond the API. And they're fantastic. There's amazing technology out there. I'm very excited about

00:17:17.980 | them, too. But if you care about control over your data, you want to integrate into your security,

00:17:23.180 | you want customization, you want to save money, you want portability across hardware, then you need

00:17:26.860 | to get on to something else. And so, if you're interested in these things, then Macs can be very

00:17:30.460 | interesting to you. Macs is free. You can download today. It's totally available. You can go nuts.

00:17:36.460 | We didn't talk about production or deployment or things like this, but if you want to do that,

00:17:40.940 | we can also help. We support production deployment on Kubernetes, SageMaker, and we can make it super

00:17:45.580 | easy for you. Our GPU support, like I said, is actually really hard. We're working really hard on

00:17:51.500 | this. We want to do this right. And so, it'll launch officially in September. If you join our Discord,

00:17:56.780 | you can get early access, and we'd be very happy to work with you ahead of that, too.

00:18:00.060 | We're cranking out new stuff all the time. And so, if you are interested in learning more, you can

00:18:06.700 | check out modular.com. Find us on GitHub. A lot of this is open source. And join our Discord. Thank you, everyone.

00:18:13.260 | We'll see you in the next one.