back to index

Keynote: The AI developer experience doesn't have to suck – why and how we built Modal


Whisper Transcript | Transcript Only Page

00:00:00.320 | Hi, my name is Eric Bernhardson. It's great to be here virtually.
00:00:04.140 | Who am I? I am the CEO of a company called Modal.
00:00:07.940 | We are based here in New York.
00:00:09.500 | Most of my background is in data, AI, machine learning.
00:00:13.140 | And in particular, I was at Spotify for many years
00:00:15.980 | and built a music recommendation system there.
00:00:17.940 | I did leave about 10 years ago and did all kinds of other stuff in between.
00:00:21.220 | But started Modal about four or five years ago during a pandemic.
00:00:25.700 | And the mission I had at that point was to build an infrastructure platform
00:00:32.100 | for data, AI, machine learning in a way that makes it fun again to write these applications.
00:00:40.220 | Basically to deploy models, scale them out, run large-scale back jobs,
00:00:44.720 | making it possible to focus on writing code and not have to deal with infrastructure.
00:00:49.300 | As it turned out, Gen.AI was a perfect use case for this.
00:00:52.260 | We just didn't know it at that time.
00:00:54.860 | Modal is very much focused on high code use cases.
00:00:57.460 | What that means is we focus on people who want to write their own code,
00:01:00.220 | in particular, writing their own models,
00:01:01.880 | but also in many cases using existing models
00:01:04.080 | in a way where you want to have control over the workflow or other thing.
00:01:08.120 | And so you can think of it more as like Kubernetes or AWS Lambda
00:01:11.900 | in the sense that we can run arbitrary containers or arbitrary code.
00:01:14.680 | We do focus on Python right now, might add other languages in the future.
00:01:18.300 | Unlike a system like Kubernetes, we're fully managed.
00:01:23.000 | So we run all the infrastructure.
00:01:24.500 | We have a big pool of thousands of GPUs and CPUs,
00:01:27.980 | but we let you run all kinds of applications in our cloud.
00:01:32.280 | And this could really be anything.
00:01:35.660 | In that sense, we're not an AI API.
00:01:37.260 | We don't have one model or 10 models that we put behind an API doing that token prediction.
00:01:42.940 | You can really run anything, which puts a little bit more onus on the developer to build this thing,
00:01:47.520 | but it also makes this a lot more powerful.
00:01:49.480 | In particular, when I think about platforms and how they make you productive
00:01:55.000 | and makes it fun to write code, a lot of my experience is that it comes down to fast feedback loops.
00:02:01.860 | So in order to make engineers fast and make it more productive, you want to have this like super fast feedback loop that let you iterate on code very quickly.
00:02:08.580 | I think cloud has been a phenomenal invention and lets us build things with far more powerful things.
00:02:15.400 | But it's arguably a step backwards in terms of developer experience.
00:02:18.720 | And thinking a lot about this problem, what I realized was in order to solve this,
00:02:23.520 | we had to build our own system to start containers in the cloud very fast.
00:02:28.480 | Because if you can start containers in the cloud very fast, you can take code that the user is building locally and execute in the cloud,
00:02:34.800 | maybe inside a custom image running on a GPU, whatever, and have that sort of fast feedback loop that you like when you run things locally.
00:02:42.440 | As it turns out, solving container cold start in a distributed system is a very, very deep rabbit hole.
00:02:49.500 | We had to build our own scheduler.
00:02:51.000 | We had to build our own file system and many other things.
00:02:55.080 | We set out on a multi-year journey that we still haven't completed, building a lot of this very, very core, very foundational infrastructure.
00:03:01.000 | Model today, you can think of it as two facets.
00:03:04.440 | One is a big resource pool.
00:03:05.980 | We run thousands of GPUs, different types, H100s, A100s, L4s, T4s, you name it.
00:03:12.560 | And the only way to access those is through a Python SDK.
00:03:15.980 | We might have a language in the future, like I mentioned, but right now it's Python.
00:03:19.300 | And the reason we started with Python is obviously that Python is such a dominant language in AI.
00:03:24.660 | machine learning and data applications.
00:03:27.300 | One way to think about modal is that it's a serverless framework that basically lets you take any Python function and turn that into serverless function.
00:03:36.180 | And so you do that by applying this decorator, as you can see in this code sample.
00:03:39.500 | I'll show you a little bit more examples in a second.
00:03:41.680 | People use modal for very large-scale applications, but also small-scale applications.
00:03:47.120 | The biggest use case is most likely Gen.AI in France.
00:03:51.220 | We, in particular, have seen a lot of traction within diffusion models.
00:03:56.960 | So, for instance, AI-generated music, video images, but also a lot of batch jobs, a lot of, for instance, processing very large-scale medical images or doing computer vision on frames of videos.
00:04:12.460 | I'm seeing a lot of traction in computational bio, things like protein folding, both things running on GPUs, but also CPUs.
00:04:20.040 | Of course, LLMs.
00:04:21.140 | You can't talk about Gen.AI without mentioning LLMs.
00:04:23.520 | We have a lot of fine-tuning applications, batch embeddings, of course, inference as well.
00:04:29.100 | Some of our customers, one of the customers I always think is incredibly cool is Suno.
00:04:35.460 | They do AI-generated music and run a lot of their inference on modal, but we have many other use cases for modal, some running a very large scale, doing all kinds of different applications.
00:04:49.020 | Modal, it's a little bit abstract to talk about modal without going into code.
00:04:53.780 | So I'm going to do some live coding, so let's jump into the terminal, and I'll show you exactly, try to give you an idea of what it looks like in code.
00:05:02.260 | So let's look at a very, very basic modal application.
00:05:06.100 | Modal, basically, one way to think about it is we take Python functions and turn them into things that run in the cloud.
00:05:14.060 | There's a very simple function called square, which returns a square of a number and also prints on stuff to standard error.
00:05:20.740 | And this decorator that we apply, app.function, takes that and turns that into a serverless function running in the cloud.
00:05:29.060 | And there's a few different ways to invoke this thing, but we have a little thing here that basically makes sure to trigger it from our laptop when we run it from the command line.
00:05:36.880 | So we're going to do that.
00:05:37.540 | So modal has a little command line interface where basically it lets you run things interactively.
00:05:42.940 | And what happens when we run this thing is we take the code, we stick it in a container, we execute it in the cloud.
00:05:48.540 | As it's executing, it streams the output back.
00:05:51.280 | And the whole point of this is, like, we want to make it fast and feel like we're almost developing things locally.
00:05:55.860 | It's almost as fast as running things locally.
00:05:58.080 | And this extends to things like, let's say we want to edit this thing and just, you know, print something else.
00:06:03.660 | And instead of having to rebuild a container, push up the container to the cloud, download logs, et cetera, with a slow feedback loop, it just picks up the latest code, right?
00:06:14.940 | And rebuilds the container automatically and all these things, right?
00:06:17.240 | And so while you're, like, building applications and rewriting code, you can always just run things in the cloud very, very fast.
00:06:23.980 | So far, this is just showcases, like, the sort of iteration speed.
00:06:28.320 | But also, let's look at the power of modal.
00:06:30.200 | Like, what can you do with modal?
00:06:31.300 | Like, what kinds of stuff can you, can we get to scale?
00:06:33.560 | Can we run things on other types of hardware?
00:06:36.560 | So let's actually run this on an H100.
00:06:39.940 | And the way in modal you do that is by saying just on the function decorator, you say GPU equals H100.
00:06:46.100 | We have a bunch of other types.
00:06:48.360 | As I mentioned, we have A100s and T4s and all kinds of other ones.
00:06:51.480 | But let's run this on an H100, which is NVIDIA's flagship.
00:06:55.320 | And we can get access to an H100 in a couple of seconds.
00:07:00.680 | This is obviously not using the H100, but we're running it in a container that has access to an H100.
00:07:06.860 | So let's say we want to actually access it.
00:07:09.160 | Now we need to probably install some software, right?
00:07:11.400 | So we might want to install Torch in this case.
00:07:14.080 | There's a few different ways you can do that in modal.
00:07:16.540 | You can give us a Docker file.
00:07:18.140 | You can also point to Docker image.
00:07:19.660 | But the easiest thing to do that is to basically define the entire compute environment in code.
00:07:25.820 | So we're going to define the container image using modal's Python SDK.
00:07:30.720 | So we're going to say image equals modal.image.debian slim as the base image.
00:07:36.120 | And we're going to pip install Torch.
00:07:38.380 | And then we're going to use this image on this function.
00:07:40.980 | And we're going to import Torch.
00:07:43.920 | And just to show that it works, we're going to print Torch.cuda.get device name.
00:07:50.840 | Hopefully this works when I run this.
00:07:53.480 | We'll delete this line.
00:07:54.540 | And when I run this thing, hopefully it will print something like we're running on H100.
00:08:00.100 | And as you can see, it's still very fast, but slightly slower this time because loading Torch takes a little bit of extra overhead.
00:08:10.120 | And we'll talk about in a second what we've done to reduce that overhead.
00:08:13.220 | But it takes maybe about a second to initialize Torch.
00:08:17.580 | Okay, cool.
00:08:18.420 | So now we can run stuff on H100s.
00:08:21.560 | Let's try to run things on a lot of H100s.
00:08:24.360 | And so let's try to scale things out a little bit.
00:08:26.520 | In modal, any function can, you can map over any function in modal just in code.
00:08:32.560 | So instead of calling just a single function invocation, we're going to fan out and do 1,000 or maybe let's do 10,000 function invocation.
00:08:40.240 | And you can do this in code by just saying we're going to map over 5,000.
00:08:45.440 | As I said, 10,000 actually.
00:08:46.820 | So let's do that.
00:08:47.560 | And we're going to unpack the iterator.
00:08:49.940 | And let's print X just to show some progress.
00:08:56.680 | And what modal does when you fan out is that it's going to spin up as many containers as possible.
00:09:03.100 | And so you can see we're already running five containers, six containers, eight containers.
00:09:07.400 | It makes it very easy to fan out and start, you know, even hundreds of containers or even thousands of containers running on GPUs.
00:09:15.220 | If we keep this running for several minutes, we can easily scale up to a very large number.
00:09:20.820 | So this gives you basically the ability to take something like, you know, that needs a lot of compute and, you know, something like a batch job and fan out,
00:09:31.180 | spin up thousands of containers, paralyze over it, and get results much faster.
00:09:36.160 | We're going to take a look at the UI for a second.
00:09:41.700 | Modal also has a UI that you can access.
00:09:45.180 | If you go to the website, the URL is printed in the console.
00:09:50.940 | So let's take a look at that.
00:09:55.020 | So we can see the app details in our UI.
00:09:58.000 | There's all kinds of interesting things here.
00:10:00.560 | Modal has a pretty rich UI that lets you see container metrics, logs, lets you set up users and many other things.
00:10:07.900 | So if you zoom in, for instance, on the number of containers, we can see here we spun up 18 containers at peak.
00:10:14.000 | As I mentioned, if we had kept going, we would reach a much larger number, we got 18 containers at this point.
00:10:18.980 | We can look at the CPU utilization, GPU, et cetera.
00:10:22.460 | We can look at GPU temperature, 33 Celsius, even the watt consumption.
00:10:29.820 | So there's a lot of other things here.
00:10:31.980 | We can look at app logs and many other things.
00:10:33.820 | Okay, let's switch back to the terminal for a second and see some other stuff.
00:10:42.540 | There's a lot of stuff, so I'm not going to go into every single possibility of how to use Modal.
00:10:47.880 | But one thing I didn't show that I think is interesting and very valuable is you can also deploy these things.
00:10:54.380 | So, so far, we only showed how to run things interactively, which means we run things from our laptop.
00:11:00.740 | But if I take this code and deploy this using Modal Deploy, we get this persistent endpoint.
00:11:08.440 | And what's nice about that is now we have this thing we can call from any other context in Python.
00:11:13.740 | And I'm just going to show this using my repl.
00:11:15.680 | If we import modal, and if we do lookup like this, we get this handle to this remote function.
00:11:24.660 | So let's call this.
00:11:26.220 | And the first time we're going to call it, we're going to have a call start.
00:11:29.680 | So it's going to take a couple of seconds because the container has to start up.
00:11:32.760 | And remember, we're importing Torch and we're running this on an H100, so it takes a little bit of extra time.
00:11:36.440 | The container keeps running for 60 seconds by default and then it shuts down.
00:11:41.280 | So now it's actually idle.
00:11:42.640 | So if we call this again, typically it would be a little bit faster.
00:11:46.360 | And we're obviously wasting an enormous amount of flops using a GPU to calculate the square of a number.
00:11:52.560 | But this showcases how you can easily take things and deploy it, even on very powerful hardware and build these serverless endpoints.
00:12:01.920 | For instance, doing inference.
00:12:03.080 | And modal handles all the scaling.
00:12:05.780 | So when you invoke this function, multiple times, we'll just scale up using more and more containers and shut down.
00:12:10.540 | Many other things you can do with modal.
00:12:13.740 | You can set up distributed file systems that you can mount to each container.
00:12:18.260 | So you can exchange information using the file system.
00:12:20.740 | You can set up web endpoints.
00:12:23.020 | You can set up cron jobs and many other things.
00:12:27.340 | So this hopefully gives you a little bit more of an idea of what modal looks like from an engineering perspective.
00:12:31.960 | What does it look like when you're interacting through code with modal?
00:12:34.860 | Let's talk a bit about how modal works under the hood.
00:12:38.440 | And as I mentioned, modal, in order to deliver on this developer experience that I always wanted to have,
00:12:45.100 | we had to go down this very deep rabbit hole and build a lot of custom infrastructure ourselves.
00:12:49.380 | And that's the only way we felt that we can make it fast enough.
00:12:52.520 | We couldn't use Kubernetes.
00:12:53.920 | We couldn't use Docker.
00:12:55.660 | So we had to build a lot of this stuff ourselves.
00:12:57.480 | And it should be pointed out.
00:12:59.080 | We're standing on the shoulders of giants here.
00:13:00.680 | We're using a fantastic container runtime called GVisor that gives us isolation,
00:13:06.700 | but we had to build a lot of stuff around it.
00:13:08.500 | We had to build it on scheduler and many other things,
00:13:11.700 | but we're obviously using a lot of the existing things in Linux and other systems.
00:13:16.020 | And we're using fantastic cloud tools as well.
00:13:18.220 | In order to deliver the developer experience that we wanted to, as I mentioned,
00:13:23.600 | and the feedback loops that we wanted to, we had to figure out container cold start.
00:13:26.900 | And container cold start, starting containers fast in a distributed system is a hard problem.
00:13:32.920 | So let's talk about what containers are to start with.
00:13:35.960 | Containers are, and this is my super crude, unfair generalization of what a container is,
00:13:41.020 | or container image.
00:13:42.320 | It's basically two things.
00:13:44.060 | It's a root file system, so that's like the slash that you have in Linux that contains all the data on your drive.
00:13:49.200 | And then it's a bunch of stuff to isolate processes, so they can't tamper with each other.
00:13:54.980 | There are many inefficiencies with how container images are stored and how container images are transferred.
00:14:04.280 | In particular, one of the issues is that there's a lot of junk.
00:14:08.020 | There's a lot of stuff we're never going to read.
00:14:10.100 | Like many container images has Perl installed by default, man pages, locale information, time zone information for Uzbekistan.
00:14:17.860 | You're never going to read this stuff.
00:14:19.380 | So we're sending all this data back and forth.
00:14:21.480 | And the core thing here is we want to start containers on a remote file, on a remote worker very, very fast.
00:14:26.760 | We want to minimize the amount of data that has to be transferred.
00:14:29.100 | We want to do as little as possible.
00:14:30.940 | The other inefficiency is that there's a lot of redundancy in this.
00:14:35.020 | A lot of the files that are being transferred back and forth are actually the same files.
00:14:38.920 | So if you grab just like three very different container images like I did in this case,
00:14:43.840 | and you look at the files, it actually turns out to be mostly the same files to a very large extent.
00:14:49.300 | So with those two tricks, with those two observations, there's a number of tricks we can do.
00:14:53.620 | And so we built what's called a content address storage.
00:14:58.020 | And this is not a new invention.
00:14:59.600 | This is not something we came up with.
00:15:01.220 | But it's rarely used in production systems.
00:15:03.560 | Notably, AWS Lambda actually uses the same technique.
00:15:06.700 | And the idea is that instead of storing the images directly, we store the images, the container images,
00:15:13.760 | as just a bunch of metadata that points to blobs.
00:15:18.400 | And for each blob, we compute a checksum or a hash value.
00:15:23.760 | And then we use that to deduplicate all the blobs because there's an enormous amount of redundancy in these blobs.
00:15:29.300 | And this means the container images themselves are actually just little pieces of metadata.
00:15:35.680 | And in many cases, we can cache a very large percentage of the container images and we can also avoid pulling data that we're not going to need by lazy loading a lot of the data on access.
00:15:50.080 | This is tricky because container cold start in particular with Python is very latency sensitive because we end up doing a lot of very sequential file accesses.
00:16:01.780 | So in many cases, when a container starts up in modal or in Python, in any case, it requires reading every single module, every single Python module, which is many, many cases, it ends up being several thousand Python modules.
00:16:16.640 | Each one of them requires accessing the file system and so what we can't allow is that to take several milliseconds, because if you're doing something that takes several milliseconds and you're doing it a thousand times, it ends up taking several seconds and we want to avoid that.
00:16:31.480 | So there's a lot of tricks that we have to do in order to basically get this down below a second.
00:16:37.160 | We do a lot of prefetching, we do a lot of task tracing, we look at, you know, historical runs and see what types of files was accessed last time it ran.
00:16:46.060 | And then building these containers is obviously also another whole challenge.
00:16:51.800 | We basically built our own container image builders.
00:16:53.900 | Another technique that we're also more recently started leveraging is we can snapshot the CPU memory.
00:17:00.620 | So we talked about how we snapshot the container images and we cache a lot of the data, which means like when you're loading it, you don't have to fetch a lot of data.
00:17:10.080 | But what if we can avoid loading the data in the first place?
00:17:12.740 | What if we can just like revert to the memory state, the CPU memory, the RAM of a container?
00:17:19.460 | And as it turns out, GeVisor actually supports this.
00:17:22.160 | And that's another way that arguably supersedes a lot of the previous stuff.
00:17:27.200 | In practice, they end up kind of both reinforcing this container cold start.
00:17:30.980 | But this lets us cut down even more dramatically.
00:17:33.560 | Things like stable diffusion, we can now start in a couple of seconds, even though it involves loading very, very large model weights, like, you know, five or 10 gigabytes.
00:17:44.840 | We're also looking at GPU snapshotting, which will make things even faster, which is very exciting.
00:17:51.660 | And so doing all these things, you know, owning the entire stack, owning the file system, you know, building a storage system.
00:17:59.940 | I didn't talk about the storage system.
00:18:01.040 | We basically use R2 and we run in many different regions.
00:18:04.680 | And so we use both the CDN and the R2.
00:18:08.120 | And so all these optimizations together means we kind of solve the problem of container cold start.
00:18:16.540 | And let's remember, why did we originally set out to start this thing?
00:18:20.840 | It's because we want to deliver good developer experience.
00:18:23.640 | As it turns out, it's good for other things, too.
00:18:28.300 | So container cold start is also good because it enables serverless.
00:18:32.080 | So what does serverless mean?
00:18:33.820 | It means a lot of different things.
00:18:35.560 | I think part of why sometimes I avoid the term serverless is that it has so many different definitions.
00:18:40.880 | But the promise of serverless was always don't provision more than you actually need.
00:18:46.440 | Just, you know, only pay for capacity you're actually using.
00:18:52.180 | And so, especially with GPUs, which are very expensive, as it turns out, you can pack, take a lot of different users, pull them together, and give people dynamically the resources they need and get dramatically better utilization.
00:19:08.260 | And so that, in turn, means we can get lower cost.
00:19:13.940 | It means there's no capacity planning.
00:19:16.420 | It also, because we can pull a lot of these users, the variance, the total variance goes down, the relative variance, which means we can run a much more predictable set of resource pools, the total capacity that we run.
00:19:29.720 | Which is another problem, by the way.
00:19:32.480 | So we need to run thousands of GPUs.
00:19:34.600 | We use a lot of different cloud vendors.
00:19:36.340 | We use a lot of different regions.
00:19:37.580 | We scale up and down continuously.
00:19:39.800 | In fact, we actually end up solving a mixed-energy programming problem to do this, minimizing the total cost spend.
00:19:46.680 | And this is some of the stuff we have to do for our customers so they don't have to think about it.
00:19:51.480 | So through model, you can come in and you can request 100 GPUs.
00:19:54.600 | Under the hood, there's enormous amount of work that we had to put in in order to get the capacity somewhere in the world, you know, spinning up GPUs if needed.
00:20:04.000 | But in many cases, it happens instantaneously because we can maintain a buffer that makes it very fast to get access to these compute resources for any customer.
00:20:10.720 | This was very technical, but just to kind of go back and look at a high level again, why do people like modal?
00:20:20.440 | People pick modal because they can run their own code.
00:20:25.240 | We're not an AI API, so to speak.
00:20:27.600 | You can run almost anything with modal.
00:20:29.160 | We make it possible to iterate very quickly.
00:20:31.820 | We're fully usage-based.
00:20:33.800 | So when you run things in modal, you only pay for the time the containers are actually active.
00:20:38.100 | You have to never think about capacity.
00:20:40.420 | You don't have to, you know, go out and buy, you know, hundreds of GPUs or thousands of GPUs.
00:20:45.080 | We can get you that within, you know, seconds or at least minutes.
00:20:48.880 | So there's a lot of things, this sort of burden of infrastructure, building your own internal platform, setting up Kubernetes, setting up, you know, Docker and all these things.
00:20:57.100 | You don't have to think about this with modal.
00:20:59.020 | How do you try modal?
00:21:01.480 | It's actually very simple.
00:21:02.940 | You go to your terminal and you do pip install modal.
00:21:05.120 | The Python client automatically, you know, configures itself to connect to modal.
00:21:11.160 | And you can immediately start running stuff because we give everyone $30 per month of free credits.
00:21:16.600 | If you are a startup, we can give you up to $50,000 in credits in order for you to get started.
00:21:23.520 | Thank you.
00:21:25.900 | And I really hope you enjoy this.
00:21:28.380 | And if you have any questions, feel free to reach out at ericmodo.com.
00:21:32.360 | You can also follow me on Twitter, Bernhardson, or check out my blog, ericbern.com.