State Space Models for Realtime Multimodal Intelligence: Karan Goel

00:00:00.000 | Maybe to set the stage a little bit, the last four or five years of AI have basically been

00:00:19.200 | really focused on this idea of batch intelligence, which is sort of pretty core to this idea of

00:00:25.920 | building like an AI system that can reason for long periods of time on a problem and then solve it.

00:00:31.600 | So you can think about like math problems or, you know, physics problems that are hard.

00:00:35.120 | There's a lot of applications where actually what you need are systems that are streaming.

00:00:40.320 | So they're real time. They work instantly. So imagine generating video, audio, or doing like

00:00:47.200 | understanding applications on sensor streams, et cetera. So it sort of bifurcates where there's

00:00:53.360 | these two different types of applications, similar to how there's, you know, generally this idea of

00:00:58.400 | having batch workloads and streaming workloads. And so a lot of what we've seen over the last few years

00:01:03.760 | has really been focused on batch APIs where you call a model in the cloud, it takes a few seconds,

00:01:09.120 | and then you get a pretty good response back. And now we're seeing some shift towards more real-time

00:01:15.040 | applications where you constantly will be querying a model and asking it to return responses at low

00:01:21.920 | latency and then using that to sort of interpret or generate information.

00:01:26.720 | And I think this, you know, this area is really exciting because it's going to be transformative to

00:01:33.680 | a lot of interesting applications that have so far actually not necessarily been the main focus for a

00:01:41.280 | lot of what we've seen over the last few years. So conversational voice is an example of this where you

00:01:46.240 | should be able to interact with the system and then talk to it, and it should be able to understand you

00:01:51.280 | and do all kinds of tasks on your behalf. This is similar to having assistants that are on device and run

00:01:58.000 | kind of really efficiently at low power at, you know, all times, regardless of whether you're on a phone or a laptop.

00:02:05.360 | And then things like world generation where, like, you can imagine actually playing a game that is

00:02:10.320 | generated in real time, similar to how the graphics are rendered on GPUs. And all of this, you know,

00:02:18.000 | should be able to happen in real time, on low power, on your phone, on your MacBook, et cetera. Robotics is

00:02:26.640 | another great example where it sort of culminates with all of these coming together on a single

00:02:31.760 | device that is trying to kind of interpret everything in the world. And so I think this is

00:02:38.400 | sort of the exciting intersection, which is, like, how do we make intelligence faster and cheaper so that

00:02:43.520 | we can put it everywhere, basically. And a couple of examples that are really powerful, real-time

00:02:49.840 | intelligence for conversational interfaces is going to be really interesting because you would be able to

00:02:55.040 | have an agent that can provide customer support for a problem, answer questions about health insurance,

00:03:00.320 | you know, call your vendor to pick up a shipment. All these coordination tasks that generally are

00:03:07.040 | annoying to do should be really automated and real-time intelligent agents should be doing them.

00:03:13.600 | And then humans can spend their time solving sort of harder problems that are more interesting. And

00:03:18.320 | in customer support, that could be dealing with, you know, the tail customers that are much more important

00:03:23.360 | because they're pissed off or they're more important because they have, you know, more customer value,

00:03:29.280 | et cetera. And similarly in robotics, there's this idea of, like, ingesting similar to humans, like audio,

00:03:35.360 | video, sensor data, and then responding instantly to a lot of these pieces of information. So I think

00:03:40.960 | this is sort of the world we should be living in where all of these intelligent models run super fast.

00:03:45.520 | they saw all these different problems and they're able to really kind of power these new experiences that are interactive at their core.

00:03:54.480 | So this is where we come in. We're building these real-time foundation models. So some of what I'll talk about is

00:04:02.800 | the work we've done in really building kind of new ideas for how you can create deep learning models. So

00:04:11.920 | I did my PhD before this. I was working with a lot of these folks for my PhD. Chris was our PhD advisor.

00:04:16.800 | And we were really focused on this idea that you should be able to have a model that can compress

00:04:22.400 | information as it comes into the model and use that to really kind of build powerful systems that are streaming at their core.

00:04:30.240 | And I'll talk a little bit about this, but that's really the technology that we've been working with for the last four or five years.

00:04:36.000 | We've been developing academia and some of you might have heard of things like Mamba, which is sort of a

00:04:40.960 | more recent iteration of this technology. You know, I did my PhD working on some of the

00:04:46.000 | early iterations that nobody uses anymore, but are sort of the precursors to a lot of the modern stuff that

00:04:51.920 | is now more widely used. And now what we're doing at Cartesia is basically taking this and trying to

00:04:57.360 | understand how we can improve it, how we push the boundaries on what architectures can do.

00:05:02.400 | And I think it's an interesting question because, you know, we should not settle for

00:05:06.800 | having one way of doing things. I think that's sort of a poor way to kind of think about the future.

00:05:13.200 | So our approach is sort of like let's think about new ways of actually designing models that aren't

00:05:18.240 | necessarily built on, let's say, the transformer architecture and the standard recipe for deep learning

00:05:23.120 | that's, you know, prevalent today. And I think it boils down to this question of like efficiently

00:05:29.520 | modeling long context is a huge problem because, you know, a lot of practical data is really long

00:05:34.800 | sequence data. I think text is maybe the least interesting long sequence data because text is

00:05:39.840 | actually fairly compressed already, right? Like you have a lot of information that is embedded in two

00:05:46.400 | minutes of -- or two sentences of text. But there's all these other domains where, you know, audio, video,

00:05:52.240 | et cetera, where there's so much information. You know, imagine looking at a security camera for a

00:05:56.880 | day. Like you would probably have just so much information coming into the system and just very

00:06:02.320 | little of that would be useful. So compression is kind of really fundamental to intelligence because

00:06:06.640 | we're able to do this where we can look at all this information and then sort of compress it down to

00:06:11.360 | whatever's necessary to remember or understand. And I think so far what we've seen is that the AI

00:06:17.360 | systems that we built have not necessarily exhibited that same behavior. So they're really kind of built

00:06:22.400 | not on the principles of compression, but more on this idea of retrieval, like keeping all the context

00:06:27.040 | around and then using it to reason over all the information that you've seen. So I think our kind of

00:06:32.640 | point of view is that multimodal AI will remain challenging as long as you're sort of working in

00:06:37.680 | this paradigm. Because if you try to think about what humans do in a year, you're basically processing

00:06:44.080 | understanding about a billion text tokens, 10 billion audio tokens. These are, you know,

00:06:48.960 | back of the envelope calculations that I did. And about a trillion video tokens probably underestimates

00:06:53.840 | how much video we process and not including all the other sensory information that you're processing.

00:06:58.720 | And you're doing it simultaneously. And you're doing it on a computer that fits in your brain.

00:07:02.880 | And you, you know, sometimes don't eat and drink and, you know, you're still functioning fine. So,

00:07:08.320 | you know, you can have variable amounts of power in the system. So I think the idea that, like,

00:07:14.320 | intelligence is solved is sort of very far from the truth because humans just are an extremely amazing

00:07:20.320 | machine that does something very extraordinary in a very compressed way that our AI models can't do.

00:07:27.200 | So I think that's sort of our, you know, sort of the reason we get up in the morning is we think

00:07:32.400 | about this and we're like, yeah, we're very far away from where we should be. And the best models

00:07:37.840 | today are in the, you know, 10 million, 100 million sort of token range. So that's really good. A lot of

00:07:43.280 | progress has been made. But really, this is sort of what we aspire to is how do you kind of build these

00:07:47.200 | machines that are long lived that can actually understand information over very long periods of time.

00:07:52.480 | And I think the cool thing is, like, as a human, you can remember things that happened 30 years ago with very

00:07:57.200 | little effort. You don't need to do rag or retrieval or anything. You just, you know, you remember it.

00:08:01.840 | It's gisted in your brain and then you figure it out, basically. So I think that's kind of an extraordinary

00:08:06.720 | capability that we should be able to put into our AI models as well. And so some of the big problems with

00:08:14.720 | models today are, you know, they're built on transformers, really optimized for data center. I think we see this

00:08:20.880 | with, like, a lot of the work we did, which was on sub-quadratic models. So quadratic scaling and context

00:08:25.600 | length really just means that, you know, the amount of computation you have to do to process long

00:08:31.120 | amounts of context is very large. And so right now the sort of predominant approach is to throw compute

00:08:36.400 | at that problem and then hope that that would scale. Obviously, compute is a very important piece of the

00:08:41.440 | puzzle because you do need more computation to be able to do more difficult things. But this type of

00:08:46.880 | approach, because of the quadratic scaling, actually has poor scaling with, you know, very large multimodal

00:08:51.280 | context. And text contexts tend to be shorter. Multimodal contexts will get larger because you

00:08:56.400 | have just way more tokens and information that's going into the system. So that's going to be a big

00:09:00.320 | challenge for these models, especially how do you do this inference efficiently so you're not, you know,

00:09:04.880 | burning down the data centers to, you know, do a fairly limited amount of inference. Like, you have to

00:09:10.160 | imagine that we're doing a thousand times or, you know, a hundred thousand times more inference. And then

00:09:15.040 | if the models are scaling the same way, it's going to be really, really, really expensive. So you're not going to be able to

00:09:19.840 | permeate all these applications that I talked about very easily. And so, you know, that's sort of a big

00:09:24.960 | challenge, I would say. And so, you know, again, our hypothesis is you need new architectures and

00:09:30.400 | that's kind of where we spend our time and we want to make these models more efficient, faster, more capable

00:09:34.640 | while being able to handle all these long context problems. This is a slide about, you know, transformers

00:09:40.160 | being somewhat inefficient at handling this, but obviously a very good recipe for scaling these models out.

00:09:48.640 | And so, you know, some of the work that we've been doing is new fundamentally efficient architectures.

00:09:52.880 | So they have compression at their core. So they sort of -- the way they operate -- I'll have a slide

00:09:58.000 | on this just to give you kind of a quick illustration. But they really scale more linearly in context lens.

00:10:04.880 | So you should be able to have -- because of this, like, more low power implementations of these models,

00:10:10.320 | you can compress information as it comes into the system. You have low memory usage.

00:10:15.600 | And you can actually scale to much more massive context because of that.

00:10:18.480 | And this is all the work around SSMs. I just threw this nice slide, which I thought was cool.

00:10:26.080 | Jensen had an interesting quote about SSMs in one of his Wired articles that I like to keep talking

00:10:31.440 | about. But I think it's a cool technology that has a lot of potential and sort of that's where we're

00:10:36.880 | spending a lot of our time. And if you folks are interested in reading more, there's lots of videos on

00:10:41.120 | YouTube and lots of sort of resources that try to make this more accessible to understand and kind

00:10:45.760 | of get into some of the details. But, you know, the working intuition is basically --

00:10:50.560 | transformers are generating quadratically by attending to every past token of information.

00:10:54.720 | So as tokens come into the system, you're sort of keeping them around

00:10:58.160 | and then looking at all the past tokens. So if you want to generate the word "jumped"

00:11:01.840 | from the quick brown fox, you would actually look at the entire context,

00:11:04.960 | try to understand what the next word should be,

00:11:06.960 | and then generate it, push it into the context, do it again.

00:11:10.480 | With SSMs, you just have a streaming system. So you have a token stream in,

00:11:15.040 | they update an internal memory for the model, and then the token gets thrown away.

00:11:19.520 | So that actually really simplifies the system. And that's why it's such a core

00:11:22.880 | sort of streaming interface, because you're just not keeping all this memory around about

00:11:26.800 | what happened in the past. You're compressing it into some sort of zipped file state inside the

00:11:31.840 | model that's going to be used to do a future generation. And so this is sort of taking this

00:11:37.600 | idea of -- taking advantage of this idea of recurrence, which is sort of core to how even humans

00:11:42.960 | do a lot of their raising. And, you know, last few months, a lot of these models have been getting

00:11:48.240 | adopted. So it's great that, you know, a lot of folks are now excited about the -- this, you know,

00:11:53.360 | alternate way of doing things that is much more sort of oriented around this idea of recurrence,

00:11:59.440 | rather than retrieval. And so I think, like, we'll see a lot more activity in this,

00:12:03.920 | especially with multimodal data becoming more important. And, you know, a lot of the challenges

00:12:08.480 | of multimodal data around efficiency will mean that I think that these models will have more of a role

00:12:13.120 | to play in the next three to five years, as we also do our work in scaling them up and making them

00:12:18.320 | more interesting. A lot of people ask me about quality. I only have a few minutes, so I'll go through

00:12:23.120 | the rest of the slide super fast. But, you know, SSMs generally have the right quality. Obviously,

00:12:29.680 | there's a tradeoff between compression and keeping all of the information around. But actually, like,

00:12:34.480 | compression can be helpful. So if you imagine the security camera example, if you're watching 24

00:12:39.280 | hours of footage, actually compressing all of that information on the fly would help you solve tasks

00:12:43.600 | and answer questions better rather than looking at all 24 hours every time. So I think that's sort

00:12:48.720 | of the rule of thumb to think about, which is compression super helpful for a large context,

00:12:52.400 | not as helpful for short context. And so we see that quality actually is very good for long context

00:12:58.560 | problems and multimodal problems. Let me talk quickly about some of the work we've been doing.

00:13:03.120 | So we've been starting to work on sort of multimodal data. And we did a release a few weeks ago

00:13:07.600 | for a voice generation model. So this is sort of text-to-speech and sort of in line with some of

00:13:12.320 | the work we're doing to bring more multimodal data into a single model and use SSMs to power the

00:13:19.280 | inference and the training and so on. So this is a model you can actually play with. I'll try to show

00:13:23.600 | you a demo. But one of the things we're proudest off with this model is that we really shrunk the

00:13:28.320 | latency down. So when you play with it on the playground, you get instant voice back generated from

00:13:32.720 | the data center. And there's some cool work we're doing to actually run these models on Mac.

00:13:37.680 | And other devices so that you can basically have the same experience as you have in the data center,

00:13:41.680 | but on any device. And do that efficiently and at low power. How much time do I have?

00:13:46.000 | Okay. We're out of time, but I was also almost done. So go to the website, play.cartesia.ai. I

00:13:54.000 | unfortunately couldn't walk through the demo, but play with it and send us feedback. This is my email,

00:13:59.120 | in case you want to send me a note. I would love to hear feedback and anything that you folks find

00:14:04.080 | interesting. Thank you.

00:14:16.800 | Thank you.