back to index

State Space Models for Realtime Multimodal Intelligence: Karan Goel


Whisper Transcript | Transcript Only Page

00:00:00.000 | Maybe to set the stage a little bit, the last four or five years of AI have basically been
00:00:19.200 | really focused on this idea of batch intelligence, which is sort of pretty core to this idea of
00:00:25.920 | building like an AI system that can reason for long periods of time on a problem and then solve it.
00:00:31.600 | So you can think about like math problems or, you know, physics problems that are hard.
00:00:35.120 | There's a lot of applications where actually what you need are systems that are streaming.
00:00:40.320 | So they're real time. They work instantly. So imagine generating video, audio, or doing like
00:00:47.200 | understanding applications on sensor streams, et cetera. So it sort of bifurcates where there's
00:00:53.360 | these two different types of applications, similar to how there's, you know, generally this idea of
00:00:58.400 | having batch workloads and streaming workloads. And so a lot of what we've seen over the last few years
00:01:03.760 | has really been focused on batch APIs where you call a model in the cloud, it takes a few seconds,
00:01:09.120 | and then you get a pretty good response back. And now we're seeing some shift towards more real-time
00:01:15.040 | applications where you constantly will be querying a model and asking it to return responses at low
00:01:21.920 | latency and then using that to sort of interpret or generate information.
00:01:26.720 | And I think this, you know, this area is really exciting because it's going to be transformative to
00:01:33.680 | a lot of interesting applications that have so far actually not necessarily been the main focus for a
00:01:41.280 | lot of what we've seen over the last few years. So conversational voice is an example of this where you
00:01:46.240 | should be able to interact with the system and then talk to it, and it should be able to understand you
00:01:51.280 | and do all kinds of tasks on your behalf. This is similar to having assistants that are on device and run
00:01:58.000 | kind of really efficiently at low power at, you know, all times, regardless of whether you're on a phone or a laptop.
00:02:05.360 | And then things like world generation where, like, you can imagine actually playing a game that is
00:02:10.320 | generated in real time, similar to how the graphics are rendered on GPUs. And all of this, you know,
00:02:18.000 | should be able to happen in real time, on low power, on your phone, on your MacBook, et cetera. Robotics is
00:02:26.640 | another great example where it sort of culminates with all of these coming together on a single
00:02:31.760 | device that is trying to kind of interpret everything in the world. And so I think this is
00:02:38.400 | sort of the exciting intersection, which is, like, how do we make intelligence faster and cheaper so that
00:02:43.520 | we can put it everywhere, basically. And a couple of examples that are really powerful, real-time
00:02:49.840 | intelligence for conversational interfaces is going to be really interesting because you would be able to
00:02:55.040 | have an agent that can provide customer support for a problem, answer questions about health insurance,
00:03:00.320 | you know, call your vendor to pick up a shipment. All these coordination tasks that generally are
00:03:07.040 | annoying to do should be really automated and real-time intelligent agents should be doing them.
00:03:13.600 | And then humans can spend their time solving sort of harder problems that are more interesting. And
00:03:18.320 | in customer support, that could be dealing with, you know, the tail customers that are much more important
00:03:23.360 | because they're pissed off or they're more important because they have, you know, more customer value,
00:03:29.280 | et cetera. And similarly in robotics, there's this idea of, like, ingesting similar to humans, like audio,
00:03:35.360 | video, sensor data, and then responding instantly to a lot of these pieces of information. So I think
00:03:40.960 | this is sort of the world we should be living in where all of these intelligent models run super fast.
00:03:45.520 | they saw all these different problems and they're able to really kind of power these new experiences that are interactive at their core.
00:03:54.480 | So this is where we come in. We're building these real-time foundation models. So some of what I'll talk about is
00:04:02.800 | the work we've done in really building kind of new ideas for how you can create deep learning models. So
00:04:11.920 | I did my PhD before this. I was working with a lot of these folks for my PhD. Chris was our PhD advisor.
00:04:16.800 | And we were really focused on this idea that you should be able to have a model that can compress
00:04:22.400 | information as it comes into the model and use that to really kind of build powerful systems that are streaming at their core.
00:04:30.240 | And I'll talk a little bit about this, but that's really the technology that we've been working with for the last four or five years.
00:04:36.000 | We've been developing academia and some of you might have heard of things like Mamba, which is sort of a
00:04:40.960 | more recent iteration of this technology. You know, I did my PhD working on some of the
00:04:46.000 | early iterations that nobody uses anymore, but are sort of the precursors to a lot of the modern stuff that
00:04:51.920 | is now more widely used. And now what we're doing at Cartesia is basically taking this and trying to
00:04:57.360 | understand how we can improve it, how we push the boundaries on what architectures can do.
00:05:02.400 | And I think it's an interesting question because, you know, we should not settle for
00:05:06.800 | having one way of doing things. I think that's sort of a poor way to kind of think about the future.
00:05:13.200 | So our approach is sort of like let's think about new ways of actually designing models that aren't
00:05:18.240 | necessarily built on, let's say, the transformer architecture and the standard recipe for deep learning
00:05:23.120 | that's, you know, prevalent today. And I think it boils down to this question of like efficiently
00:05:29.520 | modeling long context is a huge problem because, you know, a lot of practical data is really long
00:05:34.800 | sequence data. I think text is maybe the least interesting long sequence data because text is
00:05:39.840 | actually fairly compressed already, right? Like you have a lot of information that is embedded in two
00:05:46.400 | minutes of -- or two sentences of text. But there's all these other domains where, you know, audio, video,
00:05:52.240 | et cetera, where there's so much information. You know, imagine looking at a security camera for a
00:05:56.880 | day. Like you would probably have just so much information coming into the system and just very
00:06:02.320 | little of that would be useful. So compression is kind of really fundamental to intelligence because
00:06:06.640 | we're able to do this where we can look at all this information and then sort of compress it down to
00:06:11.360 | whatever's necessary to remember or understand. And I think so far what we've seen is that the AI
00:06:17.360 | systems that we built have not necessarily exhibited that same behavior. So they're really kind of built
00:06:22.400 | not on the principles of compression, but more on this idea of retrieval, like keeping all the context
00:06:27.040 | around and then using it to reason over all the information that you've seen. So I think our kind of
00:06:32.640 | point of view is that multimodal AI will remain challenging as long as you're sort of working in
00:06:37.680 | this paradigm. Because if you try to think about what humans do in a year, you're basically processing
00:06:44.080 | understanding about a billion text tokens, 10 billion audio tokens. These are, you know,
00:06:48.960 | back of the envelope calculations that I did. And about a trillion video tokens probably underestimates
00:06:53.840 | how much video we process and not including all the other sensory information that you're processing.
00:06:58.720 | And you're doing it simultaneously. And you're doing it on a computer that fits in your brain.
00:07:02.880 | And you, you know, sometimes don't eat and drink and, you know, you're still functioning fine. So,
00:07:08.320 | you know, you can have variable amounts of power in the system. So I think the idea that, like,
00:07:14.320 | intelligence is solved is sort of very far from the truth because humans just are an extremely amazing
00:07:20.320 | machine that does something very extraordinary in a very compressed way that our AI models can't do.
00:07:27.200 | So I think that's sort of our, you know, sort of the reason we get up in the morning is we think
00:07:32.400 | about this and we're like, yeah, we're very far away from where we should be. And the best models
00:07:37.840 | today are in the, you know, 10 million, 100 million sort of token range. So that's really good. A lot of
00:07:43.280 | progress has been made. But really, this is sort of what we aspire to is how do you kind of build these
00:07:47.200 | machines that are long lived that can actually understand information over very long periods of time.
00:07:52.480 | And I think the cool thing is, like, as a human, you can remember things that happened 30 years ago with very
00:07:57.200 | little effort. You don't need to do rag or retrieval or anything. You just, you know, you remember it.
00:08:01.840 | It's gisted in your brain and then you figure it out, basically. So I think that's kind of an extraordinary
00:08:06.720 | capability that we should be able to put into our AI models as well. And so some of the big problems with
00:08:14.720 | models today are, you know, they're built on transformers, really optimized for data center. I think we see this
00:08:20.880 | with, like, a lot of the work we did, which was on sub-quadratic models. So quadratic scaling and context
00:08:25.600 | length really just means that, you know, the amount of computation you have to do to process long
00:08:31.120 | amounts of context is very large. And so right now the sort of predominant approach is to throw compute
00:08:36.400 | at that problem and then hope that that would scale. Obviously, compute is a very important piece of the
00:08:41.440 | puzzle because you do need more computation to be able to do more difficult things. But this type of
00:08:46.880 | approach, because of the quadratic scaling, actually has poor scaling with, you know, very large multimodal
00:08:51.280 | context. And text contexts tend to be shorter. Multimodal contexts will get larger because you
00:08:56.400 | have just way more tokens and information that's going into the system. So that's going to be a big
00:09:00.320 | challenge for these models, especially how do you do this inference efficiently so you're not, you know,
00:09:04.880 | burning down the data centers to, you know, do a fairly limited amount of inference. Like, you have to
00:09:10.160 | imagine that we're doing a thousand times or, you know, a hundred thousand times more inference. And then
00:09:15.040 | if the models are scaling the same way, it's going to be really, really, really expensive. So you're not going to be able to
00:09:19.840 | permeate all these applications that I talked about very easily. And so, you know, that's sort of a big
00:09:24.960 | challenge, I would say. And so, you know, again, our hypothesis is you need new architectures and
00:09:30.400 | that's kind of where we spend our time and we want to make these models more efficient, faster, more capable
00:09:34.640 | while being able to handle all these long context problems. This is a slide about, you know, transformers
00:09:40.160 | being somewhat inefficient at handling this, but obviously a very good recipe for scaling these models out.
00:09:48.640 | And so, you know, some of the work that we've been doing is new fundamentally efficient architectures.
00:09:52.880 | So they have compression at their core. So they sort of -- the way they operate -- I'll have a slide
00:09:58.000 | on this just to give you kind of a quick illustration. But they really scale more linearly in context lens.
00:10:04.880 | So you should be able to have -- because of this, like, more low power implementations of these models,
00:10:10.320 | you can compress information as it comes into the system. You have low memory usage.
00:10:15.600 | And you can actually scale to much more massive context because of that.
00:10:18.480 | And this is all the work around SSMs. I just threw this nice slide, which I thought was cool.
00:10:26.080 | Jensen had an interesting quote about SSMs in one of his Wired articles that I like to keep talking
00:10:31.440 | about. But I think it's a cool technology that has a lot of potential and sort of that's where we're
00:10:36.880 | spending a lot of our time. And if you folks are interested in reading more, there's lots of videos on
00:10:41.120 | YouTube and lots of sort of resources that try to make this more accessible to understand and kind
00:10:45.760 | of get into some of the details. But, you know, the working intuition is basically --
00:10:50.560 | transformers are generating quadratically by attending to every past token of information.
00:10:54.720 | So as tokens come into the system, you're sort of keeping them around
00:10:58.160 | and then looking at all the past tokens. So if you want to generate the word "jumped"
00:11:01.840 | from the quick brown fox, you would actually look at the entire context,
00:11:04.960 | try to understand what the next word should be,
00:11:06.960 | and then generate it, push it into the context, do it again.
00:11:10.480 | With SSMs, you just have a streaming system. So you have a token stream in,
00:11:15.040 | they update an internal memory for the model, and then the token gets thrown away.
00:11:19.520 | So that actually really simplifies the system. And that's why it's such a core
00:11:22.880 | sort of streaming interface, because you're just not keeping all this memory around about
00:11:26.800 | what happened in the past. You're compressing it into some sort of zipped file state inside the
00:11:31.840 | model that's going to be used to do a future generation. And so this is sort of taking this
00:11:37.600 | idea of -- taking advantage of this idea of recurrence, which is sort of core to how even humans
00:11:42.960 | do a lot of their raising. And, you know, last few months, a lot of these models have been getting
00:11:48.240 | adopted. So it's great that, you know, a lot of folks are now excited about the -- this, you know,
00:11:53.360 | alternate way of doing things that is much more sort of oriented around this idea of recurrence,
00:11:59.440 | rather than retrieval. And so I think, like, we'll see a lot more activity in this,
00:12:03.920 | especially with multimodal data becoming more important. And, you know, a lot of the challenges
00:12:08.480 | of multimodal data around efficiency will mean that I think that these models will have more of a role
00:12:13.120 | to play in the next three to five years, as we also do our work in scaling them up and making them
00:12:18.320 | more interesting. A lot of people ask me about quality. I only have a few minutes, so I'll go through
00:12:23.120 | the rest of the slide super fast. But, you know, SSMs generally have the right quality. Obviously,
00:12:29.680 | there's a tradeoff between compression and keeping all of the information around. But actually, like,
00:12:34.480 | compression can be helpful. So if you imagine the security camera example, if you're watching 24
00:12:39.280 | hours of footage, actually compressing all of that information on the fly would help you solve tasks
00:12:43.600 | and answer questions better rather than looking at all 24 hours every time. So I think that's sort
00:12:48.720 | of the rule of thumb to think about, which is compression super helpful for a large context,
00:12:52.400 | not as helpful for short context. And so we see that quality actually is very good for long context
00:12:58.560 | problems and multimodal problems. Let me talk quickly about some of the work we've been doing.
00:13:03.120 | So we've been starting to work on sort of multimodal data. And we did a release a few weeks ago
00:13:07.600 | for a voice generation model. So this is sort of text-to-speech and sort of in line with some of
00:13:12.320 | the work we're doing to bring more multimodal data into a single model and use SSMs to power the
00:13:19.280 | inference and the training and so on. So this is a model you can actually play with. I'll try to show
00:13:23.600 | you a demo. But one of the things we're proudest off with this model is that we really shrunk the
00:13:28.320 | latency down. So when you play with it on the playground, you get instant voice back generated from
00:13:32.720 | the data center. And there's some cool work we're doing to actually run these models on Mac.
00:13:37.680 | And other devices so that you can basically have the same experience as you have in the data center,
00:13:41.680 | but on any device. And do that efficiently and at low power. How much time do I have?
00:13:46.000 | Okay. We're out of time, but I was also almost done. So go to the website, play.cartesia.ai. I
00:13:54.000 | unfortunately couldn't walk through the demo, but play with it and send us feedback. This is my email,
00:13:59.120 | in case you want to send me a note. I would love to hear feedback and anything that you folks find
00:14:04.080 | interesting. Thank you.
00:14:16.800 | Thank you.