back to indexState Space Models for Realtime Multimodal Intelligence: Karan Goel

00:00:00.000 |
Maybe to set the stage a little bit, the last four or five years of AI have basically been 00:00:19.200 |
really focused on this idea of batch intelligence, which is sort of pretty core to this idea of 00:00:25.920 |
building like an AI system that can reason for long periods of time on a problem and then solve it. 00:00:31.600 |
So you can think about like math problems or, you know, physics problems that are hard. 00:00:35.120 |
There's a lot of applications where actually what you need are systems that are streaming. 00:00:40.320 |
So they're real time. They work instantly. So imagine generating video, audio, or doing like 00:00:47.200 |
understanding applications on sensor streams, et cetera. So it sort of bifurcates where there's 00:00:53.360 |
these two different types of applications, similar to how there's, you know, generally this idea of 00:00:58.400 |
having batch workloads and streaming workloads. And so a lot of what we've seen over the last few years 00:01:03.760 |
has really been focused on batch APIs where you call a model in the cloud, it takes a few seconds, 00:01:09.120 |
and then you get a pretty good response back. And now we're seeing some shift towards more real-time 00:01:15.040 |
applications where you constantly will be querying a model and asking it to return responses at low 00:01:21.920 |
latency and then using that to sort of interpret or generate information. 00:01:26.720 |
And I think this, you know, this area is really exciting because it's going to be transformative to 00:01:33.680 |
a lot of interesting applications that have so far actually not necessarily been the main focus for a 00:01:41.280 |
lot of what we've seen over the last few years. So conversational voice is an example of this where you 00:01:46.240 |
should be able to interact with the system and then talk to it, and it should be able to understand you 00:01:51.280 |
and do all kinds of tasks on your behalf. This is similar to having assistants that are on device and run 00:01:58.000 |
kind of really efficiently at low power at, you know, all times, regardless of whether you're on a phone or a laptop. 00:02:05.360 |
And then things like world generation where, like, you can imagine actually playing a game that is 00:02:10.320 |
generated in real time, similar to how the graphics are rendered on GPUs. And all of this, you know, 00:02:18.000 |
should be able to happen in real time, on low power, on your phone, on your MacBook, et cetera. Robotics is 00:02:26.640 |
another great example where it sort of culminates with all of these coming together on a single 00:02:31.760 |
device that is trying to kind of interpret everything in the world. And so I think this is 00:02:38.400 |
sort of the exciting intersection, which is, like, how do we make intelligence faster and cheaper so that 00:02:43.520 |
we can put it everywhere, basically. And a couple of examples that are really powerful, real-time 00:02:49.840 |
intelligence for conversational interfaces is going to be really interesting because you would be able to 00:02:55.040 |
have an agent that can provide customer support for a problem, answer questions about health insurance, 00:03:00.320 |
you know, call your vendor to pick up a shipment. All these coordination tasks that generally are 00:03:07.040 |
annoying to do should be really automated and real-time intelligent agents should be doing them. 00:03:13.600 |
And then humans can spend their time solving sort of harder problems that are more interesting. And 00:03:18.320 |
in customer support, that could be dealing with, you know, the tail customers that are much more important 00:03:23.360 |
because they're pissed off or they're more important because they have, you know, more customer value, 00:03:29.280 |
et cetera. And similarly in robotics, there's this idea of, like, ingesting similar to humans, like audio, 00:03:35.360 |
video, sensor data, and then responding instantly to a lot of these pieces of information. So I think 00:03:40.960 |
this is sort of the world we should be living in where all of these intelligent models run super fast. 00:03:45.520 |
they saw all these different problems and they're able to really kind of power these new experiences that are interactive at their core. 00:03:54.480 |
So this is where we come in. We're building these real-time foundation models. So some of what I'll talk about is 00:04:02.800 |
the work we've done in really building kind of new ideas for how you can create deep learning models. So 00:04:11.920 |
I did my PhD before this. I was working with a lot of these folks for my PhD. Chris was our PhD advisor. 00:04:16.800 |
And we were really focused on this idea that you should be able to have a model that can compress 00:04:22.400 |
information as it comes into the model and use that to really kind of build powerful systems that are streaming at their core. 00:04:30.240 |
And I'll talk a little bit about this, but that's really the technology that we've been working with for the last four or five years. 00:04:36.000 |
We've been developing academia and some of you might have heard of things like Mamba, which is sort of a 00:04:40.960 |
more recent iteration of this technology. You know, I did my PhD working on some of the 00:04:46.000 |
early iterations that nobody uses anymore, but are sort of the precursors to a lot of the modern stuff that 00:04:51.920 |
is now more widely used. And now what we're doing at Cartesia is basically taking this and trying to 00:04:57.360 |
understand how we can improve it, how we push the boundaries on what architectures can do. 00:05:02.400 |
And I think it's an interesting question because, you know, we should not settle for 00:05:06.800 |
having one way of doing things. I think that's sort of a poor way to kind of think about the future. 00:05:13.200 |
So our approach is sort of like let's think about new ways of actually designing models that aren't 00:05:18.240 |
necessarily built on, let's say, the transformer architecture and the standard recipe for deep learning 00:05:23.120 |
that's, you know, prevalent today. And I think it boils down to this question of like efficiently 00:05:29.520 |
modeling long context is a huge problem because, you know, a lot of practical data is really long 00:05:34.800 |
sequence data. I think text is maybe the least interesting long sequence data because text is 00:05:39.840 |
actually fairly compressed already, right? Like you have a lot of information that is embedded in two 00:05:46.400 |
minutes of -- or two sentences of text. But there's all these other domains where, you know, audio, video, 00:05:52.240 |
et cetera, where there's so much information. You know, imagine looking at a security camera for a 00:05:56.880 |
day. Like you would probably have just so much information coming into the system and just very 00:06:02.320 |
little of that would be useful. So compression is kind of really fundamental to intelligence because 00:06:06.640 |
we're able to do this where we can look at all this information and then sort of compress it down to 00:06:11.360 |
whatever's necessary to remember or understand. And I think so far what we've seen is that the AI 00:06:17.360 |
systems that we built have not necessarily exhibited that same behavior. So they're really kind of built 00:06:22.400 |
not on the principles of compression, but more on this idea of retrieval, like keeping all the context 00:06:27.040 |
around and then using it to reason over all the information that you've seen. So I think our kind of 00:06:32.640 |
point of view is that multimodal AI will remain challenging as long as you're sort of working in 00:06:37.680 |
this paradigm. Because if you try to think about what humans do in a year, you're basically processing 00:06:44.080 |
understanding about a billion text tokens, 10 billion audio tokens. These are, you know, 00:06:48.960 |
back of the envelope calculations that I did. And about a trillion video tokens probably underestimates 00:06:53.840 |
how much video we process and not including all the other sensory information that you're processing. 00:06:58.720 |
And you're doing it simultaneously. And you're doing it on a computer that fits in your brain. 00:07:02.880 |
And you, you know, sometimes don't eat and drink and, you know, you're still functioning fine. So, 00:07:08.320 |
you know, you can have variable amounts of power in the system. So I think the idea that, like, 00:07:14.320 |
intelligence is solved is sort of very far from the truth because humans just are an extremely amazing 00:07:20.320 |
machine that does something very extraordinary in a very compressed way that our AI models can't do. 00:07:27.200 |
So I think that's sort of our, you know, sort of the reason we get up in the morning is we think 00:07:32.400 |
about this and we're like, yeah, we're very far away from where we should be. And the best models 00:07:37.840 |
today are in the, you know, 10 million, 100 million sort of token range. So that's really good. A lot of 00:07:43.280 |
progress has been made. But really, this is sort of what we aspire to is how do you kind of build these 00:07:47.200 |
machines that are long lived that can actually understand information over very long periods of time. 00:07:52.480 |
And I think the cool thing is, like, as a human, you can remember things that happened 30 years ago with very 00:07:57.200 |
little effort. You don't need to do rag or retrieval or anything. You just, you know, you remember it. 00:08:01.840 |
It's gisted in your brain and then you figure it out, basically. So I think that's kind of an extraordinary 00:08:06.720 |
capability that we should be able to put into our AI models as well. And so some of the big problems with 00:08:14.720 |
models today are, you know, they're built on transformers, really optimized for data center. I think we see this 00:08:20.880 |
with, like, a lot of the work we did, which was on sub-quadratic models. So quadratic scaling and context 00:08:25.600 |
length really just means that, you know, the amount of computation you have to do to process long 00:08:31.120 |
amounts of context is very large. And so right now the sort of predominant approach is to throw compute 00:08:36.400 |
at that problem and then hope that that would scale. Obviously, compute is a very important piece of the 00:08:41.440 |
puzzle because you do need more computation to be able to do more difficult things. But this type of 00:08:46.880 |
approach, because of the quadratic scaling, actually has poor scaling with, you know, very large multimodal 00:08:51.280 |
context. And text contexts tend to be shorter. Multimodal contexts will get larger because you 00:08:56.400 |
have just way more tokens and information that's going into the system. So that's going to be a big 00:09:00.320 |
challenge for these models, especially how do you do this inference efficiently so you're not, you know, 00:09:04.880 |
burning down the data centers to, you know, do a fairly limited amount of inference. Like, you have to 00:09:10.160 |
imagine that we're doing a thousand times or, you know, a hundred thousand times more inference. And then 00:09:15.040 |
if the models are scaling the same way, it's going to be really, really, really expensive. So you're not going to be able to 00:09:19.840 |
permeate all these applications that I talked about very easily. And so, you know, that's sort of a big 00:09:24.960 |
challenge, I would say. And so, you know, again, our hypothesis is you need new architectures and 00:09:30.400 |
that's kind of where we spend our time and we want to make these models more efficient, faster, more capable 00:09:34.640 |
while being able to handle all these long context problems. This is a slide about, you know, transformers 00:09:40.160 |
being somewhat inefficient at handling this, but obviously a very good recipe for scaling these models out. 00:09:48.640 |
And so, you know, some of the work that we've been doing is new fundamentally efficient architectures. 00:09:52.880 |
So they have compression at their core. So they sort of -- the way they operate -- I'll have a slide 00:09:58.000 |
on this just to give you kind of a quick illustration. But they really scale more linearly in context lens. 00:10:04.880 |
So you should be able to have -- because of this, like, more low power implementations of these models, 00:10:10.320 |
you can compress information as it comes into the system. You have low memory usage. 00:10:15.600 |
And you can actually scale to much more massive context because of that. 00:10:18.480 |
And this is all the work around SSMs. I just threw this nice slide, which I thought was cool. 00:10:26.080 |
Jensen had an interesting quote about SSMs in one of his Wired articles that I like to keep talking 00:10:31.440 |
about. But I think it's a cool technology that has a lot of potential and sort of that's where we're 00:10:36.880 |
spending a lot of our time. And if you folks are interested in reading more, there's lots of videos on 00:10:41.120 |
YouTube and lots of sort of resources that try to make this more accessible to understand and kind 00:10:45.760 |
of get into some of the details. But, you know, the working intuition is basically -- 00:10:50.560 |
transformers are generating quadratically by attending to every past token of information. 00:10:54.720 |
So as tokens come into the system, you're sort of keeping them around 00:10:58.160 |
and then looking at all the past tokens. So if you want to generate the word "jumped" 00:11:01.840 |
from the quick brown fox, you would actually look at the entire context, 00:11:04.960 |
try to understand what the next word should be, 00:11:06.960 |
and then generate it, push it into the context, do it again. 00:11:10.480 |
With SSMs, you just have a streaming system. So you have a token stream in, 00:11:15.040 |
they update an internal memory for the model, and then the token gets thrown away. 00:11:19.520 |
So that actually really simplifies the system. And that's why it's such a core 00:11:22.880 |
sort of streaming interface, because you're just not keeping all this memory around about 00:11:26.800 |
what happened in the past. You're compressing it into some sort of zipped file state inside the 00:11:31.840 |
model that's going to be used to do a future generation. And so this is sort of taking this 00:11:37.600 |
idea of -- taking advantage of this idea of recurrence, which is sort of core to how even humans 00:11:42.960 |
do a lot of their raising. And, you know, last few months, a lot of these models have been getting 00:11:48.240 |
adopted. So it's great that, you know, a lot of folks are now excited about the -- this, you know, 00:11:53.360 |
alternate way of doing things that is much more sort of oriented around this idea of recurrence, 00:11:59.440 |
rather than retrieval. And so I think, like, we'll see a lot more activity in this, 00:12:03.920 |
especially with multimodal data becoming more important. And, you know, a lot of the challenges 00:12:08.480 |
of multimodal data around efficiency will mean that I think that these models will have more of a role 00:12:13.120 |
to play in the next three to five years, as we also do our work in scaling them up and making them 00:12:18.320 |
more interesting. A lot of people ask me about quality. I only have a few minutes, so I'll go through 00:12:23.120 |
the rest of the slide super fast. But, you know, SSMs generally have the right quality. Obviously, 00:12:29.680 |
there's a tradeoff between compression and keeping all of the information around. But actually, like, 00:12:34.480 |
compression can be helpful. So if you imagine the security camera example, if you're watching 24 00:12:39.280 |
hours of footage, actually compressing all of that information on the fly would help you solve tasks 00:12:43.600 |
and answer questions better rather than looking at all 24 hours every time. So I think that's sort 00:12:48.720 |
of the rule of thumb to think about, which is compression super helpful for a large context, 00:12:52.400 |
not as helpful for short context. And so we see that quality actually is very good for long context 00:12:58.560 |
problems and multimodal problems. Let me talk quickly about some of the work we've been doing. 00:13:03.120 |
So we've been starting to work on sort of multimodal data. And we did a release a few weeks ago 00:13:07.600 |
for a voice generation model. So this is sort of text-to-speech and sort of in line with some of 00:13:12.320 |
the work we're doing to bring more multimodal data into a single model and use SSMs to power the 00:13:19.280 |
inference and the training and so on. So this is a model you can actually play with. I'll try to show 00:13:23.600 |
you a demo. But one of the things we're proudest off with this model is that we really shrunk the 00:13:28.320 |
latency down. So when you play with it on the playground, you get instant voice back generated from 00:13:32.720 |
the data center. And there's some cool work we're doing to actually run these models on Mac. 00:13:37.680 |
And other devices so that you can basically have the same experience as you have in the data center, 00:13:41.680 |
but on any device. And do that efficiently and at low power. How much time do I have? 00:13:46.000 |
Okay. We're out of time, but I was also almost done. So go to the website, play.cartesia.ai. I 00:13:54.000 |
unfortunately couldn't walk through the demo, but play with it and send us feedback. This is my email, 00:13:59.120 |
in case you want to send me a note. I would love to hear feedback and anything that you folks find