State Space Models for Realtime Multimodal Intelligence: Karan Goel

Maybe to set the stage a little bit, the last four or five years of AI have basically been really focused on this idea of batch intelligence, which is sort of pretty core to this idea of building like an AI system that can reason for long periods of time on a problem and then solve it.

So you can think about like math problems or, you know, physics problems that are hard. There's a lot of applications where actually what you need are systems that are streaming. So they're real time. They work instantly. So imagine generating video, audio, or doing like understanding applications on sensor streams, et cetera.

So it sort of bifurcates where there's these two different types of applications, similar to how there's, you know, generally this idea of having batch workloads and streaming workloads. And so a lot of what we've seen over the last few years has really been focused on batch APIs where you call a model in the cloud, it takes a few seconds, and then you get a pretty good response back.

And now we're seeing some shift towards more real-time applications where you constantly will be querying a model and asking it to return responses at low latency and then using that to sort of interpret or generate information. And I think this, you know, this area is really exciting because it's going to be transformative to a lot of interesting applications that have so far actually not necessarily been the main focus for a lot of what we've seen over the last few years.

So conversational voice is an example of this where you should be able to interact with the system and then talk to it, and it should be able to understand you and do all kinds of tasks on your behalf. This is similar to having assistants that are on device and run kind of really efficiently at low power at, you know, all times, regardless of whether you're on a phone or a laptop.

And then things like world generation where, like, you can imagine actually playing a game that is generated in real time, similar to how the graphics are rendered on GPUs. And all of this, you know, should be able to happen in real time, on low power, on your phone, on your MacBook, et cetera.

Robotics is another great example where it sort of culminates with all of these coming together on a single device that is trying to kind of interpret everything in the world. And so I think this is sort of the exciting intersection, which is, like, how do we make intelligence faster and cheaper so that we can put it everywhere, basically.

And a couple of examples that are really powerful, real-time intelligence for conversational interfaces is going to be really interesting because you would be able to have an agent that can provide customer support for a problem, answer questions about health insurance, you know, call your vendor to pick up a shipment.

All these coordination tasks that generally are annoying to do should be really automated and real-time intelligent agents should be doing them. And then humans can spend their time solving sort of harder problems that are more interesting. And in customer support, that could be dealing with, you know, the tail customers that are much more important because they're pissed off or they're more important because they have, you know, more customer value, et cetera.

And similarly in robotics, there's this idea of, like, ingesting similar to humans, like audio, video, sensor data, and then responding instantly to a lot of these pieces of information. So I think this is sort of the world we should be living in where all of these intelligent models run super fast.

they saw all these different problems and they're able to really kind of power these new experiences that are interactive at their core. So this is where we come in. We're building these real-time foundation models. So some of what I'll talk about is the work we've done in really building kind of new ideas for how you can create deep learning models.

So I did my PhD before this. I was working with a lot of these folks for my PhD. Chris was our PhD advisor. And we were really focused on this idea that you should be able to have a model that can compress information as it comes into the model and use that to really kind of build powerful systems that are streaming at their core.

And I'll talk a little bit about this, but that's really the technology that we've been working with for the last four or five years. We've been developing academia and some of you might have heard of things like Mamba, which is sort of a more recent iteration of this technology.

You know, I did my PhD working on some of the early iterations that nobody uses anymore, but are sort of the precursors to a lot of the modern stuff that is now more widely used. And now what we're doing at Cartesia is basically taking this and trying to understand how we can improve it, how we push the boundaries on what architectures can do.

And I think it's an interesting question because, you know, we should not settle for having one way of doing things. I think that's sort of a poor way to kind of think about the future. So our approach is sort of like let's think about new ways of actually designing models that aren't necessarily built on, let's say, the transformer architecture and the standard recipe for deep learning that's, you know, prevalent today.

And I think it boils down to this question of like efficiently modeling long context is a huge problem because, you know, a lot of practical data is really long sequence data. I think text is maybe the least interesting long sequence data because text is actually fairly compressed already, right?

Like you have a lot of information that is embedded in two minutes of -- or two sentences of text. But there's all these other domains where, you know, audio, video, et cetera, where there's so much information. You know, imagine looking at a security camera for a day. Like you would probably have just so much information coming into the system and just very little of that would be useful.

So compression is kind of really fundamental to intelligence because we're able to do this where we can look at all this information and then sort of compress it down to whatever's necessary to remember or understand. And I think so far what we've seen is that the AI systems that we built have not necessarily exhibited that same behavior.

So they're really kind of built not on the principles of compression, but more on this idea of retrieval, like keeping all the context around and then using it to reason over all the information that you've seen. So I think our kind of point of view is that multimodal AI will remain challenging as long as you're sort of working in this paradigm.

Because if you try to think about what humans do in a year, you're basically processing understanding about a billion text tokens, 10 billion audio tokens. These are, you know, back of the envelope calculations that I did. And about a trillion video tokens probably underestimates how much video we process and not including all the other sensory information that you're processing.

And you're doing it simultaneously. And you're doing it on a computer that fits in your brain. And you, you know, sometimes don't eat and drink and, you know, you're still functioning fine. So, you know, you can have variable amounts of power in the system. So I think the idea that, like, intelligence is solved is sort of very far from the truth because humans just are an extremely amazing machine that does something very extraordinary in a very compressed way that our AI models can't do.

So I think that's sort of our, you know, sort of the reason we get up in the morning is we think about this and we're like, yeah, we're very far away from where we should be. And the best models today are in the, you know, 10 million, 100 million sort of token range.

So that's really good. A lot of progress has been made. But really, this is sort of what we aspire to is how do you kind of build these machines that are long lived that can actually understand information over very long periods of time. And I think the cool thing is, like, as a human, you can remember things that happened 30 years ago with very little effort.

You don't need to do rag or retrieval or anything. You just, you know, you remember it. It's gisted in your brain and then you figure it out, basically. So I think that's kind of an extraordinary capability that we should be able to put into our AI models as well.

And so some of the big problems with models today are, you know, they're built on transformers, really optimized for data center. I think we see this with, like, a lot of the work we did, which was on sub-quadratic models. So quadratic scaling and context length really just means that, you know, the amount of computation you have to do to process long amounts of context is very large.

And so right now the sort of predominant approach is to throw compute at that problem and then hope that that would scale. Obviously, compute is a very important piece of the puzzle because you do need more computation to be able to do more difficult things. But this type of approach, because of the quadratic scaling, actually has poor scaling with, you know, very large multimodal context.

And text contexts tend to be shorter. Multimodal contexts will get larger because you have just way more tokens and information that's going into the system. So that's going to be a big challenge for these models, especially how do you do this inference efficiently so you're not, you know, burning down the data centers to, you know, do a fairly limited amount of inference.

Like, you have to imagine that we're doing a thousand times or, you know, a hundred thousand times more inference. And then if the models are scaling the same way, it's going to be really, really, really expensive. So you're not going to be able to permeate all these applications that I talked about very easily.

And so, you know, that's sort of a big challenge, I would say. And so, you know, again, our hypothesis is you need new architectures and that's kind of where we spend our time and we want to make these models more efficient, faster, more capable while being able to handle all these long context problems.

This is a slide about, you know, transformers being somewhat inefficient at handling this, but obviously a very good recipe for scaling these models out. And so, you know, some of the work that we've been doing is new fundamentally efficient architectures. So they have compression at their core. So they sort of -- the way they operate -- I'll have a slide on this just to give you kind of a quick illustration.

But they really scale more linearly in context lens. So you should be able to have -- because of this, like, more low power implementations of these models, you can compress information as it comes into the system. You have low memory usage. And you can actually scale to much more massive context because of that.

And this is all the work around SSMs. I just threw this nice slide, which I thought was cool. Jensen had an interesting quote about SSMs in one of his Wired articles that I like to keep talking about. But I think it's a cool technology that has a lot of potential and sort of that's where we're spending a lot of our time.

And if you folks are interested in reading more, there's lots of videos on YouTube and lots of sort of resources that try to make this more accessible to understand and kind of get into some of the details. But, you know, the working intuition is basically -- transformers are generating quadratically by attending to every past token of information.

So as tokens come into the system, you're sort of keeping them around and then looking at all the past tokens. So if you want to generate the word "jumped" from the quick brown fox, you would actually look at the entire context, try to understand what the next word should be, and then generate it, push it into the context, do it again.

With SSMs, you just have a streaming system. So you have a token stream in, they update an internal memory for the model, and then the token gets thrown away. So that actually really simplifies the system. And that's why it's such a core sort of streaming interface, because you're just not keeping all this memory around about what happened in the past.

You're compressing it into some sort of zipped file state inside the model that's going to be used to do a future generation. And so this is sort of taking this idea of -- taking advantage of this idea of recurrence, which is sort of core to how even humans do a lot of their raising.

And, you know, last few months, a lot of these models have been getting adopted. So it's great that, you know, a lot of folks are now excited about the -- this, you know, alternate way of doing things that is much more sort of oriented around this idea of recurrence, rather than retrieval.

And so I think, like, we'll see a lot more activity in this, especially with multimodal data becoming more important. And, you know, a lot of the challenges of multimodal data around efficiency will mean that I think that these models will have more of a role to play in the next three to five years, as we also do our work in scaling them up and making them more interesting.

A lot of people ask me about quality. I only have a few minutes, so I'll go through the rest of the slide super fast. But, you know, SSMs generally have the right quality. Obviously, there's a tradeoff between compression and keeping all of the information around. But actually, like, compression can be helpful.

So if you imagine the security camera example, if you're watching 24 hours of footage, actually compressing all of that information on the fly would help you solve tasks and answer questions better rather than looking at all 24 hours every time. So I think that's sort of the rule of thumb to think about, which is compression super helpful for a large context, not as helpful for short context.

And so we see that quality actually is very good for long context problems and multimodal problems. Let me talk quickly about some of the work we've been doing. So we've been starting to work on sort of multimodal data. And we did a release a few weeks ago for a voice generation model.

So this is sort of text-to-speech and sort of in line with some of the work we're doing to bring more multimodal data into a single model and use SSMs to power the inference and the training and so on. So this is a model you can actually play with. I'll try to show you a demo.

But one of the things we're proudest off with this model is that we really shrunk the latency down. So when you play with it on the playground, you get instant voice back generated from the data center. And there's some cool work we're doing to actually run these models on Mac.

And other devices so that you can basically have the same experience as you have in the data center, but on any device. And do that efficiently and at low power. How much time do I have? Okay. We're out of time, but I was also almost done. So go to the website, play.cartesia.ai.

I unfortunately couldn't walk through the demo, but play with it and send us feedback. This is my email, in case you want to send me a note. I would love to hear feedback and anything that you folks find interesting. Thank you. Thank you.

State Space Models for Realtime Multimodal Intelligence: Karan Goel

Transcript