Back to Index

4 Reasons AI in 2024 is On An Exponential: Data, Mamba, and More


Transcript

I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious start. This video though has been made to show that we are on the steep part of the exponential and will be for a while yet. I'm going to give you 4 clear reasons why, though I could have easily given 8 or 16 depending on how you categorise them and how much time you've got.

We will look at how data quality will change everything according to the famed authors of Mamba and Mixed Trial, and then how models will start to think before answering, and according to this fascinating new paper, just how much low hanging fruit there is out there in AI that doesn't even require more compute.

And even setting all of that aside, we'll end with the explosion in multimodal progress that is occurring around us. That by the way will include listening to a version of my voice that you might find hard to distinguish from my real one. I'll also finish with some predictions for the year ahead.

But I'm going to start with Mamba, but not yet the new architecture that is causing shockwaves. I'll cover that in a minute. I'm going to start with one of the co-authors, Tree Dao, and despite all the buzz about his new architecture, here's what he said about data quality. All the architecture stuff is fun, making that hardware efficient is fun, but I think ultimately it's about data.

If you look at the scaling log curve, different model architectures would generally have the same slope, they're just different offset. Seems like the only thing that changes the slope is the data quality. Yes, we're going to cover Mamba in a minute, but for language modeling, it does perform better than the Transformer++, basically the best version of a transformer.

But according to this graph, with 5 or 10 times as much compute, you could replicate the performance of Mamba with a transformer. And all of this, remember, is for a small 3 billion parameter model. We don't know what it will be like at bigger sizes. So for language modeling, potentially a big breakthrough, but data quality still means more.

We are not even close to maximizing the quality of data fed into our models. This is Arthur Mench, co-founder of Mistral, the creators of some of the most cutting edge open source AI models. To increase that efficiency, you do need to work on coming up with very high quality data, featuring things, many new techniques that needs to be invented still, but that's really where the lock is.

Data is the one important thing, and the ability of the model to decide how much compute it wants to allocate to a certain problem is definitely on the frontier as well. So these are things that we're actively looking at. I'll speak more about letting models think for longer and inference time compute later on in this video.

But if you still weren't convinced about the importance of data quality, here's Sebastian Bubek, who might be able to persuade you in an interview for AI Insiders on Patreon. That last slide that you did on, for your channel, amazing channel, where it was like a thousand X increase in effective compute and parameters and data, it seems massive to me, and I don't think enough people are appreciating that point.

Yeah, I think it's pretty massive, to be honest, you know, before working on this line of work, I was thinking about improving the optimization algorithm. I was thinking about improving the architecture, and I worked on this for a few years, and we could get, you know, 2%, 3% improvement, like this small type of, you know, around the edges.

It's nice, but it's tiny. But then suddenly when we started to focus on the data and really trying to craft data in a way that is more digestible by the LLM at training time, suddenly we saw this incredible like, you know, thousand X gains. So yes, I think it is massive, and it's really pointing to where the gold is, and the gold is in the data.

Now, at this point, I know many of you will be saying, we get it, Philip, data quality is important, but tell us about the new architectures that might be processing that data. In other words, tell us about Mamba. That's a new architecture that has been generating a lot of buzz in AI circles, if not for the general public.

I've been hoping to speak to one of the only two authors on this paper. In the meantime, I want to try to translate its significance to the lay person. I'm going to try to convey what Mamba is and what it means in just two or three minutes. Before we touch on the contender, let's talk about the King Transformers.

That's the architecture behind everything from DALI 3 to GPT 4. And transformers are famous in part because they're great at paying attention. In this diagram from the famous illustrated transformer, we see that as we encode or process the word it, the transformer architecture pays attention to all of the previous words or tokens.

Some, of course, like animal more than others, and it will do this for all of the words it's going to encode. And the truth is that paying attention is great, but the kind of attention in transformers where every element must attend to every other element is complex at big scales.

In fact, it's called quadratic complexity. And whenever you hear that word quadratic, think square. You may remember the word quadratic from school and the Y equals X squared graph. And hopefully that makes sense because as you double the number of elements, you far more than double the number of pairwise connections.

In a very rough analogy, imagine everyone shaking hands in a room. And if you triple the number of people in that large room, you are roughly 9xing 3 squared the number of handshakes. But now imagine sequences of 1 million interconnected tokens. No one has that much attention to give.

But there are other known ways to process a sequence of inputs. How about a state of fixed size being updated by inputs step by step? It seems simpler, right? Although it's a lot less parallelizable, that is a hard word to say. Now there have been attempts for quite a while to get the best of both.

Here's Albert Gu, one of the lead authors of Mamba, and his paper from 2021 was called Structured State Spaces for Sequence Modeling. Indeed, that 4 S's name had so much S sound sibilance, it inspired the Mamba snake name. But now let's get to the Mamba paper itself and the key diagram therein.

There's that hidden state on the left, merrily processing its way across to the right. Updated in turn by the inputs coming in from here. This long lasting state needs to be a rich but compressed expression of all of the data seen so far. But if this approach can be made to work, it would mean far fewer connections to compute and therefore faster inference.

Indeed, the paper claims 5x faster inference on NVIDIA's A100. And without that quadratic complexity, it would mean that even extremely long sequences - think vast code bases, DNA sequences, and even the video input from a long YouTube explainer - could now be handled without a mental breakdown. But remember, that state needs to compress all the data it's seen, including therefore ignoring certain inputs.

And that is where the selection mechanism comes in. It decides what inputs to ignore and which to concentrate on. The trouble is, this rich and expressive hidden state with the distilled selected inputs is slow to process and computationally demanding. So how did Tredow keep this hidden state rich and expressive with the distilled wisdom of previous inputs?

How did they expand that state without bringing everything to a standstill? Well, by painting it orange. No, but seriously, what orange means in the diagram is that it's processed in the GPU SRAM. Think of that as the super fast part of the GPU's memory with the shortest commute to the processing chip.

In contrast, all the model parameters in green, which won't change, and the inputs can be handled by the slower high bandwidth memory. All of this is where we get the term hardware aware state expansion. It's an architecture built with an awareness of the kind of GPUs it's going to run on.

Let's try to make this more tangible with an example of what we can achieve with all of this freed up complexity. Take this induction head comparison. First take the word explained, which is made up of two tokens, explur and aimed. An induction head is a circuit that might be attending to the token explur and its function is to scan the existing sequence for previous examples of that token that it's attending to.

Then it needs to find the token that in that survey came after the token, which in our case will be aimed, and then forecast that that will happen once more to give us explained. Obviously, you need to be great at recall to do this, especially if the sequence involves thousands, hundreds of thousands, or even millions of tokens.

Other architectures fall apart as the sequence length gets longer than that found in training, but not Mamba. Look at the top line. Accuracy staying at one, even up to a million tokens. And remember that architectures like Mamba need not necessarily hunt alone. Take this announcement of striped hyena from Together AI.

They showed that we need not necessarily choose with a hybrid architecture involving attention performing well. But there's one more thing before we put the Mamba snake back into the basket. On the left, you can see its superior performance at great apes DNA classification. In other words, deciding whether a sequence of DNA was human, chimpanzee, gorilla, orangutan, or bonobo.

Notice that it's on the longest sequence lengths that its performance starts to really shine. This task, by the way, they made artificially hard for themselves because it was originally about distinguishing between a human, lemur, mouse, pig, and hippo. But you could easily imagine other use cases like healthcare. You could rapidly analyze a patient's genetic data and come up with personalized medical treatment for them.

More speculatively, you could imagine a chat bot that remembers a conversation you had months or years ago. And then of course, I'll leave it to you to think of all the other long sequences out there like stock market data or weather data. And as I mentioned before, video data from those long video explainers that annoy everyone.

But wait, is this video becoming one of those? I really hope not. So I'm going to move on to the next reason that AI is not slowing down anytime soon. That is inference time compute, or the ability of the model, as Arthur Mensch said, to decide how much compute to allocate to certain problems.

Now I touched on this in my Q star video, but here's Lucas Kaiser, one of the authors of the transformer architecture and indeed a senior figure for open AI. For reasoning, you also need these chains of thought. You need to give the model the ability to think longer than it has layers.

But it can be combined with multimodal, especially when you have -- nowadays, models, they can generate, you say, "Okay, how does it look when a boy kicks a ball?" And you can generate a few frames of the video. And now more and more, there are models that will generate the whole video.

And then there start to be models of the world that say, "Well, if you drive a car like this, how will the street look? How will people look? What will happen? Will you crash into something?" So in the future, the models will have this knowledge of the world and this generation, which we call chain of thought and text.

But multimodality, this means just it's a chain of frames of what's going to happen to the world, which is basically how we sometimes think. So it will be multimodality and this ability to generate sequences of things before you give an answer that will resemble much more what we call reasoning.

For more on that, check out my Q* video, but here's another taster. This is from Noam Brown, also of OpenAI. He admitted that letting models think for longer would occasionally have drawbacks. Inference at times may be 1,000x slower and more costly. And he said, "What inference cost would we pay for a new cancer drug, or for a proof of the Riemann hypothesis?" AlphaCode2, which I've also done a video on, gives us a foretaste of the kind of results that we might expect.

And remember, this is separate from data quality or dynamic new architectures. But we can't mention AlphaCode2 and inference time compute without also mentioning Let's Verify Step-by-Step, aka process-based verification. But I know what you're wondering, what is this graph and where does it come from? Well, it comes from a fascinating new paper entitled "AI Capabilities Can Be Significantly Improved Without Expensive Retraining".

In a way, this paper sums up the message of this video. We are not even close to being done with the exponential gains in AI. And the way that this paper measured it was fascinating. It basically asked how much extra computing power would we have to provide to get the equivalent gain that these methods provide.

As you can see, the methods are quite diverse and almost all of them have been covered before on this channel. Things like prompting and scaffolding, a bit like smart GPT, tool use, or indeed data quality as we saw with Orca and self-consistency. If you're not familiar with either Orca or self-consistency and majority voting, check out the videos that I've done before linked in the description.

The x-axis, by the way, is the one-time compute cost that these methods entail. Yes, some of them go up to 1% or even 10% as a fraction of the training compute used to create the models, but look at the returns on the y-axis. Having a verifier check the steps of a model does cost 0.001% as a fraction of the training compute, but the compute equivalent gain is around 9.

In other words, you would have had to use 9 times as much compute on the base model to achieve similar results. And yes, many of these methods can indeed be combined. Indeed that's what smart GPT was all about. It was combining chain of thought, few-shotting, and majority voting. In 2024, we may see the PHY2 data quality approach combined with, say, the Mamba architecture.

And all of this is before we scale the models up as the paper says, "Gains from enhancements and gains from scaling might interact and compound in non-trivial ways." And they give the example that chain of thought prompting didn't even work on smaller models. I remember saying toward the end of my smart GPT video in August that we need to find out the ceiling of the models, not just the floor.

And the authors concur saying researchers could also study whether there is a ceiling to the total improvement you can get from post-training enhancements. That will enable the HEI labs to better predict how much more capable their model might become in the future. Now, I know what some veteran researchers will be thinking while looking at these charts.

Surely it depends on the task and the benchmark and the models, and all of these numbers must be very approximate. And of course, you are right. But I have one more critique in addition to that. And that is the third, or is it fourth, to be honest, I've lost count, reason why AI is going to continue to improve dramatically.

And that is prompt optimization. I spoke to Tim Rochtaschel of Google DeepMind about this for Patreon. But this is what I mean in a nutshell. Language models can optimize their own prompts. There are many techniques for doing this, but the manual methods we're coming up with, the heuristics like this is important for my career and my grandma wants me to do this, are our manual feeble approaches.

Once we deploy LLMs to help us optimize the prompts going into LLMs, we might see dramatically better performance. Indeed, we already have for things like high school mathematics and movie recommendations. Anyway, this is the simplified version. Do check out that video. But the point is this. Even if we weren't going for dramatically better data quality, dynamic new architectures and getting models to reason and think for longer, then prompt optimization would allow us to squeeze out significantly better results, even from existing models.

And notice I've got through the whole video without yet mentioning, of course, scaling the models up to 10 trillion parameters or indeed a hundred trillion parameters as promised by Etched.ai. I've got much more information about them coming soon, but let's say you're someone who doesn't care about stats, benchmarks, or indeed AGI countdowns.

Maybe you're someone who is skeptical about tweets like this from an OpenAI employee who said, "Brace yourselves, AGI is coming." Well, even for you, 2024 looks set to be a dramatic year. These outputs from the Walt team at Google may be low resolution, but they're quite high consistency. I use PicoLabs and RunwayML2 and the progress season on season is quite something to watch.

And that would be a prediction I'd make for before the end of 2024. I think we will get a 3-5 second photorealistic text to video output that could fool most humans. Now you might say, "Oh, I'd never be fooled," but let's test you on that with text to image.

Which of these then is the real Roman arch? One of them is from Midjourney version 6 upscaled with Magnific and the other is a real arch. The answer is that the one on the left is the real Roman arch. Now in my Christmas video, while misspelling some prompts, I did show off the style raw input.

This is for anyone using Midjourney version 6. But since then, from Reddit, I found an even better tip. Use the phrase phone photo. You can get, as you can see, strikingly realistic images. And of course, you can further upscale any of these too. You can definitely get some quite interesting results this way.

And at this point, I want to show you this quite fascinating prediction made a hundred years ago. This is what the cartoonist Harold Tucker Webster thought the world of drawing would be like in 2023. Now yes, we're a day into 2024, but I still think this is interesting. Notice the drawing is being done automatically.

And the caption is, "In the year 2023, when all our work is done by electricity." He called it the cartoon dynamo. We call it Midjourney version 6, but tomato, tomato. It's time now to draw the video to an end, but I'm going to end where I started. And no, I don't just mean this tweet from Jim Fan.

I also mean the words I used in the introduction. And here, via 11labs, is me in quotes, AI explained, Philip, who is a real person and not GPT-5 or 6 as some assume, but here's the AI version of my voice delivering the intro. I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious start.

This video has been made to show that we are on the steep part of the exponential and will be for a while yet. Just for fun, for all of my legendary supporters on Patreon, I'm going to process this entire video so you can hear how good AI is getting at imitating the human voice.

Soon it may be impossible to tell who's human and who's not using audio alone and then soon thereafter even video. I know you guys know that I'm real, so thank you so much for watching all the way to the end and, as always, have a wonderful day and a wonderful 2024.