back to index4 Reasons AI in 2024 is On An Exponential: Data, Mamba, and More
00:00:00.000 |
I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious 00:00:09.520 |
This video though has been made to show that we are on the steep part of the exponential 00:00:16.980 |
I'm going to give you 4 clear reasons why, though I could have easily given 8 or 16 depending 00:00:23.560 |
on how you categorise them and how much time you've got. 00:00:26.600 |
We will look at how data quality will change everything according to the famed authors 00:00:32.040 |
of Mamba and Mixed Trial, and then how models will start to think before answering, and 00:00:37.600 |
according to this fascinating new paper, just how much low hanging fruit there is out there 00:00:42.760 |
in AI that doesn't even require more compute. 00:00:46.040 |
And even setting all of that aside, we'll end with the explosion in multimodal progress 00:00:53.640 |
That by the way will include listening to a version of my voice that you might find 00:01:00.760 |
I'll also finish with some predictions for the year ahead. 00:01:04.080 |
But I'm going to start with Mamba, but not yet the new architecture that is causing shockwaves. 00:01:09.980 |
I'm going to start with one of the co-authors, Tree Dao, and despite all the buzz about his 00:01:15.120 |
new architecture, here's what he said about data quality. 00:01:18.720 |
All the architecture stuff is fun, making that hardware efficient is fun, but I think 00:01:25.960 |
If you look at the scaling log curve, different model architectures would generally have the 00:01:34.760 |
Seems like the only thing that changes the slope is the data quality. 00:01:39.040 |
Yes, we're going to cover Mamba in a minute, but for language modeling, it does perform 00:01:44.200 |
better than the Transformer++, basically the best version of a transformer. 00:01:49.440 |
But according to this graph, with 5 or 10 times as much compute, you could replicate 00:01:56.760 |
And all of this, remember, is for a small 3 billion parameter model. 00:02:00.160 |
We don't know what it will be like at bigger sizes. 00:02:02.720 |
So for language modeling, potentially a big breakthrough, but data quality still means 00:02:08.280 |
We are not even close to maximizing the quality of data fed into our models. 00:02:13.520 |
This is Arthur Mench, co-founder of Mistral, the creators of some of the most cutting edge 00:02:19.600 |
To increase that efficiency, you do need to work on coming up with very high quality data, 00:02:24.840 |
featuring things, many new techniques that needs to be invented still, but that's really 00:02:32.160 |
Data is the one important thing, and the ability of the model to decide how much compute it 00:02:38.200 |
wants to allocate to a certain problem is definitely on the frontier as well. 00:02:42.760 |
So these are things that we're actively looking at. 00:02:45.840 |
I'll speak more about letting models think for longer and inference time compute later 00:02:52.120 |
But if you still weren't convinced about the importance of data quality, here's Sebastian 00:02:57.800 |
Bubek, who might be able to persuade you in an interview for AI Insiders on Patreon. 00:03:04.600 |
That last slide that you did on, for your channel, amazing channel, where it was like 00:03:08.880 |
a thousand X increase in effective compute and parameters and data, it seems massive 00:03:14.760 |
to me, and I don't think enough people are appreciating that point. 00:03:19.800 |
Yeah, I think it's pretty massive, to be honest, you know, before working on this line of work, 00:03:25.560 |
I was thinking about improving the optimization algorithm. 00:03:28.640 |
I was thinking about improving the architecture, and I worked on this for a few years, and 00:03:32.320 |
we could get, you know, 2%, 3% improvement, like this small type of, you know, around 00:03:40.800 |
But then suddenly when we started to focus on the data and really trying to craft data 00:03:46.240 |
in a way that is more digestible by the LLM at training time, suddenly we saw this incredible 00:03:55.560 |
So yes, I think it is massive, and it's really pointing to where the gold is, and the gold 00:04:02.380 |
Now, at this point, I know many of you will be saying, we get it, Philip, data quality 00:04:05.880 |
is important, but tell us about the new architectures that might be processing that data. 00:04:13.060 |
That's a new architecture that has been generating a lot of buzz in AI circles, if not for the 00:04:19.480 |
I've been hoping to speak to one of the only two authors on this paper. 00:04:24.420 |
In the meantime, I want to try to translate its significance to the lay person. 00:04:29.840 |
I'm going to try to convey what Mamba is and what it means in just two or three minutes. 00:04:35.720 |
Before we touch on the contender, let's talk about the King Transformers. 00:04:40.100 |
That's the architecture behind everything from DALI 3 to GPT 4. 00:04:44.580 |
And transformers are famous in part because they're great at paying attention. 00:04:49.400 |
In this diagram from the famous illustrated transformer, we see that as we encode or process 00:04:54.100 |
the word it, the transformer architecture pays attention to all of the previous words 00:05:00.140 |
Some, of course, like animal more than others, and it will do this for all of the words it's 00:05:05.800 |
And the truth is that paying attention is great, but the kind of attention in transformers 00:05:10.580 |
where every element must attend to every other element is complex at big scales. 00:05:19.240 |
And whenever you hear that word quadratic, think square. 00:05:22.080 |
You may remember the word quadratic from school and the Y equals X squared graph. 00:05:27.680 |
And hopefully that makes sense because as you double the number of elements, you far 00:05:31.620 |
more than double the number of pairwise connections. 00:05:35.000 |
In a very rough analogy, imagine everyone shaking hands in a room. 00:05:39.360 |
And if you triple the number of people in that large room, you are roughly 9xing 3 squared 00:05:47.520 |
But now imagine sequences of 1 million interconnected tokens. 00:05:55.000 |
But there are other known ways to process a sequence of inputs. 00:05:58.960 |
How about a state of fixed size being updated by inputs step by step? 00:06:05.320 |
Although it's a lot less parallelizable, that is a hard word to say. 00:06:09.260 |
Now there have been attempts for quite a while to get the best of both. 00:06:13.160 |
Here's Albert Gu, one of the lead authors of Mamba, and his paper from 2021 was called 00:06:18.680 |
Structured State Spaces for Sequence Modeling. 00:06:21.440 |
Indeed, that 4 S's name had so much S sound sibilance, it inspired the Mamba snake name. 00:06:29.840 |
But now let's get to the Mamba paper itself and the key diagram therein. 00:06:34.920 |
There's that hidden state on the left, merrily processing its way across to the right. 00:06:40.740 |
Updated in turn by the inputs coming in from here. 00:06:44.040 |
This long lasting state needs to be a rich but compressed expression of all of the data 00:06:51.760 |
But if this approach can be made to work, it would mean far fewer connections to compute 00:06:58.560 |
Indeed, the paper claims 5x faster inference on NVIDIA's A100. 00:07:03.160 |
And without that quadratic complexity, it would mean that even extremely long sequences 00:07:07.960 |
- think vast code bases, DNA sequences, and even the video input from a long YouTube explainer 00:07:15.400 |
- could now be handled without a mental breakdown. 00:07:19.000 |
But remember, that state needs to compress all the data it's seen, including therefore 00:07:25.800 |
And that is where the selection mechanism comes in. 00:07:29.080 |
It decides what inputs to ignore and which to concentrate on. 00:07:33.360 |
The trouble is, this rich and expressive hidden state with the distilled selected inputs is 00:07:38.880 |
slow to process and computationally demanding. 00:07:42.300 |
So how did Tredow keep this hidden state rich and expressive with the distilled wisdom of 00:07:48.600 |
How did they expand that state without bringing everything to a standstill? 00:07:54.400 |
No, but seriously, what orange means in the diagram is that it's processed in the GPU 00:08:00.680 |
Think of that as the super fast part of the GPU's memory with the shortest commute to 00:08:07.280 |
In contrast, all the model parameters in green, which won't change, and the inputs can be 00:08:15.160 |
All of this is where we get the term hardware aware state expansion. 00:08:19.600 |
It's an architecture built with an awareness of the kind of GPUs it's going to run on. 00:08:25.080 |
Let's try to make this more tangible with an example of what we can achieve with all 00:08:32.880 |
First take the word explained, which is made up of two tokens, explur and aimed. 00:08:38.200 |
An induction head is a circuit that might be attending to the token explur and its function 00:08:43.720 |
is to scan the existing sequence for previous examples of that token that it's attending 00:08:49.600 |
Then it needs to find the token that in that survey came after the token, which in our 00:08:54.400 |
case will be aimed, and then forecast that that will happen once more to give us explained. 00:08:59.480 |
Obviously, you need to be great at recall to do this, especially if the sequence involves 00:09:04.240 |
thousands, hundreds of thousands, or even millions of tokens. 00:09:09.160 |
Other architectures fall apart as the sequence length gets longer than that found in training, 00:09:17.280 |
Accuracy staying at one, even up to a million tokens. 00:09:21.960 |
And remember that architectures like Mamba need not necessarily hunt alone. 00:09:26.360 |
Take this announcement of striped hyena from Together AI. 00:09:29.920 |
They showed that we need not necessarily choose with a hybrid architecture involving attention 00:09:36.640 |
But there's one more thing before we put the Mamba snake back into the basket. 00:09:41.360 |
On the left, you can see its superior performance at great apes DNA classification. 00:09:46.440 |
In other words, deciding whether a sequence of DNA was human, chimpanzee, gorilla, orangutan, 00:09:52.560 |
Notice that it's on the longest sequence lengths that its performance starts to really 00:09:57.800 |
This task, by the way, they made artificially hard for themselves because it was originally 00:10:01.360 |
about distinguishing between a human, lemur, mouse, pig, and hippo. 00:10:05.480 |
But you could easily imagine other use cases like healthcare. 00:10:08.880 |
You could rapidly analyze a patient's genetic data and come up with personalized medical 00:10:14.840 |
More speculatively, you could imagine a chat bot that remembers a conversation you had 00:10:21.120 |
And then of course, I'll leave it to you to think of all the other long sequences out 00:10:25.520 |
there like stock market data or weather data. 00:10:28.980 |
And as I mentioned before, video data from those long video explainers that annoy everyone. 00:10:34.720 |
But wait, is this video becoming one of those? 00:10:38.320 |
So I'm going to move on to the next reason that AI is not slowing down anytime soon. 00:10:44.000 |
That is inference time compute, or the ability of the model, as Arthur Mensch said, to decide 00:10:49.640 |
how much compute to allocate to certain problems. 00:10:52.920 |
Now I touched on this in my Q star video, but here's Lucas Kaiser, one of the authors 00:10:58.080 |
of the transformer architecture and indeed a senior figure for open AI. 00:11:03.120 |
For reasoning, you also need these chains of thought. 00:11:07.200 |
You need to give the model the ability to think longer than it has layers. 00:11:14.720 |
But it can be combined with multimodal, especially when you have -- nowadays, models, they can 00:11:20.920 |
generate, you say, "Okay, how does it look when a boy kicks a ball?" 00:11:25.240 |
And you can generate a few frames of the video. 00:11:27.520 |
And now more and more, there are models that will generate the whole video. 00:11:32.200 |
And then there start to be models of the world that say, "Well, if you drive a car like this, 00:11:41.840 |
So in the future, the models will have this knowledge of the world and this generation, 00:11:49.360 |
But multimodality, this means just it's a chain of frames of what's going to happen 00:11:54.080 |
to the world, which is basically how we sometimes think. 00:11:57.440 |
So it will be multimodality and this ability to generate sequences of things before you 00:12:03.000 |
give an answer that will resemble much more what we call reasoning. 00:12:08.760 |
For more on that, check out my Q* video, but here's another taster. 00:12:15.600 |
He admitted that letting models think for longer would occasionally have drawbacks. 00:12:20.560 |
Inference at times may be 1,000x slower and more costly. 00:12:24.440 |
And he said, "What inference cost would we pay for a new cancer drug, or for a proof 00:12:31.240 |
AlphaCode2, which I've also done a video on, gives us a foretaste of the kind of results 00:12:38.120 |
And remember, this is separate from data quality or dynamic new architectures. 00:12:42.000 |
But we can't mention AlphaCode2 and inference time compute without also mentioning Let's 00:12:47.920 |
Verify Step-by-Step, aka process-based verification. 00:12:51.360 |
But I know what you're wondering, what is this graph and where does it come from? 00:12:54.360 |
Well, it comes from a fascinating new paper entitled "AI Capabilities Can Be Significantly 00:13:02.040 |
In a way, this paper sums up the message of this video. 00:13:05.460 |
We are not even close to being done with the exponential gains in AI. 00:13:10.400 |
And the way that this paper measured it was fascinating. 00:13:13.560 |
It basically asked how much extra computing power would we have to provide to get the 00:13:22.400 |
As you can see, the methods are quite diverse and almost all of them have been covered before 00:13:28.680 |
Things like prompting and scaffolding, a bit like smart GPT, tool use, or indeed data quality 00:13:37.480 |
If you're not familiar with either Orca or self-consistency and majority voting, check 00:13:41.920 |
out the videos that I've done before linked in the description. 00:13:45.240 |
The x-axis, by the way, is the one-time compute cost that these methods entail. 00:13:50.300 |
Yes, some of them go up to 1% or even 10% as a fraction of the training compute used 00:13:56.680 |
to create the models, but look at the returns on the y-axis. 00:14:01.360 |
Having a verifier check the steps of a model does cost 0.001% as a fraction of the training 00:14:09.860 |
compute, but the compute equivalent gain is around 9. 00:14:13.520 |
In other words, you would have had to use 9 times as much compute on the base model 00:14:19.880 |
And yes, many of these methods can indeed be combined. 00:14:25.600 |
It was combining chain of thought, few-shotting, and majority voting. 00:14:29.640 |
In 2024, we may see the PHY2 data quality approach combined with, say, the Mamba architecture. 00:14:35.760 |
And all of this is before we scale the models up as the paper says, "Gains from enhancements 00:14:41.640 |
and gains from scaling might interact and compound in non-trivial ways." 00:14:46.840 |
And they give the example that chain of thought prompting didn't even work on smaller models. 00:14:51.160 |
I remember saying toward the end of my smart GPT video in August that we need to find out 00:14:56.920 |
the ceiling of the models, not just the floor. 00:15:00.040 |
And the authors concur saying researchers could also study whether there is a ceiling 00:15:04.320 |
to the total improvement you can get from post-training enhancements. 00:15:08.300 |
That will enable the HEI labs to better predict how much more capable their model might become 00:15:14.920 |
Now, I know what some veteran researchers will be thinking while looking at these charts. 00:15:19.800 |
Surely it depends on the task and the benchmark and the models, and all of these numbers must 00:15:28.400 |
But I have one more critique in addition to that. 00:15:31.120 |
And that is the third, or is it fourth, to be honest, I've lost count, reason why AI 00:15:35.880 |
is going to continue to improve dramatically. 00:15:40.720 |
I spoke to Tim Rochtaschel of Google DeepMind about this for Patreon. 00:15:47.380 |
Language models can optimize their own prompts. 00:15:49.880 |
There are many techniques for doing this, but the manual methods we're coming up with, 00:15:54.720 |
the heuristics like this is important for my career and my grandma wants me to do this, 00:16:01.920 |
Once we deploy LLMs to help us optimize the prompts going into LLMs, we might see dramatically 00:16:09.480 |
Indeed, we already have for things like high school mathematics and movie recommendations. 00:16:19.520 |
Even if we weren't going for dramatically better data quality, dynamic new architectures 00:16:24.160 |
and getting models to reason and think for longer, then prompt optimization would allow 00:16:29.680 |
us to squeeze out significantly better results, even from existing models. 00:16:34.760 |
And notice I've got through the whole video without yet mentioning, of course, scaling 00:16:38.520 |
the models up to 10 trillion parameters or indeed a hundred trillion parameters as promised 00:16:45.160 |
I've got much more information about them coming soon, but let's say you're someone 00:16:49.640 |
who doesn't care about stats, benchmarks, or indeed AGI countdowns. 00:16:54.520 |
Maybe you're someone who is skeptical about tweets like this from an OpenAI employee who 00:17:02.120 |
Well, even for you, 2024 looks set to be a dramatic year. 00:17:06.880 |
These outputs from the Walt team at Google may be low resolution, but they're quite 00:17:13.200 |
I use PicoLabs and RunwayML2 and the progress season on season is quite something to watch. 00:17:20.240 |
And that would be a prediction I'd make for before the end of 2024. 00:17:24.400 |
I think we will get a 3-5 second photorealistic text to video output that could fool most 00:17:31.280 |
Now you might say, "Oh, I'd never be fooled," but let's test you on that with text to 00:17:39.280 |
One of them is from Midjourney version 6 upscaled with Magnific and the other is a real arch. 00:17:45.520 |
The answer is that the one on the left is the real Roman arch. 00:17:50.720 |
Now in my Christmas video, while misspelling some prompts, I did show off the style raw 00:17:57.080 |
This is for anyone using Midjourney version 6. 00:17:59.560 |
But since then, from Reddit, I found an even better tip. 00:18:05.200 |
You can get, as you can see, strikingly realistic images. 00:18:08.480 |
And of course, you can further upscale any of these too. 00:18:11.600 |
You can definitely get some quite interesting results this way. 00:18:14.680 |
And at this point, I want to show you this quite fascinating prediction made a hundred 00:18:20.120 |
This is what the cartoonist Harold Tucker Webster thought the world of drawing would 00:18:27.520 |
Now yes, we're a day into 2024, but I still think this is interesting. 00:18:31.960 |
Notice the drawing is being done automatically. 00:18:34.800 |
And the caption is, "In the year 2023, when all our work is done by electricity." 00:18:42.040 |
We call it Midjourney version 6, but tomato, tomato. 00:18:45.400 |
It's time now to draw the video to an end, but I'm going to end where I started. 00:18:49.880 |
And no, I don't just mean this tweet from Jim Fan. 00:18:52.840 |
I also mean the words I used in the introduction. 00:18:56.560 |
And here, via 11labs, is me in quotes, AI explained, Philip, who is a real person and 00:19:03.360 |
not GPT-5 or 6 as some assume, but here's the AI version of my voice delivering the 00:19:10.280 |
I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious 00:19:18.760 |
This video has been made to show that we are on the steep part of the exponential and will 00:19:23.820 |
Just for fun, for all of my legendary supporters on Patreon, I'm going to process this entire 00:19:29.280 |
video so you can hear how good AI is getting at imitating the human voice. 00:19:35.040 |
Soon it may be impossible to tell who's human and who's not using audio alone and 00:19:41.640 |
I know you guys know that I'm real, so thank you so much for watching all the way to the 00:19:46.680 |
end and, as always, have a wonderful day and a wonderful 2024.