back to index

4 Reasons AI in 2024 is On An Exponential: Data, Mamba, and More


Whisper Transcript | Transcript Only Page

00:00:00.000 | I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious
00:00:08.520 | start.
00:00:09.520 | This video though has been made to show that we are on the steep part of the exponential
00:00:14.880 | and will be for a while yet.
00:00:16.980 | I'm going to give you 4 clear reasons why, though I could have easily given 8 or 16 depending
00:00:23.560 | on how you categorise them and how much time you've got.
00:00:26.600 | We will look at how data quality will change everything according to the famed authors
00:00:32.040 | of Mamba and Mixed Trial, and then how models will start to think before answering, and
00:00:37.600 | according to this fascinating new paper, just how much low hanging fruit there is out there
00:00:42.760 | in AI that doesn't even require more compute.
00:00:46.040 | And even setting all of that aside, we'll end with the explosion in multimodal progress
00:00:51.760 | that is occurring around us.
00:00:53.640 | That by the way will include listening to a version of my voice that you might find
00:00:58.200 | hard to distinguish from my real one.
00:01:00.760 | I'll also finish with some predictions for the year ahead.
00:01:04.080 | But I'm going to start with Mamba, but not yet the new architecture that is causing shockwaves.
00:01:08.760 | I'll cover that in a minute.
00:01:09.980 | I'm going to start with one of the co-authors, Tree Dao, and despite all the buzz about his
00:01:15.120 | new architecture, here's what he said about data quality.
00:01:18.720 | All the architecture stuff is fun, making that hardware efficient is fun, but I think
00:01:24.360 | ultimately it's about data.
00:01:25.960 | If you look at the scaling log curve, different model architectures would generally have the
00:01:32.760 | same slope, they're just different offset.
00:01:34.760 | Seems like the only thing that changes the slope is the data quality.
00:01:39.040 | Yes, we're going to cover Mamba in a minute, but for language modeling, it does perform
00:01:44.200 | better than the Transformer++, basically the best version of a transformer.
00:01:49.440 | But according to this graph, with 5 or 10 times as much compute, you could replicate
00:01:54.240 | the performance of Mamba with a transformer.
00:01:56.760 | And all of this, remember, is for a small 3 billion parameter model.
00:02:00.160 | We don't know what it will be like at bigger sizes.
00:02:02.720 | So for language modeling, potentially a big breakthrough, but data quality still means
00:02:07.280 | more.
00:02:08.280 | We are not even close to maximizing the quality of data fed into our models.
00:02:13.520 | This is Arthur Mench, co-founder of Mistral, the creators of some of the most cutting edge
00:02:17.880 | open source AI models.
00:02:19.600 | To increase that efficiency, you do need to work on coming up with very high quality data,
00:02:24.840 | featuring things, many new techniques that needs to be invented still, but that's really
00:02:29.920 | where the lock is.
00:02:32.160 | Data is the one important thing, and the ability of the model to decide how much compute it
00:02:38.200 | wants to allocate to a certain problem is definitely on the frontier as well.
00:02:42.760 | So these are things that we're actively looking at.
00:02:45.840 | I'll speak more about letting models think for longer and inference time compute later
00:02:51.120 | on in this video.
00:02:52.120 | But if you still weren't convinced about the importance of data quality, here's Sebastian
00:02:57.800 | Bubek, who might be able to persuade you in an interview for AI Insiders on Patreon.
00:03:04.600 | That last slide that you did on, for your channel, amazing channel, where it was like
00:03:08.880 | a thousand X increase in effective compute and parameters and data, it seems massive
00:03:14.760 | to me, and I don't think enough people are appreciating that point.
00:03:19.800 | Yeah, I think it's pretty massive, to be honest, you know, before working on this line of work,
00:03:25.560 | I was thinking about improving the optimization algorithm.
00:03:28.640 | I was thinking about improving the architecture, and I worked on this for a few years, and
00:03:32.320 | we could get, you know, 2%, 3% improvement, like this small type of, you know, around
00:03:37.720 | the edges.
00:03:38.720 | It's nice, but it's tiny.
00:03:40.800 | But then suddenly when we started to focus on the data and really trying to craft data
00:03:46.240 | in a way that is more digestible by the LLM at training time, suddenly we saw this incredible
00:03:53.120 | like, you know, thousand X gains.
00:03:55.560 | So yes, I think it is massive, and it's really pointing to where the gold is, and the gold
00:04:01.380 | is in the data.
00:04:02.380 | Now, at this point, I know many of you will be saying, we get it, Philip, data quality
00:04:05.880 | is important, but tell us about the new architectures that might be processing that data.
00:04:10.880 | In other words, tell us about Mamba.
00:04:13.060 | That's a new architecture that has been generating a lot of buzz in AI circles, if not for the
00:04:18.480 | general public.
00:04:19.480 | I've been hoping to speak to one of the only two authors on this paper.
00:04:24.420 | In the meantime, I want to try to translate its significance to the lay person.
00:04:29.840 | I'm going to try to convey what Mamba is and what it means in just two or three minutes.
00:04:35.720 | Before we touch on the contender, let's talk about the King Transformers.
00:04:40.100 | That's the architecture behind everything from DALI 3 to GPT 4.
00:04:44.580 | And transformers are famous in part because they're great at paying attention.
00:04:49.400 | In this diagram from the famous illustrated transformer, we see that as we encode or process
00:04:54.100 | the word it, the transformer architecture pays attention to all of the previous words
00:04:59.140 | or tokens.
00:05:00.140 | Some, of course, like animal more than others, and it will do this for all of the words it's
00:05:04.480 | going to encode.
00:05:05.800 | And the truth is that paying attention is great, but the kind of attention in transformers
00:05:10.580 | where every element must attend to every other element is complex at big scales.
00:05:16.800 | In fact, it's called quadratic complexity.
00:05:19.240 | And whenever you hear that word quadratic, think square.
00:05:22.080 | You may remember the word quadratic from school and the Y equals X squared graph.
00:05:27.680 | And hopefully that makes sense because as you double the number of elements, you far
00:05:31.620 | more than double the number of pairwise connections.
00:05:35.000 | In a very rough analogy, imagine everyone shaking hands in a room.
00:05:39.360 | And if you triple the number of people in that large room, you are roughly 9xing 3 squared
00:05:46.080 | the number of handshakes.
00:05:47.520 | But now imagine sequences of 1 million interconnected tokens.
00:05:52.200 | No one has that much attention to give.
00:05:55.000 | But there are other known ways to process a sequence of inputs.
00:05:58.960 | How about a state of fixed size being updated by inputs step by step?
00:06:04.120 | It seems simpler, right?
00:06:05.320 | Although it's a lot less parallelizable, that is a hard word to say.
00:06:09.260 | Now there have been attempts for quite a while to get the best of both.
00:06:13.160 | Here's Albert Gu, one of the lead authors of Mamba, and his paper from 2021 was called
00:06:18.680 | Structured State Spaces for Sequence Modeling.
00:06:21.440 | Indeed, that 4 S's name had so much S sound sibilance, it inspired the Mamba snake name.
00:06:29.840 | But now let's get to the Mamba paper itself and the key diagram therein.
00:06:34.920 | There's that hidden state on the left, merrily processing its way across to the right.
00:06:40.740 | Updated in turn by the inputs coming in from here.
00:06:44.040 | This long lasting state needs to be a rich but compressed expression of all of the data
00:06:50.480 | seen so far.
00:06:51.760 | But if this approach can be made to work, it would mean far fewer connections to compute
00:06:57.000 | and therefore faster inference.
00:06:58.560 | Indeed, the paper claims 5x faster inference on NVIDIA's A100.
00:07:03.160 | And without that quadratic complexity, it would mean that even extremely long sequences
00:07:07.960 | - think vast code bases, DNA sequences, and even the video input from a long YouTube explainer
00:07:15.400 | - could now be handled without a mental breakdown.
00:07:19.000 | But remember, that state needs to compress all the data it's seen, including therefore
00:07:23.960 | ignoring certain inputs.
00:07:25.800 | And that is where the selection mechanism comes in.
00:07:29.080 | It decides what inputs to ignore and which to concentrate on.
00:07:33.360 | The trouble is, this rich and expressive hidden state with the distilled selected inputs is
00:07:38.880 | slow to process and computationally demanding.
00:07:42.300 | So how did Tredow keep this hidden state rich and expressive with the distilled wisdom of
00:07:47.400 | previous inputs?
00:07:48.600 | How did they expand that state without bringing everything to a standstill?
00:07:52.640 | Well, by painting it orange.
00:07:54.400 | No, but seriously, what orange means in the diagram is that it's processed in the GPU
00:07:59.680 | SRAM.
00:08:00.680 | Think of that as the super fast part of the GPU's memory with the shortest commute to
00:08:05.800 | the processing chip.
00:08:07.280 | In contrast, all the model parameters in green, which won't change, and the inputs can be
00:08:12.200 | handled by the slower high bandwidth memory.
00:08:15.160 | All of this is where we get the term hardware aware state expansion.
00:08:19.600 | It's an architecture built with an awareness of the kind of GPUs it's going to run on.
00:08:25.080 | Let's try to make this more tangible with an example of what we can achieve with all
00:08:28.520 | of this freed up complexity.
00:08:30.560 | Take this induction head comparison.
00:08:32.880 | First take the word explained, which is made up of two tokens, explur and aimed.
00:08:38.200 | An induction head is a circuit that might be attending to the token explur and its function
00:08:43.720 | is to scan the existing sequence for previous examples of that token that it's attending
00:08:49.600 | Then it needs to find the token that in that survey came after the token, which in our
00:08:54.400 | case will be aimed, and then forecast that that will happen once more to give us explained.
00:08:59.480 | Obviously, you need to be great at recall to do this, especially if the sequence involves
00:09:04.240 | thousands, hundreds of thousands, or even millions of tokens.
00:09:09.160 | Other architectures fall apart as the sequence length gets longer than that found in training,
00:09:14.440 | but not Mamba.
00:09:15.880 | Look at the top line.
00:09:17.280 | Accuracy staying at one, even up to a million tokens.
00:09:21.960 | And remember that architectures like Mamba need not necessarily hunt alone.
00:09:26.360 | Take this announcement of striped hyena from Together AI.
00:09:29.920 | They showed that we need not necessarily choose with a hybrid architecture involving attention
00:09:35.640 | performing well.
00:09:36.640 | But there's one more thing before we put the Mamba snake back into the basket.
00:09:41.360 | On the left, you can see its superior performance at great apes DNA classification.
00:09:46.440 | In other words, deciding whether a sequence of DNA was human, chimpanzee, gorilla, orangutan,
00:09:51.560 | or bonobo.
00:09:52.560 | Notice that it's on the longest sequence lengths that its performance starts to really
00:09:56.800 | shine.
00:09:57.800 | This task, by the way, they made artificially hard for themselves because it was originally
00:10:01.360 | about distinguishing between a human, lemur, mouse, pig, and hippo.
00:10:05.480 | But you could easily imagine other use cases like healthcare.
00:10:08.880 | You could rapidly analyze a patient's genetic data and come up with personalized medical
00:10:13.560 | treatment for them.
00:10:14.840 | More speculatively, you could imagine a chat bot that remembers a conversation you had
00:10:19.560 | months or years ago.
00:10:21.120 | And then of course, I'll leave it to you to think of all the other long sequences out
00:10:25.520 | there like stock market data or weather data.
00:10:28.980 | And as I mentioned before, video data from those long video explainers that annoy everyone.
00:10:34.720 | But wait, is this video becoming one of those?
00:10:37.120 | I really hope not.
00:10:38.320 | So I'm going to move on to the next reason that AI is not slowing down anytime soon.
00:10:44.000 | That is inference time compute, or the ability of the model, as Arthur Mensch said, to decide
00:10:49.640 | how much compute to allocate to certain problems.
00:10:52.920 | Now I touched on this in my Q star video, but here's Lucas Kaiser, one of the authors
00:10:58.080 | of the transformer architecture and indeed a senior figure for open AI.
00:11:03.120 | For reasoning, you also need these chains of thought.
00:11:07.200 | You need to give the model the ability to think longer than it has layers.
00:11:14.720 | But it can be combined with multimodal, especially when you have -- nowadays, models, they can
00:11:20.920 | generate, you say, "Okay, how does it look when a boy kicks a ball?"
00:11:25.240 | And you can generate a few frames of the video.
00:11:27.520 | And now more and more, there are models that will generate the whole video.
00:11:32.200 | And then there start to be models of the world that say, "Well, if you drive a car like this,
00:11:36.040 | how will the street look?
00:11:37.120 | How will people look?
00:11:38.240 | What will happen?
00:11:39.240 | Will you crash into something?"
00:11:41.840 | So in the future, the models will have this knowledge of the world and this generation,
00:11:46.720 | which we call chain of thought and text.
00:11:49.360 | But multimodality, this means just it's a chain of frames of what's going to happen
00:11:54.080 | to the world, which is basically how we sometimes think.
00:11:57.440 | So it will be multimodality and this ability to generate sequences of things before you
00:12:03.000 | give an answer that will resemble much more what we call reasoning.
00:12:08.760 | For more on that, check out my Q* video, but here's another taster.
00:12:12.760 | This is from Noam Brown, also of OpenAI.
00:12:15.600 | He admitted that letting models think for longer would occasionally have drawbacks.
00:12:20.560 | Inference at times may be 1,000x slower and more costly.
00:12:24.440 | And he said, "What inference cost would we pay for a new cancer drug, or for a proof
00:12:29.440 | of the Riemann hypothesis?"
00:12:31.240 | AlphaCode2, which I've also done a video on, gives us a foretaste of the kind of results
00:12:36.720 | that we might expect.
00:12:38.120 | And remember, this is separate from data quality or dynamic new architectures.
00:12:42.000 | But we can't mention AlphaCode2 and inference time compute without also mentioning Let's
00:12:47.920 | Verify Step-by-Step, aka process-based verification.
00:12:51.360 | But I know what you're wondering, what is this graph and where does it come from?
00:12:54.360 | Well, it comes from a fascinating new paper entitled "AI Capabilities Can Be Significantly
00:12:59.520 | Improved Without Expensive Retraining".
00:13:02.040 | In a way, this paper sums up the message of this video.
00:13:05.460 | We are not even close to being done with the exponential gains in AI.
00:13:10.400 | And the way that this paper measured it was fascinating.
00:13:13.560 | It basically asked how much extra computing power would we have to provide to get the
00:13:19.240 | equivalent gain that these methods provide.
00:13:22.400 | As you can see, the methods are quite diverse and almost all of them have been covered before
00:13:27.680 | on this channel.
00:13:28.680 | Things like prompting and scaffolding, a bit like smart GPT, tool use, or indeed data quality
00:13:34.240 | as we saw with Orca and self-consistency.
00:13:37.480 | If you're not familiar with either Orca or self-consistency and majority voting, check
00:13:41.920 | out the videos that I've done before linked in the description.
00:13:45.240 | The x-axis, by the way, is the one-time compute cost that these methods entail.
00:13:50.300 | Yes, some of them go up to 1% or even 10% as a fraction of the training compute used
00:13:56.680 | to create the models, but look at the returns on the y-axis.
00:14:01.360 | Having a verifier check the steps of a model does cost 0.001% as a fraction of the training
00:14:09.860 | compute, but the compute equivalent gain is around 9.
00:14:13.520 | In other words, you would have had to use 9 times as much compute on the base model
00:14:18.000 | to achieve similar results.
00:14:19.880 | And yes, many of these methods can indeed be combined.
00:14:23.440 | Indeed that's what smart GPT was all about.
00:14:25.600 | It was combining chain of thought, few-shotting, and majority voting.
00:14:29.640 | In 2024, we may see the PHY2 data quality approach combined with, say, the Mamba architecture.
00:14:35.760 | And all of this is before we scale the models up as the paper says, "Gains from enhancements
00:14:41.640 | and gains from scaling might interact and compound in non-trivial ways."
00:14:46.840 | And they give the example that chain of thought prompting didn't even work on smaller models.
00:14:51.160 | I remember saying toward the end of my smart GPT video in August that we need to find out
00:14:56.920 | the ceiling of the models, not just the floor.
00:15:00.040 | And the authors concur saying researchers could also study whether there is a ceiling
00:15:04.320 | to the total improvement you can get from post-training enhancements.
00:15:08.300 | That will enable the HEI labs to better predict how much more capable their model might become
00:15:13.920 | in the future.
00:15:14.920 | Now, I know what some veteran researchers will be thinking while looking at these charts.
00:15:19.800 | Surely it depends on the task and the benchmark and the models, and all of these numbers must
00:15:25.280 | be very approximate.
00:15:26.280 | And of course, you are right.
00:15:28.400 | But I have one more critique in addition to that.
00:15:31.120 | And that is the third, or is it fourth, to be honest, I've lost count, reason why AI
00:15:35.880 | is going to continue to improve dramatically.
00:15:38.760 | And that is prompt optimization.
00:15:40.720 | I spoke to Tim Rochtaschel of Google DeepMind about this for Patreon.
00:15:44.960 | But this is what I mean in a nutshell.
00:15:47.380 | Language models can optimize their own prompts.
00:15:49.880 | There are many techniques for doing this, but the manual methods we're coming up with,
00:15:54.720 | the heuristics like this is important for my career and my grandma wants me to do this,
00:15:59.480 | are our manual feeble approaches.
00:16:01.920 | Once we deploy LLMs to help us optimize the prompts going into LLMs, we might see dramatically
00:16:08.480 | better performance.
00:16:09.480 | Indeed, we already have for things like high school mathematics and movie recommendations.
00:16:15.160 | Anyway, this is the simplified version.
00:16:17.160 | Do check out that video.
00:16:18.360 | But the point is this.
00:16:19.520 | Even if we weren't going for dramatically better data quality, dynamic new architectures
00:16:24.160 | and getting models to reason and think for longer, then prompt optimization would allow
00:16:29.680 | us to squeeze out significantly better results, even from existing models.
00:16:34.760 | And notice I've got through the whole video without yet mentioning, of course, scaling
00:16:38.520 | the models up to 10 trillion parameters or indeed a hundred trillion parameters as promised
00:16:44.160 | by Etched.ai.
00:16:45.160 | I've got much more information about them coming soon, but let's say you're someone
00:16:49.640 | who doesn't care about stats, benchmarks, or indeed AGI countdowns.
00:16:54.520 | Maybe you're someone who is skeptical about tweets like this from an OpenAI employee who
00:16:59.240 | said, "Brace yourselves, AGI is coming."
00:17:02.120 | Well, even for you, 2024 looks set to be a dramatic year.
00:17:06.880 | These outputs from the Walt team at Google may be low resolution, but they're quite
00:17:11.920 | high consistency.
00:17:13.200 | I use PicoLabs and RunwayML2 and the progress season on season is quite something to watch.
00:17:20.240 | And that would be a prediction I'd make for before the end of 2024.
00:17:24.400 | I think we will get a 3-5 second photorealistic text to video output that could fool most
00:17:30.280 | humans.
00:17:31.280 | Now you might say, "Oh, I'd never be fooled," but let's test you on that with text to
00:17:35.320 | image.
00:17:36.320 | Which of these then is the real Roman arch?
00:17:39.280 | One of them is from Midjourney version 6 upscaled with Magnific and the other is a real arch.
00:17:45.520 | The answer is that the one on the left is the real Roman arch.
00:17:50.720 | Now in my Christmas video, while misspelling some prompts, I did show off the style raw
00:17:56.080 | input.
00:17:57.080 | This is for anyone using Midjourney version 6.
00:17:59.560 | But since then, from Reddit, I found an even better tip.
00:18:03.080 | Use the phrase phone photo.
00:18:05.200 | You can get, as you can see, strikingly realistic images.
00:18:08.480 | And of course, you can further upscale any of these too.
00:18:11.600 | You can definitely get some quite interesting results this way.
00:18:14.680 | And at this point, I want to show you this quite fascinating prediction made a hundred
00:18:19.120 | years ago.
00:18:20.120 | This is what the cartoonist Harold Tucker Webster thought the world of drawing would
00:18:25.120 | be like in 2023.
00:18:27.520 | Now yes, we're a day into 2024, but I still think this is interesting.
00:18:31.960 | Notice the drawing is being done automatically.
00:18:34.800 | And the caption is, "In the year 2023, when all our work is done by electricity."
00:18:40.080 | He called it the cartoon dynamo.
00:18:42.040 | We call it Midjourney version 6, but tomato, tomato.
00:18:45.400 | It's time now to draw the video to an end, but I'm going to end where I started.
00:18:49.880 | And no, I don't just mean this tweet from Jim Fan.
00:18:52.840 | I also mean the words I used in the introduction.
00:18:56.560 | And here, via 11labs, is me in quotes, AI explained, Philip, who is a real person and
00:19:03.360 | not GPT-5 or 6 as some assume, but here's the AI version of my voice delivering the
00:19:09.280 | intro.
00:19:10.280 | I hope everyone watching had an excellent 2023 and is looking to get 2024 off to a rambunctious
00:19:17.760 | start.
00:19:18.760 | This video has been made to show that we are on the steep part of the exponential and will
00:19:22.400 | be for a while yet.
00:19:23.820 | Just for fun, for all of my legendary supporters on Patreon, I'm going to process this entire
00:19:29.280 | video so you can hear how good AI is getting at imitating the human voice.
00:19:35.040 | Soon it may be impossible to tell who's human and who's not using audio alone and
00:19:39.600 | then soon thereafter even video.
00:19:41.640 | I know you guys know that I'm real, so thank you so much for watching all the way to the
00:19:46.680 | end and, as always, have a wonderful day and a wonderful 2024.