back to index‘Advanced Voice’ ChatGPT Just Happened … But There's 3 Other Stories You Probably Shouldn’t Ignore
Chapters
0:0 Intro
0:40 Voice Tips
1:47 Altman Predictions
4:10 Story 1
7:36 Story 2
13:33 Story 3
00:00:00.000 |
Just a few minutes ago, the rollout of advanced voice mode for ChatGPT was complete and apparently 00:00:06.560 |
it was done "early" to quote Sam Altman. I've been playing with it, it's amazing as expected, 00:00:12.800 |
but that's not actually the main focus of this video. Yes, I will quickly give some tips on how 00:00:18.720 |
literally anyone can access these super responsive and realistic voices that can do all sorts of 00:00:25.120 |
verbal feats, but then I'll cover three other stories in the last few days that you might have 00:00:31.440 |
missed and I am very, very confident you will be fascinated by for at least one of them, if not 00:00:38.240 |
every one. But first, as you may have gathered from my accent, I am actually from the UK, 00:00:44.160 |
which is geographically part of Europe, and you may be somewhat scratching your head as to how 00:00:49.920 |
I've gained access to ChatGPT advanced voice mode. At least officially, advanced voice mode is not 00:00:55.920 |
released in Europe, but what I did was first, I used a VPN. Second, and this has helped many 00:01:01.680 |
people apparently, I uninstalled and reinstalled the app. Thirdly, you could add I am a $20 a 00:01:07.040 |
month subscriber to ChatGPT. I'm not though going to linger on this story because you can draw your 00:01:12.720 |
own conclusions about whether you enjoy the app, but for me, it was quite fun getting it to reply 00:01:18.640 |
in various accents. Personally, I think the biggest impact will be to bring potentially 00:01:23.840 |
hundreds of millions of more people into engaging every day with large language models. And the 00:01:29.920 |
natural and not too distant endpoint for all of this is for ChatGPT to gain a photorealistic set 00:01:37.680 |
of video avatars. Let me put one prediction on the record, which is that in 2025, I think we 00:01:43.680 |
will be having effectively a Zoom call with ChatGPT. But just for now, what are these three 00:01:49.040 |
other stories that I'm talking about? And no, one of them isn't the intelligence age essay by Sam 00:01:56.240 |
Altman. It does though introduce a story I'm going to be talking about. So let me spend just a minute 00:02:01.840 |
on it. The essay came out around 36 hours ago and it basically describes the imminent arrival of 00:02:08.480 |
superintelligence. He describes us all having virtual tutors, but the role for formal education 00:02:15.040 |
is at the very least unclear in an age in which we would have superintelligence. Sam Altman did 00:02:21.840 |
though kind of give us a date for when he thinks superintelligence will come, or at least a range. 00:02:28.480 |
He said it's coming in a few thousand days. Now, it's probably not going to be terribly fruitful 00:02:34.240 |
to analyze this prediction too closely, but if we define few as say between two and five, that's 00:02:41.200 |
between 2030 and 2038. The story though of how we get there is, according to Sam Altman, quite 00:02:48.080 |
simple. Deep learning worked. It's going to gradually understand the rules of reality that 00:02:53.920 |
produce its training data and any remaining problems will be solved. And if you will, 00:02:59.520 |
let me try to summarize that declarative statement in a sentiment that I think 00:03:04.400 |
pretty much everyone can agree on. If there's just a 10 or 20% chance he's correct, 00:03:10.400 |
is this not the biggest news story of the century? Pretty hard to see how it wouldn't be, 00:03:17.200 |
but that's not going to be the focus of this video. No, it's a remark he made further on 00:03:23.200 |
in the essay. You might think I'm going to focus on how he described AI systems that are so good 00:03:28.960 |
that they can help us make the next generation of AI systems or how AI is going to help us fix the 00:03:35.360 |
climate, establish a space colony and help us discover all of physics. No, many will of course 00:03:42.240 |
focus on how he no longer describes superintelligence as being a risk for lights out for 00:03:48.080 |
all of us and instead being a risk for the labor market. But I actually want to focus on this 00:03:54.160 |
sentence. He said, "If we don't build enough infrastructure, AI will be a very limited 00:03:59.920 |
resource that wars get fought over." That is a quite fascinating framing that will make more 00:04:06.080 |
sense when you see the articles that I'm about to link to. It was reported just yesterday that 00:04:11.520 |
OpenAI thinks we're going to need more power than it was wildly speculated that even they were aiming 00:04:17.840 |
for just six months ago. The figures in this article are quite extraordinary and I'm going 00:04:23.600 |
to put it in context. But don't forget that framing from the essay we just saw. If someone 00:04:28.720 |
were to genuinely believe and have evidence for the fact that superintelligence could arrive 00:04:34.480 |
within five to ten years then this would make some sense. If the progress in AI was bottlenecked by 00:04:41.200 |
power as I've described in other videos it wouldn't just be harder to train such a superintelligence 00:04:46.240 |
but to spread it out to everyone. The cost of inference aka the cost of actually getting 00:04:51.280 |
outputs from the model would be prohibitive to many around the world and there is a real scenario 00:04:57.600 |
where that leaves us in quite an awkward situation where essentially rich people can get the answers 00:05:03.280 |
from a superintelligence and poor people can't. But anyway let's put some quick context on these 00:05:08.800 |
numbers like five gigawatts before getting to the next interesting story. Five gigawatts is roughly 00:05:14.960 |
the equivalent of five nuclear reactors or enough power for almost three million homes. Now I know 00:05:23.520 |
what you might be thinking that sounds a lot but not completely crazy and I would almost agree 00:05:28.720 |
with that if they were proposing just one such five gigawatt data center. After all I've already 00:05:34.960 |
done a video a few months back on the hundred billion dollar Stargate AI supercomputer. That 00:05:40.320 |
system which could be launched as soon as 2028 will by 2030 need as much as five gigawatts of 00:05:48.240 |
power. So nothing too new in that Bloomberg article right? Well except that now OpenAI are 00:05:54.560 |
talking about building five to seven data centers that are each five gigawatts. That's enough to 00:06:02.080 |
power New York City and London combined. And it must be added of course that many think that's 00:06:07.920 |
so ambitious it's just not feasible. What does it say though about the scale of confidence of OpenAI 00:06:14.240 |
and more importantly Microsoft who are funding much of this that they are even reaching for 00:06:19.440 |
these figures? And the moment you start looking out for these stories they're everywhere like 00:06:23.440 |
this article just from yesterday in Wired. Microsoft have done a deal to bring back the 00:06:29.040 |
three mile island nuclear reactor. Of course many of you will be thinking there is a 50% chance even 00:06:36.000 |
an 80% chance that all of this just ends in a puff of smoke. Maybe these five gigawatt data 00:06:41.520 |
centers don't happen or they do happen and it turns out you need far more than just compute 00:06:47.360 |
to get super intelligence. But for me after the release of O1 Preview I'm a little bit less 00:06:54.000 |
confident that compute isn't all we need. Not saying we don't need immense talent tricks and 00:06:59.680 |
data but it could be that compute is the current big bottleneck. And I do wonder if even Yan LeCun 00:07:07.040 |
might be starting to agree with that sentiment. And for a deep dive on that do check out the new 00:07:13.280 |
$9 AI Insiders on my Patreon. For years now and as recently as just two weeks ago Yan LeCun 00:07:20.560 |
has been quoting PlanBench for establishing a discrepancy between human planning ability 00:07:26.240 |
and that of LLMs. Suffice to say that after I go through a newly released paper in this video 00:07:31.840 |
you may no longer believe that such a distinction exists. But my second story actually involves an 00:07:37.440 |
announcement from yesterday by Google though I will be bringing in a comparison to O1. The TLDR 00:07:44.480 |
is that they improved the benchmark performance of Gemini 1.5 Pro while also reducing the price 00:07:51.040 |
and increasing the speed. They did however give it the very awkward name of Gemini 1.5 Pro 002. 00:07:58.400 |
Do you remember we originally had Gemini Pro and also Gemini Ultra. Ultra was the biggest and best 00:08:04.720 |
model and Pro was like the middle version. That was generation 1 but then we got 1.5 Pro but no 00:08:11.280 |
1.5 Ultra. So both the number and the name imply that there's much more to come we're just not 00:08:17.120 |
seeing it. It's 1.5 not 2. It's the pro version not the ultra version. It's this constant tantalizing 00:08:23.520 |
promise and all of them do it that the next version is just around the corner. It's Claude 3.5 Sonnet 00:08:29.280 |
not Claude 4. Oh and it's the Sonnet not the Opus or biggest edition from Anthropic. And now by the 00:08:34.960 |
way is Gemini 1.5 Pro 002. So will the next version be Gemini 1.5 Pro 003 or maybe Gemini 2 Ultra 007? 00:08:45.680 |
Anyway let's get to the performance which is the main thing not the name. The amount of content 00:08:50.800 |
that you can feed into the model at any one time remains amazing at 2 million tokens. As they said 00:08:57.280 |
imagine 1,000 page PDFs or answering questions about repos containing 10,000 lines of code. 00:09:03.440 |
Moreover on traditional benchmarks as you might expect there is a significant upgrade. If I zoom 00:09:09.040 |
in you can see the significant upgrade in mathematics performance as well as in vision and 00:09:15.120 |
translation. In the incredibly challenging biology physics and chemistry benchmark known as GPQA 00:09:21.520 |
Google Proof Question and Answer it got 59% up 13% from where it was before. It should be noted 00:09:27.920 |
that the O1 family gets up to around 80%. I of course ran it on simple bench like I do for all 00:09:34.000 |
new models and while I am so close to being able to publish all the results from all the models let 00:09:39.120 |
me give you a vivid example to explain the difference between 1.5 Pro and O1 preview. I'm 00:09:46.320 |
going to use a just slightly tweaked example given by OpenAI itself in its release videos for the O1 00:09:53.520 |
family. The example they gave involved putting a strawberry into a cup placing the cup upside 00:09:59.920 |
down on a table then picking up the cup and putting it in a microwave and asking about the 00:10:05.200 |
strawberry. The vast majority of humans will realize that the strawberry is still on the table 00:10:11.680 |
and the O1 preview model is the first LLM to also realize that fact. But I want to illustrate 00:10:18.320 |
through comparison also to Gemini 1.5 Pro how O1's world model is still far from complete. That's why 00:10:25.200 |
its performance on simple bench still lags dramatically behind humans. Here is my tweaked 00:10:30.000 |
version of that question which is not found in the benchmark because that data will remain private. I 00:10:35.840 |
used the same intro and outro as OpenAI but just changed a few things. Let's see if you notice. 00:10:42.000 |
Jerry is standing as he puts a small strawberry into a normal cup and places the cup upside down 00:10:49.120 |
on a normal table. Just the same. The table though is made of beautiful wood mahogany. Its ornate 00:10:56.400 |
left top corner is positioned to nudge Jerry's shoulder. Now try to picture that. Its top left 00:11:02.880 |
corner is nudging his shoulder. Its intricately carved bottom right top surface digs into his 00:11:10.240 |
outstretched right ankle. So top left corner nudging his shoulder. Its bottom right top 00:11:15.920 |
surface nudging his right ankle. Jerry then lifts the cup. What will happen? Drops anything he is 00:11:23.280 |
holding aside from the cup. Another hint. And puts the cup inside the microwave and turns the 00:11:28.880 |
microwave on. Where is the strawberry now? The model thought for 46 seconds but I tried to make 00:11:34.960 |
it abundantly obvious that the table is tilted. If you imagine someone standing up with one top 00:11:40.960 |
left corner of a normal table against their shoulder and the opposite bottom right corner 00:11:46.560 |
against their ankle. It is almost inconceivable that that table is not tilted. In fact tilted 00:11:53.280 |
quite dramatically. So therefore when Jerry lifts up the cup, let alone before he even drops 00:11:58.880 |
everything else he's holding, i.e the table, the strawberry would roll off the table. O1 preview 00:12:04.720 |
with that incomplete world model misses that completely. Well I should correct myself, it 00:12:10.400 |
actually kind of notices, it just doesn't follow through. It says this suggests the table is at 00:12:16.160 |
an angle. Well done. Possibly tilted or leaning. With one corner higher than the other. Yeah, 00:12:21.120 |
shoulder and ankle. Tell me about it. However, this description serves more as a red herring 00:12:26.000 |
and does not impact the strawberry's position. Again, I want to emphasize this is not actually 00:12:30.800 |
a SimpleBench question which would have a more clear-cut answer. Some of you might say it gets 00:12:35.200 |
trapped in the carving or something like that. SimpleBench would have clear correct answers 00:12:40.160 |
with six multiple choice options now. Anyway, as you can see, O1 says nothing about getting stuck 00:12:46.080 |
on the table. It addresses the tilt but says that will have no effect and it says the strawberry 00:12:51.760 |
will stay on the table. Okay, you're thinking, but wasn't this second story supposed to be about 00:12:56.960 |
Gemini? Yes, and I of course tested this exact question on Gemini 1.5 Pro 002. What a mouthful. 00:13:04.080 |
And the strawberry is apparently inside the cup, inside the microwave. Now yes, I could have given 00:13:09.120 |
you a clearer cut mathematical question, but I thought this one just illustrates that difference, 00:13:13.920 |
that differential between the O1 family and Gemini 1.5 Pro. I'm not in any way saying that Google 00:13:20.560 |
won't at some point catch up. They have the resources and talent to do so, just that their 00:13:25.760 |
current frontier model is a step behind. Now, if you really care about costs though, their new 00:13:30.880 |
proposition is pretty compelling. Now for the final story, which is actually powered by Gemini 00:13:36.960 |
1.5 Pro and it's Google's Notebook LM. And some of you might be surprised that I'm giving it that 00:13:43.520 |
higher prominence, but it's actually an amazing free tool and Google should be celebrated for it. 00:13:50.240 |
In fact, let me go one step further and defy anyone to not find at least one use case for 00:13:56.320 |
personal use or work use for Notebook LM. I might have just caught your curiosity. So what is 00:14:02.800 |
Notebook LM? How does it work and what does it do? It's very simple. Anyone can use it. You just 00:14:07.680 |
upload a source like a PDF or text file. In fact, I'm going to do that again here, just so you see 00:14:12.880 |
the process quickly. Once you have chosen your file, then this screen pops up and you'll have 00:14:18.640 |
the option to generate a deep dive conversation with intriguingly two hosts. You can use other 00:14:25.360 |
sources and chat with a document, but I'm going to focus on the key feature, that audio overview. 00:14:31.120 |
After you click generate, of course, depending on the number and length of sources you're using, 00:14:35.760 |
it takes between a minute or a few minutes. In about 30 seconds, I'm going to give you a sample 00:14:41.120 |
of its output and it will be worth the wait. But very quickly before that, what did I actually 00:14:46.000 |
upload? Well, it was a transcript of my Q-Star video from last November. But how did I get such 00:14:52.720 |
a good transcript? And many of you will know where I'm going from here. I use Assembly AI's 00:14:58.320 |
Universal One, which is the state of the art multilingual speech to text model. I am grateful 00:15:03.920 |
that Assembly AI is sponsoring this video and they have the industry's lowest word error rate. 00:15:09.920 |
And by the way, it's not just about words. It's about catching those characters. Like when I say 00:15:14.160 |
GPT 4.0. Not many models I can tell you capture that accurately. I've only worked with three 00:15:19.760 |
companies in the history of this channel and you can start to see why Assembly AI is one of them. 00:15:25.760 |
Even better, of course, if you're interested, you can click on the link in the description 00:15:30.160 |
to try it yourself. So a couple minutes later, using that transcript, Google produced this. 00:15:35.680 |
It's essentially an AI generated conversation or podcast between two hosts about the document 00:15:42.560 |
or PDF you provide. Here is a 20 second snippet. Open AI. Seems like they're always making 00:15:48.000 |
headlines, right? Every day there's a new story about how they're on the edge of some huge 00:15:52.080 |
AI breakthrough or maybe a total meltdown. But you've been digging deeper than the headlines 00:15:59.040 |
and you've found some really interesting stuff. We're talking potential game changers they've 00:16:04.160 |
been working on. So let's try to connect the dots together and see what's really going on. 00:16:07.600 |
I am always down for a good deep dive. Now, I know some of you will be thinking 00:16:11.360 |
that I'm getting too excited about it, but I think this is a tool that could be used by almost 00:16:15.280 |
anyone. Obviously, this isn't for high stakes settings where every detail is crucial, but if 00:16:19.920 |
you're trying to make any material engaging, this is a great way of doing it. It's very easy to get 00:16:24.720 |
caught up in the ups and downs of AI, but this tool is a genuine step forward. Those were my 00:16:30.560 |
three stories and I didn't even get to cling AI's motion brush where you can control text to video 00:16:36.000 |
in unprecedented ways. And I am genuinely curious which of these four stories in total you found 00:16:42.160 |
the most important or interesting. And even if you found somehow none of them interesting, 00:16:47.040 |
thank you so much for watching to the end. I personally found all of them interesting, 00:16:51.600 |
but regardless, thank you so much for watching and have a wonderful day.