Back to Index

‘Advanced Voice’ ChatGPT Just Happened … But There's 3 Other Stories You Probably Shouldn’t Ignore


Chapters

0:0 Intro
0:40 Voice Tips
1:47 Altman Predictions
4:10 Story 1
7:36 Story 2
13:33 Story 3

Transcript

Just a few minutes ago, the rollout of advanced voice mode for ChatGPT was complete and apparently it was done "early" to quote Sam Altman. I've been playing with it, it's amazing as expected, but that's not actually the main focus of this video. Yes, I will quickly give some tips on how literally anyone can access these super responsive and realistic voices that can do all sorts of verbal feats, but then I'll cover three other stories in the last few days that you might have missed and I am very, very confident you will be fascinated by for at least one of them, if not every one.

But first, as you may have gathered from my accent, I am actually from the UK, which is geographically part of Europe, and you may be somewhat scratching your head as to how I've gained access to ChatGPT advanced voice mode. At least officially, advanced voice mode is not released in Europe, but what I did was first, I used a VPN.

Second, and this has helped many people apparently, I uninstalled and reinstalled the app. Thirdly, you could add I am a $20 a month subscriber to ChatGPT. I'm not though going to linger on this story because you can draw your own conclusions about whether you enjoy the app, but for me, it was quite fun getting it to reply in various accents.

Personally, I think the biggest impact will be to bring potentially hundreds of millions of more people into engaging every day with large language models. And the natural and not too distant endpoint for all of this is for ChatGPT to gain a photorealistic set of video avatars. Let me put one prediction on the record, which is that in 2025, I think we will be having effectively a Zoom call with ChatGPT.

But just for now, what are these three other stories that I'm talking about? And no, one of them isn't the intelligence age essay by Sam Altman. It does though introduce a story I'm going to be talking about. So let me spend just a minute on it. The essay came out around 36 hours ago and it basically describes the imminent arrival of superintelligence.

He describes us all having virtual tutors, but the role for formal education is at the very least unclear in an age in which we would have superintelligence. Sam Altman did though kind of give us a date for when he thinks superintelligence will come, or at least a range. He said it's coming in a few thousand days.

Now, it's probably not going to be terribly fruitful to analyze this prediction too closely, but if we define few as say between two and five, that's between 2030 and 2038. The story though of how we get there is, according to Sam Altman, quite simple. Deep learning worked. It's going to gradually understand the rules of reality that produce its training data and any remaining problems will be solved.

And if you will, let me try to summarize that declarative statement in a sentiment that I think pretty much everyone can agree on. If there's just a 10 or 20% chance he's correct, is this not the biggest news story of the century? Pretty hard to see how it wouldn't be, but that's not going to be the focus of this video.

No, it's a remark he made further on in the essay. You might think I'm going to focus on how he described AI systems that are so good that they can help us make the next generation of AI systems or how AI is going to help us fix the climate, establish a space colony and help us discover all of physics.

No, many will of course focus on how he no longer describes superintelligence as being a risk for lights out for all of us and instead being a risk for the labor market. But I actually want to focus on this sentence. He said, "If we don't build enough infrastructure, AI will be a very limited resource that wars get fought over." That is a quite fascinating framing that will make more sense when you see the articles that I'm about to link to.

It was reported just yesterday that OpenAI thinks we're going to need more power than it was wildly speculated that even they were aiming for just six months ago. The figures in this article are quite extraordinary and I'm going to put it in context. But don't forget that framing from the essay we just saw.

If someone were to genuinely believe and have evidence for the fact that superintelligence could arrive within five to ten years then this would make some sense. If the progress in AI was bottlenecked by power as I've described in other videos it wouldn't just be harder to train such a superintelligence but to spread it out to everyone.

The cost of inference aka the cost of actually getting outputs from the model would be prohibitive to many around the world and there is a real scenario where that leaves us in quite an awkward situation where essentially rich people can get the answers from a superintelligence and poor people can't.

But anyway let's put some quick context on these numbers like five gigawatts before getting to the next interesting story. Five gigawatts is roughly the equivalent of five nuclear reactors or enough power for almost three million homes. Now I know what you might be thinking that sounds a lot but not completely crazy and I would almost agree with that if they were proposing just one such five gigawatt data center.

After all I've already done a video a few months back on the hundred billion dollar Stargate AI supercomputer. That system which could be launched as soon as 2028 will by 2030 need as much as five gigawatts of power. So nothing too new in that Bloomberg article right? Well except that now OpenAI are talking about building five to seven data centers that are each five gigawatts.

That's enough to power New York City and London combined. And it must be added of course that many think that's so ambitious it's just not feasible. What does it say though about the scale of confidence of OpenAI and more importantly Microsoft who are funding much of this that they are even reaching for these figures?

And the moment you start looking out for these stories they're everywhere like this article just from yesterday in Wired. Microsoft have done a deal to bring back the three mile island nuclear reactor. Of course many of you will be thinking there is a 50% chance even an 80% chance that all of this just ends in a puff of smoke.

Maybe these five gigawatt data centers don't happen or they do happen and it turns out you need far more than just compute to get super intelligence. But for me after the release of O1 Preview I'm a little bit less confident that compute isn't all we need. Not saying we don't need immense talent tricks and data but it could be that compute is the current big bottleneck.

And I do wonder if even Yan LeCun might be starting to agree with that sentiment. And for a deep dive on that do check out the new $9 AI Insiders on my Patreon. For years now and as recently as just two weeks ago Yan LeCun has been quoting PlanBench for establishing a discrepancy between human planning ability and that of LLMs.

Suffice to say that after I go through a newly released paper in this video you may no longer believe that such a distinction exists. But my second story actually involves an announcement from yesterday by Google though I will be bringing in a comparison to O1. The TLDR is that they improved the benchmark performance of Gemini 1.5 Pro while also reducing the price and increasing the speed.

They did however give it the very awkward name of Gemini 1.5 Pro 002. Do you remember we originally had Gemini Pro and also Gemini Ultra. Ultra was the biggest and best model and Pro was like the middle version. That was generation 1 but then we got 1.5 Pro but no 1.5 Ultra.

So both the number and the name imply that there's much more to come we're just not seeing it. It's 1.5 not 2. It's the pro version not the ultra version. It's this constant tantalizing promise and all of them do it that the next version is just around the corner.

It's Claude 3.5 Sonnet not Claude 4. Oh and it's the Sonnet not the Opus or biggest edition from Anthropic. And now by the way is Gemini 1.5 Pro 002. So will the next version be Gemini 1.5 Pro 003 or maybe Gemini 2 Ultra 007? Anyway let's get to the performance which is the main thing not the name.

The amount of content that you can feed into the model at any one time remains amazing at 2 million tokens. As they said imagine 1,000 page PDFs or answering questions about repos containing 10,000 lines of code. Moreover on traditional benchmarks as you might expect there is a significant upgrade.

If I zoom in you can see the significant upgrade in mathematics performance as well as in vision and translation. In the incredibly challenging biology physics and chemistry benchmark known as GPQA Google Proof Question and Answer it got 59% up 13% from where it was before. It should be noted that the O1 family gets up to around 80%.

I of course ran it on simple bench like I do for all new models and while I am so close to being able to publish all the results from all the models let me give you a vivid example to explain the difference between 1.5 Pro and O1 preview. I'm going to use a just slightly tweaked example given by OpenAI itself in its release videos for the O1 family.

The example they gave involved putting a strawberry into a cup placing the cup upside down on a table then picking up the cup and putting it in a microwave and asking about the strawberry. The vast majority of humans will realize that the strawberry is still on the table and the O1 preview model is the first LLM to also realize that fact.

But I want to illustrate through comparison also to Gemini 1.5 Pro how O1's world model is still far from complete. That's why its performance on simple bench still lags dramatically behind humans. Here is my tweaked version of that question which is not found in the benchmark because that data will remain private.

I used the same intro and outro as OpenAI but just changed a few things. Let's see if you notice. Jerry is standing as he puts a small strawberry into a normal cup and places the cup upside down on a normal table. Just the same. The table though is made of beautiful wood mahogany.

Its ornate left top corner is positioned to nudge Jerry's shoulder. Now try to picture that. Its top left corner is nudging his shoulder. Its intricately carved bottom right top surface digs into his outstretched right ankle. So top left corner nudging his shoulder. Its bottom right top surface nudging his right ankle.

Jerry then lifts the cup. What will happen? Drops anything he is holding aside from the cup. Another hint. And puts the cup inside the microwave and turns the microwave on. Where is the strawberry now? The model thought for 46 seconds but I tried to make it abundantly obvious that the table is tilted.

If you imagine someone standing up with one top left corner of a normal table against their shoulder and the opposite bottom right corner against their ankle. It is almost inconceivable that that table is not tilted. In fact tilted quite dramatically. So therefore when Jerry lifts up the cup, let alone before he even drops everything else he's holding, i.e the table, the strawberry would roll off the table.

O1 preview with that incomplete world model misses that completely. Well I should correct myself, it actually kind of notices, it just doesn't follow through. It says this suggests the table is at an angle. Well done. Possibly tilted or leaning. With one corner higher than the other. Yeah, shoulder and ankle.

Tell me about it. However, this description serves more as a red herring and does not impact the strawberry's position. Again, I want to emphasize this is not actually a SimpleBench question which would have a more clear-cut answer. Some of you might say it gets trapped in the carving or something like that.

SimpleBench would have clear correct answers with six multiple choice options now. Anyway, as you can see, O1 says nothing about getting stuck on the table. It addresses the tilt but says that will have no effect and it says the strawberry will stay on the table. Okay, you're thinking, but wasn't this second story supposed to be about Gemini?

Yes, and I of course tested this exact question on Gemini 1.5 Pro 002. What a mouthful. And the strawberry is apparently inside the cup, inside the microwave. Now yes, I could have given you a clearer cut mathematical question, but I thought this one just illustrates that difference, that differential between the O1 family and Gemini 1.5 Pro.

I'm not in any way saying that Google won't at some point catch up. They have the resources and talent to do so, just that their current frontier model is a step behind. Now, if you really care about costs though, their new proposition is pretty compelling. Now for the final story, which is actually powered by Gemini 1.5 Pro and it's Google's Notebook LM.

And some of you might be surprised that I'm giving it that higher prominence, but it's actually an amazing free tool and Google should be celebrated for it. In fact, let me go one step further and defy anyone to not find at least one use case for personal use or work use for Notebook LM.

I might have just caught your curiosity. So what is Notebook LM? How does it work and what does it do? It's very simple. Anyone can use it. You just upload a source like a PDF or text file. In fact, I'm going to do that again here, just so you see the process quickly.

Once you have chosen your file, then this screen pops up and you'll have the option to generate a deep dive conversation with intriguingly two hosts. You can use other sources and chat with a document, but I'm going to focus on the key feature, that audio overview. After you click generate, of course, depending on the number and length of sources you're using, it takes between a minute or a few minutes.

In about 30 seconds, I'm going to give you a sample of its output and it will be worth the wait. But very quickly before that, what did I actually upload? Well, it was a transcript of my Q-Star video from last November. But how did I get such a good transcript?

And many of you will know where I'm going from here. I use Assembly AI's Universal One, which is the state of the art multilingual speech to text model. I am grateful that Assembly AI is sponsoring this video and they have the industry's lowest word error rate. And by the way, it's not just about words.

It's about catching those characters. Like when I say GPT 4.0. Not many models I can tell you capture that accurately. I've only worked with three companies in the history of this channel and you can start to see why Assembly AI is one of them. Even better, of course, if you're interested, you can click on the link in the description to try it yourself.

So a couple minutes later, using that transcript, Google produced this. It's essentially an AI generated conversation or podcast between two hosts about the document or PDF you provide. Here is a 20 second snippet. Open AI. Seems like they're always making headlines, right? Every day there's a new story about how they're on the edge of some huge AI breakthrough or maybe a total meltdown.

But you've been digging deeper than the headlines and you've found some really interesting stuff. We're talking potential game changers they've been working on. So let's try to connect the dots together and see what's really going on. I am always down for a good deep dive. Now, I know some of you will be thinking that I'm getting too excited about it, but I think this is a tool that could be used by almost anyone.

Obviously, this isn't for high stakes settings where every detail is crucial, but if you're trying to make any material engaging, this is a great way of doing it. It's very easy to get caught up in the ups and downs of AI, but this tool is a genuine step forward.

Those were my three stories and I didn't even get to cling AI's motion brush where you can control text to video in unprecedented ways. And I am genuinely curious which of these four stories in total you found the most important or interesting. And even if you found somehow none of them interesting, thank you so much for watching to the end.

I personally found all of them interesting, but regardless, thank you so much for watching and have a wonderful day.