back to index

‘Advanced Voice’ ChatGPT Just Happened … But There's 3 Other Stories You Probably Shouldn’t Ignore


Chapters

0:0 Intro
0:40 Voice Tips
1:47 Altman Predictions
4:10 Story 1
7:36 Story 2
13:33 Story 3

Whisper Transcript | Transcript Only Page

00:00:00.000 | Just a few minutes ago, the rollout of advanced voice mode for ChatGPT was complete and apparently
00:00:06.560 | it was done "early" to quote Sam Altman. I've been playing with it, it's amazing as expected,
00:00:12.800 | but that's not actually the main focus of this video. Yes, I will quickly give some tips on how
00:00:18.720 | literally anyone can access these super responsive and realistic voices that can do all sorts of
00:00:25.120 | verbal feats, but then I'll cover three other stories in the last few days that you might have
00:00:31.440 | missed and I am very, very confident you will be fascinated by for at least one of them, if not
00:00:38.240 | every one. But first, as you may have gathered from my accent, I am actually from the UK,
00:00:44.160 | which is geographically part of Europe, and you may be somewhat scratching your head as to how
00:00:49.920 | I've gained access to ChatGPT advanced voice mode. At least officially, advanced voice mode is not
00:00:55.920 | released in Europe, but what I did was first, I used a VPN. Second, and this has helped many
00:01:01.680 | people apparently, I uninstalled and reinstalled the app. Thirdly, you could add I am a $20 a
00:01:07.040 | month subscriber to ChatGPT. I'm not though going to linger on this story because you can draw your
00:01:12.720 | own conclusions about whether you enjoy the app, but for me, it was quite fun getting it to reply
00:01:18.640 | in various accents. Personally, I think the biggest impact will be to bring potentially
00:01:23.840 | hundreds of millions of more people into engaging every day with large language models. And the
00:01:29.920 | natural and not too distant endpoint for all of this is for ChatGPT to gain a photorealistic set
00:01:37.680 | of video avatars. Let me put one prediction on the record, which is that in 2025, I think we
00:01:43.680 | will be having effectively a Zoom call with ChatGPT. But just for now, what are these three
00:01:49.040 | other stories that I'm talking about? And no, one of them isn't the intelligence age essay by Sam
00:01:56.240 | Altman. It does though introduce a story I'm going to be talking about. So let me spend just a minute
00:02:01.840 | on it. The essay came out around 36 hours ago and it basically describes the imminent arrival of
00:02:08.480 | superintelligence. He describes us all having virtual tutors, but the role for formal education
00:02:15.040 | is at the very least unclear in an age in which we would have superintelligence. Sam Altman did
00:02:21.840 | though kind of give us a date for when he thinks superintelligence will come, or at least a range.
00:02:28.480 | He said it's coming in a few thousand days. Now, it's probably not going to be terribly fruitful
00:02:34.240 | to analyze this prediction too closely, but if we define few as say between two and five, that's
00:02:41.200 | between 2030 and 2038. The story though of how we get there is, according to Sam Altman, quite
00:02:48.080 | simple. Deep learning worked. It's going to gradually understand the rules of reality that
00:02:53.920 | produce its training data and any remaining problems will be solved. And if you will,
00:02:59.520 | let me try to summarize that declarative statement in a sentiment that I think
00:03:04.400 | pretty much everyone can agree on. If there's just a 10 or 20% chance he's correct,
00:03:10.400 | is this not the biggest news story of the century? Pretty hard to see how it wouldn't be,
00:03:17.200 | but that's not going to be the focus of this video. No, it's a remark he made further on
00:03:23.200 | in the essay. You might think I'm going to focus on how he described AI systems that are so good
00:03:28.960 | that they can help us make the next generation of AI systems or how AI is going to help us fix the
00:03:35.360 | climate, establish a space colony and help us discover all of physics. No, many will of course
00:03:42.240 | focus on how he no longer describes superintelligence as being a risk for lights out for
00:03:48.080 | all of us and instead being a risk for the labor market. But I actually want to focus on this
00:03:54.160 | sentence. He said, "If we don't build enough infrastructure, AI will be a very limited
00:03:59.920 | resource that wars get fought over." That is a quite fascinating framing that will make more
00:04:06.080 | sense when you see the articles that I'm about to link to. It was reported just yesterday that
00:04:11.520 | OpenAI thinks we're going to need more power than it was wildly speculated that even they were aiming
00:04:17.840 | for just six months ago. The figures in this article are quite extraordinary and I'm going
00:04:23.600 | to put it in context. But don't forget that framing from the essay we just saw. If someone
00:04:28.720 | were to genuinely believe and have evidence for the fact that superintelligence could arrive
00:04:34.480 | within five to ten years then this would make some sense. If the progress in AI was bottlenecked by
00:04:41.200 | power as I've described in other videos it wouldn't just be harder to train such a superintelligence
00:04:46.240 | but to spread it out to everyone. The cost of inference aka the cost of actually getting
00:04:51.280 | outputs from the model would be prohibitive to many around the world and there is a real scenario
00:04:57.600 | where that leaves us in quite an awkward situation where essentially rich people can get the answers
00:05:03.280 | from a superintelligence and poor people can't. But anyway let's put some quick context on these
00:05:08.800 | numbers like five gigawatts before getting to the next interesting story. Five gigawatts is roughly
00:05:14.960 | the equivalent of five nuclear reactors or enough power for almost three million homes. Now I know
00:05:23.520 | what you might be thinking that sounds a lot but not completely crazy and I would almost agree
00:05:28.720 | with that if they were proposing just one such five gigawatt data center. After all I've already
00:05:34.960 | done a video a few months back on the hundred billion dollar Stargate AI supercomputer. That
00:05:40.320 | system which could be launched as soon as 2028 will by 2030 need as much as five gigawatts of
00:05:48.240 | power. So nothing too new in that Bloomberg article right? Well except that now OpenAI are
00:05:54.560 | talking about building five to seven data centers that are each five gigawatts. That's enough to
00:06:02.080 | power New York City and London combined. And it must be added of course that many think that's
00:06:07.920 | so ambitious it's just not feasible. What does it say though about the scale of confidence of OpenAI
00:06:14.240 | and more importantly Microsoft who are funding much of this that they are even reaching for
00:06:19.440 | these figures? And the moment you start looking out for these stories they're everywhere like
00:06:23.440 | this article just from yesterday in Wired. Microsoft have done a deal to bring back the
00:06:29.040 | three mile island nuclear reactor. Of course many of you will be thinking there is a 50% chance even
00:06:36.000 | an 80% chance that all of this just ends in a puff of smoke. Maybe these five gigawatt data
00:06:41.520 | centers don't happen or they do happen and it turns out you need far more than just compute
00:06:47.360 | to get super intelligence. But for me after the release of O1 Preview I'm a little bit less
00:06:54.000 | confident that compute isn't all we need. Not saying we don't need immense talent tricks and
00:06:59.680 | data but it could be that compute is the current big bottleneck. And I do wonder if even Yan LeCun
00:07:07.040 | might be starting to agree with that sentiment. And for a deep dive on that do check out the new
00:07:13.280 | $9 AI Insiders on my Patreon. For years now and as recently as just two weeks ago Yan LeCun
00:07:20.560 | has been quoting PlanBench for establishing a discrepancy between human planning ability
00:07:26.240 | and that of LLMs. Suffice to say that after I go through a newly released paper in this video
00:07:31.840 | you may no longer believe that such a distinction exists. But my second story actually involves an
00:07:37.440 | announcement from yesterday by Google though I will be bringing in a comparison to O1. The TLDR
00:07:44.480 | is that they improved the benchmark performance of Gemini 1.5 Pro while also reducing the price
00:07:51.040 | and increasing the speed. They did however give it the very awkward name of Gemini 1.5 Pro 002.
00:07:58.400 | Do you remember we originally had Gemini Pro and also Gemini Ultra. Ultra was the biggest and best
00:08:04.720 | model and Pro was like the middle version. That was generation 1 but then we got 1.5 Pro but no
00:08:11.280 | 1.5 Ultra. So both the number and the name imply that there's much more to come we're just not
00:08:17.120 | seeing it. It's 1.5 not 2. It's the pro version not the ultra version. It's this constant tantalizing
00:08:23.520 | promise and all of them do it that the next version is just around the corner. It's Claude 3.5 Sonnet
00:08:29.280 | not Claude 4. Oh and it's the Sonnet not the Opus or biggest edition from Anthropic. And now by the
00:08:34.960 | way is Gemini 1.5 Pro 002. So will the next version be Gemini 1.5 Pro 003 or maybe Gemini 2 Ultra 007?
00:08:45.680 | Anyway let's get to the performance which is the main thing not the name. The amount of content
00:08:50.800 | that you can feed into the model at any one time remains amazing at 2 million tokens. As they said
00:08:57.280 | imagine 1,000 page PDFs or answering questions about repos containing 10,000 lines of code.
00:09:03.440 | Moreover on traditional benchmarks as you might expect there is a significant upgrade. If I zoom
00:09:09.040 | in you can see the significant upgrade in mathematics performance as well as in vision and
00:09:15.120 | translation. In the incredibly challenging biology physics and chemistry benchmark known as GPQA
00:09:21.520 | Google Proof Question and Answer it got 59% up 13% from where it was before. It should be noted
00:09:27.920 | that the O1 family gets up to around 80%. I of course ran it on simple bench like I do for all
00:09:34.000 | new models and while I am so close to being able to publish all the results from all the models let
00:09:39.120 | me give you a vivid example to explain the difference between 1.5 Pro and O1 preview. I'm
00:09:46.320 | going to use a just slightly tweaked example given by OpenAI itself in its release videos for the O1
00:09:53.520 | family. The example they gave involved putting a strawberry into a cup placing the cup upside
00:09:59.920 | down on a table then picking up the cup and putting it in a microwave and asking about the
00:10:05.200 | strawberry. The vast majority of humans will realize that the strawberry is still on the table
00:10:11.680 | and the O1 preview model is the first LLM to also realize that fact. But I want to illustrate
00:10:18.320 | through comparison also to Gemini 1.5 Pro how O1's world model is still far from complete. That's why
00:10:25.200 | its performance on simple bench still lags dramatically behind humans. Here is my tweaked
00:10:30.000 | version of that question which is not found in the benchmark because that data will remain private. I
00:10:35.840 | used the same intro and outro as OpenAI but just changed a few things. Let's see if you notice.
00:10:42.000 | Jerry is standing as he puts a small strawberry into a normal cup and places the cup upside down
00:10:49.120 | on a normal table. Just the same. The table though is made of beautiful wood mahogany. Its ornate
00:10:56.400 | left top corner is positioned to nudge Jerry's shoulder. Now try to picture that. Its top left
00:11:02.880 | corner is nudging his shoulder. Its intricately carved bottom right top surface digs into his
00:11:10.240 | outstretched right ankle. So top left corner nudging his shoulder. Its bottom right top
00:11:15.920 | surface nudging his right ankle. Jerry then lifts the cup. What will happen? Drops anything he is
00:11:23.280 | holding aside from the cup. Another hint. And puts the cup inside the microwave and turns the
00:11:28.880 | microwave on. Where is the strawberry now? The model thought for 46 seconds but I tried to make
00:11:34.960 | it abundantly obvious that the table is tilted. If you imagine someone standing up with one top
00:11:40.960 | left corner of a normal table against their shoulder and the opposite bottom right corner
00:11:46.560 | against their ankle. It is almost inconceivable that that table is not tilted. In fact tilted
00:11:53.280 | quite dramatically. So therefore when Jerry lifts up the cup, let alone before he even drops
00:11:58.880 | everything else he's holding, i.e the table, the strawberry would roll off the table. O1 preview
00:12:04.720 | with that incomplete world model misses that completely. Well I should correct myself, it
00:12:10.400 | actually kind of notices, it just doesn't follow through. It says this suggests the table is at
00:12:16.160 | an angle. Well done. Possibly tilted or leaning. With one corner higher than the other. Yeah,
00:12:21.120 | shoulder and ankle. Tell me about it. However, this description serves more as a red herring
00:12:26.000 | and does not impact the strawberry's position. Again, I want to emphasize this is not actually
00:12:30.800 | a SimpleBench question which would have a more clear-cut answer. Some of you might say it gets
00:12:35.200 | trapped in the carving or something like that. SimpleBench would have clear correct answers
00:12:40.160 | with six multiple choice options now. Anyway, as you can see, O1 says nothing about getting stuck
00:12:46.080 | on the table. It addresses the tilt but says that will have no effect and it says the strawberry
00:12:51.760 | will stay on the table. Okay, you're thinking, but wasn't this second story supposed to be about
00:12:56.960 | Gemini? Yes, and I of course tested this exact question on Gemini 1.5 Pro 002. What a mouthful.
00:13:04.080 | And the strawberry is apparently inside the cup, inside the microwave. Now yes, I could have given
00:13:09.120 | you a clearer cut mathematical question, but I thought this one just illustrates that difference,
00:13:13.920 | that differential between the O1 family and Gemini 1.5 Pro. I'm not in any way saying that Google
00:13:20.560 | won't at some point catch up. They have the resources and talent to do so, just that their
00:13:25.760 | current frontier model is a step behind. Now, if you really care about costs though, their new
00:13:30.880 | proposition is pretty compelling. Now for the final story, which is actually powered by Gemini
00:13:36.960 | 1.5 Pro and it's Google's Notebook LM. And some of you might be surprised that I'm giving it that
00:13:43.520 | higher prominence, but it's actually an amazing free tool and Google should be celebrated for it.
00:13:50.240 | In fact, let me go one step further and defy anyone to not find at least one use case for
00:13:56.320 | personal use or work use for Notebook LM. I might have just caught your curiosity. So what is
00:14:02.800 | Notebook LM? How does it work and what does it do? It's very simple. Anyone can use it. You just
00:14:07.680 | upload a source like a PDF or text file. In fact, I'm going to do that again here, just so you see
00:14:12.880 | the process quickly. Once you have chosen your file, then this screen pops up and you'll have
00:14:18.640 | the option to generate a deep dive conversation with intriguingly two hosts. You can use other
00:14:25.360 | sources and chat with a document, but I'm going to focus on the key feature, that audio overview.
00:14:31.120 | After you click generate, of course, depending on the number and length of sources you're using,
00:14:35.760 | it takes between a minute or a few minutes. In about 30 seconds, I'm going to give you a sample
00:14:41.120 | of its output and it will be worth the wait. But very quickly before that, what did I actually
00:14:46.000 | upload? Well, it was a transcript of my Q-Star video from last November. But how did I get such
00:14:52.720 | a good transcript? And many of you will know where I'm going from here. I use Assembly AI's
00:14:58.320 | Universal One, which is the state of the art multilingual speech to text model. I am grateful
00:15:03.920 | that Assembly AI is sponsoring this video and they have the industry's lowest word error rate.
00:15:09.920 | And by the way, it's not just about words. It's about catching those characters. Like when I say
00:15:14.160 | GPT 4.0. Not many models I can tell you capture that accurately. I've only worked with three
00:15:19.760 | companies in the history of this channel and you can start to see why Assembly AI is one of them.
00:15:25.760 | Even better, of course, if you're interested, you can click on the link in the description
00:15:30.160 | to try it yourself. So a couple minutes later, using that transcript, Google produced this.
00:15:35.680 | It's essentially an AI generated conversation or podcast between two hosts about the document
00:15:42.560 | or PDF you provide. Here is a 20 second snippet. Open AI. Seems like they're always making
00:15:48.000 | headlines, right? Every day there's a new story about how they're on the edge of some huge
00:15:52.080 | AI breakthrough or maybe a total meltdown. But you've been digging deeper than the headlines
00:15:59.040 | and you've found some really interesting stuff. We're talking potential game changers they've
00:16:04.160 | been working on. So let's try to connect the dots together and see what's really going on.
00:16:07.600 | I am always down for a good deep dive. Now, I know some of you will be thinking
00:16:11.360 | that I'm getting too excited about it, but I think this is a tool that could be used by almost
00:16:15.280 | anyone. Obviously, this isn't for high stakes settings where every detail is crucial, but if
00:16:19.920 | you're trying to make any material engaging, this is a great way of doing it. It's very easy to get
00:16:24.720 | caught up in the ups and downs of AI, but this tool is a genuine step forward. Those were my
00:16:30.560 | three stories and I didn't even get to cling AI's motion brush where you can control text to video
00:16:36.000 | in unprecedented ways. And I am genuinely curious which of these four stories in total you found
00:16:42.160 | the most important or interesting. And even if you found somehow none of them interesting,
00:16:47.040 | thank you so much for watching to the end. I personally found all of them interesting,
00:16:51.600 | but regardless, thank you so much for watching and have a wonderful day.