back to indexGPT 4 Turbo-Charged? Plus Custom GPTS, Grok, AGI Tier List, Vision Demos, Whisper V3 and more
00:00:00.000 |
As you may have noticed, there have been one or two things happening in the world of AI this week. 00:00:06.700 |
We've had a bit of time to take it in, but of course it will take weeks and months to test 00:00:12.340 |
everything out. Let's bring you some of the best bits now though, from a custom GPT I made, 00:00:18.260 |
to the interestingly named new Grok, Gauss and Olympus models, Whisper V3, better text-to-video, 00:00:25.500 |
an AGI tier list from DeepMind and much more. But I'm going to start kind of randomly with the 00:00:32.460 |
update to how current ChatGPT is. We're also updating the knowledge cutoff. We are just as 00:00:38.580 |
annoyed as all of you, probably more, that GPT-4's knowledge about the world ended in 2021. We will 00:00:43.800 |
try to never let it get that out of date again. GPT-4 Turbo has knowledge about the world up to 00:00:48.540 |
April of 2023. And my only comment on that is that this means that GPT-4 has knowledge of itself. 00:00:55.480 |
For the first time. While GPT-4 was trained in August of 2022, it wasn't released until March of 00:01:03.400 |
2023. So GPT-4 should know a lot more about how it's trained and some of the latest advances in 00:01:10.300 |
AI. But how much can the new GPT-4 Turbo process at once? Well, 128,000 tokens, which is approximately 00:01:21.560 |
GPT-4 supported up to 8K and in some cases up to 32K context length. But we know that isn't enough 00:01:28.700 |
for many of you and what you want to do. GPT-4 Turbo supports up to 128,000 tokens of context. 00:01:35.060 |
That's 300 pages of a standard book, 16 times longer than our 8K context. 00:01:40.580 |
Don't assume though that just because it can process that many words that it will be equally 00:01:45.560 |
accurate at comprehending all of them. This analysis from yesterday showed that when you're 00:01:51.440 |
fact from a document, particularly one of more than about 80,000 words or 100,000 tokens, if you 00:01:57.620 |
place that fact at between 10 to 50% through the document, retrieval started to get pretty bad. If 00:02:05.120 |
the fact was towards the end of a document or you submitted fewer words or tokens, performance was 00:02:11.000 |
much higher. The good news though is that GPT-4 Turbo, the 11.06 preview, is much better than 00:02:21.140 |
So, if you're trying to retrieve a document, you're going to need to have a lot of experience. 00:02:21.140 |
So, if you're trying to retrieve a document, you're going to need to have a lot of experience. 00:02:21.140 |
So, if you're trying to retrieve a document, you're going to need to have a lot of experience. 00:02:21.140 |
So, if you're trying to retrieve a document, you're going to need to have a lot of experience. 00:02:21.140 |
So, if you're trying to retrieve a document, you're going to need to have a lot of experience. So, yes, even though performance does degrade, 00:02:26.660 |
the more tokens or words you submit, it's better than before. But again, better than before still 00:02:32.240 |
isn't even close to perfect. Surprising no one, DALI 3, GPT-4 Turbo with vision, 00:02:39.320 |
and the new text-to-speech model are all going into the API today. And I'm going to give you a 00:02:44.720 |
taste of the kind of things these new APIs unlock. We have this implementation by Robert Lukoschko, 00:02:50.840 |
integrating GPT Vision API and allowing you just to clip out bits of a screen and ask questions about 00:02:56.900 |
it. Part I'm interested in, and the GPT-4 is going to answer me, what is it? It's hip joint region. 00:03:02.720 |
And what about this part? What is it? I'm not giving it even any context. It just knows. This 00:03:10.400 |
is Schrodinger's equation. Let's try this part. What is it? Potential energy term. And let's say I'm really into the cards, but I'm not going to be able to do it. 00:03:20.540 |
But I just don't know. What is this? Your orange stick. What is this orange stick? 00:03:26.600 |
And an oil dipstick. What about webcam GPT? I think the kind of people who will use this are 00:03:34.220 |
those who want to maximize their productivity and they're going to get their webcam to check 00:03:39.200 |
if they're working. Of course, let's hope that companies don't use it, as they have been known 00:03:43.580 |
to do, to monitor their employees. And we also got a new text-to-speech model, which people are already 00:03:50.240 |
integrating with GPT Vision. Can you believe this? He's taken on the whole defense. He's a one-man 00:03:56.300 |
show, ladies and gentlemen. He shoots. Girl! Messy, messy, messy. Unbelievable. What a goal. 00:04:02.900 |
What a goal. Glorious. Absolutely glorious. The stadium explodes in joy. This is football 00:04:08.360 |
magic at its finest. And then, of course, we have Whisper version 3, which can understand 00:04:12.860 |
speech and convert it into text. Models like Whisper V3 and Conformer 2 will change how we use the internet 00:04:19.940 |
in the not too distant future. Speaking of new modalities, we're also releasing the next version 00:04:26.120 |
of our open source speech recognition model, Whisper V3, today. And it'll be coming soon to the 00:04:31.400 |
API. It features improved performance across many languages, and we think you're really going to like 00:04:35.840 |
it. This chart shows the word error rate across a range of languages for Whisper V3 and V2. A drop 00:04:43.760 |
means an improvement, and surprisingly, English isn't the most accurate language. If you happen to speak 00:04:49.640 |
much Spanish or Korean, you might be in for a better time. Of course, if you speak one of the 00:04:55.340 |
lower resource languages like Bengali, it might be a bit harder to use. Just quickly, though, 00:05:00.860 |
I did pick up on this somewhat brief remark Sam Altman made about GPT-4 Turbo being smarter. 00:05:06.860 |
GPT-4 Turbo is the industry leading model. It delivers a lot of improvements that we just 00:05:13.940 |
covered, and it's a smarter model than GPT-4. I've started my own investigations, as have 00:05:19.340 |
others, and the results aren't uniformly better. And in fact, for some use cases, GPT-4 Turbo performs 00:05:25.700 |
worse. This was a test for SAT reading by Jeffrey Wang. There does need to be far more thorough 00:05:31.460 |
investigations, but I wish they released some benchmarks along with that claim. And there were 00:05:36.440 |
two moments from the guest appearance of the CEO of Microsoft, Satya Nadella, that I found 00:05:41.780 |
particularly interesting. First, he promised OpenAI the most compute. The systems that are needed as 00:05:49.040 |
you aggressively push forward on your roadmap requires us to be on the top of our game. And 00:05:55.460 |
we intend fully to commit ourselves deeply to making sure you all, as builders of these 00:06:01.820 |
foundation models, have not only the best systems for training and inference, but the most compute 00:06:08.240 |
so that you can keep pushing forward on the frontiers. Because I think that's the way we're 00:06:13.700 |
going to make progress. Why is that relevant? Because it seems at the moment every mega corporation in the 00:06:18.740 |
world is going to come out with their own language model. Many of you may have seen the publicity over 00:06:23.420 |
Grok from XAI. It's been trained on Twitter and apparently it replies to more spicy questions. They 00:06:29.780 |
tested the new Grok 1 across a range of benchmarks, although some people have complained about the 00:06:34.880 |
settings they used. Either way, on these benchmarks, it did seem to be somewhat firmly beating GPT 3.5, 00:06:41.060 |
the original ChatGPT. But notice that on these benchmarks, it was falling way behind CLAWD 2. That's on the MMLU 00:06:48.440 |
testing for general knowledge, human eval for coding, and GSM 8K for mathematics. They also gave 00:06:55.220 |
this human graded evaluation comparison on an exam produced after all of these models were trained. 00:07:01.400 |
And on that, Grok 1 outperforms even CLAWD 2 and way outperforms GPT 3.5. Now many have complained 00:07:08.600 |
about the way that the models were evaluated. But I've talked before with SmartGPT about the 00:07:14.600 |
discrepancies that can arise with human grading and automatic grading. 00:07:18.140 |
This just reaffirms the need for me for new authoritative hand graded benchmarks. Pricy, 00:07:24.620 |
but well worth it. And they're not the only one with Amazon building a model called Olympus, 00:07:29.840 |
which apparently is going to have 2 trillion parameters. This article says that's twice the 00:07:35.000 |
size of GPT 4. But according to what I've seen, it's more like about 10% bigger. Problem is, 00:07:39.740 |
of course, it's not all about the parameter size. It's about how you use those parameters. 00:07:44.540 |
So I don't personally predict that the Amazon model will be more 00:07:47.840 |
powerful than GPT 4. And even Samsung are getting into the game with a model that can also produce 00:07:54.020 |
text, code and images. That's apparently going to be incorporated into their next round of phones, 00:07:59.480 |
and I will be there to test it, hopefully. But back to the OpenAI Dev Day, where they confirmed 00:08:05.180 |
that we can now use different modalities of GPT 4 all in the same window. ChatGPT now uses GPT 4 00:08:12.800 |
Turbo with all the latest improvements, including the latest knowledge cutoff, which will continue 00:08:16.880 |
to update. That's all I have to say. I'm going to go back to the chat. I'm going to go back to the 00:08:17.540 |
chat. That's all live today. It can now browse the web when it needs to, write and run code, 00:08:22.700 |
analyze data, take and generate images, and much more. And we heard your feedback, 00:08:27.020 |
that model picker, extremely annoying. That is gone starting today. You will not have to click 00:08:31.640 |
around the dropdown menu. All of this will just work together. I decided to test that out and 00:08:36.560 |
ask three things in one. All in one prompt, I asked, "Use Bing to get the live market cap 00:08:42.380 |
of Microsoft and Apple. Then use Code Interpreter to calculate the percent difference." 00:08:47.240 |
I definitely don't trust the base GPT 4 to get that calculation correct. And then I asked, "Output an 00:08:53.420 |
image that captures which market cap is bigger." Now, of course, market capitalization changes all 00:08:58.580 |
the time, but these figures are broadly accurate. And this Apple figure is indeed 5.32% bigger than 00:09:05.420 |
a Microsoft figure. But it was the final part incorporating DALI 3 that I found the most 00:09:10.940 |
impressive. It created the image you can see here with the Apple skyscraper being slightly bigger. I 00:09:16.940 |
would argue around 5% bigger than the Microsoft tower, albeit with the hallucination of an A in 00:09:23.900 |
Microsoft. It was around this point in the presentation that Sam Altman mentioned AGI. 00:09:29.060 |
But before I take a look at that comment and a new AGI tier list paper from Google DeepMind, 00:09:35.360 |
I thought we'd take a brief interlude and enjoy this incredible update to Gen 2 from Runway ML. 00:09:46.640 |
At the end of the video, you're going to see an even more impressive demo, 00:10:02.660 |
I think, of this technology. But just before we get back to Sam, what do you think about 00:10:07.280 |
a physical device for video editing generated by AI? This is called the First AI machine. 00:10:14.600 |
"I'm going to start with the first AI machine." 00:10:52.340 |
Now here's Sam Altman's quick comment to Satya Nadella on AGI. 00:10:57.340 |
Well I think we have the best partnership in tech. I'm excited for us to build AGI together. 00:11:00.340 |
I'm really excited. Have a fantastic time. Thank you very much for coming. 00:11:03.340 |
That brought to mind a paper I read from Google DeepMind 00:11:05.340 |
that said, "The most important thing in the world is to build a company that is capable of doing everything you want to do." 00:11:06.340 |
And I think that's what I brought to mind this week. 00:11:07.340 |
It's called "Levels of AGI" and it's very much based on the idea of the levels of autonomous driving. 00:11:13.340 |
And I do admire their intent to create a more clear definition of AGI. 00:11:17.340 |
For example, Level 2: Competent AGI. Level 4: Virtuoso AGI. 00:11:22.340 |
With so many AGI predictions out there, as shown in my previous video, clear definitions are quite important. 00:11:28.340 |
But I do have some questions for the authors that I would want clarified. 00:11:32.340 |
For example, they say, "Competent AGI, which has not yet been achieved, 00:11:35.340 |
is when the AI is at least 50th percentile of skilled adults. 00:11:41.340 |
That compares to emerging AGI, which is equal or somewhat better than an unskilled human." 00:11:47.340 |
And in one of their later examples, they talk about mathematics. 00:11:50.340 |
But what counts as being skilled in mathematics? 00:11:53.340 |
Are we talking about being better than the median high school graduate in mathematics? 00:11:57.340 |
Being better than the median college student in mathematics? 00:12:01.340 |
Or are we talking being better than the median professional mathematician? 00:12:04.340 |
To me, they never made that clear and it makes a big difference to the benchmark. 00:12:09.340 |
If we're just talking about those graduating high school or secondary school, 00:12:12.340 |
GPT-4 is already better than the median person there in mathematics. 00:12:16.340 |
If we're talking about elite professional mathematicians, 00:12:19.340 |
when it's better than the 50th percentile of those, 00:12:22.340 |
we are getting really close to superintelligence. 00:12:26.340 |
Also, the authors confused me somewhat on page 7 when they said this: 00:12:30.340 |
"A competent AGI must have performance at least at the 50th percentile 00:12:33.340 |
for skilled adult humans on most cognitive tasks." 00:12:39.340 |
So it's not about being average at all tasks, 00:12:42.340 |
it's apparently about being average at most tasks. 00:12:45.340 |
But I could well imagine a situation where, say, GPT-5 is well above average for 90% of tasks, 00:12:51.340 |
but its capacity for purely abstract reasoning is down at the 10th or 20th percentile. 00:12:57.340 |
And even though that would meet this definition of a competent AGI, 00:13:04.340 |
In that scenario, it wouldn't fundamentally be understanding what it's performing so well in. 00:13:13.340 |
but doesn't go all the way to a solution for me. 00:13:18.340 |
"The rate of progression between levels of performance and/or generality may be non-linear, 00:13:26.340 |
Acquiring the capability to learn new skills may particularly accelerate progress toward the next level." 00:13:31.340 |
Going back to that table, I think that's something really worth noting. 00:13:34.340 |
Imagine a virtuoso AGI that's better than 99% of skilled people across a range of domains. 00:13:42.340 |
But those skills would include machine learning, chip design, curriculum learning. 00:13:46.340 |
And of course, that virtuoso AGI can just be duplicated, so there's a billion of them running. 00:13:51.340 |
And surely a billion virtuoso AGIs would be able to create the next level, ASI, much quicker than they themselves were created. 00:14:00.340 |
It seems like others are getting that sense of acceleration too, 00:14:04.340 |
with the co-founder and CEO of RunwayML, as we've seen, saying: 00:14:09.340 |
"Decades worth of progress happened this year. 00:14:12.340 |
A year's worth of progress is now happening in months. 00:14:15.340 |
Months worth of progress will start to happen in days." 00:14:18.340 |
I personally don't think we have quite reached that point in AI, 00:14:24.340 |
Going back to the paper, toward the end, it read a bit like a screenplay for X-Men, 00:14:29.340 |
but it's not quite as good as it was in the first place. 00:14:32.340 |
In the paper, the author of "Artificial Superintelligence", the paper said: 00:14:35.340 |
"Non-human skills that an ASI might have could include capabilities such as neural interfaces, 00:14:41.340 |
perhaps through mechanisms such as analysing brain signals to decode thoughts, 00:14:45.340 |
oracular abilities, being an oracle, perhaps through mechanisms such as analysing large volumes of data to make high quality predictions, 00:14:52.340 |
or the ability to communicate with animals, that would be amazing, 00:14:55.340 |
perhaps by mechanisms such as analysing patterns in their vocalisations, brainwaves, or body language." 00:14:58.340 |
The paper also talked about the level of autonomy that we might give to AI, 00:15:04.340 |
But I bet David Shapiro has got something to say about the possible societal scale on "we", 00:15:10.340 |
that's boredom or lethargy, that might accompany expert AGI that drives interaction, 00:15:15.340 |
with humans only providing guidance and feedback. 00:15:18.340 |
Anyway, let's get back to what you can do now, which is create your own chatbot, your own GPT. 00:15:24.340 |
I luckily got early access to this and have been playing about a bit with it. 00:15:27.340 |
First thing I did was create a bot of Ivan Ilyich, the star of an amazing short story by Tolstoy, The Death of Ivan Ilyich. 00:15:36.340 |
All I did was create a description and instructions for GPT-4 Turbo and created an image of Ivan Ilyich with DALI-3. 00:15:44.340 |
I gave it knowledge of the entire short story by just uploading a document with the short story. 00:15:49.340 |
I think I'm allowed to do that in terms of copyright, but copyright doesn't seem to mean much anymore. 00:15:54.340 |
Anyway, all I wanted to really demonstrate was that 00:15:56.340 |
this Ivan Ilyich bot would know the story better than base GPT-4. 00:16:05.340 |
For how many days did Ivan Ilyich scream in the famous short story The Death of Ivan Ilyich? 00:16:10.340 |
It seems to think I'm quoting a document and says "error reading documents", but anyway can't answer the question. 00:16:17.340 |
That short story probably is in its pre-training data, but it can't retrieve the information well enough. 00:16:25.340 |
Even with such a direct question as "For how many days did you scream?" 00:16:30.340 |
Acting in character it said "For the last three days of my life". 00:16:33.340 |
And that is correct based on the short story. 00:16:36.340 |
What I'm now working on, probably with thousands of other creators, is an AI explained bot. 00:16:42.340 |
Getting the transcripts of my video and turning it into an imitative chatbot. 00:16:47.340 |
I asked, well myself, what sort of timelines are people using for AGI? 00:16:56.340 |
As you might expect, I've been testing out many of the other GPTs, including those made by third parties. 00:17:02.340 |
But honestly, the OpenAI GPTs don't seem that much better than the generic version. 00:17:08.340 |
For example, I tried out the MathMentor, and it couldn't seem to answer math questions any better than generic GPT-4. 00:17:15.340 |
I also tested the Meme GPT, and it didn't seem to understand any more memes than the base GPT. 00:17:23.340 |
we are going to get a GPT store, so there should be many more GPTs to test out soon. 00:17:29.340 |
According to Sam Altman, bringing in just a smidgen of hype, 00:17:32.340 |
all of these developments are just the start. 00:17:35.340 |
As intelligence gets integrated everywhere, we will all have superpowers on demand. 00:17:42.340 |
What we launch today is going to look very quaint relative to what we're busy creating for you now. 00:17:47.340 |
Thank you so much for watching to the end of the video, 00:17:50.340 |
and I'm going to leave you with this creation 00:17:54.340 |
They use Midjourney, The New Runway, and Stable Audio.