back to index

GPT 4 Turbo-Charged? Plus Custom GPTS, Grok, AGI Tier List, Vision Demos, Whisper V3 and more


Whisper Transcript | Transcript Only Page

00:00:00.000 | As you may have noticed, there have been one or two things happening in the world of AI this week.
00:00:06.700 | We've had a bit of time to take it in, but of course it will take weeks and months to test
00:00:12.340 | everything out. Let's bring you some of the best bits now though, from a custom GPT I made,
00:00:18.260 | to the interestingly named new Grok, Gauss and Olympus models, Whisper V3, better text-to-video,
00:00:25.500 | an AGI tier list from DeepMind and much more. But I'm going to start kind of randomly with the
00:00:32.460 | update to how current ChatGPT is. We're also updating the knowledge cutoff. We are just as
00:00:38.580 | annoyed as all of you, probably more, that GPT-4's knowledge about the world ended in 2021. We will
00:00:43.800 | try to never let it get that out of date again. GPT-4 Turbo has knowledge about the world up to
00:00:48.540 | April of 2023. And my only comment on that is that this means that GPT-4 has knowledge of itself.
00:00:55.480 | For the first time. While GPT-4 was trained in August of 2022, it wasn't released until March of
00:01:03.400 | 2023. So GPT-4 should know a lot more about how it's trained and some of the latest advances in
00:01:10.300 | AI. But how much can the new GPT-4 Turbo process at once? Well, 128,000 tokens, which is approximately
00:01:19.180 | 100,000 words in English.
00:01:21.560 | GPT-4 supported up to 8K and in some cases up to 32K context length. But we know that isn't enough
00:01:28.700 | for many of you and what you want to do. GPT-4 Turbo supports up to 128,000 tokens of context.
00:01:35.060 | That's 300 pages of a standard book, 16 times longer than our 8K context.
00:01:40.580 | Don't assume though that just because it can process that many words that it will be equally
00:01:45.560 | accurate at comprehending all of them. This analysis from yesterday showed that when you're
00:01:50.000 | trying to retrieve a
00:01:51.440 | fact from a document, particularly one of more than about 80,000 words or 100,000 tokens, if you
00:01:57.620 | place that fact at between 10 to 50% through the document, retrieval started to get pretty bad. If
00:02:05.120 | the fact was towards the end of a document or you submitted fewer words or tokens, performance was
00:02:11.000 | much higher. The good news though is that GPT-4 Turbo, the 11.06 preview, is much better than
00:02:18.380 | previous iterations of GPT-4 at
00:02:21.140 | retrieval.
00:02:21.140 | So, if you're trying to retrieve a document, you're going to need to have a lot of experience.
00:02:21.140 | So, if you're trying to retrieve a document, you're going to need to have a lot of experience.
00:02:21.140 | So, if you're trying to retrieve a document, you're going to need to have a lot of experience.
00:02:21.140 | So, if you're trying to retrieve a document, you're going to need to have a lot of experience.
00:02:21.140 | So, if you're trying to retrieve a document, you're going to need to have a lot of experience. So, yes, even though performance does degrade,
00:02:26.660 | the more tokens or words you submit, it's better than before. But again, better than before still
00:02:32.240 | isn't even close to perfect. Surprising no one, DALI 3, GPT-4 Turbo with vision,
00:02:39.320 | and the new text-to-speech model are all going into the API today. And I'm going to give you a
00:02:44.720 | taste of the kind of things these new APIs unlock. We have this implementation by Robert Lukoschko,
00:02:50.840 | integrating GPT Vision API and allowing you just to clip out bits of a screen and ask questions about
00:02:56.900 | it. Part I'm interested in, and the GPT-4 is going to answer me, what is it? It's hip joint region.
00:03:02.720 | And what about this part? What is it? I'm not giving it even any context. It just knows. This
00:03:10.400 | is Schrodinger's equation. Let's try this part. What is it? Potential energy term. And let's say I'm really into the cards, but I'm not going to be able to do it.
00:03:20.540 | But I just don't know. What is this? Your orange stick. What is this orange stick?
00:03:26.600 | And an oil dipstick. What about webcam GPT? I think the kind of people who will use this are
00:03:34.220 | those who want to maximize their productivity and they're going to get their webcam to check
00:03:39.200 | if they're working. Of course, let's hope that companies don't use it, as they have been known
00:03:43.580 | to do, to monitor their employees. And we also got a new text-to-speech model, which people are already
00:03:50.240 | integrating with GPT Vision. Can you believe this? He's taken on the whole defense. He's a one-man
00:03:56.300 | show, ladies and gentlemen. He shoots. Girl! Messy, messy, messy. Unbelievable. What a goal.
00:04:02.900 | What a goal. Glorious. Absolutely glorious. The stadium explodes in joy. This is football
00:04:08.360 | magic at its finest. And then, of course, we have Whisper version 3, which can understand
00:04:12.860 | speech and convert it into text. Models like Whisper V3 and Conformer 2 will change how we use the internet
00:04:19.940 | in the not too distant future. Speaking of new modalities, we're also releasing the next version
00:04:26.120 | of our open source speech recognition model, Whisper V3, today. And it'll be coming soon to the
00:04:31.400 | API. It features improved performance across many languages, and we think you're really going to like
00:04:35.840 | it. This chart shows the word error rate across a range of languages for Whisper V3 and V2. A drop
00:04:43.760 | means an improvement, and surprisingly, English isn't the most accurate language. If you happen to speak
00:04:49.640 | much Spanish or Korean, you might be in for a better time. Of course, if you speak one of the
00:04:55.340 | lower resource languages like Bengali, it might be a bit harder to use. Just quickly, though,
00:05:00.860 | I did pick up on this somewhat brief remark Sam Altman made about GPT-4 Turbo being smarter.
00:05:06.860 | GPT-4 Turbo is the industry leading model. It delivers a lot of improvements that we just
00:05:13.940 | covered, and it's a smarter model than GPT-4. I've started my own investigations, as have
00:05:19.340 | others, and the results aren't uniformly better. And in fact, for some use cases, GPT-4 Turbo performs
00:05:25.700 | worse. This was a test for SAT reading by Jeffrey Wang. There does need to be far more thorough
00:05:31.460 | investigations, but I wish they released some benchmarks along with that claim. And there were
00:05:36.440 | two moments from the guest appearance of the CEO of Microsoft, Satya Nadella, that I found
00:05:41.780 | particularly interesting. First, he promised OpenAI the most compute. The systems that are needed as
00:05:49.040 | you aggressively push forward on your roadmap requires us to be on the top of our game. And
00:05:55.460 | we intend fully to commit ourselves deeply to making sure you all, as builders of these
00:06:01.820 | foundation models, have not only the best systems for training and inference, but the most compute
00:06:08.240 | so that you can keep pushing forward on the frontiers. Because I think that's the way we're
00:06:13.700 | going to make progress. Why is that relevant? Because it seems at the moment every mega corporation in the
00:06:18.740 | world is going to come out with their own language model. Many of you may have seen the publicity over
00:06:23.420 | Grok from XAI. It's been trained on Twitter and apparently it replies to more spicy questions. They
00:06:29.780 | tested the new Grok 1 across a range of benchmarks, although some people have complained about the
00:06:34.880 | settings they used. Either way, on these benchmarks, it did seem to be somewhat firmly beating GPT 3.5,
00:06:41.060 | the original ChatGPT. But notice that on these benchmarks, it was falling way behind CLAWD 2. That's on the MMLU
00:06:48.440 | testing for general knowledge, human eval for coding, and GSM 8K for mathematics. They also gave
00:06:55.220 | this human graded evaluation comparison on an exam produced after all of these models were trained.
00:07:01.400 | And on that, Grok 1 outperforms even CLAWD 2 and way outperforms GPT 3.5. Now many have complained
00:07:08.600 | about the way that the models were evaluated. But I've talked before with SmartGPT about the
00:07:14.600 | discrepancies that can arise with human grading and automatic grading.
00:07:18.140 | This just reaffirms the need for me for new authoritative hand graded benchmarks. Pricy,
00:07:24.620 | but well worth it. And they're not the only one with Amazon building a model called Olympus,
00:07:29.840 | which apparently is going to have 2 trillion parameters. This article says that's twice the
00:07:35.000 | size of GPT 4. But according to what I've seen, it's more like about 10% bigger. Problem is,
00:07:39.740 | of course, it's not all about the parameter size. It's about how you use those parameters.
00:07:44.540 | So I don't personally predict that the Amazon model will be more
00:07:47.840 | powerful than GPT 4. And even Samsung are getting into the game with a model that can also produce
00:07:54.020 | text, code and images. That's apparently going to be incorporated into their next round of phones,
00:07:59.480 | and I will be there to test it, hopefully. But back to the OpenAI Dev Day, where they confirmed
00:08:05.180 | that we can now use different modalities of GPT 4 all in the same window. ChatGPT now uses GPT 4
00:08:12.800 | Turbo with all the latest improvements, including the latest knowledge cutoff, which will continue
00:08:16.880 | to update. That's all I have to say. I'm going to go back to the chat. I'm going to go back to the
00:08:17.540 | chat. That's all live today. It can now browse the web when it needs to, write and run code,
00:08:22.700 | analyze data, take and generate images, and much more. And we heard your feedback,
00:08:27.020 | that model picker, extremely annoying. That is gone starting today. You will not have to click
00:08:31.640 | around the dropdown menu. All of this will just work together. I decided to test that out and
00:08:36.560 | ask three things in one. All in one prompt, I asked, "Use Bing to get the live market cap
00:08:42.380 | of Microsoft and Apple. Then use Code Interpreter to calculate the percent difference."
00:08:47.240 | I definitely don't trust the base GPT 4 to get that calculation correct. And then I asked, "Output an
00:08:53.420 | image that captures which market cap is bigger." Now, of course, market capitalization changes all
00:08:58.580 | the time, but these figures are broadly accurate. And this Apple figure is indeed 5.32% bigger than
00:09:05.420 | a Microsoft figure. But it was the final part incorporating DALI 3 that I found the most
00:09:10.940 | impressive. It created the image you can see here with the Apple skyscraper being slightly bigger. I
00:09:16.940 | would argue around 5% bigger than the Microsoft tower, albeit with the hallucination of an A in
00:09:23.900 | Microsoft. It was around this point in the presentation that Sam Altman mentioned AGI.
00:09:29.060 | But before I take a look at that comment and a new AGI tier list paper from Google DeepMind,
00:09:35.360 | I thought we'd take a brief interlude and enjoy this incredible update to Gen 2 from Runway ML.
00:09:46.640 | At the end of the video, you're going to see an even more impressive demo,
00:10:02.660 | I think, of this technology. But just before we get back to Sam, what do you think about
00:10:07.280 | a physical device for video editing generated by AI? This is called the First AI machine.
00:10:14.600 | "I'm going to start with the first AI machine."
00:10:16.340 | ♪♪
00:10:26.340 | ♪♪
00:10:36.340 | ♪♪
00:10:46.340 | ♪♪
00:10:52.340 | Now here's Sam Altman's quick comment to Satya Nadella on AGI.
00:10:57.340 | Well I think we have the best partnership in tech. I'm excited for us to build AGI together.
00:11:00.340 | I'm really excited. Have a fantastic time. Thank you very much for coming.
00:11:03.340 | That brought to mind a paper I read from Google DeepMind
00:11:05.340 | that said, "The most important thing in the world is to build a company that is capable of doing everything you want to do."
00:11:06.340 | And I think that's what I brought to mind this week.
00:11:07.340 | It's called "Levels of AGI" and it's very much based on the idea of the levels of autonomous driving.
00:11:13.340 | And I do admire their intent to create a more clear definition of AGI.
00:11:17.340 | For example, Level 2: Competent AGI. Level 4: Virtuoso AGI.
00:11:22.340 | With so many AGI predictions out there, as shown in my previous video, clear definitions are quite important.
00:11:28.340 | But I do have some questions for the authors that I would want clarified.
00:11:32.340 | For example, they say, "Competent AGI, which has not yet been achieved,
00:11:35.340 | is when the AI is at least 50th percentile of skilled adults.
00:11:41.340 | That compares to emerging AGI, which is equal or somewhat better than an unskilled human."
00:11:47.340 | And in one of their later examples, they talk about mathematics.
00:11:50.340 | But what counts as being skilled in mathematics?
00:11:53.340 | Are we talking about being better than the median high school graduate in mathematics?
00:11:57.340 | Being better than the median college student in mathematics?
00:12:01.340 | Or are we talking being better than the median professional mathematician?
00:12:04.340 | To me, they never made that clear and it makes a big difference to the benchmark.
00:12:09.340 | If we're just talking about those graduating high school or secondary school,
00:12:12.340 | GPT-4 is already better than the median person there in mathematics.
00:12:16.340 | If we're talking about elite professional mathematicians,
00:12:19.340 | when it's better than the 50th percentile of those,
00:12:22.340 | we are getting really close to superintelligence.
00:12:26.340 | Also, the authors confused me somewhat on page 7 when they said this:
00:12:30.340 | "A competent AGI must have performance at least at the 50th percentile
00:12:33.340 | for skilled adult humans on most cognitive tasks."
00:12:39.340 | So it's not about being average at all tasks,
00:12:42.340 | it's apparently about being average at most tasks.
00:12:45.340 | But I could well imagine a situation where, say, GPT-5 is well above average for 90% of tasks,
00:12:51.340 | but its capacity for purely abstract reasoning is down at the 10th or 20th percentile.
00:12:57.340 | And even though that would meet this definition of a competent AGI,
00:13:00.340 | many people would say,
00:13:02.340 | "It's not really AGI."
00:13:04.340 | In that scenario, it wouldn't fundamentally be understanding what it's performing so well in.
00:13:09.340 | So again, definitions really do matter,
00:13:11.340 | and this is a step forward,
00:13:13.340 | but doesn't go all the way to a solution for me.
00:13:16.340 | I did also note this on page 7:
00:13:18.340 | "The rate of progression between levels of performance and/or generality may be non-linear,
00:13:24.340 | not in a straight line.
00:13:26.340 | Acquiring the capability to learn new skills may particularly accelerate progress toward the next level."
00:13:31.340 | Going back to that table, I think that's something really worth noting.
00:13:34.340 | Imagine a virtuoso AGI that's better than 99% of skilled people across a range of domains.
00:13:42.340 | But those skills would include machine learning, chip design, curriculum learning.
00:13:46.340 | And of course, that virtuoso AGI can just be duplicated, so there's a billion of them running.
00:13:51.340 | And surely a billion virtuoso AGIs would be able to create the next level, ASI, much quicker than they themselves were created.
00:14:00.340 | It seems like others are getting that sense of acceleration too,
00:14:04.340 | with the co-founder and CEO of RunwayML, as we've seen, saying:
00:14:09.340 | "Decades worth of progress happened this year.
00:14:12.340 | A year's worth of progress is now happening in months.
00:14:15.340 | Months worth of progress will start to happen in days."
00:14:18.340 | I personally don't think we have quite reached that point in AI,
00:14:21.340 | but the momentum is definitely building.
00:14:24.340 | Going back to the paper, toward the end, it read a bit like a screenplay for X-Men,
00:14:29.340 | but it's not quite as good as it was in the first place.
00:14:32.340 | In the paper, the author of "Artificial Superintelligence", the paper said:
00:14:35.340 | "Non-human skills that an ASI might have could include capabilities such as neural interfaces,
00:14:41.340 | perhaps through mechanisms such as analysing brain signals to decode thoughts,
00:14:45.340 | oracular abilities, being an oracle, perhaps through mechanisms such as analysing large volumes of data to make high quality predictions,
00:14:52.340 | or the ability to communicate with animals, that would be amazing,
00:14:55.340 | perhaps by mechanisms such as analysing patterns in their vocalisations, brainwaves, or body language."
00:14:58.340 | The paper also talked about the level of autonomy that we might give to AI,
00:15:02.340 | again graded across five levels.
00:15:04.340 | But I bet David Shapiro has got something to say about the possible societal scale on "we",
00:15:10.340 | that's boredom or lethargy, that might accompany expert AGI that drives interaction,
00:15:15.340 | with humans only providing guidance and feedback.
00:15:18.340 | Anyway, let's get back to what you can do now, which is create your own chatbot, your own GPT.
00:15:24.340 | I luckily got early access to this and have been playing about a bit with it.
00:15:27.340 | First thing I did was create a bot of Ivan Ilyich, the star of an amazing short story by Tolstoy, The Death of Ivan Ilyich.
00:15:36.340 | All I did was create a description and instructions for GPT-4 Turbo and created an image of Ivan Ilyich with DALI-3.
00:15:44.340 | I gave it knowledge of the entire short story by just uploading a document with the short story.
00:15:49.340 | I think I'm allowed to do that in terms of copyright, but copyright doesn't seem to mean much anymore.
00:15:54.340 | Anyway, all I wanted to really demonstrate was that
00:15:56.340 | this Ivan Ilyich bot would know the story better than base GPT-4.
00:16:02.340 | For example, here is normal GPT-4.
00:16:05.340 | For how many days did Ivan Ilyich scream in the famous short story The Death of Ivan Ilyich?
00:16:10.340 | It seems to think I'm quoting a document and says "error reading documents", but anyway can't answer the question.
00:16:17.340 | That short story probably is in its pre-training data, but it can't retrieve the information well enough.
00:16:23.340 | But my Ivan Ilyich bot can.
00:16:25.340 | Even with such a direct question as "For how many days did you scream?"
00:16:30.340 | Acting in character it said "For the last three days of my life".
00:16:33.340 | And that is correct based on the short story.
00:16:36.340 | What I'm now working on, probably with thousands of other creators, is an AI explained bot.
00:16:42.340 | Getting the transcripts of my video and turning it into an imitative chatbot.
00:16:47.340 | I asked, well myself, what sort of timelines are people using for AGI?
00:16:52.340 | And drawing on one of my recent videos,
00:16:54.340 | it outputted a fairly decent answer.
00:16:56.340 | As you might expect, I've been testing out many of the other GPTs, including those made by third parties.
00:17:02.340 | But honestly, the OpenAI GPTs don't seem that much better than the generic version.
00:17:08.340 | For example, I tried out the MathMentor, and it couldn't seem to answer math questions any better than generic GPT-4.
00:17:15.340 | I also tested the Meme GPT, and it didn't seem to understand any more memes than the base GPT.
00:17:22.340 | Apparently, later this month,
00:17:23.340 | we are going to get a GPT store, so there should be many more GPTs to test out soon.
00:17:29.340 | According to Sam Altman, bringing in just a smidgen of hype,
00:17:32.340 | all of these developments are just the start.
00:17:35.340 | As intelligence gets integrated everywhere, we will all have superpowers on demand.
00:17:40.340 | We hope that you'll come back next year.
00:17:42.340 | What we launch today is going to look very quaint relative to what we're busy creating for you now.
00:17:47.340 | Thank you so much for watching to the end of the video,
00:17:50.340 | and I'm going to leave you with this creation
00:17:52.340 | from InfiniteYe.
00:17:54.340 | They use Midjourney, The New Runway, and Stable Audio.
00:17:58.340 | Have a wonderful day.
00:18:00.340 | See you next time.