Back to Index

GPT 4 Turbo-Charged? Plus Custom GPTS, Grok, AGI Tier List, Vision Demos, Whisper V3 and more


Transcript

As you may have noticed, there have been one or two things happening in the world of AI this week. We've had a bit of time to take it in, but of course it will take weeks and months to test everything out. Let's bring you some of the best bits now though, from a custom GPT I made, to the interestingly named new Grok, Gauss and Olympus models, Whisper V3, better text-to-video, an AGI tier list from DeepMind and much more.

But I'm going to start kind of randomly with the update to how current ChatGPT is. We're also updating the knowledge cutoff. We are just as annoyed as all of you, probably more, that GPT-4's knowledge about the world ended in 2021. We will try to never let it get that out of date again.

GPT-4 Turbo has knowledge about the world up to April of 2023. And my only comment on that is that this means that GPT-4 has knowledge of itself. For the first time. While GPT-4 was trained in August of 2022, it wasn't released until March of 2023. So GPT-4 should know a lot more about how it's trained and some of the latest advances in AI.

But how much can the new GPT-4 Turbo process at once? Well, 128,000 tokens, which is approximately 100,000 words in English. GPT-4 supported up to 8K and in some cases up to 32K context length. But we know that isn't enough for many of you and what you want to do.

GPT-4 Turbo supports up to 128,000 tokens of context. That's 300 pages of a standard book, 16 times longer than our 8K context. Don't assume though that just because it can process that many words that it will be equally accurate at comprehending all of them. This analysis from yesterday showed that when you're trying to retrieve a fact from a document, particularly one of more than about 80,000 words or 100,000 tokens, if you place that fact at between 10 to 50% through the document, retrieval started to get pretty bad.

If the fact was towards the end of a document or you submitted fewer words or tokens, performance was much higher. The good news though is that GPT-4 Turbo, the 11.06 preview, is much better than previous iterations of GPT-4 at retrieval. So, if you're trying to retrieve a document, you're going to need to have a lot of experience.

So, if you're trying to retrieve a document, you're going to need to have a lot of experience. So, if you're trying to retrieve a document, you're going to need to have a lot of experience. So, if you're trying to retrieve a document, you're going to need to have a lot of experience.

So, if you're trying to retrieve a document, you're going to need to have a lot of experience. So, yes, even though performance does degrade, the more tokens or words you submit, it's better than before. But again, better than before still isn't even close to perfect. Surprising no one, DALI 3, GPT-4 Turbo with vision, and the new text-to-speech model are all going into the API today.

And I'm going to give you a taste of the kind of things these new APIs unlock. We have this implementation by Robert Lukoschko, integrating GPT Vision API and allowing you just to clip out bits of a screen and ask questions about it. Part I'm interested in, and the GPT-4 is going to answer me, what is it?

It's hip joint region. And what about this part? What is it? I'm not giving it even any context. It just knows. This is Schrodinger's equation. Let's try this part. What is it? Potential energy term. And let's say I'm really into the cards, but I'm not going to be able to do it.

But I just don't know. What is this? Your orange stick. What is this orange stick? And an oil dipstick. What about webcam GPT? I think the kind of people who will use this are those who want to maximize their productivity and they're going to get their webcam to check if they're working.

Of course, let's hope that companies don't use it, as they have been known to do, to monitor their employees. And we also got a new text-to-speech model, which people are already integrating with GPT Vision. Can you believe this? He's taken on the whole defense. He's a one-man show, ladies and gentlemen.

He shoots. Girl! Messy, messy, messy. Unbelievable. What a goal. What a goal. Glorious. Absolutely glorious. The stadium explodes in joy. This is football magic at its finest. And then, of course, we have Whisper version 3, which can understand speech and convert it into text. Models like Whisper V3 and Conformer 2 will change how we use the internet in the not too distant future.

Speaking of new modalities, we're also releasing the next version of our open source speech recognition model, Whisper V3, today. And it'll be coming soon to the API. It features improved performance across many languages, and we think you're really going to like it. This chart shows the word error rate across a range of languages for Whisper V3 and V2.

A drop means an improvement, and surprisingly, English isn't the most accurate language. If you happen to speak much Spanish or Korean, you might be in for a better time. Of course, if you speak one of the lower resource languages like Bengali, it might be a bit harder to use.

Just quickly, though, I did pick up on this somewhat brief remark Sam Altman made about GPT-4 Turbo being smarter. GPT-4 Turbo is the industry leading model. It delivers a lot of improvements that we just covered, and it's a smarter model than GPT-4. I've started my own investigations, as have others, and the results aren't uniformly better.

And in fact, for some use cases, GPT-4 Turbo performs worse. This was a test for SAT reading by Jeffrey Wang. There does need to be far more thorough investigations, but I wish they released some benchmarks along with that claim. And there were two moments from the guest appearance of the CEO of Microsoft, Satya Nadella, that I found particularly interesting.

First, he promised OpenAI the most compute. The systems that are needed as you aggressively push forward on your roadmap requires us to be on the top of our game. And we intend fully to commit ourselves deeply to making sure you all, as builders of these foundation models, have not only the best systems for training and inference, but the most compute so that you can keep pushing forward on the frontiers.

Because I think that's the way we're going to make progress. Why is that relevant? Because it seems at the moment every mega corporation in the world is going to come out with their own language model. Many of you may have seen the publicity over Grok from XAI. It's been trained on Twitter and apparently it replies to more spicy questions.

They tested the new Grok 1 across a range of benchmarks, although some people have complained about the settings they used. Either way, on these benchmarks, it did seem to be somewhat firmly beating GPT 3.5, the original ChatGPT. But notice that on these benchmarks, it was falling way behind CLAWD 2.

That's on the MMLU testing for general knowledge, human eval for coding, and GSM 8K for mathematics. They also gave this human graded evaluation comparison on an exam produced after all of these models were trained. And on that, Grok 1 outperforms even CLAWD 2 and way outperforms GPT 3.5. Now many have complained about the way that the models were evaluated.

But I've talked before with SmartGPT about the discrepancies that can arise with human grading and automatic grading. This just reaffirms the need for me for new authoritative hand graded benchmarks. Pricy, but well worth it. And they're not the only one with Amazon building a model called Olympus, which apparently is going to have 2 trillion parameters.

This article says that's twice the size of GPT 4. But according to what I've seen, it's more like about 10% bigger. Problem is, of course, it's not all about the parameter size. It's about how you use those parameters. So I don't personally predict that the Amazon model will be more powerful than GPT 4.

And even Samsung are getting into the game with a model that can also produce text, code and images. That's apparently going to be incorporated into their next round of phones, and I will be there to test it, hopefully. But back to the OpenAI Dev Day, where they confirmed that we can now use different modalities of GPT 4 all in the same window.

ChatGPT now uses GPT 4 Turbo with all the latest improvements, including the latest knowledge cutoff, which will continue to update. That's all I have to say. I'm going to go back to the chat. I'm going to go back to the chat. That's all live today. It can now browse the web when it needs to, write and run code, analyze data, take and generate images, and much more.

And we heard your feedback, that model picker, extremely annoying. That is gone starting today. You will not have to click around the dropdown menu. All of this will just work together. I decided to test that out and ask three things in one. All in one prompt, I asked, "Use Bing to get the live market cap of Microsoft and Apple.

Then use Code Interpreter to calculate the percent difference." I definitely don't trust the base GPT 4 to get that calculation correct. And then I asked, "Output an image that captures which market cap is bigger." Now, of course, market capitalization changes all the time, but these figures are broadly accurate.

And this Apple figure is indeed 5.32% bigger than a Microsoft figure. But it was the final part incorporating DALI 3 that I found the most impressive. It created the image you can see here with the Apple skyscraper being slightly bigger. I would argue around 5% bigger than the Microsoft tower, albeit with the hallucination of an A in Microsoft.

It was around this point in the presentation that Sam Altman mentioned AGI. But before I take a look at that comment and a new AGI tier list paper from Google DeepMind, I thought we'd take a brief interlude and enjoy this incredible update to Gen 2 from Runway ML. At the end of the video, you're going to see an even more impressive demo, I think, of this technology.

But just before we get back to Sam, what do you think about a physical device for video editing generated by AI? This is called the First AI machine. Sam: "I'm going to start with the first AI machine." ♪♪ ♪♪ ♪♪ ♪♪ Now here's Sam Altman's quick comment to Satya Nadella on AGI.

Well I think we have the best partnership in tech. I'm excited for us to build AGI together. I'm really excited. Have a fantastic time. Thank you very much for coming. That brought to mind a paper I read from Google DeepMind that said, "The most important thing in the world is to build a company that is capable of doing everything you want to do." And I think that's what I brought to mind this week.

It's called "Levels of AGI" and it's very much based on the idea of the levels of autonomous driving. And I do admire their intent to create a more clear definition of AGI. For example, Level 2: Competent AGI. Level 4: Virtuoso AGI. With so many AGI predictions out there, as shown in my previous video, clear definitions are quite important.

But I do have some questions for the authors that I would want clarified. For example, they say, "Competent AGI, which has not yet been achieved, is when the AI is at least 50th percentile of skilled adults. That compares to emerging AGI, which is equal or somewhat better than an unskilled human." And in one of their later examples, they talk about mathematics.

But what counts as being skilled in mathematics? Are we talking about being better than the median high school graduate in mathematics? Being better than the median college student in mathematics? Or are we talking being better than the median professional mathematician? To me, they never made that clear and it makes a big difference to the benchmark.

If we're just talking about those graduating high school or secondary school, GPT-4 is already better than the median person there in mathematics. If we're talking about elite professional mathematicians, when it's better than the 50th percentile of those, we are getting really close to superintelligence. Also, the authors confused me somewhat on page 7 when they said this: "A competent AGI must have performance at least at the 50th percentile for skilled adult humans on most cognitive tasks." So it's not about being average at all tasks, it's apparently about being average at most tasks.

But I could well imagine a situation where, say, GPT-5 is well above average for 90% of tasks, but its capacity for purely abstract reasoning is down at the 10th or 20th percentile. And even though that would meet this definition of a competent AGI, many people would say, "It's not really AGI." In that scenario, it wouldn't fundamentally be understanding what it's performing so well in.

So again, definitions really do matter, and this is a step forward, but doesn't go all the way to a solution for me. I did also note this on page 7: "The rate of progression between levels of performance and/or generality may be non-linear, not in a straight line. Acquiring the capability to learn new skills may particularly accelerate progress toward the next level." Going back to that table, I think that's something really worth noting.

Imagine a virtuoso AGI that's better than 99% of skilled people across a range of domains. But those skills would include machine learning, chip design, curriculum learning. And of course, that virtuoso AGI can just be duplicated, so there's a billion of them running. And surely a billion virtuoso AGIs would be able to create the next level, ASI, much quicker than they themselves were created.

It seems like others are getting that sense of acceleration too, with the co-founder and CEO of RunwayML, as we've seen, saying: "Decades worth of progress happened this year. A year's worth of progress is now happening in months. Months worth of progress will start to happen in days." I personally don't think we have quite reached that point in AI, but the momentum is definitely building.

Going back to the paper, toward the end, it read a bit like a screenplay for X-Men, but it's not quite as good as it was in the first place. In the paper, the author of "Artificial Superintelligence", the paper said: "Non-human skills that an ASI might have could include capabilities such as neural interfaces, perhaps through mechanisms such as analysing brain signals to decode thoughts, oracular abilities, being an oracle, perhaps through mechanisms such as analysing large volumes of data to make high quality predictions, or the ability to communicate with animals, that would be amazing, perhaps by mechanisms such as analysing patterns in their vocalisations, brainwaves, or body language." The paper also talked about the level of autonomy that we might give to AI, again graded across five levels.

But I bet David Shapiro has got something to say about the possible societal scale on "we", that's boredom or lethargy, that might accompany expert AGI that drives interaction, with humans only providing guidance and feedback. Anyway, let's get back to what you can do now, which is create your own chatbot, your own GPT.

I luckily got early access to this and have been playing about a bit with it. First thing I did was create a bot of Ivan Ilyich, the star of an amazing short story by Tolstoy, The Death of Ivan Ilyich. All I did was create a description and instructions for GPT-4 Turbo and created an image of Ivan Ilyich with DALI-3.

I gave it knowledge of the entire short story by just uploading a document with the short story. I think I'm allowed to do that in terms of copyright, but copyright doesn't seem to mean much anymore. Anyway, all I wanted to really demonstrate was that this Ivan Ilyich bot would know the story better than base GPT-4.

For example, here is normal GPT-4. For how many days did Ivan Ilyich scream in the famous short story The Death of Ivan Ilyich? It seems to think I'm quoting a document and says "error reading documents", but anyway can't answer the question. That short story probably is in its pre-training data, but it can't retrieve the information well enough.

But my Ivan Ilyich bot can. Even with such a direct question as "For how many days did you scream?" Acting in character it said "For the last three days of my life". And that is correct based on the short story. What I'm now working on, probably with thousands of other creators, is an AI explained bot.

Getting the transcripts of my video and turning it into an imitative chatbot. I asked, well myself, what sort of timelines are people using for AGI? And drawing on one of my recent videos, it outputted a fairly decent answer. As you might expect, I've been testing out many of the other GPTs, including those made by third parties.

But honestly, the OpenAI GPTs don't seem that much better than the generic version. For example, I tried out the MathMentor, and it couldn't seem to answer math questions any better than generic GPT-4. I also tested the Meme GPT, and it didn't seem to understand any more memes than the base GPT.

Apparently, later this month, we are going to get a GPT store, so there should be many more GPTs to test out soon. According to Sam Altman, bringing in just a smidgen of hype, all of these developments are just the start. As intelligence gets integrated everywhere, we will all have superpowers on demand.

We hope that you'll come back next year. What we launch today is going to look very quaint relative to what we're busy creating for you now. Thank you so much for watching to the end of the video, and I'm going to leave you with this creation from InfiniteYe. They use Midjourney, The New Runway, and Stable Audio.

Have a wonderful day. See you next time.