Back to Index

GPT-4o - Full Breakdown + Bonus Details


Transcript

It's smarter in most ways, cheaper, faster, better at coding, multi-modal in and out, and perfectly timed to steal the spotlight from Google. It's GPT-4 Omni. I've gone through all the benchmarks and the release videos to give you the highlights. My first reaction was it's more flirtatious PSI than AGI, but a notable step forward nonetheless.

First things first, GPT-4 O, meaning Omni, which is all or everywhere, referencing the different modalities it's got, is free. By making GPT-4 O free, they are either crazy committed to scaling up from 100 million users to hundreds of millions of users, or they have an even smarter model coming soon, and they did hint at that.

Of course, it could be both, but it does have to be something. Just giving paid users five times more in terms of message limits doesn't seem enough to me. Next, OpenAI branded this as GPT-4 level intelligence, although in a way, I think they slightly underplayed it. So before we get to the video demos, some of which you may have already seen, let me get to some more under the radar announcements.

Take text to image and look at the accuracy of the text generated from this prompt. Now, I know it's not perfect. There aren't two question marks on the now. There's others that you can spot, like the I being capitalized. But overall, I've never seen text generated with that much accuracy.

It wasn't even in the demo. Or take this other example, where two OpenAI researchers submitted their photos. Then they asked GPT-4 O to design a movie poster, and they gave the requirements in text. Now, when you see the first output, you're going to say, well, that isn't that good.

But then they asked GPT-4 O something fascinating. It seemed to be almost reverse psychology because they said, here is the same poster, but cleaned up. The text is crisper and the colors bolder and more dramatic. The whole image is now improved. This is the input, don't forget. The final result in terms of the accuracy of the photos and of the text was really quite impressive.

I can imagine millions of children and adults playing about with this functionality. Of course, they can't do so immediately because OpenAI said this will be released in the next few weeks. As another bonus, here is a video that OpenAI didn't put on their YouTube channel. It mimics a demo that Google made years ago, but never followed up with.

The OpenAI employee asked GPT-4 O to call customer service and ask for something. I've skipped ahead, and the customer service in this case is another AI. But here is the conclusion. Could you provide Joe's email address for me? Sure. It's joe@example.com. Awesome. All right. I've just sent the email.

Can you check if Joe received it? We'll check right now. Please hold. Sure thing. Hey, Joe, could you please check your email to see if the shipping label and return instructions have arrived? Fingers crossed. Yes, I got the instructions. Perfect. Joe has received the email. They call it a proof of concept, but it is a hint toward the agents that are coming.

Here are five more quick things that didn't make it to the demo. How about a replacement for lenser? Submit your photo and get a caricature of yourself. Or what about text to new font? You just ask for a new style of font and it will generate one. Or what about meeting transcription?

The meeting in this case had four speakers and it was transcribed. Or video summaries. Remember this model is multimodal in and out. Now it doesn't have video out, but I'll get to that in a moment. Here, though, was a demonstration of a 45 minute video submitted to GPT 4.0 and a summary of that video.

We also got character consistency across both woman and dog, almost like an entire cartoon strip. If those were the quick bonuses, what about the actual intelligence and performance of the model? Before I get to official benchmarks, here is a human graded leaderboard pitting one model against another. And yes, I'm also a good GPT 2 chatbot is indeed GPT 4.0.

So it turns out I've actually been testing the model for days. Overall, you can see the preference for GPT 4.0 compared to all other models. In coding specifically, the difference is quite stark. I would say even here, though, we're not looking at an entirely new tier of intelligence. Remember that a 100 Elo gap is a win rate of around two thirds.

So one third of the time GPT 4.0 Turbo's outputs would be preferred. That's about the same gap between GPT 4.0 Turbo and last year's GPT 4.0. A huge step forward, but not completely night and day. I think one underrated announcement was the desktop app, a live coding co-pilot. OK, so I'm going to open the chat GPT desktop app like Mira was talking about before.

OK, and to give a bit of background of what's going on. So here we have a computer and on the screen we have some code and then the chat GPT voice app is on the right. So chat GPT will be able to hear me, but it can't see anything on the screen.

So I'm going to highlight the code, command see it, and then that will send it to chat GPT. And then I'm going to talk about the code to chat GPT. OK, so I just shared some code with you. Could you give me a really brief one sentence description of what's going on in the code?

This code fetches daily weather data for a specific location and time period, smooths the temperature data using a rolling average, annotates a significant weather event on the resulting plot, and then displays the plot with the average minimum and maximum temperatures over the year. I've delayed long enough. Here are the benchmarks.

I was most impressed with GPT 4.0's performance on the math benchmark. Even though it fails pretty much all of my math prompts, that is still a stark improvement from the original GPT 4. On the Google Proof graduate test, it beats Cloud 3 Opus. And remember, that was the headline benchmark for Anthropic.

In fact, speaking of Anthropic, they are somewhat challenged by this release. GPT 4.0 costs $5 per 1 million tokens input and $15 per 1 million tokens output. As a quick aside, it also has 128k token context and an October knowledge cutoff. But remember the pricing, $5 and $15. Cloud 3 Opus is $15 and $75.

And remember, for Cloud 3 Opus on the web, you have to sign up with a subscription. But GPT 4.0 will be free. So for Cloud 3 Opus to be beaten in its headline benchmark is a concern for them. In fact, I think the results are clear enough to say that GPT 4.0 is the new smartest AI.

However, just before you get carried away and type on Twitter that AGI is here, there are some more mixed benchmarks. Take the DROP benchmark. I dug into this benchmark and it's about adversarial reading comprehension questions. They're designed to really test the reasoning capabilities of models. If you give models difficult passages and they've got to sort through references, do some counting and other operations, how do they fare?

The DROP, by the way, is discrete reasoning over the content of paragraphs. It does slightly better than the original GPT 4.0, but slightly worse than LLAMA 3/400B. And as they note, LLAMA 3/400B is still training. So it's just about the new smartest model by a hair's breadth. However, we're not done yet.

It's better at translation than Gemini models. Quick caveat there, Gemini 2 might be announced tomorrow and that could regain the lead. Then there are the vision understanding evaluations. It was a real step forward on the MMMU. As you can see, a clear 10 points better than Cloud Opus. Again, I'm curious if Google Gemini can exceed it though.

The improvements to the tokenizer could be revolutionary for non-English speakers. The dramatically fewer tokens needed for languages like Gujarati, Hindi, Arabic, and more don't just mean that conversations are cheaper, they're also quicker. And what about multilingual performance? Well, this time they didn't compare it to other models, but compared it to the original GPT 4.

Definitely a step up across languages, but English is still by far the most suited language. Indeed, here is a video of some of the model's mistakes ending with some dodgy language tuition. I know enough Mandarin to say it wasn't perfect at tutoring. Let's root, root, root for the home.

Oh no, what was that? Sorry guys, I got carried away. Ready for another round? I have a feeling that went very wrong. Hello, my name is Nach, I'm here with my co-worker. Hi, I'm Shil, I'm trying to teach my co-worker how to speak in Mandarin. We want to start from simple words like "Nǐ hǎo".

Can you teach him how to pronounce that? Of course. Hey, Nacho, nice to meet you. "Nǐ hǎo" is pretty straightforward to pronounce. It sounds like "Nǐ hǎo". Just make sure to keep the "Nǐ" part high and then go down and pitch on the "hǎo". Give it a try. "Nǐ hǎo"?

That was great. Really good first try, Nacho. You're a natural. It really wasn't, he needs to work on his tones and her face was the giveaway. There were a lot of other interesting video demonstrations, but before them, the GPT-4.0 blog post from Sam Altman put out tonight. He made the argument that putting this capable AI tool in the hands of everyone for free is justifying that open part of open AI.

Less about open weights or open source, more about open to everyone, without ads or anything like that. He also draws our attention to the latency of the models. In many ways, that is the key innovation with GPT-4.0. By dialing down the latency, you dial up the realism. As he said, it feels like AI from the movies, getting to human level response times and expressiveness turns out to be a big change.

Indeed, I think I should get a little credit for predicting her like AI in a video from a month ago. But now I want to get to those demos. The response times were amazing and the model was clearly designed to be flirtatious. Note that in Senate testimony last year, Sam Altman said, "We try to design systems that do not maximize for engagement." So let me know in the comments whether you think these demos are designed to maximize engagement.

Okay, so this is what I wrote down. What do you see? Oh, I see. I love chat GPT. That's so sweet of you. Yeah, well, I really appreciate all the help. So yeah, as we can see. Wow, that's quite the outfit you've got on. Yeah, as we can see, you know, we can chat in real time.

I was impressed though that it could speed up its talking on demand. Of course. One, two, three. Hey, actually, that's a little slow. Could you count faster? Sure thing. One, two, three, four, five, six, seven, eight, nine, ten. Okay, that was a little bit too fast. Of course, OpenAI must be given credit because for blind people, this could be revolutionary.

...right now. The Royal Standard flag is flying above Buckingham Palace, which is the signal that the monarch is present. How exciting to be there while the king is home. Try and tell me exactly what they're doing right now, please. Right now, the ducks are gently gliding across the water.

It will also help you with interview prep by analyzing your appearance. How about you? I'm awesome. Listen, I got some huge news. Oh, do tell. I'm all ears. Well, in a few minutes, I'm going to be interviewing at OpenAI. Have you heard of them? OpenAI? Huh? Sounds vaguely familiar.

Kidding. Of course. That's incredible, Rocky. What kind of interview? Well, it's for a software engineering role, but I just need to know, do I look presentable? Professional? Well, Rocky, you definitely have the "I've been coding all night" look down, which could actually work in your favor. Maybe just run a hand through your hair or lean into the mad genie's spine.

Your enthusiasm is what's really going to shine through. I don't have a lot of time, so I'm just going to throw this on. What do you think? Oh, Rocky, that's quite a statement piece. I mean, you'll definitely stand out, though maybe not in the way you're hoping for an interview.

Okay, I got it. I got it. I could definitely see GPT-4.0 being used for mathematics tutoring. Here is a 30-second clip, but notice also the slight glitchiness at certain moments. Relative to angle alpha. All right, so I'm pretty sure this is the angle alpha right here, right? Perfect. Correct.

Now, looking at the triangle, which side do you think is the hypotenuse? Um, remember the hypotenuse? I'm not totally sure. I think it might be this one, but I really am not sure. This side AC? You're close. I also noticed quite a few glitches in the demo. Sounds cutting in and out.

The one feature though that OpenAI wants you to really focus on is the video in capacity. You can live stream videos straight to the transformer architecture behind GPT-4.0. And so the video in functionality is certainly impressive. Note though that the reaction time of GPT-4.0 isn't quite as immediate as for audio.

Greg Brockman, the co-founder of OpenAI, has to prompt it to react to the intruder. Overall stylish feel. They might be getting ready for a conversation or presentation, given their direct engagement with the camera. Sounds like we're in for an intriguing interaction. I'm curious about the lighting you mentioned. Could you describe it a bit more?

Is it natural light or are we talking unique fixtures or colors that add to the atmosphere? The lighting here seems to be a mix of natural and artificial. There's a noticeable bright light overhead, probably a fixture, casting a focused beam downwards, which creates a spotlight effect. This adds a dramatic and modern feel to the scene.

The rest of the room is lit more softly, possibly by natural light filtering... Did anything unusual happen recently? Yes, actually. Just as you asked, another person came into view behind the first person. They playfully made bunny ears behind the first person's head and then quickly left the frame. It added a lighthearted and unexpected moment to the scene.

Now the focus is back on the original person with the leather jacket. Nevertheless, GPT-4O can produce multiple voices that can sing almost in harmony. And really try to harmonize here. But maybe make it more dramatic and make the soprano higher. Great, thank you! And I suspect this real-time translation could soon be coming to Siri.

Later for us. So every time I say something in English, can you repeat it back in Spanish? And every time he says something in Spanish, can you repeat it back in English? Sure, I can do that. Let's get this translation train rolling. Hey, how's it been going? Have you been up to anything interesting recently?

Hey, I've been good. Just a bit busy here preparing for an event next week. Why do I say that? Because Bloomberg reported two days ago that Apple is nearing a deal with OpenAI to put ChatGPT on iPhone. And in case you're wondering about GPT-4.5 or even 5, Sam Ullman said we'll have more stuff to share soon.

And Mira Murati in the official presentation said that they would be soon updating us on progress on the next big thing. Whether that's empty hype or real, you can decide. No word of course about OpenAI co-founder Ilya Sutskov, although he was listed as a contributor under additional leadership. Overall, I think this model will be massively more popular even if it isn't massively more intelligent.

You can prompt the model now with text and images in the OpenAI playground. All the links will be in the description. Note also that all the demos you saw were in real time at 1x speed. That I think was a nod to Google's botched demo. Of course, let's see tomorrow what Google replies with.

To those who think that GPT-4.0 is a huge stride towards AGI, I would point them to the somewhat mixed results on the reasoning benchmarks. Expect GPT-4.0 to still suffer from a massive amount of hallucinations. To those though who think that GPT-4.0 will change nothing, I would say this. Look at what ChatGPT did to the popularity of the underlying GPT series.

It being a free and chatty model brought 100 million people into testing AI. GPT-4.0 being the smartest model currently available and free on the web and multimodal, I think could unlock AI for hundreds of millions more people. But of course, only time will tell. If you want to analyse the announcement even more, do join me on the AI Insiders Discord via Patreon.

We have live meetups around the world and professional best practice sharing. So let me know what you think and as always have a wonderful day.