Back to Index

How Well Can GPT-4 See? And the 5 Upgrades That Are Next


Transcript

We all saw that GPT-4 is able to create a website from handwriting on a napkin. But with all the news since, the focus on vision has been lost. Meanwhile, in the last few hours and days, a select few with full access to multimodal GPT-4 have been releasing snapshots of what it can do.

I want to show you not only what is imminent with GPT-4 vision, but with releases this week in text to 3D, text inside 3D, speech to text, and even embodiment, we're going to see how language and visual model innovations are complementing each other and beginning to snowball. But let's start with images.

Do you remember from the GPT-4 technical report when the model was able to manipulate, when prompted, a human into solving captures for it? Well, that may no longer be needed. It solves this one pretty easily, so no, captures are not going to slow down GPT-4. Next, medical imagery. It was able to interpret this complex image and spot elements of a brain tumor.

Now, it did not spot the full diagnosis, but I want to point something out. This paper from OpenAI was released only a few days ago, and it tested GPT-4 on medical questions. They found that GPT-4 can attain outstanding results exceeding human performance levels, and that that was without vision.

The images and graphs were not passed to the model. And as you can see, when the questions did have, a lot of media in them, it brought down GPT-4's average. It will be very interesting to see GPT-4's results when its multimodal capabilities are accounted for. Next is humor, and I'm not showing these to say that they're necessarily going to change the world, but it does demonstrate the raw intellect of GPT-4.

To suss out why these images are funny, you have to have quite a nuanced understanding of humanity. Let's just say that it probably understood this meme quicker than I did. Quick thing to point out, by the way, it won't do faces. For pretty obvious privacy reasons, they won't allow the model to recognize faces.

Whether that ability gets jailbreak, only time will tell. Meanwhile, it can read menus and interpret the physical world, which is an amazing asset for visually impaired people. I want to move on to another fascinating ability that the vision model inside GPT-4 possesses, and that is reading graphs and text from images.

Its ability to interpret complex diagrams and captions is going to change the world. Here it is understanding a complex diagram and caption from the Palm e-paper released only about three weeks ago, which I have done a video on, by the way. But just how good is it at reading text from an image?

Well, let's take a look at GPT-4's score on the text VQA benchmark. Now, I've covered quite a few of the other benchmarks in other videos, but I want to focus on this one here. Notice how GPT-4 got 78%, which is better than the previous state of the art model, which got 72%.

Now, try to remember that 78% figure. What exactly is this testing, you ask? Well, reading text from complex images. This is the original text VQA academic paper, and you can see some of the sample questions above. To be honest, if you want to test your own eyesight, you can try them yourself.

So how does the average human perform? Well, on page seven, we have this table and we get this figure for humans, 85%. You don't need me to tell you that's just 7% better than GPT-4. The thing is, though, these models aren't slowing down. As the Vision co-lead at OpenAI put it, "Scale is all you need until everyone else realizes it too." But the point of this video is to show you that improvements in one area are starting to bleed into improvements in other areas.

We already saw that an image of bad handwriting could be translated into a website. As you can see here, even badly written natural language can now be translated directly into code in Blender, creating detailed 3D models with fascinating physics. The borders of text, image, 3D and embodiment are beginning to be broken down.

And of course, other companies are jumping in too. Here's Adobe showing how you can edit 3D images using text. And how long will it really be before we go direct from text to physical models, all mediated through natural language? And it's not just about creating 3D, it's about interacting with it through text.

Notice how we can pick out both text and higher level concepts like objects. The advanced 3D field was captured using 2D images from a phone. This paper was released only 10 days ago. But notice how now we have language embedded inside the model. We can search and scan for more abstract concepts like yellow or even utensils or electricity.

It's not perfect. And for some reason, it really struggled with recognizing Raman. But it does represent state of the art image into 3D interpreted through text. And you don't even want to type. You just want to use your voice. Just three weeks ago, I did a video on how voice recognition will change everything.

And I was talking about OpenAI's Whisper API. But now we have Conformer, which is better than Whisper. Here is the chart to prove it. And look how Conformer makes fewer errors even than Whisper at recognizing speech. The cool thing is you can test it for yourself and the link is in the description.

And while you're passing by the description, don't forget to leave a like and a comment to let me know if you've learned anything from this video. As you'd expect, I tested it myself and it did amazingly at transcribing my recent video on GPT-4. There were only a handful of mistakes in a 12 minute transcript.

At this point, you're probably thinking, what's next? Well, look at the route sketched out two years ago by Sam Altman. He said in the next five years, computer programs that can think will read legal documents and give medical advice. With GPT-4 passing the bar, I would say so far he's two for two.

He goes on, in the next decade, they will do assembly line work and maybe even become companions. He's talking about the physical embodiment of language models. Back then, OpenAI had a robotics team themselves that could do things like this. Here is a robotic hand solving a Rubik's Cube despite interruptions from a giraffe and someone putting a pen to interrupt the model.

It still solved the cube. But then that team got disbanded and it seems like they've moved into investing in startups. They are leading a $23 million investment in OneX, a startup developing a human-like robot. Here is the OneX website and it features this rather startling image and it says Summer 2023.

Our newest Android iteration, Neo, will explore how artificial intelligence can take form in a human-like body. Now, of course, for many of you, a humanoid robot won't be that surprising. Here is the obligatory clip from Boston Dynamics. And of course, these models don't have to be humanoid. Here is a demonstration from a paper published just four days ago.

This is not just walking. It's climbing up, balancing, pressing and operating buttons. And before you think all of this is really far away, these assembly line robots are now commercially available. I still think there's a long way to go before embodiment becomes mainstream. But my point is this. All these improvements that we're seeing in text, audio, 3D and embodiment, they're starting to merge into each other, complement each other.

On their own, they're cool and a bit nerdy. But once they start synergizing, fusing together, they could be revolutionary. As Sam Altman said on the Lex Friedman podcast released yesterday, embodiment might not be needed for AGI, but it's coming anyway. Let me know what you think in the comments and have a wonderful day.