back to indexHow Well Can GPT-4 See? And the 5 Upgrades That Are Next
00:00:00.000 |
We all saw that GPT-4 is able to create a website from handwriting on a napkin. 00:00:05.160 |
But with all the news since, the focus on vision has been lost. 00:00:08.940 |
Meanwhile, in the last few hours and days, a select few with full access to multimodal GPT-4 00:00:15.500 |
have been releasing snapshots of what it can do. 00:00:18.340 |
I want to show you not only what is imminent with GPT-4 vision, 00:00:21.680 |
but with releases this week in text to 3D, text inside 3D, speech to text, 00:00:28.120 |
and even embodiment, we're going to see how language and visual model innovations 00:00:33.020 |
are complementing each other and beginning to snowball. 00:00:38.980 |
Do you remember from the GPT-4 technical report when the model was able to manipulate, 00:00:44.120 |
when prompted, a human into solving captures for it? 00:00:49.460 |
It solves this one pretty easily, so no, captures are not going to slow down GPT-4. 00:00:56.240 |
It was able to interpret this complex image and spot elements of a brain tumor. 00:01:01.920 |
Now, it did not spot the full diagnosis, but I want to point something out. 00:01:05.900 |
This paper from OpenAI was released only a few days ago, 00:01:12.680 |
They found that GPT-4 can attain outstanding results exceeding human performance levels, 00:01:20.400 |
The images and graphs were not passed to the model. 00:01:23.780 |
And as you can see, when the questions did have, 00:01:26.040 |
a lot of media in them, it brought down GPT-4's average. 00:01:29.320 |
It will be very interesting to see GPT-4's results 00:01:32.320 |
when its multimodal capabilities are accounted for. 00:01:35.600 |
Next is humor, and I'm not showing these to say that they're necessarily going to change the world, 00:01:40.360 |
but it does demonstrate the raw intellect of GPT-4. 00:01:45.760 |
you have to have quite a nuanced understanding of humanity. 00:01:49.080 |
Let's just say that it probably understood this meme quicker than I did. 00:01:52.440 |
Quick thing to point out, by the way, it won't do faces. 00:01:55.840 |
For pretty obvious privacy reasons, they won't allow the model to recognize faces. 00:02:00.400 |
Whether that ability gets jailbreak, only time will tell. 00:02:03.720 |
Meanwhile, it can read menus and interpret the physical world, 00:02:07.720 |
which is an amazing asset for visually impaired people. 00:02:11.080 |
I want to move on to another fascinating ability that the vision model inside GPT-4 possesses, 00:02:17.440 |
and that is reading graphs and text from images. 00:02:20.400 |
Its ability to interpret complex diagrams and captions is going to change the world. 00:02:25.640 |
Here it is understanding a complex diagram and caption from the Palm e-paper released only about three weeks ago, 00:02:34.600 |
But just how good is it at reading text from an image? 00:02:37.240 |
Well, let's take a look at GPT-4's score on the text VQA benchmark. 00:02:42.400 |
Now, I've covered quite a few of the other benchmarks in other videos, 00:02:47.360 |
Notice how GPT-4 got 78%, which is better than the previous state of the art model, which got 72%. 00:03:02.160 |
This is the original text VQA academic paper, and you can see some of the sample questions above. 00:03:07.840 |
To be honest, if you want to test your own eyesight, you can try them yourself. 00:03:13.440 |
Well, on page seven, we have this table and we get this figure for humans, 85%. 00:03:19.680 |
You don't need me to tell you that's just 7% better than GPT-4. 00:03:29.560 |
"Scale is all you need until everyone else realizes it too." 00:03:34.000 |
But the point of this video is to show you that improvements in one area are 00:03:37.840 |
starting to bleed into improvements in other areas. 00:03:40.720 |
We already saw that an image of bad handwriting could be translated into a website. 00:03:45.240 |
As you can see here, even badly written natural language can now be 00:03:49.000 |
translated directly into code in Blender, creating detailed 3D models with fascinating physics. 00:03:55.040 |
The borders of text, image, 3D and embodiment are beginning to be broken down. 00:04:00.160 |
And of course, other companies are jumping in too. 00:04:02.640 |
Here's Adobe showing how you can edit 3D images using text. 00:04:06.960 |
And how long will it really be before we go direct from text to physical models, 00:04:14.760 |
And it's not just about creating 3D, it's about interacting with it through text. 00:04:19.120 |
Notice how we can pick out both text and higher level concepts like objects. 00:04:24.840 |
The advanced 3D field was captured using 2D images from a phone. 00:04:31.400 |
But notice how now we have language embedded inside the model. 00:04:38.280 |
concepts like yellow or even utensils or electricity. 00:04:44.840 |
And for some reason, it really struggled with recognizing Raman. 00:04:48.120 |
But it does represent state of the art image into 3D interpreted through text. 00:04:57.080 |
Just three weeks ago, I did a video on how voice recognition will change everything. 00:05:02.320 |
And I was talking about OpenAI's Whisper API. 00:05:05.760 |
But now we have Conformer, which is better than Whisper. 00:05:11.000 |
And look how Conformer makes fewer errors even than Whisper at recognizing speech. 00:05:16.920 |
The cool thing is you can test it for yourself and the link is in the description. 00:05:20.600 |
And while you're passing by the description, don't forget to leave a like and a 00:05:24.440 |
comment to let me know if you've learned anything from this video. 00:05:29.400 |
did amazingly at transcribing my recent video on GPT-4. 00:05:34.400 |
There were only a handful of mistakes in a 12 minute transcript. 00:05:38.360 |
At this point, you're probably thinking, what's next? 00:05:40.880 |
Well, look at the route sketched out two years ago by Sam Altman. 00:05:45.560 |
He said in the next five years, computer programs that can think will read 00:05:50.000 |
legal documents and give medical advice. With GPT-4 passing the bar, 00:05:58.200 |
they will do assembly line work and maybe even become companions. 00:06:02.480 |
He's talking about the physical embodiment of language models. 00:06:06.080 |
Back then, OpenAI had a robotics team themselves that could do things like this. 00:06:10.840 |
Here is a robotic hand solving a Rubik's Cube despite interruptions from a giraffe 00:06:17.520 |
and someone putting a pen to interrupt the model. 00:06:26.240 |
seems like they've moved into investing in startups. 00:06:32.800 |
in OneX, a startup developing a human-like robot. 00:06:39.600 |
this rather startling image and it says Summer 2023. 00:06:43.720 |
Our newest Android iteration, Neo, will explore how artificial 00:06:47.600 |
intelligence can take form in a human-like body. 00:06:50.240 |
Now, of course, for many of you, a humanoid robot won't 00:06:54.840 |
Here is the obligatory clip from Boston Dynamics. 00:07:00.040 |
And of course, these models don't have to be humanoid. 00:07:18.960 |
Here is a demonstration from a paper published just four days ago. 00:07:23.640 |
It's climbing up, balancing, pressing and operating buttons. 00:07:27.520 |
And before you think all of this is really far away, 00:07:30.520 |
these assembly line robots are now commercially available. 00:07:34.240 |
I still think there's a long way to go before embodiment becomes mainstream. 00:07:39.520 |
All these improvements that we're seeing in text, audio, 3D and embodiment, 00:07:44.320 |
they're starting to merge into each other, complement each other. 00:07:49.520 |
But once they start synergizing, fusing together, they could be 00:07:54.320 |
As Sam Altman said on the Lex Friedman podcast released yesterday, 00:07:58.360 |
embodiment might not be needed for AGI, but it's coming anyway. 00:08:02.960 |
Let me know what you think in the comments and have a wonderful day.