back to index

How Well Can GPT-4 See? And the 5 Upgrades That Are Next


Whisper Transcript | Transcript Only Page

00:00:00.000 | We all saw that GPT-4 is able to create a website from handwriting on a napkin.
00:00:05.160 | But with all the news since, the focus on vision has been lost.
00:00:08.940 | Meanwhile, in the last few hours and days, a select few with full access to multimodal GPT-4
00:00:15.500 | have been releasing snapshots of what it can do.
00:00:18.340 | I want to show you not only what is imminent with GPT-4 vision,
00:00:21.680 | but with releases this week in text to 3D, text inside 3D, speech to text,
00:00:28.120 | and even embodiment, we're going to see how language and visual model innovations
00:00:33.020 | are complementing each other and beginning to snowball.
00:00:36.940 | But let's start with images.
00:00:38.980 | Do you remember from the GPT-4 technical report when the model was able to manipulate,
00:00:44.120 | when prompted, a human into solving captures for it?
00:00:47.480 | Well, that may no longer be needed.
00:00:49.460 | It solves this one pretty easily, so no, captures are not going to slow down GPT-4.
00:00:54.440 | Next, medical imagery.
00:00:56.240 | It was able to interpret this complex image and spot elements of a brain tumor.
00:01:01.920 | Now, it did not spot the full diagnosis, but I want to point something out.
00:01:05.900 | This paper from OpenAI was released only a few days ago,
00:01:09.660 | and it tested GPT-4 on medical questions.
00:01:12.680 | They found that GPT-4 can attain outstanding results exceeding human performance levels,
00:01:18.320 | and that that was without vision.
00:01:20.400 | The images and graphs were not passed to the model.
00:01:23.780 | And as you can see, when the questions did have,
00:01:26.040 | a lot of media in them, it brought down GPT-4's average.
00:01:29.320 | It will be very interesting to see GPT-4's results
00:01:32.320 | when its multimodal capabilities are accounted for.
00:01:35.600 | Next is humor, and I'm not showing these to say that they're necessarily going to change the world,
00:01:40.360 | but it does demonstrate the raw intellect of GPT-4.
00:01:43.680 | To suss out why these images are funny,
00:01:45.760 | you have to have quite a nuanced understanding of humanity.
00:01:49.080 | Let's just say that it probably understood this meme quicker than I did.
00:01:52.440 | Quick thing to point out, by the way, it won't do faces.
00:01:55.840 | For pretty obvious privacy reasons, they won't allow the model to recognize faces.
00:02:00.400 | Whether that ability gets jailbreak, only time will tell.
00:02:03.720 | Meanwhile, it can read menus and interpret the physical world,
00:02:07.720 | which is an amazing asset for visually impaired people.
00:02:11.080 | I want to move on to another fascinating ability that the vision model inside GPT-4 possesses,
00:02:17.440 | and that is reading graphs and text from images.
00:02:20.400 | Its ability to interpret complex diagrams and captions is going to change the world.
00:02:25.640 | Here it is understanding a complex diagram and caption from the Palm e-paper released only about three weeks ago,
00:02:32.840 | which I have done a video on, by the way.
00:02:34.600 | But just how good is it at reading text from an image?
00:02:37.240 | Well, let's take a look at GPT-4's score on the text VQA benchmark.
00:02:42.400 | Now, I've covered quite a few of the other benchmarks in other videos,
00:02:45.560 | but I want to focus on this one here.
00:02:47.360 | Notice how GPT-4 got 78%, which is better than the previous state of the art model, which got 72%.
00:02:54.240 | Now, try to remember that
00:02:55.440 | 78% figure.
00:02:56.760 | What exactly is this testing, you ask?
00:02:59.040 | Well, reading text from complex images.
00:03:02.160 | This is the original text VQA academic paper, and you can see some of the sample questions above.
00:03:07.840 | To be honest, if you want to test your own eyesight, you can try them yourself.
00:03:11.240 | So how does the average human perform?
00:03:13.440 | Well, on page seven, we have this table and we get this figure for humans, 85%.
00:03:19.680 | You don't need me to tell you that's just 7% better than GPT-4.
00:03:24.120 | The thing is, though,
00:03:25.240 | these models aren't slowing down.
00:03:26.880 | As the Vision co-lead at OpenAI put it,
00:03:29.560 | "Scale is all you need until everyone else realizes it too."
00:03:34.000 | But the point of this video is to show you that improvements in one area are
00:03:37.840 | starting to bleed into improvements in other areas.
00:03:40.720 | We already saw that an image of bad handwriting could be translated into a website.
00:03:45.240 | As you can see here, even badly written natural language can now be
00:03:49.000 | translated directly into code in Blender, creating detailed 3D models with fascinating physics.
00:03:55.040 | The borders of text, image, 3D and embodiment are beginning to be broken down.
00:04:00.160 | And of course, other companies are jumping in too.
00:04:02.640 | Here's Adobe showing how you can edit 3D images using text.
00:04:06.960 | And how long will it really be before we go direct from text to physical models,
00:04:12.240 | all mediated through natural language?
00:04:14.760 | And it's not just about creating 3D, it's about interacting with it through text.
00:04:19.120 | Notice how we can pick out both text and higher level concepts like objects.
00:04:24.840 | The advanced 3D field was captured using 2D images from a phone.
00:04:29.160 | This paper was released only 10 days ago.
00:04:31.400 | But notice how now we have language embedded inside the model.
00:04:35.320 | We can search and scan for more abstract
00:04:38.280 | concepts like yellow or even utensils or electricity.
00:04:43.680 | It's not perfect.
00:04:44.840 | And for some reason, it really struggled with recognizing Raman.
00:04:48.120 | But it does represent state of the art image into 3D interpreted through text.
00:04:54.640 | And you don't even want to type.
00:04:55.720 | You just want to use your voice.
00:04:57.080 | Just three weeks ago, I did a video on how voice recognition will change everything.
00:05:02.320 | And I was talking about OpenAI's Whisper API.
00:05:05.760 | But now we have Conformer, which is better than Whisper.
00:05:09.280 | Here is the chart to prove it.
00:05:11.000 | And look how Conformer makes fewer errors even than Whisper at recognizing speech.
00:05:16.920 | The cool thing is you can test it for yourself and the link is in the description.
00:05:20.600 | And while you're passing by the description, don't forget to leave a like and a
00:05:24.440 | comment to let me know if you've learned anything from this video.
00:05:26.880 | As you'd expect, I tested it myself and it
00:05:29.400 | did amazingly at transcribing my recent video on GPT-4.
00:05:34.400 | There were only a handful of mistakes in a 12 minute transcript.
00:05:38.360 | At this point, you're probably thinking, what's next?
00:05:40.880 | Well, look at the route sketched out two years ago by Sam Altman.
00:05:45.560 | He said in the next five years, computer programs that can think will read
00:05:50.000 | legal documents and give medical advice. With GPT-4 passing the bar,
00:05:54.240 | I would say so far he's two for two.
00:05:56.400 | He goes on, in the next decade,
00:05:58.200 | they will do assembly line work and maybe even become companions.
00:06:02.480 | He's talking about the physical embodiment of language models.
00:06:06.080 | Back then, OpenAI had a robotics team themselves that could do things like this.
00:06:10.840 | Here is a robotic hand solving a Rubik's Cube despite interruptions from a giraffe
00:06:17.520 | and someone putting a pen to interrupt the model.
00:06:21.600 | It still solved the cube.
00:06:24.040 | But then that team got disbanded and it
00:06:26.240 | seems like they've moved into investing in startups.
00:06:29.040 | They are leading a $23 million investment
00:06:32.800 | in OneX, a startup developing a human-like robot.
00:06:37.280 | Here is the OneX website and it features
00:06:39.600 | this rather startling image and it says Summer 2023.
00:06:43.720 | Our newest Android iteration, Neo, will explore how artificial
00:06:47.600 | intelligence can take form in a human-like body.
00:06:50.240 | Now, of course, for many of you, a humanoid robot won't
00:06:53.840 | be that surprising.
00:06:54.840 | Here is the obligatory clip from Boston Dynamics.
00:07:00.040 | And of course, these models don't have to be humanoid.
00:07:18.960 | Here is a demonstration from a paper published just four days ago.
00:07:22.440 | This is not just walking.
00:07:23.640 | It's climbing up, balancing, pressing and operating buttons.
00:07:27.520 | And before you think all of this is really far away,
00:07:30.520 | these assembly line robots are now commercially available.
00:07:34.240 | I still think there's a long way to go before embodiment becomes mainstream.
00:07:38.240 | But my point is this.
00:07:39.520 | All these improvements that we're seeing in text, audio, 3D and embodiment,
00:07:44.320 | they're starting to merge into each other, complement each other.
00:07:47.520 | On their own, they're cool and a bit nerdy.
00:07:49.520 | But once they start synergizing, fusing together, they could be
00:07:53.440 | revolutionary.
00:07:54.320 | As Sam Altman said on the Lex Friedman podcast released yesterday,
00:07:58.360 | embodiment might not be needed for AGI, but it's coming anyway.
00:08:02.960 | Let me know what you think in the comments and have a wonderful day.