back to indexRT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights
00:00:00.000 |
Just as I was reaching the finishing pages of this epic 168 page report on GPT Vision, 00:00:07.020 |
which showcased unexpected abilities, novel use cases and predictions for the future, 00:00:11.880 |
Google dropped their colossal RTX Endeavor. And the data Google used with over 500 skills 00:00:19.180 |
and 150,000 tasks is open source. I've picked out over 75 highlights from both papers, 00:00:25.980 |
which I read in full, and I'll also bring in an exclusive report from the information to give us 00:00:31.520 |
an idea of what is next. By the end of the video, I want you to have a great sense for what GPT 00:00:37.940 |
with Vision can do today and a more acute awareness of what is coming in Vision, Video 00:00:43.320 |
and Robotics tomorrow. But let's start with this Google DeepMind report released just a few hours 00:00:49.460 |
ago. Essentially, what it showed is that you could create a general purpose robot from data 00:00:55.720 |
from the data that you have in your computer. And that's what I'm going to do today. So let's 00:00:55.960 |
start with the first report from diverse robotic tasks. These were data sets from different 00:00:59.960 |
universities on different continents. And Google wanted to see if this diverse data could improve 00:01:06.100 |
their famous RT2 model. I talked a lot more about RT2 in the video you can see on the screen, 00:01:11.560 |
but essentially it was trained on web data as well as robotics data. That meant that it could 00:01:16.620 |
understand questions like pick up the extinct animal. But the RTX series is another step up, 00:01:22.020 |
even though it comes just two months later. The report highlighted that, 00:01:25.940 |
conventionally, robotic learning methods train a separate model for every application, 00:01:30.340 |
every robot and even every environment. Their big finding was that training a single model 00:01:35.620 |
on that diverse data enabled that robot to outperform even the specialist robots. 00:01:41.600 |
The improved version of RT1 became RT1X and RT2 became RT2X. But here you can see RT1X 00:01:50.440 |
out-competing specialist models, weirdly aside from this wiping robot at Berkeley. 00:01:55.920 |
In a range of domains. You've got kitchen manipulation, cable routing, door opening, 00:02:00.800 |
and many more tasks that I'll get to in a second. The paper even demonstrated applicability to 00:02:05.960 |
robot arms and even quadrupeds. Think four-legged dog-like robots. And here you can see how it 00:02:12.260 |
improved RT2, which was already a big improvement on RT1. And as Google says, 00:02:18.080 |
our results suggest that co-training with data from other platforms imbues RT2X with additional skills 00:02:25.900 |
that were not present in the original dataset, enabling it to perform novel tasks. 00:02:30.540 |
Apparently it couldn't do things like this before. Move the apple between the can and the orange, 00:02:36.580 |
or move the apple near but not on top of the cloth, or move the apple on top of the pot. 00:02:43.440 |
This gives you a taste of the kind of skills they incorporated. Picking, moving, pushing, placing, sliding, 00:02:49.480 |
putting, navigating, separating, pointing, and on and on. I like that at the end we have assembling and turning on. 00:02:55.880 |
The paper draws the analogy with large language models that are trained on massive web-scale text 00:03:01.880 |
data. And those models tend to outperform systems that are only trained on narrow task-specific 00:03:07.500 |
datasets. Well, same thing now with robotics. That's why I call this the GPT-2 moment for 00:03:13.740 |
robotics. And the paper says even if such robotic datasets in their current size and coverage are 00:03:19.240 |
insufficient to attain the impressive generalization results that have been demonstrated by LLMs, 00:03:25.860 |
in the future the union of such data can potentially provide this kind of coverage. 00:03:32.080 |
One way to put it would be to think of how general and multi-skilled GPT-4 is in language. 00:03:37.840 |
From coding to poetry and mathematics and more. Now imagine that, but with robotic skills. 00:03:43.540 |
Now one question you might have is how would data on folding clothes help with pushing an apple? 00:03:49.620 |
Well they say that unlike most prior works, we directly train our policy on all of this X in 00:03:55.840 |
body-mode data without any mechanisms to reduce the embodiment gap. They didn't try any big 00:04:01.020 |
translation between domains or breaking down the problem into sub-aspects. They did however put the 00:04:06.720 |
input image data into a common resolution and unified the action vectors across these seven 00:04:12.920 |
dimensions. Now that 55 billion parameter number and the fact that it comes from the Pali model 00:04:18.280 |
that's undergirding RT2X is the perfect segue to GPT vision. After all, OpenAI directly 00:04:25.820 |
compared their GPT-4 vision model to Pali 17 billion which was a precursor model to Pali X. 00:04:33.580 |
And you might notice that at least for visual question answering, the Pali model, that's the 00:04:38.780 |
precursor 17 billion parameter model, outperformed GPT-4 vision. So everything you're about to see 00:04:45.100 |
from the 168 page GPT-4 vision report actually represents the lower bound of current frontier capability. 00:04:55.800 |
by Pali 17 billion parameters, which is beaten by Pali X 55 billion parameters, which has now been 00:05:03.300 |
incorporated into RT2 the robot. And all of that is not even to bring in OpenAI Gobi model or Google 00:05:10.620 |
Gemini, which I'll talk about later. So the main focus of the video, the dawn of large multimodal 00:05:17.160 |
models. A huge 160 plus page report from Microsoft. It was released just a few days ago and to be 00:05:25.780 |
honest, it's a bit of a surprise to me that they've actually released an entire video on its own. 00:05:27.780 |
I'm going to give you all of the highlights here in this video. So please do leave a like or subscribe 00:05:32.420 |
if you find it helpful. For all of the fascinating demos you're about to see, Microsoft says that they 00:05:38.100 |
carefully controlled both the images and text to prevent them from being seen during GPT-4V training. 00:05:44.900 |
The images were either not accessible online or had a timestamp beyond April 2023. Their headline 00:05:51.380 |
finding was that GPT vision shows impressive human level capabilities across the entire world. 00:05:55.760 |
And that's because they're not just a tool for testing and testing for the best results, but they 00:05:57.760 |
are a tool for testing and testing for the best results. And they're also a tool for testing and 00:05:59.760 |
testing for the best results. So they're a tool for testing and testing for the best results. 00:06:01.760 |
So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. 00:06:03.760 |
So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. 00:06:05.760 |
So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. 00:06:07.760 |
So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. 00:06:09.760 |
So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. 00:06:11.760 |
and by the end they're proposing agent structures and testing for self-consistency it gets pretty 00:06:16.800 |
wild but it's time for the first demo they showed gpt vision this table with the drinks that they 00:06:22.080 |
had ordered and they took a photo of the menu they asked how much should i pay for the beer 00:06:26.980 |
on the table according to the price on the menu and gpt vision got it we're starting a little slow 00:06:31.940 |
here but imagine you're drunk on the beach and you don't even know what you've ordered this could be 00:06:36.140 |
useful next was gpt vision putting the information from a driver's license into json format now first 00:06:42.800 |
time round it wasn't perfect listing his hair color as non-applicable when the license says 00:06:48.060 |
brown but there are ways to improve performance as we'll see later in fact the first way of 00:06:52.800 |
improving performance is right now that's using chain of thought chain of thought is basically a 00:06:57.840 |
way of getting the model to put out its intermediate reasoning often by using a phrase like let's think 00:07:03.620 |
step by step as you can see here the first time around the model is using chain of thought 00:07:06.120 |
and this time it couldn't identify that there's 11 apples and actually even when they used let's 00:07:11.220 |
think step by step it still got it wrong what about let's count the apples row by row nope 00:07:16.140 |
it got the right final answer but got the rows mixed up they tried some other prompts but finally 00:07:21.480 |
settled on this one you are an expert in counting things in the image and this time it got it right 00:07:26.700 |
now all those new methods you've been seeing like llms as prompt optimizers or prompt reader it 00:07:32.100 |
looks like they are going to be equally applicable to gpt vision in fact after 00:07:36.100 |
that example they say that throughout the paper we're going to employ that technique of calling 00:07:40.240 |
it an expert across various scenarios for better performance but here is one of the big revelations 00:07:46.040 |
from the paper for me the paper says that this particular ability isn't seen in existing models 00:07:51.520 |
and that is being able to follow pointers that might be circles squares or even arrows drawn 00:07:57.960 |
on a diagram and the amazing thing is this seems to work better than giving gpt vision coordinates 00:08:03.920 |
the researchers from microsoft drew these out of the paper and they said that they were able to 00:08:06.080 |
use the same technique to draw the same number of arrows onto the photo putting the label object one 00:08:09.260 |
and object two and gpt vision analyzes those objects perfectly but there was something that 00:08:14.100 |
i noticed which is that technically the green arrow is pointing to the pavement and the red 00:08:20.220 |
arrow is pointing to the table i should say that the arrows end at those places and instead of 00:08:26.020 |
interpreting the literal end of the arrows it sussed what the human meant it meant the nearest 00:08:32.060 |
big object the glass bottle and the beer i know that's something small but i thought it was a little 00:08:36.060 |
find that really impressive. The next big finding was that in context few shot learning is still 00:08:41.440 |
really crucial even for vision models. In context means as part of the prompt not in the pre-training 00:08:47.320 |
and few shot just means giving a few examples before you ask your key question. And the proof 00:08:52.580 |
of concept examples that you're about to see they say vividly demonstrate the rising significance 00:08:57.580 |
of in context few shot learning for achieving improved performance with large multimodal 00:09:03.600 |
models. I'm going to speed through this example just so you get the gist. Essentially they asked 00:09:08.080 |
it to read the speed on this speed meter. It gets it wrong and even when they ask it to think step 00:09:13.400 |
by step it still gets it wrong. Then they gave it instructions and it still got it wrong. They 00:09:18.240 |
tried loads of things but it just couldn't seem to do it. Even when they gave it one example one 00:09:23.480 |
shot it still got it wrong. You can see in the prompt they gave a correct worked example and 00:09:29.000 |
then asked the question again but no it said just passing 70 miles an hour. 00:09:33.440 |
But finally two shot with two worked examples it then gets it right. Next they showed that the 00:09:39.640 |
model could recognize celebrities and there was one particularly interesting example of Jensen 00:09:44.940 |
Huang. He's the CEO of Nvidia which produces the GPUs that went into training GPT-4 vision. Anyway 00:09:51.600 |
it could apparently recognize its own ingredients saying he's likely holding a GPU. Next they had it 00:09:57.720 |
recognizing landmarks even if they were at weird angles or at night. It could recognize dishes 00:10:03.180 |
even if they were at weird angles or at night. It could recognize dishes even if they were at 00:10:03.420 |
even if they had toppings and condiments. It also did really pretty well in medical image 00:10:08.900 |
understanding identifying what was wrong with this particular foot. You can see it also working 00:10:13.660 |
with a CT scan. Of course before we get too excited our old friend hallucination is still 00:10:19.420 |
there. It described a bridge overpass that I frankly can't see at all. It skipped over North 00:10:24.920 |
Carolina entirely when asked about the states shown on this map. And it also gets seemingly 00:10:30.360 |
random numbers wrong. Take this table where it noted that the state of North Carolina was 00:10:33.400 |
down almost every number correctly down to three decimal places but then for some reason the 15 00:10:39.640 |
million 971 880 became 15 971 421 actually i've just noticed while filming that that's the same 00:10:50.200 |
ending as the profit in the next column so maybe there was a reason but it's still pretty random 00:10:55.400 |
point is you still can't fully rely on the outputs and it seems to me that in figure 36 there was a 00:11:01.000 |
mistake that even the researchers didn't notice if i'm right that shows how pernicious these mistakes 00:11:06.120 |
can be the researchers say that the model not only understands the program in the given flowchart but 00:11:11.720 |
also translates the details to a python code if you go to figure 36 you see this flowchart and it 00:11:18.040 |
was asked can you translate the flowchart to a python code you can see the flowchart and you can 00:11:23.640 |
see the code now obviously it's impressive that it can even attempt to do this but that code is dodgy 00:11:29.000 |
taking a string as input essentially the code is not the code that you can use to do this but it's 00:11:30.600 |
essentially a bunch of letters instead of a floating point number now the goal was to print 00:11:35.640 |
the larger number and that's what gpt vision says in the explanation that it will do but that input 00:11:40.920 |
problem that i mentioned means that it returned three when comparing three with twenty the paper 00:11:46.680 |
also called this answer correct when averaging these two numbers which comes out to 76.555 00:11:52.840 |
that rounds to 76 dollars and 56 cents not 76 dollars and 55 cents now you might say all of this 00:12:00.200 |
is pedantic but the errors keep coming the paper says that in the bottom row of figure 37 gpt vision 00:12:06.600 |
shows a clear understanding of both x and y axis and explains the key insight presented in the 00:12:12.600 |
chart go to figure 37 and you get this chart the key insight to me from this chart is that 00:12:18.840 |
publishing bad okay or pretty good papers makes almost no difference it's only when they get 00:12:24.280 |
very creative original and good that it makes lots of impact on your career now gpt vision 00:12:29.800 |
does say that publishing a bad paper has little impact on your career and a creative paper has 00:12:35.800 |
significant impact correct but then it says the impact of the paper on a person's career increases 00:12:41.000 |
as the quality of the paper improves now while that's technically correct it misses the key 00:12:46.120 |
insight basically a flat line and then a sudden upward turn anyway loads of errors but let's focus 00:12:52.360 |
on the potential use cases because we must remember that gpt vision is still capable of things 00:12:58.760 |
like this a gpt vision is still capable of things like this a gpt vision is still capable of things 00:12:59.400 |
a guy on twitter or x daniel lit said this i've been told gpt4 with code interpreter is good at 00:13:05.800 |
math he was taking the mickey because the output is this can you compute the seventh root of three 00:13:11.640 |
to the power of seven now the seventh root is the opposite of raising a number to the power of seven 00:13:16.440 |
so the answer should be three but it said the seventh root of three to the seven is approximately 00:13:21.240 |
4.2 but then someone else put that image into gpt vision and said why is this tweet funny and gpt 00:13:29.000 |
vision was able to pick up on the humor the humor in this tweet arises from the mathematical 00:13:35.080 |
inconsistency the question posed to gpt4 with code interpreter asked for the seventh root of three to 00:13:40.600 |
the seven mathematically the seventh root of three to the seven is simply three it corrected its own 00:13:45.560 |
error the incongruity between the question and the answer creates a comedic effect so it was not only 00:13:51.480 |
able to correct its error it was able to see the humor in someone pointing that out but then 00:13:56.520 |
someone went even more meta pasting this entire thing into gpt vision saying why is this analysis 00:14:03.560 |
funny and then gpt vision is able to summarize the entire situation describing gpt4 itself 00:14:10.520 |
as analyzing its own mistake the irony lies in gpt4 critiquing its own incorrect answer 00:14:16.680 |
i would have given it bonus points if it said and here i am talking about it but let's not 00:14:20.920 |
ask for too much this is already impressive speaking of gpt vision being a bit more critical 00:14:26.200 |
it was asked this question how many families are earning more than 13 000 and owns more than two 00:14:32.440 |
cars the question is very ambiguous it gave no time period earning more than 13 000 a month a 00:14:38.600 |
year and it talked about owning the cars when the table was just about vehicles per family of course 00:14:44.200 |
a family might have a vehicle without owning it i'd have loved it if gpt vision picked up on this 00:14:49.720 |
ambiguity in the question and then it was asked how many families are earning more than 13 000 00:14:50.520 |
the question and asked clarifying questions instead it outputted a reasonable answer based 00:14:55.960 |
on a few assumptions and the paper marked it as correct they did show it analyzing a full academic 00:15:02.120 |
paper and making only a few mistakes though and to me that shows some pretty crazy potential 00:15:08.120 |
especially for the next model down the line imagine a model being able to read all ai related 00:15:13.480 |
papers in any language and synthesize some of the findings that's when things might get a little out 00:15:20.120 |
control i do think the paper gets a little bit over eager in places though for example here it 00:15:25.320 |
fed gpt vision a series of frames depicting a player taking a penalty as you can see in the 00:15:31.320 |
last frame the ball is in the net gpt vision correctly said the ball was not blocked by the 00:15:37.640 |
goalkeeper the conclusion of the paper is that this demonstrates cause and effect reasoning by 00:15:42.760 |
determining whether the ball was blocked based on the goalkeeper ball interaction but to me it could 00:15:47.640 |
be simple memorization based on the goalkeeper interaction but to me it could be simple memorization based on the 00:15:49.720 |
web scale of data it was trained on for example it might have seen many many images of a ball in the 00:15:55.160 |
back of a net and it understands that those images correspond to a penalty not being saved you can let 00:16:00.760 |
me know in the comments if you think this demonstrates a considerable level of sophistication 00:16:06.120 |
in the model's reasoning abilities i was really impressed by this though they sent 00:16:10.360 |
gpt vision a series of photos and highlighted one of the guys in the photo and it was able to deduce 00:16:16.840 |
that he is playfully pretending to punch the ball in the net and that's why i think this is a really 00:16:19.320 |
cool image of a ball in the net and it's really cool to see that he is playing a real punch 00:16:23.640 |
now i am sure that many models and even quite a few humans might think that these images depict 00:16:29.400 |
a real punch but if you look at it carefully it does seem like he's playing so that was really 00:16:34.200 |
impressive to me it could also identify south park characters just from ascii art that's despite it 00:16:39.800 |
not being able to generate good ascii art currently itself or maybe it can but the reinforcement 00:16:45.320 |
learning has drained it of that ability anyway it is able to read emotions of people from their faces 00:16:48.920 |
so if you one day approach a gpt vision model looking like this it's going to know what you're 00:16:54.600 |
thinking i don't know if this quite counts as emotional intelligence or empathy though those 00:16:59.880 |
were some of the words used by the paper i did find it interesting though that they said that 00:17:04.840 |
understanding anger or and fear will be essential in use cases such as home robots i'm not sure if 00:17:11.720 |
they're anticipating many people being angry in awe or in fear of their home robot which 00:17:17.160 |
presumably they bought maybe it feels like they're just trying to get the robot to be angry and angry 00:17:18.520 |
maybe it finds faces easier to read than helmets because it says there are eight people in this 00:17:23.560 |
image wearing helmets and as i speculated previously it is able to iterate on those prompts 00:17:29.240 |
it noticed how this output didn't match the original request have it look like a graphic 00:17:34.680 |
novel and then it suggested improvements to the prompt as i've said before imagine this combined 00:17:39.800 |
with dali 3 with constant iterations it might take a bit longer but only the output that gets 10 out 00:17:45.640 |
of 10 from gpt vision would be handed to you so if you're interested in learning more about gpt vision 00:17:48.120 |
you can check out the gpt vision website and the gpt vision website is a great place to start 00:17:50.280 |
some of you may know that steve wozniak proposed a somewhat peculiar test for agi could a machine 00:17:56.440 |
enter the average american home and figure out how to make coffee as wikipedia says this has 00:18:01.400 |
not yet been completed but it might not be far away after all gpt vision was able to figure out 00:18:08.040 |
the buttons on a coffee machine and then it could work its way through a house via images 00:18:14.280 |
to enact a plan for example it wanted to go to the fridge and it 00:18:17.720 |
proposed a series of actions turn right and move forward toward the hallway then next when it was 00:18:23.080 |
in a different position it said i would now turn right and move toward the kitchen it goes on that 00:18:28.680 |
it would head toward the fridge and finally in this example it would now open the fridge door 00:18:33.720 |
and retrieve the requested item now you might say oh that's all well and good having the plan 00:18:39.000 |
and being able to use vision to propose a plan but that's still not the same as being 00:18:43.480 |
dexterous enough to actually pour the coffee let alone get out the cups from the fridge and then 00:18:47.320 |
handle them but of course we started the video with the rtx series we're getting close to that 00:18:53.240 |
level of manipulation i could honestly see that task being achieved in the next three years or 00:18:59.240 |
perhaps even sooner if a team went straight out to achieve it next they showed gpt vision being 00:19:05.000 |
able to handle a computer screen it knew at least the general direction of where to click and what 00:19:10.840 |
to do next you can see it here via the researchers navigating google search and it's a very simple 00:19:16.920 |
search it does still have problems with exact coordinates though so its clicks might be a little 00:19:21.960 |
inaccurate also of course it still hallucinates here it is trying to read the news and it decides 00:19:27.880 |
to close the tab by clicking the x in the top right corner of course that would not just close 00:19:33.880 |
the current tab it can also handle phone screens and even phone notifications though it does make 00:19:39.400 |
one key mistake the sender yyk hahaha has sent the message i see you are in seattle let's meet 00:19:46.520 |
up and gpt vision proposes this let's move my finger to the maps app icon and that will allow 00:19:52.440 |
me to search for a location in seattle and plan a meetup with the user clearly though here the 00:19:57.800 |
correct answer was to simply delete the message ain't nobody got time for that kind of thing it 00:20:03.160 |
can also watch videos if they're broken down frame by frame it correctly identified here that this is 00:20:08.920 |
a recipe tutorial for strawberry stuffed french toast however with gemini being trained on youtube 00:20:15.320 |
according to the info i've just shared with you in the description below if you're interested in 00:20:16.120 |
the information and openai already planning to follow up gpt vision with a model called gobi that 00:20:22.840 |
model by the way would be designed as multimodal from the start at that point when it can properly 00:20:28.440 |
ingest video data that's when image and video capabilities might really take off i can imagine 00:20:34.360 |
today with what we have already teachers having a self-monitored camera facing their whiteboard as 00:20:39.800 |
they write out their questions and answers and explanations gpt vision could be monitoring for 00:20:45.160 |
errors this could apply to any of the gpt vision models that we have already seen in the past 00:20:45.720 |
this could apply particularly to primary education where teachers have to sometimes cover topics that 00:20:50.600 |
they're not fully familiar with with one click gpt vision could check for any mistakes and give you 00:20:56.200 |
feedback anyway i've just thought of that you let me know in the comments other use cases not covered 00:21:01.160 |
so far this is certainly a wild time and thank you so much for watching all the way to the end 00:21:07.480 |
if you've learned anything like i say please do leave a like 00:21:10.600 |
do check out my patreon if you're feeling extra generous and have a wonderful day