RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights

00:00:00.000 | Just as I was reaching the finishing pages of this epic 168 page report on GPT Vision,

00:00:07.020 | which showcased unexpected abilities, novel use cases and predictions for the future,

00:00:11.880 | Google dropped their colossal RTX Endeavor. And the data Google used with over 500 skills

00:00:19.180 | and 150,000 tasks is open source. I've picked out over 75 highlights from both papers,

00:00:25.980 | which I read in full, and I'll also bring in an exclusive report from the information to give us

00:00:31.520 | an idea of what is next. By the end of the video, I want you to have a great sense for what GPT

00:00:37.940 | with Vision can do today and a more acute awareness of what is coming in Vision, Video

00:00:43.320 | and Robotics tomorrow. But let's start with this Google DeepMind report released just a few hours

00:00:49.460 | ago. Essentially, what it showed is that you could create a general purpose robot from data

00:00:55.720 | from the data that you have in your computer. And that's what I'm going to do today. So let's

00:00:55.960 | start with the first report from diverse robotic tasks. These were data sets from different

00:00:59.960 | universities on different continents. And Google wanted to see if this diverse data could improve

00:01:06.100 | their famous RT2 model. I talked a lot more about RT2 in the video you can see on the screen,

00:01:11.560 | but essentially it was trained on web data as well as robotics data. That meant that it could

00:01:16.620 | understand questions like pick up the extinct animal. But the RTX series is another step up,

00:01:22.020 | even though it comes just two months later. The report highlighted that,

00:01:25.940 | conventionally, robotic learning methods train a separate model for every application,

00:01:30.340 | every robot and even every environment. Their big finding was that training a single model

00:01:35.620 | on that diverse data enabled that robot to outperform even the specialist robots.

00:01:41.600 | The improved version of RT1 became RT1X and RT2 became RT2X. But here you can see RT1X

00:01:50.440 | out-competing specialist models, weirdly aside from this wiping robot at Berkeley.

00:01:55.920 | In a range of domains. You've got kitchen manipulation, cable routing, door opening,

00:02:00.800 | and many more tasks that I'll get to in a second. The paper even demonstrated applicability to

00:02:05.960 | robot arms and even quadrupeds. Think four-legged dog-like robots. And here you can see how it

00:02:12.260 | improved RT2, which was already a big improvement on RT1. And as Google says,

00:02:18.080 | our results suggest that co-training with data from other platforms imbues RT2X with additional skills

00:02:25.900 | that were not present in the original dataset, enabling it to perform novel tasks.

00:02:30.540 | Apparently it couldn't do things like this before. Move the apple between the can and the orange,

00:02:36.580 | or move the apple near but not on top of the cloth, or move the apple on top of the pot.

00:02:43.440 | This gives you a taste of the kind of skills they incorporated. Picking, moving, pushing, placing, sliding,

00:02:49.480 | putting, navigating, separating, pointing, and on and on. I like that at the end we have assembling and turning on.

00:02:55.880 | The paper draws the analogy with large language models that are trained on massive web-scale text

00:03:01.880 | data. And those models tend to outperform systems that are only trained on narrow task-specific

00:03:07.500 | datasets. Well, same thing now with robotics. That's why I call this the GPT-2 moment for

00:03:13.740 | robotics. And the paper says even if such robotic datasets in their current size and coverage are

00:03:19.240 | insufficient to attain the impressive generalization results that have been demonstrated by LLMs,

00:03:25.860 | in the future the union of such data can potentially provide this kind of coverage.

00:03:32.080 | One way to put it would be to think of how general and multi-skilled GPT-4 is in language.

00:03:37.840 | From coding to poetry and mathematics and more. Now imagine that, but with robotic skills.

00:03:43.540 | Now one question you might have is how would data on folding clothes help with pushing an apple?

00:03:49.620 | Well they say that unlike most prior works, we directly train our policy on all of this X in

00:03:55.840 | body-mode data without any mechanisms to reduce the embodiment gap. They didn't try any big

00:04:01.020 | translation between domains or breaking down the problem into sub-aspects. They did however put the

00:04:06.720 | input image data into a common resolution and unified the action vectors across these seven

00:04:12.920 | dimensions. Now that 55 billion parameter number and the fact that it comes from the Pali model

00:04:18.280 | that's undergirding RT2X is the perfect segue to GPT vision. After all, OpenAI directly

00:04:25.820 | compared their GPT-4 vision model to Pali 17 billion which was a precursor model to Pali X.

00:04:33.580 | And you might notice that at least for visual question answering, the Pali model, that's the

00:04:38.780 | precursor 17 billion parameter model, outperformed GPT-4 vision. So everything you're about to see

00:04:45.100 | from the 168 page GPT-4 vision report actually represents the lower bound of current frontier capability.

00:04:53.100 | To recap, that's GPT-4 vision being beefed

00:04:55.800 | by Pali 17 billion parameters, which is beaten by Pali X 55 billion parameters, which has now been

00:05:03.300 | incorporated into RT2 the robot. And all of that is not even to bring in OpenAI Gobi model or Google

00:05:10.620 | Gemini, which I'll talk about later. So the main focus of the video, the dawn of large multimodal

00:05:17.160 | models. A huge 160 plus page report from Microsoft. It was released just a few days ago and to be

00:05:25.780 | honest, it's a bit of a surprise to me that they've actually released an entire video on its own.

00:05:27.780 | I'm going to give you all of the highlights here in this video. So please do leave a like or subscribe

00:05:32.420 | if you find it helpful. For all of the fascinating demos you're about to see, Microsoft says that they

00:05:38.100 | carefully controlled both the images and text to prevent them from being seen during GPT-4V training.

00:05:44.900 | The images were either not accessible online or had a timestamp beyond April 2023. Their headline

00:05:51.380 | finding was that GPT vision shows impressive human level capabilities across the entire world.

00:05:55.760 | And that's because they're not just a tool for testing and testing for the best results, but they

00:05:57.760 | are a tool for testing and testing for the best results. And they're also a tool for testing and

00:05:59.760 | testing for the best results. So they're a tool for testing and testing for the best results.

00:06:01.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

00:06:03.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

00:06:05.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

00:06:07.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

00:06:09.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

00:06:11.760 | and by the end they're proposing agent structures and testing for self-consistency it gets pretty

00:06:16.800 | wild but it's time for the first demo they showed gpt vision this table with the drinks that they

00:06:22.080 | had ordered and they took a photo of the menu they asked how much should i pay for the beer

00:06:26.980 | on the table according to the price on the menu and gpt vision got it we're starting a little slow

00:06:31.940 | here but imagine you're drunk on the beach and you don't even know what you've ordered this could be

00:06:36.140 | useful next was gpt vision putting the information from a driver's license into json format now first

00:06:42.800 | time round it wasn't perfect listing his hair color as non-applicable when the license says

00:06:48.060 | brown but there are ways to improve performance as we'll see later in fact the first way of

00:06:52.800 | improving performance is right now that's using chain of thought chain of thought is basically a

00:06:57.840 | way of getting the model to put out its intermediate reasoning often by using a phrase like let's think

00:07:03.620 | step by step as you can see here the first time around the model is using chain of thought

00:07:06.120 | and this time it couldn't identify that there's 11 apples and actually even when they used let's

00:07:11.220 | think step by step it still got it wrong what about let's count the apples row by row nope

00:07:16.140 | it got the right final answer but got the rows mixed up they tried some other prompts but finally

00:07:21.480 | settled on this one you are an expert in counting things in the image and this time it got it right

00:07:26.700 | now all those new methods you've been seeing like llms as prompt optimizers or prompt reader it

00:07:32.100 | looks like they are going to be equally applicable to gpt vision in fact after

00:07:36.100 | that example they say that throughout the paper we're going to employ that technique of calling

00:07:40.240 | it an expert across various scenarios for better performance but here is one of the big revelations

00:07:46.040 | from the paper for me the paper says that this particular ability isn't seen in existing models

00:07:51.520 | and that is being able to follow pointers that might be circles squares or even arrows drawn

00:07:57.960 | on a diagram and the amazing thing is this seems to work better than giving gpt vision coordinates

00:08:03.920 | the researchers from microsoft drew these out of the paper and they said that they were able to

00:08:06.080 | use the same technique to draw the same number of arrows onto the photo putting the label object one

00:08:09.260 | and object two and gpt vision analyzes those objects perfectly but there was something that

00:08:14.100 | i noticed which is that technically the green arrow is pointing to the pavement and the red

00:08:20.220 | arrow is pointing to the table i should say that the arrows end at those places and instead of

00:08:26.020 | interpreting the literal end of the arrows it sussed what the human meant it meant the nearest

00:08:32.060 | big object the glass bottle and the beer i know that's something small but i thought it was a little

00:08:36.060 | find that really impressive. The next big finding was that in context few shot learning is still

00:08:41.440 | really crucial even for vision models. In context means as part of the prompt not in the pre-training

00:08:47.320 | and few shot just means giving a few examples before you ask your key question. And the proof

00:08:52.580 | of concept examples that you're about to see they say vividly demonstrate the rising significance

00:08:57.580 | of in context few shot learning for achieving improved performance with large multimodal

00:09:03.600 | models. I'm going to speed through this example just so you get the gist. Essentially they asked

00:09:08.080 | it to read the speed on this speed meter. It gets it wrong and even when they ask it to think step

00:09:13.400 | by step it still gets it wrong. Then they gave it instructions and it still got it wrong. They

00:09:18.240 | tried loads of things but it just couldn't seem to do it. Even when they gave it one example one

00:09:23.480 | shot it still got it wrong. You can see in the prompt they gave a correct worked example and

00:09:29.000 | then asked the question again but no it said just passing 70 miles an hour.

00:09:33.440 | But finally two shot with two worked examples it then gets it right. Next they showed that the

00:09:39.640 | model could recognize celebrities and there was one particularly interesting example of Jensen

00:09:44.940 | Huang. He's the CEO of Nvidia which produces the GPUs that went into training GPT-4 vision. Anyway

00:09:51.600 | it could apparently recognize its own ingredients saying he's likely holding a GPU. Next they had it

00:09:57.720 | recognizing landmarks even if they were at weird angles or at night. It could recognize dishes

00:10:03.180 | even if they were at weird angles or at night. It could recognize dishes even if they were at

00:10:03.420 | even if they had toppings and condiments. It also did really pretty well in medical image

00:10:08.900 | understanding identifying what was wrong with this particular foot. You can see it also working

00:10:13.660 | with a CT scan. Of course before we get too excited our old friend hallucination is still

00:10:19.420 | there. It described a bridge overpass that I frankly can't see at all. It skipped over North

00:10:24.920 | Carolina entirely when asked about the states shown on this map. And it also gets seemingly

00:10:30.360 | random numbers wrong. Take this table where it noted that the state of North Carolina was

00:10:33.400 | down almost every number correctly down to three decimal places but then for some reason the 15

00:10:39.640 | million 971 880 became 15 971 421 actually i've just noticed while filming that that's the same

00:10:50.200 | ending as the profit in the next column so maybe there was a reason but it's still pretty random

00:10:55.400 | point is you still can't fully rely on the outputs and it seems to me that in figure 36 there was a

00:11:01.000 | mistake that even the researchers didn't notice if i'm right that shows how pernicious these mistakes

00:11:06.120 | can be the researchers say that the model not only understands the program in the given flowchart but

00:11:11.720 | also translates the details to a python code if you go to figure 36 you see this flowchart and it

00:11:18.040 | was asked can you translate the flowchart to a python code you can see the flowchart and you can

00:11:23.640 | see the code now obviously it's impressive that it can even attempt to do this but that code is dodgy

00:11:29.000 | taking a string as input essentially the code is not the code that you can use to do this but it's

00:11:30.600 | essentially a bunch of letters instead of a floating point number now the goal was to print

00:11:35.640 | the larger number and that's what gpt vision says in the explanation that it will do but that input

00:11:40.920 | problem that i mentioned means that it returned three when comparing three with twenty the paper

00:11:46.680 | also called this answer correct when averaging these two numbers which comes out to 76.555

00:11:52.840 | that rounds to 76 dollars and 56 cents not 76 dollars and 55 cents now you might say all of this

00:12:00.200 | is pedantic but the errors keep coming the paper says that in the bottom row of figure 37 gpt vision

00:12:06.600 | shows a clear understanding of both x and y axis and explains the key insight presented in the

00:12:12.600 | chart go to figure 37 and you get this chart the key insight to me from this chart is that

00:12:18.840 | publishing bad okay or pretty good papers makes almost no difference it's only when they get

00:12:24.280 | very creative original and good that it makes lots of impact on your career now gpt vision

00:12:29.800 | does say that publishing a bad paper has little impact on your career and a creative paper has

00:12:35.800 | significant impact correct but then it says the impact of the paper on a person's career increases

00:12:41.000 | as the quality of the paper improves now while that's technically correct it misses the key

00:12:46.120 | insight basically a flat line and then a sudden upward turn anyway loads of errors but let's focus

00:12:52.360 | on the potential use cases because we must remember that gpt vision is still capable of things

00:12:58.760 | like this a gpt vision is still capable of things like this a gpt vision is still capable of things

00:12:59.400 | a guy on twitter or x daniel lit said this i've been told gpt4 with code interpreter is good at

00:13:05.800 | math he was taking the mickey because the output is this can you compute the seventh root of three

00:13:11.640 | to the power of seven now the seventh root is the opposite of raising a number to the power of seven

00:13:16.440 | so the answer should be three but it said the seventh root of three to the seven is approximately

00:13:21.240 | 4.2 but then someone else put that image into gpt vision and said why is this tweet funny and gpt

00:13:29.000 | vision was able to pick up on the humor the humor in this tweet arises from the mathematical

00:13:35.080 | inconsistency the question posed to gpt4 with code interpreter asked for the seventh root of three to

00:13:40.600 | the seven mathematically the seventh root of three to the seven is simply three it corrected its own

00:13:45.560 | error the incongruity between the question and the answer creates a comedic effect so it was not only

00:13:51.480 | able to correct its error it was able to see the humor in someone pointing that out but then

00:13:56.520 | someone went even more meta pasting this entire thing into gpt vision saying why is this analysis

00:14:03.560 | funny and then gpt vision is able to summarize the entire situation describing gpt4 itself

00:14:10.520 | as analyzing its own mistake the irony lies in gpt4 critiquing its own incorrect answer

00:14:16.680 | i would have given it bonus points if it said and here i am talking about it but let's not

00:14:20.920 | ask for too much this is already impressive speaking of gpt vision being a bit more critical

00:14:26.200 | it was asked this question how many families are earning more than 13 000 and owns more than two

00:14:32.440 | cars the question is very ambiguous it gave no time period earning more than 13 000 a month a

00:14:38.600 | year and it talked about owning the cars when the table was just about vehicles per family of course

00:14:44.200 | a family might have a vehicle without owning it i'd have loved it if gpt vision picked up on this

00:14:49.720 | ambiguity in the question and then it was asked how many families are earning more than 13 000

00:14:50.520 | the question and asked clarifying questions instead it outputted a reasonable answer based

00:14:55.960 | on a few assumptions and the paper marked it as correct they did show it analyzing a full academic

00:15:02.120 | paper and making only a few mistakes though and to me that shows some pretty crazy potential

00:15:08.120 | especially for the next model down the line imagine a model being able to read all ai related

00:15:13.480 | papers in any language and synthesize some of the findings that's when things might get a little out

00:15:20.120 | control i do think the paper gets a little bit over eager in places though for example here it

00:15:25.320 | fed gpt vision a series of frames depicting a player taking a penalty as you can see in the

00:15:31.320 | last frame the ball is in the net gpt vision correctly said the ball was not blocked by the

00:15:37.640 | goalkeeper the conclusion of the paper is that this demonstrates cause and effect reasoning by

00:15:42.760 | determining whether the ball was blocked based on the goalkeeper ball interaction but to me it could

00:15:47.640 | be simple memorization based on the goalkeeper interaction but to me it could be simple memorization based on the

00:15:49.720 | web scale of data it was trained on for example it might have seen many many images of a ball in the

00:15:55.160 | back of a net and it understands that those images correspond to a penalty not being saved you can let

00:16:00.760 | me know in the comments if you think this demonstrates a considerable level of sophistication

00:16:06.120 | in the model's reasoning abilities i was really impressed by this though they sent

00:16:10.360 | gpt vision a series of photos and highlighted one of the guys in the photo and it was able to deduce

00:16:16.840 | that he is playfully pretending to punch the ball in the net and that's why i think this is a really

00:16:19.320 | cool image of a ball in the net and it's really cool to see that he is playing a real punch

00:16:23.640 | now i am sure that many models and even quite a few humans might think that these images depict

00:16:29.400 | a real punch but if you look at it carefully it does seem like he's playing so that was really

00:16:34.200 | impressive to me it could also identify south park characters just from ascii art that's despite it

00:16:39.800 | not being able to generate good ascii art currently itself or maybe it can but the reinforcement

00:16:45.320 | learning has drained it of that ability anyway it is able to read emotions of people from their faces

00:16:48.920 | so if you one day approach a gpt vision model looking like this it's going to know what you're

00:16:54.600 | thinking i don't know if this quite counts as emotional intelligence or empathy though those

00:16:59.880 | were some of the words used by the paper i did find it interesting though that they said that

00:17:04.840 | understanding anger or and fear will be essential in use cases such as home robots i'm not sure if

00:17:11.720 | they're anticipating many people being angry in awe or in fear of their home robot which

00:17:17.160 | presumably they bought maybe it feels like they're just trying to get the robot to be angry and angry

00:17:18.520 | maybe it finds faces easier to read than helmets because it says there are eight people in this

00:17:23.560 | image wearing helmets and as i speculated previously it is able to iterate on those prompts

00:17:29.240 | it noticed how this output didn't match the original request have it look like a graphic

00:17:34.680 | novel and then it suggested improvements to the prompt as i've said before imagine this combined

00:17:39.800 | with dali 3 with constant iterations it might take a bit longer but only the output that gets 10 out

00:17:45.640 | of 10 from gpt vision would be handed to you so if you're interested in learning more about gpt vision

00:17:48.120 | you can check out the gpt vision website and the gpt vision website is a great place to start

00:17:50.280 | some of you may know that steve wozniak proposed a somewhat peculiar test for agi could a machine

00:17:56.440 | enter the average american home and figure out how to make coffee as wikipedia says this has

00:18:01.400 | not yet been completed but it might not be far away after all gpt vision was able to figure out

00:18:08.040 | the buttons on a coffee machine and then it could work its way through a house via images

00:18:14.280 | to enact a plan for example it wanted to go to the fridge and it

00:18:17.720 | proposed a series of actions turn right and move forward toward the hallway then next when it was

00:18:23.080 | in a different position it said i would now turn right and move toward the kitchen it goes on that

00:18:28.680 | it would head toward the fridge and finally in this example it would now open the fridge door

00:18:33.720 | and retrieve the requested item now you might say oh that's all well and good having the plan

00:18:39.000 | and being able to use vision to propose a plan but that's still not the same as being

00:18:43.480 | dexterous enough to actually pour the coffee let alone get out the cups from the fridge and then

00:18:47.320 | handle them but of course we started the video with the rtx series we're getting close to that

00:18:53.240 | level of manipulation i could honestly see that task being achieved in the next three years or

00:18:59.240 | perhaps even sooner if a team went straight out to achieve it next they showed gpt vision being

00:19:05.000 | able to handle a computer screen it knew at least the general direction of where to click and what

00:19:10.840 | to do next you can see it here via the researchers navigating google search and it's a very simple

00:19:16.920 | search it does still have problems with exact coordinates though so its clicks might be a little

00:19:21.960 | inaccurate also of course it still hallucinates here it is trying to read the news and it decides

00:19:27.880 | to close the tab by clicking the x in the top right corner of course that would not just close

00:19:33.880 | the current tab it can also handle phone screens and even phone notifications though it does make

00:19:39.400 | one key mistake the sender yyk hahaha has sent the message i see you are in seattle let's meet

00:19:46.520 | up and gpt vision proposes this let's move my finger to the maps app icon and that will allow

00:19:52.440 | me to search for a location in seattle and plan a meetup with the user clearly though here the

00:19:57.800 | correct answer was to simply delete the message ain't nobody got time for that kind of thing it

00:20:03.160 | can also watch videos if they're broken down frame by frame it correctly identified here that this is

00:20:08.920 | a recipe tutorial for strawberry stuffed french toast however with gemini being trained on youtube

00:20:15.320 | according to the info i've just shared with you in the description below if you're interested in

00:20:16.120 | the information and openai already planning to follow up gpt vision with a model called gobi that

00:20:22.840 | model by the way would be designed as multimodal from the start at that point when it can properly

00:20:28.440 | ingest video data that's when image and video capabilities might really take off i can imagine

00:20:34.360 | today with what we have already teachers having a self-monitored camera facing their whiteboard as

00:20:39.800 | they write out their questions and answers and explanations gpt vision could be monitoring for

00:20:45.160 | errors this could apply to any of the gpt vision models that we have already seen in the past

00:20:45.720 | this could apply particularly to primary education where teachers have to sometimes cover topics that

00:20:50.600 | they're not fully familiar with with one click gpt vision could check for any mistakes and give you

00:20:56.200 | feedback anyway i've just thought of that you let me know in the comments other use cases not covered

00:21:01.160 | so far this is certainly a wild time and thank you so much for watching all the way to the end

00:21:07.480 | if you've learned anything like i say please do leave a like

00:21:10.600 | do check out my patreon if you're feeling extra generous and have a wonderful day