back to index

RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights


Whisper Transcript | Transcript Only Page

00:00:00.000 | Just as I was reaching the finishing pages of this epic 168 page report on GPT Vision,
00:00:07.020 | which showcased unexpected abilities, novel use cases and predictions for the future,
00:00:11.880 | Google dropped their colossal RTX Endeavor. And the data Google used with over 500 skills
00:00:19.180 | and 150,000 tasks is open source. I've picked out over 75 highlights from both papers,
00:00:25.980 | which I read in full, and I'll also bring in an exclusive report from the information to give us
00:00:31.520 | an idea of what is next. By the end of the video, I want you to have a great sense for what GPT
00:00:37.940 | with Vision can do today and a more acute awareness of what is coming in Vision, Video
00:00:43.320 | and Robotics tomorrow. But let's start with this Google DeepMind report released just a few hours
00:00:49.460 | ago. Essentially, what it showed is that you could create a general purpose robot from data
00:00:55.720 | from the data that you have in your computer. And that's what I'm going to do today. So let's
00:00:55.960 | start with the first report from diverse robotic tasks. These were data sets from different
00:00:59.960 | universities on different continents. And Google wanted to see if this diverse data could improve
00:01:06.100 | their famous RT2 model. I talked a lot more about RT2 in the video you can see on the screen,
00:01:11.560 | but essentially it was trained on web data as well as robotics data. That meant that it could
00:01:16.620 | understand questions like pick up the extinct animal. But the RTX series is another step up,
00:01:22.020 | even though it comes just two months later. The report highlighted that,
00:01:25.940 | conventionally, robotic learning methods train a separate model for every application,
00:01:30.340 | every robot and even every environment. Their big finding was that training a single model
00:01:35.620 | on that diverse data enabled that robot to outperform even the specialist robots.
00:01:41.600 | The improved version of RT1 became RT1X and RT2 became RT2X. But here you can see RT1X
00:01:50.440 | out-competing specialist models, weirdly aside from this wiping robot at Berkeley.
00:01:55.920 | In a range of domains. You've got kitchen manipulation, cable routing, door opening,
00:02:00.800 | and many more tasks that I'll get to in a second. The paper even demonstrated applicability to
00:02:05.960 | robot arms and even quadrupeds. Think four-legged dog-like robots. And here you can see how it
00:02:12.260 | improved RT2, which was already a big improvement on RT1. And as Google says,
00:02:18.080 | our results suggest that co-training with data from other platforms imbues RT2X with additional skills
00:02:25.900 | that were not present in the original dataset, enabling it to perform novel tasks.
00:02:30.540 | Apparently it couldn't do things like this before. Move the apple between the can and the orange,
00:02:36.580 | or move the apple near but not on top of the cloth, or move the apple on top of the pot.
00:02:43.440 | This gives you a taste of the kind of skills they incorporated. Picking, moving, pushing, placing, sliding,
00:02:49.480 | putting, navigating, separating, pointing, and on and on. I like that at the end we have assembling and turning on.
00:02:55.880 | The paper draws the analogy with large language models that are trained on massive web-scale text
00:03:01.880 | data. And those models tend to outperform systems that are only trained on narrow task-specific
00:03:07.500 | datasets. Well, same thing now with robotics. That's why I call this the GPT-2 moment for
00:03:13.740 | robotics. And the paper says even if such robotic datasets in their current size and coverage are
00:03:19.240 | insufficient to attain the impressive generalization results that have been demonstrated by LLMs,
00:03:25.860 | in the future the union of such data can potentially provide this kind of coverage.
00:03:32.080 | One way to put it would be to think of how general and multi-skilled GPT-4 is in language.
00:03:37.840 | From coding to poetry and mathematics and more. Now imagine that, but with robotic skills.
00:03:43.540 | Now one question you might have is how would data on folding clothes help with pushing an apple?
00:03:49.620 | Well they say that unlike most prior works, we directly train our policy on all of this X in
00:03:55.840 | body-mode data without any mechanisms to reduce the embodiment gap. They didn't try any big
00:04:01.020 | translation between domains or breaking down the problem into sub-aspects. They did however put the
00:04:06.720 | input image data into a common resolution and unified the action vectors across these seven
00:04:12.920 | dimensions. Now that 55 billion parameter number and the fact that it comes from the Pali model
00:04:18.280 | that's undergirding RT2X is the perfect segue to GPT vision. After all, OpenAI directly
00:04:25.820 | compared their GPT-4 vision model to Pali 17 billion which was a precursor model to Pali X.
00:04:33.580 | And you might notice that at least for visual question answering, the Pali model, that's the
00:04:38.780 | precursor 17 billion parameter model, outperformed GPT-4 vision. So everything you're about to see
00:04:45.100 | from the 168 page GPT-4 vision report actually represents the lower bound of current frontier capability.
00:04:53.100 | To recap, that's GPT-4 vision being beefed
00:04:55.800 | by Pali 17 billion parameters, which is beaten by Pali X 55 billion parameters, which has now been
00:05:03.300 | incorporated into RT2 the robot. And all of that is not even to bring in OpenAI Gobi model or Google
00:05:10.620 | Gemini, which I'll talk about later. So the main focus of the video, the dawn of large multimodal
00:05:17.160 | models. A huge 160 plus page report from Microsoft. It was released just a few days ago and to be
00:05:25.780 | honest, it's a bit of a surprise to me that they've actually released an entire video on its own.
00:05:27.780 | I'm going to give you all of the highlights here in this video. So please do leave a like or subscribe
00:05:32.420 | if you find it helpful. For all of the fascinating demos you're about to see, Microsoft says that they
00:05:38.100 | carefully controlled both the images and text to prevent them from being seen during GPT-4V training.
00:05:44.900 | The images were either not accessible online or had a timestamp beyond April 2023. Their headline
00:05:51.380 | finding was that GPT vision shows impressive human level capabilities across the entire world.
00:05:55.760 | And that's because they're not just a tool for testing and testing for the best results, but they
00:05:57.760 | are a tool for testing and testing for the best results. And they're also a tool for testing and
00:05:59.760 | testing for the best results. So they're a tool for testing and testing for the best results.
00:06:01.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.
00:06:03.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.
00:06:05.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.
00:06:07.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.
00:06:09.760 | So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.
00:06:11.760 | and by the end they're proposing agent structures and testing for self-consistency it gets pretty
00:06:16.800 | wild but it's time for the first demo they showed gpt vision this table with the drinks that they
00:06:22.080 | had ordered and they took a photo of the menu they asked how much should i pay for the beer
00:06:26.980 | on the table according to the price on the menu and gpt vision got it we're starting a little slow
00:06:31.940 | here but imagine you're drunk on the beach and you don't even know what you've ordered this could be
00:06:36.140 | useful next was gpt vision putting the information from a driver's license into json format now first
00:06:42.800 | time round it wasn't perfect listing his hair color as non-applicable when the license says
00:06:48.060 | brown but there are ways to improve performance as we'll see later in fact the first way of
00:06:52.800 | improving performance is right now that's using chain of thought chain of thought is basically a
00:06:57.840 | way of getting the model to put out its intermediate reasoning often by using a phrase like let's think
00:07:03.620 | step by step as you can see here the first time around the model is using chain of thought
00:07:06.120 | and this time it couldn't identify that there's 11 apples and actually even when they used let's
00:07:11.220 | think step by step it still got it wrong what about let's count the apples row by row nope
00:07:16.140 | it got the right final answer but got the rows mixed up they tried some other prompts but finally
00:07:21.480 | settled on this one you are an expert in counting things in the image and this time it got it right
00:07:26.700 | now all those new methods you've been seeing like llms as prompt optimizers or prompt reader it
00:07:32.100 | looks like they are going to be equally applicable to gpt vision in fact after
00:07:36.100 | that example they say that throughout the paper we're going to employ that technique of calling
00:07:40.240 | it an expert across various scenarios for better performance but here is one of the big revelations
00:07:46.040 | from the paper for me the paper says that this particular ability isn't seen in existing models
00:07:51.520 | and that is being able to follow pointers that might be circles squares or even arrows drawn
00:07:57.960 | on a diagram and the amazing thing is this seems to work better than giving gpt vision coordinates
00:08:03.920 | the researchers from microsoft drew these out of the paper and they said that they were able to
00:08:06.080 | use the same technique to draw the same number of arrows onto the photo putting the label object one
00:08:09.260 | and object two and gpt vision analyzes those objects perfectly but there was something that
00:08:14.100 | i noticed which is that technically the green arrow is pointing to the pavement and the red
00:08:20.220 | arrow is pointing to the table i should say that the arrows end at those places and instead of
00:08:26.020 | interpreting the literal end of the arrows it sussed what the human meant it meant the nearest
00:08:32.060 | big object the glass bottle and the beer i know that's something small but i thought it was a little
00:08:36.060 | find that really impressive. The next big finding was that in context few shot learning is still
00:08:41.440 | really crucial even for vision models. In context means as part of the prompt not in the pre-training
00:08:47.320 | and few shot just means giving a few examples before you ask your key question. And the proof
00:08:52.580 | of concept examples that you're about to see they say vividly demonstrate the rising significance
00:08:57.580 | of in context few shot learning for achieving improved performance with large multimodal
00:09:03.600 | models. I'm going to speed through this example just so you get the gist. Essentially they asked
00:09:08.080 | it to read the speed on this speed meter. It gets it wrong and even when they ask it to think step
00:09:13.400 | by step it still gets it wrong. Then they gave it instructions and it still got it wrong. They
00:09:18.240 | tried loads of things but it just couldn't seem to do it. Even when they gave it one example one
00:09:23.480 | shot it still got it wrong. You can see in the prompt they gave a correct worked example and
00:09:29.000 | then asked the question again but no it said just passing 70 miles an hour.
00:09:33.440 | But finally two shot with two worked examples it then gets it right. Next they showed that the
00:09:39.640 | model could recognize celebrities and there was one particularly interesting example of Jensen
00:09:44.940 | Huang. He's the CEO of Nvidia which produces the GPUs that went into training GPT-4 vision. Anyway
00:09:51.600 | it could apparently recognize its own ingredients saying he's likely holding a GPU. Next they had it
00:09:57.720 | recognizing landmarks even if they were at weird angles or at night. It could recognize dishes
00:10:03.180 | even if they were at weird angles or at night. It could recognize dishes even if they were at
00:10:03.420 | even if they had toppings and condiments. It also did really pretty well in medical image
00:10:08.900 | understanding identifying what was wrong with this particular foot. You can see it also working
00:10:13.660 | with a CT scan. Of course before we get too excited our old friend hallucination is still
00:10:19.420 | there. It described a bridge overpass that I frankly can't see at all. It skipped over North
00:10:24.920 | Carolina entirely when asked about the states shown on this map. And it also gets seemingly
00:10:30.360 | random numbers wrong. Take this table where it noted that the state of North Carolina was
00:10:33.400 | down almost every number correctly down to three decimal places but then for some reason the 15
00:10:39.640 | million 971 880 became 15 971 421 actually i've just noticed while filming that that's the same
00:10:50.200 | ending as the profit in the next column so maybe there was a reason but it's still pretty random
00:10:55.400 | point is you still can't fully rely on the outputs and it seems to me that in figure 36 there was a
00:11:01.000 | mistake that even the researchers didn't notice if i'm right that shows how pernicious these mistakes
00:11:06.120 | can be the researchers say that the model not only understands the program in the given flowchart but
00:11:11.720 | also translates the details to a python code if you go to figure 36 you see this flowchart and it
00:11:18.040 | was asked can you translate the flowchart to a python code you can see the flowchart and you can
00:11:23.640 | see the code now obviously it's impressive that it can even attempt to do this but that code is dodgy
00:11:29.000 | taking a string as input essentially the code is not the code that you can use to do this but it's
00:11:30.600 | essentially a bunch of letters instead of a floating point number now the goal was to print
00:11:35.640 | the larger number and that's what gpt vision says in the explanation that it will do but that input
00:11:40.920 | problem that i mentioned means that it returned three when comparing three with twenty the paper
00:11:46.680 | also called this answer correct when averaging these two numbers which comes out to 76.555
00:11:52.840 | that rounds to 76 dollars and 56 cents not 76 dollars and 55 cents now you might say all of this
00:12:00.200 | is pedantic but the errors keep coming the paper says that in the bottom row of figure 37 gpt vision
00:12:06.600 | shows a clear understanding of both x and y axis and explains the key insight presented in the
00:12:12.600 | chart go to figure 37 and you get this chart the key insight to me from this chart is that
00:12:18.840 | publishing bad okay or pretty good papers makes almost no difference it's only when they get
00:12:24.280 | very creative original and good that it makes lots of impact on your career now gpt vision
00:12:29.800 | does say that publishing a bad paper has little impact on your career and a creative paper has
00:12:35.800 | significant impact correct but then it says the impact of the paper on a person's career increases
00:12:41.000 | as the quality of the paper improves now while that's technically correct it misses the key
00:12:46.120 | insight basically a flat line and then a sudden upward turn anyway loads of errors but let's focus
00:12:52.360 | on the potential use cases because we must remember that gpt vision is still capable of things
00:12:58.760 | like this a gpt vision is still capable of things like this a gpt vision is still capable of things
00:12:59.400 | a guy on twitter or x daniel lit said this i've been told gpt4 with code interpreter is good at
00:13:05.800 | math he was taking the mickey because the output is this can you compute the seventh root of three
00:13:11.640 | to the power of seven now the seventh root is the opposite of raising a number to the power of seven
00:13:16.440 | so the answer should be three but it said the seventh root of three to the seven is approximately
00:13:21.240 | 4.2 but then someone else put that image into gpt vision and said why is this tweet funny and gpt
00:13:29.000 | vision was able to pick up on the humor the humor in this tweet arises from the mathematical
00:13:35.080 | inconsistency the question posed to gpt4 with code interpreter asked for the seventh root of three to
00:13:40.600 | the seven mathematically the seventh root of three to the seven is simply three it corrected its own
00:13:45.560 | error the incongruity between the question and the answer creates a comedic effect so it was not only
00:13:51.480 | able to correct its error it was able to see the humor in someone pointing that out but then
00:13:56.520 | someone went even more meta pasting this entire thing into gpt vision saying why is this analysis
00:14:03.560 | funny and then gpt vision is able to summarize the entire situation describing gpt4 itself
00:14:10.520 | as analyzing its own mistake the irony lies in gpt4 critiquing its own incorrect answer
00:14:16.680 | i would have given it bonus points if it said and here i am talking about it but let's not
00:14:20.920 | ask for too much this is already impressive speaking of gpt vision being a bit more critical
00:14:26.200 | it was asked this question how many families are earning more than 13 000 and owns more than two
00:14:32.440 | cars the question is very ambiguous it gave no time period earning more than 13 000 a month a
00:14:38.600 | year and it talked about owning the cars when the table was just about vehicles per family of course
00:14:44.200 | a family might have a vehicle without owning it i'd have loved it if gpt vision picked up on this
00:14:49.720 | ambiguity in the question and then it was asked how many families are earning more than 13 000
00:14:50.520 | the question and asked clarifying questions instead it outputted a reasonable answer based
00:14:55.960 | on a few assumptions and the paper marked it as correct they did show it analyzing a full academic
00:15:02.120 | paper and making only a few mistakes though and to me that shows some pretty crazy potential
00:15:08.120 | especially for the next model down the line imagine a model being able to read all ai related
00:15:13.480 | papers in any language and synthesize some of the findings that's when things might get a little out
00:15:20.120 | control i do think the paper gets a little bit over eager in places though for example here it
00:15:25.320 | fed gpt vision a series of frames depicting a player taking a penalty as you can see in the
00:15:31.320 | last frame the ball is in the net gpt vision correctly said the ball was not blocked by the
00:15:37.640 | goalkeeper the conclusion of the paper is that this demonstrates cause and effect reasoning by
00:15:42.760 | determining whether the ball was blocked based on the goalkeeper ball interaction but to me it could
00:15:47.640 | be simple memorization based on the goalkeeper interaction but to me it could be simple memorization based on the
00:15:49.720 | web scale of data it was trained on for example it might have seen many many images of a ball in the
00:15:55.160 | back of a net and it understands that those images correspond to a penalty not being saved you can let
00:16:00.760 | me know in the comments if you think this demonstrates a considerable level of sophistication
00:16:06.120 | in the model's reasoning abilities i was really impressed by this though they sent
00:16:10.360 | gpt vision a series of photos and highlighted one of the guys in the photo and it was able to deduce
00:16:16.840 | that he is playfully pretending to punch the ball in the net and that's why i think this is a really
00:16:19.320 | cool image of a ball in the net and it's really cool to see that he is playing a real punch
00:16:23.640 | now i am sure that many models and even quite a few humans might think that these images depict
00:16:29.400 | a real punch but if you look at it carefully it does seem like he's playing so that was really
00:16:34.200 | impressive to me it could also identify south park characters just from ascii art that's despite it
00:16:39.800 | not being able to generate good ascii art currently itself or maybe it can but the reinforcement
00:16:45.320 | learning has drained it of that ability anyway it is able to read emotions of people from their faces
00:16:48.920 | so if you one day approach a gpt vision model looking like this it's going to know what you're
00:16:54.600 | thinking i don't know if this quite counts as emotional intelligence or empathy though those
00:16:59.880 | were some of the words used by the paper i did find it interesting though that they said that
00:17:04.840 | understanding anger or and fear will be essential in use cases such as home robots i'm not sure if
00:17:11.720 | they're anticipating many people being angry in awe or in fear of their home robot which
00:17:17.160 | presumably they bought maybe it feels like they're just trying to get the robot to be angry and angry
00:17:18.520 | maybe it finds faces easier to read than helmets because it says there are eight people in this
00:17:23.560 | image wearing helmets and as i speculated previously it is able to iterate on those prompts
00:17:29.240 | it noticed how this output didn't match the original request have it look like a graphic
00:17:34.680 | novel and then it suggested improvements to the prompt as i've said before imagine this combined
00:17:39.800 | with dali 3 with constant iterations it might take a bit longer but only the output that gets 10 out
00:17:45.640 | of 10 from gpt vision would be handed to you so if you're interested in learning more about gpt vision
00:17:48.120 | you can check out the gpt vision website and the gpt vision website is a great place to start
00:17:50.280 | some of you may know that steve wozniak proposed a somewhat peculiar test for agi could a machine
00:17:56.440 | enter the average american home and figure out how to make coffee as wikipedia says this has
00:18:01.400 | not yet been completed but it might not be far away after all gpt vision was able to figure out
00:18:08.040 | the buttons on a coffee machine and then it could work its way through a house via images
00:18:14.280 | to enact a plan for example it wanted to go to the fridge and it
00:18:17.720 | proposed a series of actions turn right and move forward toward the hallway then next when it was
00:18:23.080 | in a different position it said i would now turn right and move toward the kitchen it goes on that
00:18:28.680 | it would head toward the fridge and finally in this example it would now open the fridge door
00:18:33.720 | and retrieve the requested item now you might say oh that's all well and good having the plan
00:18:39.000 | and being able to use vision to propose a plan but that's still not the same as being
00:18:43.480 | dexterous enough to actually pour the coffee let alone get out the cups from the fridge and then
00:18:47.320 | handle them but of course we started the video with the rtx series we're getting close to that
00:18:53.240 | level of manipulation i could honestly see that task being achieved in the next three years or
00:18:59.240 | perhaps even sooner if a team went straight out to achieve it next they showed gpt vision being
00:19:05.000 | able to handle a computer screen it knew at least the general direction of where to click and what
00:19:10.840 | to do next you can see it here via the researchers navigating google search and it's a very simple
00:19:16.920 | search it does still have problems with exact coordinates though so its clicks might be a little
00:19:21.960 | inaccurate also of course it still hallucinates here it is trying to read the news and it decides
00:19:27.880 | to close the tab by clicking the x in the top right corner of course that would not just close
00:19:33.880 | the current tab it can also handle phone screens and even phone notifications though it does make
00:19:39.400 | one key mistake the sender yyk hahaha has sent the message i see you are in seattle let's meet
00:19:46.520 | up and gpt vision proposes this let's move my finger to the maps app icon and that will allow
00:19:52.440 | me to search for a location in seattle and plan a meetup with the user clearly though here the
00:19:57.800 | correct answer was to simply delete the message ain't nobody got time for that kind of thing it
00:20:03.160 | can also watch videos if they're broken down frame by frame it correctly identified here that this is
00:20:08.920 | a recipe tutorial for strawberry stuffed french toast however with gemini being trained on youtube
00:20:15.320 | according to the info i've just shared with you in the description below if you're interested in
00:20:16.120 | the information and openai already planning to follow up gpt vision with a model called gobi that
00:20:22.840 | model by the way would be designed as multimodal from the start at that point when it can properly
00:20:28.440 | ingest video data that's when image and video capabilities might really take off i can imagine
00:20:34.360 | today with what we have already teachers having a self-monitored camera facing their whiteboard as
00:20:39.800 | they write out their questions and answers and explanations gpt vision could be monitoring for
00:20:45.160 | errors this could apply to any of the gpt vision models that we have already seen in the past
00:20:45.720 | this could apply particularly to primary education where teachers have to sometimes cover topics that
00:20:50.600 | they're not fully familiar with with one click gpt vision could check for any mistakes and give you
00:20:56.200 | feedback anyway i've just thought of that you let me know in the comments other use cases not covered
00:21:01.160 | so far this is certainly a wild time and thank you so much for watching all the way to the end
00:21:07.480 | if you've learned anything like i say please do leave a like
00:21:10.600 | do check out my patreon if you're feeling extra generous and have a wonderful day