RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights

Just as I was reaching the finishing pages of this epic 168 page report on GPT Vision, which showcased unexpected abilities, novel use cases and predictions for the future, Google dropped their colossal RTX Endeavor. And the data Google used with over 500 skills and 150,000 tasks is open source. I've picked out over 75 highlights from both papers, which I read in full, and I'll also bring in an exclusive report from the information to give us an idea of what is next.

By the end of the video, I want you to have a great sense for what GPT with Vision can do today and a more acute awareness of what is coming in Vision, Video and Robotics tomorrow. But let's start with this Google DeepMind report released just a few hours ago.

Essentially, what it showed is that you could create a general purpose robot from data from the data that you have in your computer. And that's what I'm going to do today. So let's start with the first report from diverse robotic tasks. These were data sets from different universities on different continents.

And Google wanted to see if this diverse data could improve their famous RT2 model. I talked a lot more about RT2 in the video you can see on the screen, but essentially it was trained on web data as well as robotics data. That meant that it could understand questions like pick up the extinct animal.

But the RTX series is another step up, even though it comes just two months later. The report highlighted that, conventionally, robotic learning methods train a separate model for every application, every robot and even every environment. Their big finding was that training a single model on that diverse data enabled that robot to outperform even the specialist robots.

The improved version of RT1 became RT1X and RT2 became RT2X. But here you can see RT1X out-competing specialist models, weirdly aside from this wiping robot at Berkeley. In a range of domains. You've got kitchen manipulation, cable routing, door opening, and many more tasks that I'll get to in a second.

The paper even demonstrated applicability to robot arms and even quadrupeds. Think four-legged dog-like robots. And here you can see how it improved RT2, which was already a big improvement on RT1. And as Google says, our results suggest that co-training with data from other platforms imbues RT2X with additional skills that were not present in the original dataset, enabling it to perform novel tasks.

Apparently it couldn't do things like this before. Move the apple between the can and the orange, or move the apple near but not on top of the cloth, or move the apple on top of the pot. This gives you a taste of the kind of skills they incorporated. Picking, moving, pushing, placing, sliding, putting, navigating, separating, pointing, and on and on.

I like that at the end we have assembling and turning on. The paper draws the analogy with large language models that are trained on massive web-scale text data. And those models tend to outperform systems that are only trained on narrow task-specific datasets. Well, same thing now with robotics. That's why I call this the GPT-2 moment for robotics.

And the paper says even if such robotic datasets in their current size and coverage are insufficient to attain the impressive generalization results that have been demonstrated by LLMs, in the future the union of such data can potentially provide this kind of coverage. One way to put it would be to think of how general and multi-skilled GPT-4 is in language.

From coding to poetry and mathematics and more. Now imagine that, but with robotic skills. Now one question you might have is how would data on folding clothes help with pushing an apple? Well they say that unlike most prior works, we directly train our policy on all of this X in body-mode data without any mechanisms to reduce the embodiment gap.

They didn't try any big translation between domains or breaking down the problem into sub-aspects. They did however put the input image data into a common resolution and unified the action vectors across these seven dimensions. Now that 55 billion parameter number and the fact that it comes from the Pali model that's undergirding RT2X is the perfect segue to GPT vision.

After all, OpenAI directly compared their GPT-4 vision model to Pali 17 billion which was a precursor model to Pali X. And you might notice that at least for visual question answering, the Pali model, that's the precursor 17 billion parameter model, outperformed GPT-4 vision. So everything you're about to see from the 168 page GPT-4 vision report actually represents the lower bound of current frontier capability.

To recap, that's GPT-4 vision being beefed by Pali 17 billion parameters, which is beaten by Pali X 55 billion parameters, which has now been incorporated into RT2 the robot. And all of that is not even to bring in OpenAI Gobi model or Google Gemini, which I'll talk about later.

So the main focus of the video, the dawn of large multimodal models. A huge 160 plus page report from Microsoft. It was released just a few days ago and to be honest, it's a bit of a surprise to me that they've actually released an entire video on its own.

I'm going to give you all of the highlights here in this video. So please do leave a like or subscribe if you find it helpful. For all of the fascinating demos you're about to see, Microsoft says that they carefully controlled both the images and text to prevent them from being seen during GPT-4V training.

The images were either not accessible online or had a timestamp beyond April 2023. Their headline finding was that GPT vision shows impressive human level capabilities across the entire world. And that's because they're not just a tool for testing and testing for the best results, but they are a tool for testing and testing for the best results.

And they're also a tool for testing and testing for the best results. So they're a tool for testing and testing for the best results. So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. So they're a tool for testing and testing for the best results.

And they're a tool for testing and testing for the best results. So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results.

So they're a tool for testing and testing for the best results. And they're a tool for testing and testing for the best results. and by the end they're proposing agent structures and testing for self-consistency it gets pretty wild but it's time for the first demo they showed gpt vision this table with the drinks that they had ordered and they took a photo of the menu they asked how much should i pay for the beer on the table according to the price on the menu and gpt vision got it we're starting a little slow here but imagine you're drunk on the beach and you don't even know what you've ordered this could be useful next was gpt vision putting the information from a driver's license into json format now first time round it wasn't perfect listing his hair color as non-applicable when the license says brown but there are ways to improve performance as we'll see later in fact the first way of improving performance is right now that's using chain of thought chain of thought is basically a way of getting the model to put out its intermediate reasoning often by using a phrase like let's think step by step as you can see here the first time around the model is using chain of thought and this time it couldn't identify that there's 11 apples and actually even when they used let's think step by step it still got it wrong what about let's count the apples row by row nope it got the right final answer but got the rows mixed up they tried some other prompts but finally settled on this one you are an expert in counting things in the image and this time it got it right now all those new methods you've been seeing like llms as prompt optimizers or prompt reader it looks like they are going to be equally applicable to gpt vision in fact after that example they say that throughout the paper we're going to employ that technique of calling it an expert across various scenarios for better performance but here is one of the big revelations from the paper for me the paper says that this particular ability isn't seen in existing models and that is being able to follow pointers that might be circles squares or even arrows drawn on a diagram and the amazing thing is this seems to work better than giving gpt vision coordinates the researchers from microsoft drew these out of the paper and they said that they were able to use the same technique to draw the same number of arrows onto the photo putting the label object one and object two and gpt vision analyzes those objects perfectly but there was something that i noticed which is that technically the green arrow is pointing to the pavement and the red arrow is pointing to the table i should say that the arrows end at those places and instead of interpreting the literal end of the arrows it sussed what the human meant it meant the nearest big object the glass bottle and the beer i know that's something small but i thought it was a little find that really impressive.

The next big finding was that in context few shot learning is still really crucial even for vision models. In context means as part of the prompt not in the pre-training and few shot just means giving a few examples before you ask your key question. And the proof of concept examples that you're about to see they say vividly demonstrate the rising significance of in context few shot learning for achieving improved performance with large multimodal models.

I'm going to speed through this example just so you get the gist. Essentially they asked it to read the speed on this speed meter. It gets it wrong and even when they ask it to think step by step it still gets it wrong. Then they gave it instructions and it still got it wrong.

They tried loads of things but it just couldn't seem to do it. Even when they gave it one example one shot it still got it wrong. You can see in the prompt they gave a correct worked example and then asked the question again but no it said just passing 70 miles an hour.

But finally two shot with two worked examples it then gets it right. Next they showed that the model could recognize celebrities and there was one particularly interesting example of Jensen Huang. He's the CEO of Nvidia which produces the GPUs that went into training GPT-4 vision. Anyway it could apparently recognize its own ingredients saying he's likely holding a GPU.

Next they had it recognizing landmarks even if they were at weird angles or at night. It could recognize dishes even if they were at weird angles or at night. It could recognize dishes even if they were at even if they had toppings and condiments. It also did really pretty well in medical image understanding identifying what was wrong with this particular foot.

You can see it also working with a CT scan. Of course before we get too excited our old friend hallucination is still there. It described a bridge overpass that I frankly can't see at all. It skipped over North Carolina entirely when asked about the states shown on this map.

And it also gets seemingly random numbers wrong. Take this table where it noted that the state of North Carolina was down almost every number correctly down to three decimal places but then for some reason the 15 million 971 880 became 15 971 421 actually i've just noticed while filming that that's the same ending as the profit in the next column so maybe there was a reason but it's still pretty random point is you still can't fully rely on the outputs and it seems to me that in figure 36 there was a mistake that even the researchers didn't notice if i'm right that shows how pernicious these mistakes can be the researchers say that the model not only understands the program in the given flowchart but also translates the details to a python code if you go to figure 36 you see this flowchart and it was asked can you translate the flowchart to a python code you can see the flowchart and you can see the code now obviously it's impressive that it can even attempt to do this but that code is dodgy taking a string as input essentially the code is not the code that you can use to do this but it's essentially a bunch of letters instead of a floating point number now the goal was to print the larger number and that's what gpt vision says in the explanation that it will do but that input problem that i mentioned means that it returned three when comparing three with twenty the paper also called this answer correct when averaging these two numbers which comes out to 76.555 that rounds to 76 dollars and 56 cents not 76 dollars and 55 cents now you might say all of this is pedantic but the errors keep coming the paper says that in the bottom row of figure 37 gpt vision shows a clear understanding of both x and y axis and explains the key insight presented in the chart go to figure 37 and you get this chart the key insight to me from this chart is that publishing bad okay or pretty good papers makes almost no difference it's only when they get very creative original and good that it makes lots of impact on your career now gpt vision does say that publishing a bad paper has little impact on your career and a creative paper has significant impact correct but then it says the impact of the paper on a person's career increases as the quality of the paper improves now while that's technically correct it misses the key insight basically a flat line and then a sudden upward turn anyway loads of errors but let's focus on the potential use cases because we must remember that gpt vision is still capable of things like this a gpt vision is still capable of things like this a gpt vision is still capable of things a guy on twitter or x daniel lit said this i've been told gpt4 with code interpreter is good at math he was taking the mickey because the output is this can you compute the seventh root of three to the power of seven now the seventh root is the opposite of raising a number to the power of seven so the answer should be three but it said the seventh root of three to the seven is approximately 4.2 but then someone else put that image into gpt vision and said why is this tweet funny and gpt vision was able to pick up on the humor the humor in this tweet arises from the mathematical inconsistency the question posed to gpt4 with code interpreter asked for the seventh root of three to the seven mathematically the seventh root of three to the seven is simply three it corrected its own error the incongruity between the question and the answer creates a comedic effect so it was not only able to correct its error it was able to see the humor in someone pointing that out but then someone went even more meta pasting this entire thing into gpt vision saying why is this analysis funny and then gpt vision is able to summarize the entire situation describing gpt4 itself as analyzing its own mistake the irony lies in gpt4 critiquing its own incorrect answer i would have given it bonus points if it said and here i am talking about it but let's not ask for too much this is already impressive speaking of gpt vision being a bit more critical it was asked this question how many families are earning more than 13 000 and owns more than two cars the question is very ambiguous it gave no time period earning more than 13 000 a month a year and it talked about owning the cars when the table was just about vehicles per family of course a family might have a vehicle without owning it i'd have loved it if gpt vision picked up on this ambiguity in the question and then it was asked how many families are earning more than 13 000 the question and asked clarifying questions instead it outputted a reasonable answer based on a few assumptions and the paper marked it as correct they did show it analyzing a full academic paper and making only a few mistakes though and to me that shows some pretty crazy potential especially for the next model down the line imagine a model being able to read all ai related papers in any language and synthesize some of the findings that's when things might get a little out control i do think the paper gets a little bit over eager in places though for example here it fed gpt vision a series of frames depicting a player taking a penalty as you can see in the last frame the ball is in the net gpt vision correctly said the ball was not blocked by the goalkeeper the conclusion of the paper is that this demonstrates cause and effect reasoning by determining whether the ball was blocked based on the goalkeeper ball interaction but to me it could be simple memorization based on the goalkeeper interaction but to me it could be simple memorization based on the web scale of data it was trained on for example it might have seen many many images of a ball in the back of a net and it understands that those images correspond to a penalty not being saved you can let me know in the comments if you think this demonstrates a considerable level of sophistication in the model's reasoning abilities i was really impressed by this though they sent gpt vision a series of photos and highlighted one of the guys in the photo and it was able to deduce that he is playfully pretending to punch the ball in the net and that's why i think this is a really cool image of a ball in the net and it's really cool to see that he is playing a real punch now i am sure that many models and even quite a few humans might think that these images depict a real punch but if you look at it carefully it does seem like he's playing so that was really impressive to me it could also identify south park characters just from ascii art that's despite it not being able to generate good ascii art currently itself or maybe it can but the reinforcement learning has drained it of that ability anyway it is able to read emotions of people from their faces so if you one day approach a gpt vision model looking like this it's going to know what you're thinking i don't know if this quite counts as emotional intelligence or empathy though those were some of the words used by the paper i did find it interesting though that they said that understanding anger or and fear will be essential in use cases such as home robots i'm not sure if they're anticipating many people being angry in awe or in fear of their home robot which presumably they bought maybe it feels like they're just trying to get the robot to be angry and angry maybe it finds faces easier to read than helmets because it says there are eight people in this image wearing helmets and as i speculated previously it is able to iterate on those prompts it noticed how this output didn't match the original request have it look like a graphic novel and then it suggested improvements to the prompt as i've said before imagine this combined with dali 3 with constant iterations it might take a bit longer but only the output that gets 10 out of 10 from gpt vision would be handed to you so if you're interested in learning more about gpt vision you can check out the gpt vision website and the gpt vision website is a great place to start some of you may know that steve wozniak proposed a somewhat peculiar test for agi could a machine enter the average american home and figure out how to make coffee as wikipedia says this has not yet been completed but it might not be far away after all gpt vision was able to figure out the buttons on a coffee machine and then it could work its way through a house via images to enact a plan for example it wanted to go to the fridge and it proposed a series of actions turn right and move forward toward the hallway then next when it was in a different position it said i would now turn right and move toward the kitchen it goes on that it would head toward the fridge and finally in this example it would now open the fridge door and retrieve the requested item now you might say oh that's all well and good having the plan and being able to use vision to propose a plan but that's still not the same as being dexterous enough to actually pour the coffee let alone get out the cups from the fridge and then handle them but of course we started the video with the rtx series we're getting close to that level of manipulation i could honestly see that task being achieved in the next three years or perhaps even sooner if a team went straight out to achieve it next they showed gpt vision being able to handle a computer screen it knew at least the general direction of where to click and what to do next you can see it here via the researchers navigating google search and it's a very simple search it does still have problems with exact coordinates though so its clicks might be a little inaccurate also of course it still hallucinates here it is trying to read the news and it decides to close the tab by clicking the x in the top right corner of course that would not just close the current tab it can also handle phone screens and even phone notifications though it does make one key mistake the sender yyk hahaha has sent the message i see you are in seattle let's meet up and gpt vision proposes this let's move my finger to the maps app icon and that will allow me to search for a location in seattle and plan a meetup with the user clearly though here the correct answer was to simply delete the message ain't nobody got time for that kind of thing it can also watch videos if they're broken down frame by frame it correctly identified here that this is a recipe tutorial for strawberry stuffed french toast however with gemini being trained on youtube according to the info i've just shared with you in the description below if you're interested in the information and openai already planning to follow up gpt vision with a model called gobi that model by the way would be designed as multimodal from the start at that point when it can properly ingest video data that's when image and video capabilities might really take off i can imagine today with what we have already teachers having a self-monitored camera facing their whiteboard as they write out their questions and answers and explanations gpt vision could be monitoring for errors this could apply to any of the gpt vision models that we have already seen in the past this could apply particularly to primary education where teachers have to sometimes cover topics that they're not fully familiar with with one click gpt vision could check for any mistakes and give you feedback anyway i've just thought of that you let me know in the comments other use cases not covered so far this is certainly a wild time and thank you so much for watching all the way to the end if you've learned anything like i say please do leave a like do check out my patreon if you're feeling extra generous and have a wonderful day

RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights

Transcript