back to indexStanford CS25: V3 I Low-level Embodied Intelligence w/ Foundation Models
00:00:00.000 |
So, hey guys, thanks for coming to our second class. 00:00:10.000 |
Today we have the pleasure of welcoming Faish Shia, he's a senior research scientist at 00:00:15.080 |
Google DeepMind, where he works on the robotics team. 00:00:18.800 |
He received his PhD here, actually, working with Sylvio Salvariz in Stanford Vision and 00:00:27.960 |
And his mission is to build intelligent embodied agents that can interact with complex and 00:00:33.380 |
unstructured real-world environments with applications with home robotics. 00:00:40.120 |
Recently he has been exploring the use of foundation models for robot decision-making 00:00:49.080 |
Hi everyone, I'm super happy to be here and happy to be back. 00:00:53.400 |
I graduated from here two years ago, and now I'm a research scientist at Google DeepMind. 00:00:58.800 |
I work on the robotics team, and today I will be talking about low-level embodied intelligence 00:01:05.600 |
So it's definitely an interesting topic, and I will introduce what is embodied intelligence 00:01:11.320 |
and what is low-level embodied intelligence, and how we can accelerate the building of 00:01:19.800 |
All right, so why are we working on embodied intelligence? 00:01:24.340 |
So embodied intelligence is an integral part of artificial intelligence, and it's an important 00:01:31.760 |
milestone to artificial general intelligence. 00:01:35.760 |
And it has a lot of use cases, like for example, we all hope we have a home robot that can 00:01:41.000 |
be in our home 24/7 and clean the home for us, or clean up our messy room, or cook for 00:01:48.720 |
us, or taking care of our aging family members. 00:01:56.360 |
That is because our intelligence is currently mostly in the virtual world. 00:02:00.840 |
So we have AI agents that can help us draft emails or write eloquent essays, but they 00:02:07.480 |
are not super good at interacting with the messy real world, unstructured, complex environment 00:02:16.600 |
So just to give you guys a couple of examples of how messy the real world can be, and how 00:02:22.280 |
hostile it could be to robotics, I want to show you a curious mistake or curious error 00:02:30.720 |
So the task is to put the Coke can in the sink, and watch what the robot do. 00:02:36.380 |
The robot grabs the Coke can and opens the tap. 00:02:39.400 |
So this is kind of dangerous, but it's kind of interesting, right? 00:02:45.820 |
Because we never expect it would do something like that. 00:02:48.800 |
It's just from random noise, it starts to open the tap, and the water starts to come 00:02:54.520 |
So for an agent to have this type of physical intelligence, it needs to understand the effect 00:03:01.000 |
of its actions, and what is so-called a world model. 00:03:04.640 |
So people have been complaining that language models so far don't have a world model. 00:03:08.780 |
So it doesn't understand geometry, it doesn't understand the spatial relationship of objects, 00:03:15.120 |
or the effect of actions, basically how objects will move according to physical laws. 00:03:24.040 |
In another case, so this is our robot that is ready to deliver a can, or actually throw 00:03:31.480 |
But as you can see, we have this pre-programmed behavior of tucking the arm behind. 00:03:39.800 |
So if there's any liquid in the can, it will spill and damage the robot. 00:03:44.800 |
So it's another example of real world is really complex, and there are a lot of things to 00:03:50.560 |
And in order for our robots to have this sort of ambient intelligence, they really need 00:03:56.620 |
to understand a lot of very nuanced details of the environment, and understanding the 00:04:02.120 |
physics, physical laws, and understanding the effect of its actions. 00:04:10.280 |
There are many ways to achieve embodied intelligence. 00:04:12.560 |
Actually, throughout my PhD study, I've been fascinated by this idea of creating interactive 00:04:18.840 |
environments, basically, let agent explore in this interactive environment, basically, 00:04:28.520 |
So that if the agent needs to survive in such environment, it must develop intelligence. 00:04:33.920 |
So it's an ecological view of perception and agency, and is popularized by American psychologist 00:04:42.320 |
So he has a famous quote, "Ask not what is inside your head, but what your head is inside 00:04:49.200 |
So human learned this type of embodied intelligence. 00:04:51.960 |
Human is able to manipulate objects effortlessly, one, because of evolution, second, because 00:04:59.400 |
We have been playing with this toy, we have been interacting with this toy, and watch 00:05:06.080 |
And similarly, we can give robots a safe playpen, so they can explore in those environment and 00:05:13.240 |
interact with environment and play, and watch the effect of actions, and effectively understand 00:05:22.460 |
So I have been developing these simulation environments, one of which is called Gibson 00:05:32.520 |
It's mainly aiming at simulating the visual world faithfully, and also simulate physical 00:05:40.960 |
So we built this environment, which is a scanned environment from a lot of houses. 00:05:46.680 |
And then an agent, we can spawn an agent in that, in this case, a humanoid agent, and 00:05:51.840 |
the agent can learn to walk or to run in this environment and simulate all this perception 00:05:59.120 |
So we can create a perception action loop for this agent. 00:06:03.560 |
And similarly, we can put other types of agents in this environment, in this case, a little 00:06:09.200 |
cart, and we can also put a quadruped or this ant into this environment. 00:06:16.040 |
So essentially, we create an environment where we can simulate perception for the agent, 00:06:23.000 |
and then we can create a neural network to map the perception to action. 00:06:26.840 |
And this way, we achieve some sort of physical intelligence. 00:06:37.560 |
So in this case, the environment is one monolithic piece of mesh. 00:06:42.500 |
As you can see, the agent run into the wall and it bounced back. 00:06:46.900 |
So there is no articulation in this environment. 00:06:50.080 |
So it's not simulating the full complexity of the environment. 00:06:53.800 |
So the things that we can do with our agent is rather limited. 00:06:59.080 |
So that's why we create other simulation environment, one of which is iGibson environment, which 00:07:06.140 |
So what we do is we create, again, scan a lot of real world houses, and then we convert 00:07:13.520 |
them to CAD assets, basically mesh assets that are interactable. 00:07:18.600 |
In this case, we have a simulated agent that go into the environment and then close all 00:07:25.260 |
So we are able to do that because we model the complexity of the world a little bit more. 00:07:33.000 |
We start to model physics a little bit more, basically modeling the degree of freedom in 00:07:39.640 |
And our agent can do more than just navigating around. 00:07:50.200 |
And our agent can develop more complicated behavior, such as unloading a dishwasher and 00:07:55.040 |
find a bowl or take out the bowl and put it on the table. 00:07:59.700 |
So as we scale up the complexity of the environment, we are able to learn much more complicated 00:08:08.920 |
And that's one way to achieve embodied intelligence, which is to build complex enough simulation 00:08:19.080 |
Not just my research, but the entire field of computer vision is undergoing a paradigm 00:08:24.740 |
So previously, we are focusing on internet AI. 00:08:27.780 |
We curate a lot of internet data sets to study problems like classification, segmentation, 00:08:35.180 |
Basically all these computer vision problems. 00:08:37.460 |
Now we focus a lot more on embodied AI, which is adding the action dimension to the problem 00:08:44.700 |
that we are studying problems like visual navigation, manipulation, rearrangement, embodied 00:08:49.340 |
question answering, instruction following, and the simulators in some sense replace the 00:08:59.900 |
One thing that doesn't change, which is the data is still super important. 00:09:04.420 |
We are still relying on a large amount of data to learn this intelligent behavior, no 00:09:11.220 |
matter if it's from a static data set or from a simulator. 00:09:16.780 |
So learning in simulation can take a lot of interactions. 00:09:22.180 |
So just to give you an example, we create this iGibson environment and we want to learn 00:09:27.980 |
a behavior called go into a room through a closed door. 00:09:31.820 |
So this is a rather simple behavior, which I can show on the top right of the screen. 00:09:37.320 |
So the agent needs to stop in front of the door, it needs to stop at the right distance. 00:09:42.300 |
If it's stopped too close to the door, it cannot extend its arm. 00:09:50.940 |
Open this door, when there is enough clearance, it will go into the door. 00:09:55.100 |
However, it takes about 50,000 episodes or 1.25 million environment interactions to learn 00:10:04.100 |
This is because we are using model-free reinforcement learning, the agent is exploring this environment. 00:10:10.100 |
It could really push any point, it could rather stop at any point. 00:10:14.100 |
So we give it a reward function to go into the room, but it's very rare that it will 00:10:23.340 |
I would like to argue with foundation models, we can do a lot more different. 00:10:29.700 |
You just ask a HHBT, how do you go into a room through a closed door? 00:10:33.980 |
And it will say, open the door, walk through the door. 00:10:36.120 |
So this is a gross simplification of the problem. 00:10:43.660 |
But what I'm just saying is that we can leverage a lot of semantic prior from the foundation 00:10:51.460 |
So if we really like data, if we really need a lot of data, the foundation model is a compressed 00:10:56.900 |
version of the entire data and it's a knowledge base that you can query and to accelerate 00:11:03.580 |
Of course, simulation and real world data is still super, super important, but maybe 00:11:10.580 |
We can use foundation models plus a limited amount of simulation or real world data. 00:11:16.980 |
So that's what I'm going to talk about today. 00:11:20.460 |
So where are we in terms of foundation models plus robotics? 00:11:24.300 |
So our team at Google DeepMind has been piloting in foundation model plus robotics. 00:11:29.940 |
So we developed advanced planning, high level planning algorithm. 00:11:37.480 |
It is an algorithm that can parse a user command. 00:11:45.540 |
How would you throw it away and bring me something to help clean? 00:11:48.740 |
And it's querying a large language model, which is given a score highlighted in blue. 00:11:56.180 |
The affordance will tell you whether an action at a given state is possible. 00:12:00.580 |
It's augmenting the language model to give you only possible things. 00:12:04.880 |
So essentially, it is doing the semantic planning with a language model. 00:12:09.860 |
But it's also taking into consideration what it can do. 00:12:13.080 |
So it's not just outputting the-- like, language model tends to hallucination. 00:12:22.320 |
It only gives you what is possible for the robot to do and what is actionable for the 00:12:26.760 |
And the robot is doing the thing that is advancing the long horizon task progress. 00:12:31.860 |
And also, each task is executed by a low-level policy. 00:12:36.560 |
Here it doesn't quite clean the table, because we haven't added this to the low-level skill. 00:12:42.880 |
But imagine there is a low-level skill to clean the table. 00:12:50.960 |
The low-level policy used here is Robotic Transformer 1, RT1. 00:12:59.200 |
Essentially, we collect a large data set of human demonstrations. 00:13:04.280 |
We put a transformer, and we train it on this large data set of expert trajectories. 00:13:12.440 |
It is able to do about 700 tasks with 97% success rate. 00:13:18.440 |
And it has interesting generalization behavior. 00:13:21.280 |
It can operate in a new kitchen it has never seen before, which is showing there is a successful 00:13:27.780 |
recipe to apply foundation models in robotics. 00:13:31.760 |
So that's roughly where are we in terms of foundation model plus robotics. 00:13:36.400 |
And I will talk about a few new works that is bringing this to the next level. 00:13:44.600 |
So actually, my teammate, Ted, gave a talk of foundation models plus robotics at the 00:13:58.680 |
I actually watched it last night so that I don't repeat some of the contents. 00:14:04.280 |
But what he basically mentioned is that he revealed our team's progress in terms of building 00:14:14.840 |
And we have had a lot of somewhat detour, and now we sort of figured out a recipe. 00:14:21.200 |
So in 2021 to 2022 is how we scale to many tasks with demonstrations. 00:14:38.160 |
We tried imitation learning plus reinforcement learning, and some other ways, or combining 00:14:46.520 |
In 2022 to 2023, it's about how we can leverage foundation models to accelerate robotics. 00:14:51.880 |
We really see a proliferation of using foundation models to accelerate robotics, both on the 00:14:58.680 |
high-level planning and low-level control, probably leaning more towards a high-level 00:15:04.520 |
So if the recipe works-- so the recipe is essentially combine a large-scale diverse 00:15:11.760 |
offline data set with high-capacity architecture, such as a transformer, and using language 00:15:20.000 |
So this will be the recipe to build foundation models for robotics. 00:15:29.000 |
Essentially, we're just-- let's just scale everything to orders of magnitude and be done 00:15:47.600 |
So we are still on our way, on our quest to solve low-level embodied intelligence. 00:15:53.680 |
When I talk to people that you can use foundation models to do robotics, their reaction would 00:16:04.760 |
It doesn't do the low-level manipulation really well. 00:16:10.480 |
One of the reasons is there is a Moravec's paradox. 00:16:14.000 |
Moravec's paradox is the observation that in artificial intelligence and robotics, contrary 00:16:18.600 |
to traditional assumptions or our intuitions, reasoning requires very little computation. 00:16:24.000 |
But sensory motor control and perception skills require enormous compute resources. 00:16:29.720 |
That is because as biological creatures, we acquire the sensory motor skills through evolution. 00:16:39.680 |
So we might not be able to reason or do large-scale computation. 00:16:46.720 |
But this sensory motor control is integral to our survival. 00:16:51.580 |
So it's essentially already learning our DNA. 00:16:55.080 |
But in robotics, it's a little bit different. 00:16:57.480 |
So the chips are very good at doing reasoning and computation. 00:17:04.960 |
They haven't acquired the sensory motor skills that is necessary for them to do tasks in 00:17:13.680 |
When the computer beat Kasparov, basically the human champion in chess, there is another 00:17:23.880 |
It can beat the human champion in chess, but there is still someone need to move the chess 00:17:30.360 |
Similarly, in the AlphaGo moment, when Lee Sedol was beaten by AlphaGo, there is still 00:17:35.000 |
someone who is moving the chess piece for them. 00:17:39.260 |
So this is showing the reasoning is that the hard things are easy, and the easy things 00:17:45.640 |
There's another thing that prevents us from using foundation models more prevalently, 00:17:51.600 |
more in a larger scale in robotics, which is the training data bias. 00:17:57.160 |
The training data of foundation models or large language models are mostly language 00:18:02.440 |
So it's perhaps not that surprising it knows how to clean up a kitchen because maybe there 00:18:07.960 |
are wikiHow articles teaching you how to clean up a kitchen or to do something in a procedural 00:18:14.040 |
But there is no wikiHow articles teaching you how to move your finger five centimeters 00:18:18.260 |
to the left because people just don't say that. 00:18:23.040 |
So there is a very limited amount of this low-level control data in large language model 00:18:29.720 |
So we do have a lot of challenges in bringing the foundation models to a lower level. 00:18:34.000 |
So that's what I mean by low-level embodied intelligence. 00:18:42.040 |
So if there is any questions, feel free to interrupt me any time. 00:18:53.120 |
So there are a couple of challenges of using large language models for low-level control. 00:18:57.240 |
As I just mentioned, the first thing is lack of data. 00:19:01.680 |
So we only have perhaps 100,000 episodes of human demonstration data and takes about 13 00:19:15.920 |
In the country, large language models are trained on the order of 1,000 billion tokens. 00:19:21.680 |
A smaller palm was trained on 780 billion tokens, and the larger one is trained-- following 00:19:31.640 |
the chinchilla rule, you would need to train it on 1.35 trillion tokens. 00:19:36.600 |
So it's a huge amount of discrepancy between how much data we can achieve in robotics and 00:19:43.320 |
how much we can get in large language models. 00:19:48.120 |
So we will always be bounded by robotic data. 00:19:53.560 |
Maybe we can keep the robotics data the same, and then we can scale on other fronts. 00:19:58.200 |
Like, maybe we can scale the pre-training mix of text and image, or maybe image and 00:20:03.520 |
Maybe we can build this cake, and the robotics data is just a cherry on top of it. 00:20:10.020 |
And we can scale the foundation really, really well. 00:20:14.260 |
Some of my work that I'm going to talk about today actually reused the RT1 data. 00:20:18.800 |
We don't collect the new data for RT2, but we want to do more things with the same amount 00:20:26.000 |
The second challenge is kind of related to the first challenge. 00:20:30.400 |
Language models lacks an interface for low-level control. 00:20:34.680 |
If you ask a language model, how do you make a robot dog stand up on two feet, it will 00:20:39.360 |
tell you a lot of things that sound reasonable, sounds plausible. 00:20:43.120 |
It will tell you the robot dog's torso is upright, balance over two hind feet, and standing 00:20:55.920 |
On the other hand, maybe we can ask a language model to write control code to directly control 00:21:01.120 |
But usually, that requires you to curate an API that is friendly to the language model. 00:21:06.840 |
It will directly ask it to give you my joint angles to make the robot stand upright. 00:21:12.520 |
It will not give you the right thing because it doesn't have enough context. 00:21:15.600 |
So essentially, large language models don't speak robot language. 00:21:20.700 |
Can we actually find the right robot language? 00:21:24.120 |
Can we find the interface between large language models and robot control? 00:21:28.240 |
Or can we just treat robot action as another language? 00:21:36.200 |
In today's agenda, I will be talking about low-level embodied intelligence with foundation 00:21:41.280 |
It's separated into two parts, and it's addressing the two challenges that I've just mentioned. 00:21:47.480 |
Part one is about model consolidation, joint scaling, and positive transfer. 00:21:51.760 |
So I have to put them in one part because they are somewhat related. 00:21:56.400 |
And part two is developing new interface of large language models. 00:22:04.840 |
Yeah, I was going to ask, why couldn't you just fine-tune an RNN for generating low-level 00:22:19.880 |
So the question is, why cannot we fine-tune language model to directly output low-level 00:22:29.880 |
So I will be talking about RT2, which does somewhat similar to that. 00:22:33.800 |
It's fine-tune language model to output action as a language, to output our action representation. 00:22:42.600 |
Like, for example, you would need to collect additional data to fine-tune a language model. 00:22:48.720 |
So either we can fine-tune that, or we can use the language model zero-shot if you find 00:22:53.160 |
the right interface, which I will talk about a little bit in the part two. 00:23:01.220 |
So the model consolidation is, essentially, we can do the high-level reasoning and low-level 00:23:07.240 |
And joint scaling is, not only we scale the robot data, which is expensive. 00:23:15.880 |
Or we already start from a pre-trained vision language model. 00:23:19.720 |
And a positive transfer is model benefiting from diverse joint training across internet 00:23:24.560 |
scale language, vision, and vision language domains combined with robotics. 00:23:31.720 |
So this is a continuation of the axes that Tad drew in his previous talk. 00:23:42.560 |
So this visualization basically highlights some of the work on our team. 00:23:47.820 |
And each work, each column, is basically a robotic system that is able to do both high-level 00:23:57.260 |
So previously, we need to have separate models for each thing. 00:24:03.080 |
Previously, in the initial release of SACAN, the planning is done by a large language model. 00:24:09.040 |
And the affordance is done by a QT opt-like policy trained with Sim2Real. 00:24:20.040 |
And the low-level policy is Robotic Transformer 1. 00:24:23.720 |
So it's each model doing its dedicated thing. 00:24:28.160 |
And we need to train each model differently, and perhaps with different type of data. 00:24:34.480 |
And later, we have QTransformer, which unifies, which is kind of an offline RL method that 00:24:46.820 |
It can train on both positive data and negative data. 00:24:49.520 |
And with that, we are able to gather a policy that is also understanding affordances. 00:24:56.180 |
So we can unify the low-level policy and affordances. 00:24:58.880 |
But the planning is still a large language model. 00:25:01.280 |
And then we have PAL-ME, which is a vision language model, which is a large language 00:25:06.580 |
model also trained on a vision language domain. 00:25:09.980 |
So the PAL-ME can do planning and affordance in just one model. 00:25:18.620 |
Like there is RT2, which I'm going to talk about today, that can do both high-level planning 00:25:23.760 |
to some extent, generating affordance, and do low-level policies. 00:25:28.320 |
So behind the model consolidation is the consolidation of tasks. 00:25:33.640 |
We can represent every task as a vision plus text to text task. 00:25:39.160 |
So it's a really universal representation of the task. 00:25:42.840 |
And then with that, you can train it really on using a lot of data. 00:25:49.960 |
Basically, learning affordance can also tell you how to achieve a task. 00:25:56.800 |
There are transfer between tasks when you pull all the tasks together. 00:26:03.800 |
So to understand this joint scaling and to understand the model consolidation, we need 00:26:12.280 |
So PAL-ME is an embodied multimodal language model. 00:26:19.440 |
We made some adaptation on the architecture so it can understand multimodal input. 00:26:25.900 |
So it is basically one model that is able to take in multimodal input. 00:26:34.760 |
So in large language models, each word is tokenized and tokenized and getting this embedding 00:26:45.800 |
And then that is fed into a large language model. 00:26:49.240 |
So in PAL-ME, what we do is instead of using words, we can use multimodal tokens. 00:26:56.120 |
So the multimodal tokens can come from a vision transformer, a VIT, or it can come from robot 00:27:06.100 |
So every multimodal token, then we map it to the text embedding space. 00:27:14.400 |
We basically train a linear affine transform between the multimodal token and the text 00:27:24.380 |
And then we can treat the multimodal token as words as well. 00:27:30.200 |
So essentially, we have a language model as a solid base, and then we start to adapt it 00:27:39.760 |
So this is quite interesting because it doesn't require a ton of adaptation or fine tuning 00:27:50.000 |
It just aligns naturally to the multimodal input, such as images. 00:27:54.840 |
I will show a couple of examples of what it can do. 00:27:58.400 |
And we can train in the same way as training large language models. 00:28:01.600 |
So essentially, we can reuse the same infrastructure and training algorithm and everything to train 00:28:10.320 |
A couple of other things we find along the way is positive transfer, which I will share 00:28:17.400 |
So I guess here, I also want to mention PAL-ME is one of the largest models we have explored 00:28:25.400 |
It has 562 billion parameters, which is by concatenating the PALM, 540 billion parameters 00:28:34.400 |
And we find a lot of emergent capabilities of these models. 00:28:39.040 |
That is, we haven't expected during training time, but really, we can prompt these models 00:28:48.920 |
We have also explored using neural scene representation, basically an object-centric representation 00:28:59.440 |
So object-centric representation assigns one token to each object. 00:29:07.160 |
And we find that this representation is super helpful for robot planning tasks, because 00:29:13.040 |
traditional VIT representation is based on grid, and it doesn't have a full understanding 00:29:20.640 |
We have done an extensive study on the scaling performance and the catastrophic forgetting 00:29:27.720 |
performance and all other interesting experiments in the paper. 00:29:34.640 |
So here, I'm just showing some interesting qualitative examples or some emergent capability 00:29:44.400 |
So first, we found this model has some reasoning capability. 00:29:48.000 |
You can give it an image and ask it questions that require a little bit of reasoning. 00:29:52.960 |
And you can prompt this with, let's think step-by-step, which is a technique used to 00:30:01.400 |
But here, in multi-modal language models, you can do the same. 00:30:04.760 |
I guess people are also experimenting these days with GPT-4V. 00:30:09.440 |
You can also prompt it to think step-by-step or count row-by-row. 00:30:13.480 |
But here, this is before GPT-4V, and we were able to elicit reasoning using some of the 00:30:19.760 |
interesting prompts, such as we can ask it, in this photo, are there more cats or more 00:30:28.040 |
And the PAL-ME found out there are equal amount of dogs and cats. 00:30:32.200 |
And on the right, give an image, can I go down the street on a bicycle, yes or no? 00:30:39.240 |
And the reply is, do not enter, second, except the bicycles. 00:30:45.200 |
So it's doing this modest reasoning, and it's mixing this understanding of symbols and also 00:30:55.240 |
So this is quite amazing to me, to be honest, when I first saw this. 00:31:00.440 |
I didn't expect a multi-modal language model would be able to do that. 00:31:04.880 |
And we also tried one thing, which is traditionally very difficult to language models, which is 00:31:11.920 |
Language models can understand joke, but sometimes it just doesn't-- it's not able 00:31:17.400 |
to tell you a joke when it comes to the punchline. 00:31:21.600 |
Because it's just trying to make something that is plausible and sounds like a joke. 00:31:27.040 |
And when it comes to the punchline, it doesn't really know what to say. 00:31:30.760 |
So here, I give it an image, and I ask it to come up with a description, and then comes 00:31:37.780 |
So this guides the language model to think step-by-step. 00:31:40.920 |
And the description is a donkey is carrying a dog, cat, and rooster. 00:31:45.160 |
And the joke is, what do you call a donkey with a rooster on his back? 00:31:50.200 |
Like when I saw this, I'm pleasantly surprised. 00:31:58.840 |
And finally, we see some math reasoning with this model. 00:32:03.280 |
Basically, I give it a messy menu from a pizza store, and I ask it, I'm just buying 00:32:16.040 |
And it's figuring out there is a pizza, and there is $9.99, and it tells you the price. 00:32:23.520 |
In some of the answers, it even calculates text, but the text is hallucinated. 00:32:29.520 |
All right, let's talk about positive transfer. 00:32:32.520 |
So apart from the amazing things that Pomi can do, it also has interesting positive transfer 00:32:43.100 |
So when we train Pomi on a single domain, when we train it on just a single robotics 00:32:52.480 |
But when we pool all the data together, and we also include internet-scale visual language 00:32:59.280 |
tasks, such as captioning or visual question answering, it is able to do much better. 00:33:05.400 |
So this shows that it's important to mix all the data together and train it jointly. 00:33:12.520 |
The internet-scale data can act as a regularizer for you to not forget the representations. 00:33:20.960 |
And those representations are, in turn, very useful for robotics. 00:33:28.300 |
And we start to see more and more positive transfer in other of our studies. 00:33:32.660 |
So how much data do you have to do collectively, like in simulation or in real world? 00:33:37.480 |
I think the playing with sorting stuff on the table is very impressive. 00:33:50.640 |
So these are all planning data, like high-level planning. 00:34:00.000 |
So first of all, the sorting results, the low-level policy is still using a traditional 00:34:10.680 |
And that policy is trained on 68,000 episodes. 00:34:16.080 |
The high-level planning is probably easier than you think, because it's giving command 00:34:25.800 |
So it's basically only need to say, put the red block into top-left corner, put another 00:34:32.960 |
So it's rather quite standard autoregressive language modeling task. 00:34:39.840 |
The only thing I need to do is to determine what task is not finished yet. 00:34:45.260 |
So for example, if the block is already in the corner, it shouldn't call low-level policy 00:34:50.780 |
So it's rather like parsing the states and understanding the states. 00:34:55.720 |
So this high-level policy only requires about 50 to 100 demonstrations to learn. 00:35:02.800 |
And in the future-- that's a very good question, actually-- in the future, a lot of these tasks 00:35:09.240 |
So maybe we just demonstrate it once to the large-language model, then it knows how to 00:35:15.880 |
Yeah, this is through human demonstration as well. 00:35:27.400 |
So a human on a low-level can demonstrate low-level policy by tele-operating a robot 00:35:34.200 |
But on a high-level, it could also just give the low-level policy-- imagine your control 00:35:44.460 |
And then as a human, you can also guide a low-level policy to accomplish a task. 00:35:49.600 |
And then that thing can then be used to train a large-language model. 00:35:57.280 |
The secant is a little bit more interesting because the planning steps are actually generated 00:36:04.460 |
So we essentially distilled POM plus this affordance model into POM-e. 00:36:13.020 |
It's like using the AI data to bootstrap itself. 00:36:16.840 |
That one has about 3,000 episodes, also not quite a lot. 00:36:22.080 |
But it's able to learn complex planning behavior, replanning behavior, error recovery, which 00:36:30.000 |
So with the POM-e as a high-level planner, we are able to take the rice chips out of 00:36:38.960 |
the drawer, and there is a twist, which is I will be messing with the robot. 00:36:47.440 |
So as it put onto counter, I put it back to the drawer. 00:36:52.040 |
And as it pick it up again, and then I put it back again. 00:36:58.400 |
It's able to understand my task is not finished. 00:37:03.240 |
Now, after I don't mess with it anymore, it's able to close the drawer and pick up 00:37:11.560 |
So POM-e is able to combine affordance and planning in one model and do complex reasoning 00:37:22.760 |
And interestingly, we can use the exact same model checkpoint to do block sorting as well. 00:37:30.840 |
It can not only reason about how to bring a bag of chips to a user, it can also sort 00:37:38.000 |
So and it's also responding to adversarial perturbation, like if the user is putting 00:37:46.060 |
the block in the middle again, it's able to recover from that. 00:37:57.360 |
So yeah, this is the power of vision language models. 00:38:06.160 |
These are all vision language models that are used for planning or high-level reasoning. 00:38:15.280 |
And that's the RGQ work, which is vision language action model that transfer web knowledge to 00:38:28.480 |
And it has a whole range of objects on the table. 00:38:32.860 |
So it can link the extinct animal to dinosaur and to the action that pick the dinosaur up. 00:38:40.960 |
So it's really doing this emergent reasoning and also the manipulation in just the one 00:38:47.760 |
And by the way, this robot hasn't seen any of these before, at least in the robot training 00:38:55.440 |
It might have seen this in their internet catalog, but it has never seen it in the robotics 00:39:03.080 |
So it's quite interesting how we need to evaluate these robots nowadays. 00:39:10.680 |
So when we evaluate language models to prevent data contamination, every time you need to 00:39:16.480 |
give it new questions because otherwise it might already memorize it in its training. 00:39:22.000 |
When we evaluate these robots, we actually go to dollar store to buy all these toys to 00:39:29.520 |
And as we run more evaluation, maybe there will be some replication as well. 00:39:33.840 |
But as you can see, it is able to understand to pick up this dinosaur toy. 00:39:42.960 |
So we start from a visual language model that is trained on internet-scale data. 00:39:49.000 |
And then we also combine it with robotics action data, which is the RT1 data and we 00:39:55.600 |
And we can dive deeper, a little bit deeper into RT2. 00:40:00.280 |
So first of all, what is a visual language model? 00:40:02.520 |
A visual language model is a transformer that takes in image and text and output text. 00:40:11.380 |
So within Google, there is a visual language model called Pali, which is an encoder-decoder 00:40:22.440 |
It's basically having a VIT to understand images and then a transformer encoder and 00:40:30.960 |
They encompass both the visual and semantics to understand the world. 00:40:35.800 |
And in robotics, we have to deal with a lot of both of these. 00:40:40.500 |
And the question is, can we leverage the knowledge in the visual language models and apply them 00:40:51.320 |
If you want to learn more about RT1, you can listen to the previous episode of this CS25 00:40:59.320 |
So he gave a detailed introduction on the RT1. 00:41:02.080 |
But the RT1 is, if you stand far enough, it is also a vision language to action or something 00:41:16.220 |
The camera image passed through a film-efficient net, which is tokenized into 81 tokens, and 00:41:21.580 |
then going to a token learner, which compresses everything into eight tokens. 00:41:26.480 |
And then there is a transformer block, leveraging a lot of self-intention layer, and then generate 00:41:41.040 |
The anti-factor has six degrees of freedom, its position and the rotation, and the gripper 00:41:49.140 |
And there is another dimension representing terminate the episode or not. 00:41:57.500 |
And we discretize every dimension into 256 bins. 00:42:03.020 |
And then we do cross-entropy loss on those bins. 00:42:05.780 |
So that's the RT1 architecture in a nutshell. 00:42:10.020 |
It's quite similar to a vision language model with different output tokens. 00:42:13.820 |
So it's rather natural that we just use a large pre-trained vision language model directly 00:42:24.440 |
And one question is, how do we deal with actions when using pre-trained vision language models? 00:42:30.120 |
And here is action representation that we use. 00:42:33.460 |
The robot actions here are the eight dimensions. 00:42:38.100 |
And as I mentioned, there is termination, position change, and rotation change. 00:42:47.120 |
We also have tried other alternative representations, but they are not as good as just this naive 00:42:58.660 |
Oh, the film efficient net is a pre-trained convolutional neural network. 00:43:07.180 |
So the reason that we do this is, through some ablation study, we can tokenize the image 00:43:17.100 |
And we can tokenize using film efficient net. 00:43:19.940 |
Film, what it means is it also take into the language embedding and append it to the intermediate 00:43:28.300 |
So we basically have some combination of feathers, and it's encoded in images. 00:43:50.820 |
On the film ResNet, it's about how we tokenize the images and how we combine vision information 00:44:22.100 |
You can basically tokenize your image just by itself. 00:44:25.220 |
And then you can have language and use cross-attention to combine the image and text representation. 00:44:35.540 |
So we do have a lot of considerations, such as latency. 00:44:38.660 |
That's why we use this film ResNet, because it's super fast. 00:44:42.020 |
And it can output a limited amount of tokens, which we can further compress with Token Learner. 00:44:50.620 |
Like, every single image it sees, it then reacts with each other? 00:44:57.740 |
And every time, we use a history of up to six steps. 00:45:04.540 |
And you see about two seconds of history before it. 00:45:12.940 |
Again, if you have more questions about RT1, I recommend watching the previous episode. 00:45:26.660 |
This will be our output of our transformer, which is a visual language model. 00:45:31.420 |
We tried other alternatives, such as floating numbers. 00:45:35.300 |
Floating numbers is not super friendly to language model tokenizer, because it has these 00:45:42.340 |
We also tried the human language, such as left or right. 00:45:46.820 |
But they cannot be directly executed on a robot, which is a limitation of this method. 00:45:53.440 |
So if we commit to this action representation, which is just a string of numbers, we essentially 00:46:01.420 |
We tried different variants, including polyX. 00:46:11.700 |
There are 5 billion parameters variant and 55 billion parameter variant. 00:46:16.020 |
And we also tried POMI, which is 12 billion parameters. 00:46:20.300 |
The procedure that we did to train this RT2 is via co-fine tuning. 00:46:26.700 |
Co-fine tuning is to put the internet scale data and the robotic data together. 00:46:32.880 |
And then we fine tune it on this mixture of data so that it doesn't-- it retains the internet 00:46:42.940 |
Maybe that's also an artifact of our data is too small and not diverse enough. 00:46:46.620 |
So if you're just a fine tune on robotics data, it will quickly overfit and forget about 00:47:02.260 |
We basically-- again, we do this autoregressively. 00:47:10.180 |
And we format this as a question and answering task. 00:47:13.260 |
What should the robot do to achieve a certain task? 00:47:15.860 |
And the task is a string that human give the robot for the robot to achieve. 00:47:20.620 |
And it also have the current observation, which is the robot observation, the camera 00:47:30.600 |
It pass through a VIT, and then it pass through the large language model, and then output 00:47:38.380 |
So we leverage the constraint decoding to make sure it always have eight numbers. 00:47:45.680 |
And because otherwise, we cannot de-tokenize it. 00:47:49.640 |
It's very easy for language model to just miss one number. 00:47:52.860 |
So we do have some mechanism, such as constraint decoding and beam search, to make sure the 00:47:59.320 |
After we get the string of eight numbers, we de-tokenize it to a delta T and delta R, 00:48:07.280 |
And the robot can just directly run this on the robots. 00:48:10.120 |
After they run on the robots, we repeat this process. 00:48:13.280 |
We get another new image, run through this process, and get a new action. 00:48:17.160 |
And we repeat this process until a termination is decoded. 00:48:21.240 |
So some people might be concerned that this is rather slow. 00:48:27.040 |
It's in fact quite slow, because it's 12 billion parameters, or 5 billion parameters. 00:48:34.920 |
So we run on a TPO cluster, and the robot is querying the TPO cluster to get the numbers 00:48:42.980 |
So for the 12 billion parameters, we can actually run at 10 hertz. 00:48:49.280 |
For all those models, we can run at least three hertz. 00:48:52.060 |
So that is sufficient for controlling a robot. 00:48:57.960 |
And we see a lot of emergent skills that is not on the training set. 00:49:04.880 |
Essentially, as I just mentioned, we are probing what this RT2 can do. 00:49:10.280 |
So we are trying to figure out what RT2 can do. 00:49:12.440 |
So we test it with a lot of new tasks, such as put a strawberry into the correct bowl, 00:49:18.720 |
or move a banana to Germany, just to test its understanding of symbols or flags. 00:49:29.360 |
So basically, test its semantic reasoning and also low-level manipulation skills. 00:49:36.240 |
And we divide the tasks into symbol understanding, and reasoning, and human recognition, and 00:49:45.280 |
And we found that with RT1, which is not trained on internet-scale data, we do quite poorly 00:49:56.200 |
And in the RT2 variants, which is co-fine-tuned on the internet data and our robotics data, 00:50:11.040 |
So the RT2 with the 55 billion poly is performing better than the 12 billion poly, although 00:50:17.920 |
they perform quite similarly for in-domain tasks. 00:50:20.840 |
But the generalization is kind of interesting. 00:50:23.200 |
It seems with larger scale, you can generalize better. 00:50:27.920 |
And here are some videos of the robot achieving these tasks, like moving the banana to a number, 00:50:35.840 |
put the strawberry into the correct bowl, move a Rubik's cube to the water bottle-- 00:50:41.880 |
but I'm speaking Chinese-- moving the banana to a German flag. 00:50:46.840 |
So it's able to do all of these very interesting tasks. 00:50:51.960 |
In terms of the quantitative evaluations, we also found that the RT2 policy is quite 00:50:57.880 |
robust to unseen objects, unseen backgrounds, and unseen environments. 00:51:03.960 |
And here is another evidence of positive transfer. 00:51:07.040 |
So co-fine-tuned with VQA data outperforms fine-tuning on robotics only. 00:51:12.520 |
And if you're trained on robot data from scratch, it barely works. 00:51:16.680 |
It almost doesn't work, because it overfits to robot data. 00:51:21.880 |
So we do need to do co-fine-tuning, or at least fine-tuning, so it retains its internet 00:51:30.660 |
This is also a recipe for how people would develop a domain-specific vision language 00:51:36.960 |
So you start from a very general vision language model, and you fine-tune on your domain. 00:51:41.240 |
Or you can co-fine-tune with your specific domain data. 00:51:45.540 |
This is likely a problem that each vertical of artificial intelligence would incur someday. 00:51:56.800 |
Like this shows some cross-embodiment, the RT2, PolyE3b outperforms previous models in 00:52:02.120 |
terms of moving blocks around a 2D environment. 00:52:09.240 |
And in large-language models, we have this chain-of-thought reasoning, which is a method 00:52:14.640 |
to elicit reasoning in large-language models. 00:52:18.120 |
You can either do zero-shot chain-of-thought reasoning by, say, eliciting step by step. 00:52:23.800 |
It's basically decoding more things and then come to the conclusion. 00:52:27.840 |
We can use a similar procedure for the RT2 as well. 00:52:32.120 |
So in RT2 PolyE, instead of directly decoding the actions, we can actually decode a plan 00:52:40.280 |
So this gives the language model an opportunity to understand a question or parse a question 00:52:45.940 |
It also gives us the opportunity to reason about things a little bit. 00:52:50.200 |
For example, if you say, "Bring me a drink," and it will say, "Pick up 7-up can," because 00:52:57.720 |
So we synthesized a couple hundred such examples using a large-language model just by augmenting 00:53:02.960 |
the instruction and then fine-tuned the RT2 just for a couple hundred steps. 00:53:07.080 |
So it's between full fine-tuning and in-context learning, and it is able to do some reasoning. 00:53:13.480 |
And some of the interesting reasoning tasks include, "I need to hammer a nail. 00:53:17.560 |
Which object from the scene might be useful?" 00:53:19.440 |
And in the scene, there is a headphone, there is a rock, and there is a sticky note. 00:53:24.600 |
And the robot will say, "Rocks," and then generate actions to pick up the rock. 00:53:28.460 |
So it's interesting that it's able to do this sort of reasoning with RT2. 00:53:34.040 |
And here is a demonstration of some of the channel-thought reasoning with RT2 PolyE. 00:53:39.880 |
And the task is, "Pick up the thing that is different from all other objects." 00:53:44.280 |
And it picks up the chocolate, because this is a snack and other things are the drink. 00:53:49.120 |
And I can also speak a different language, and the plan would be to translate it into 00:53:53.920 |
a language that it's familiar with, which is English, and then do the task. 00:54:01.280 |
There are also potentially better cases of the channel-thought reasoning. 00:54:04.700 |
So here I say, "Move the green object together." 00:54:06.720 |
And as you can see, the robot oscillates between the two green objects, because there are rather 00:54:11.720 |
It could move the can to the bag of chips, or it could move the bag of chips to the can. 00:54:16.820 |
It oscillates between two plans until one action brings it to an object, and it will 00:54:22.220 |
commit to one of the plans rather than another. 00:54:26.460 |
It's not always guaranteed to work, but it's quite interesting. 00:54:29.920 |
And it's also interesting that, again, we are testing the manipulation policy like how 00:54:34.460 |
we test intelligence of humans or animals or kids, because they're getting more and 00:54:41.980 |
As a summary, we have the vision language and action model that is able to improve the 00:54:50.560 |
It can do new tasks and operate the new objects. 00:54:53.520 |
It can also do chain-of-thought reasoning, and improving the underlying model, such as 00:54:59.440 |
the vision language model itself, by scaling it up and training it with internet-scale 00:55:06.320 |
data or training it with larger or higher-quality internet-scale data, we can achieve better 00:55:11.040 |
robot control, which is quite amazing, because robotics field has been traditionally developing 00:55:16.280 |
quite slowly and is bounded by hardware, bounded by a lot of different things, bounded by operation. 00:55:21.120 |
But now it seems we can piggyback on the development of the foundation model field, and whatever 00:55:27.920 |
they do will trickle down to our field as well. 00:55:30.560 |
And the future will be to increase the motion diversity and extend on the chain-of-thought 00:55:40.040 |
And so there is another example of positive transfer, which you might have seen recently. 00:55:46.520 |
So far, I've been talking about scaling differently. 00:55:49.520 |
I've been talking about don't scale robotics data and scale other data. 00:55:54.120 |
That's because robotics data is so hard to collect, and the purpose is not to avoid collecting 00:56:00.160 |
It's to develop a recipe that you can do more with limited robotics data. 00:56:05.560 |
However, there's also an effort from our team and the entire robotics field to scale up 00:56:12.640 |
the robot data collection, which is called OpenX Embodiment. 00:56:16.840 |
And the model chain is called RTX, Robotics Transformer X. 00:56:20.440 |
It's basically 22 type of embodiments and 572 scales and 60 datasets pulled all together. 00:56:28.520 |
So this will be the ultimate dataset we can use to study positive transfer and to study 00:56:37.280 |
And there are already evidences of positive transfer. 00:56:42.080 |
So we pulled all the data together from all these labs and find a common action representation 00:56:50.020 |
that we can use to train a robotic transformer. 00:56:53.080 |
And we have already found this jointly trained model can outperform task-specific model that 00:57:02.560 |
So there is some benefits in pulling all the data together. 00:57:05.840 |
So scaling robot data is also quite important. 00:57:12.560 |
So the summary for this part is that we are having a model consolidation. 00:57:16.640 |
We can now do the high-level reasoning and low-level control in one model. 00:57:21.240 |
And the low-level control part is what excites me because it's so far away from the traditional 00:57:26.720 |
language model domain, it's so different and it shows signs of life that we can trickle 00:57:33.640 |
down a lot more than we used to think it's possible. 00:57:37.880 |
And we can scale the pre-training of vision language models as well as scaling robotics 00:57:42.720 |
And we observe more and more positive transfer model benefiting from diverse joint training 00:57:47.560 |
across internet-scale language, vision, and vision language domains. 00:57:52.200 |
All right, so I noticed that we are close to running out of time, so I will just very 00:58:00.120 |
quickly go through the second part, which I think is also interesting, is to find new 00:58:04.680 |
interfaces of language models, but I would only talk at a very high level. 00:58:10.040 |
So language models, as we can see, can directly output action tokens if we found action representation. 00:58:15.800 |
So we can treat action as yet another language to the language model. 00:58:20.240 |
So language model can do translation, so it should be able to generate action as well. 00:58:27.920 |
Or can we generate more expressive actions that is beyond the scope of fine-tuning? 00:58:34.640 |
So that is about finding the right interface. 00:58:38.040 |
So previously, we have already established that language model doesn't have an action 00:58:43.880 |
If it has an action interface, it's not as effective. 00:58:48.480 |
So what is the best interface between language and the low-level actions? 00:58:51.840 |
I would argue the best interface between language model and the low-level actions is reward 00:59:06.500 |
And it's also a reparameterization of actions. 00:59:17.720 |
A skill is a mapping between my observation and my action. 00:59:21.480 |
So the mapping between my observation and action can be seen as a skill. 00:59:25.360 |
But a skill can have an alternative definition, which is a set of constraints and a set of 00:59:31.500 |
So picking up the bottle means the bottle is in my right hand, and the bottle is off 00:59:40.180 |
And how do I pick it up doesn't really matter. 00:59:42.960 |
That's a more, to its broader sense, a definition of skills. 00:59:47.120 |
It's more transferable between different skills. 00:59:51.600 |
And the constraints and objectives can be represented as rewards. 00:59:57.560 |
So we can ask language model to generate these reward functions. 01:00:04.360 |
It could be reinforcement learning, or it could be model predictive control that optimize 01:00:09.520 |
for those rewards and then run it on the robot. 01:00:19.560 |
So the reward translator basically is a two-stage process. 01:00:23.360 |
It's using the same language model, and it is using two different prompts. 01:00:28.160 |
So the motion description basically describes the motion. 01:00:32.360 |
So just now we found that the language model can output a description of how a robot dog 01:00:38.640 |
should stand up, but it's not able to achieve that. 01:00:42.240 |
But the motion description is still sensible. 01:00:46.820 |
So we just generate this motion description, and then we have a reward translator, reward 01:00:52.600 |
coder that translates this motion description into a piece of code that is representing 01:01:02.640 |
And these reward functions cannot be directly executed on the robot, but it can go through 01:01:08.520 |
our optimization process to learn how to achieve those reward functions. 01:01:13.480 |
So we're using reward as the interface between language model and a low-level controller. 01:01:20.020 |
And for the low-level controller, we're using Mojoco MPC, which is a model predictive control 01:01:29.660 |
It samples a lot of trajectories and finds one that optimizes your reward. 01:01:36.180 |
And we tested on a robot dog, a quadruped robot essentially, and a dextrose manipulator. 01:01:41.540 |
So the dextrose manipulator has an arm of six or seven degrees of freedom and a hand. 01:01:49.380 |
It's impossible to control it because it has so many degrees of freedom. 01:01:56.600 |
So just to showcase some of the examples, I omitted the motion description part. 01:02:07.380 |
So it seems that the language model is able to generate the right reward functions to 01:02:13.340 |
make the robot stand up on two back feet like a human. 01:02:18.140 |
And then now we are a little bit more ambitious. 01:02:22.020 |
Can we make the robot do a moonwalk while standing up like this? 01:02:25.380 |
So a moonwalk is from Michael Jackson, and it's very challenging. 01:02:29.980 |
So it generates the motion description and generates the reward code. 01:02:35.260 |
But the motion is not so correct, not exactly what we want. 01:02:41.060 |
The nice thing about using a language model and using the reward function is that you 01:02:45.820 |
You can go back and explain what went wrong and ask the language model to fix it. 01:02:51.300 |
So now we can actually say you're being very patient. 01:02:54.540 |
You say moonwalk means the robot should walk backward while the feet swing as if they are 01:03:02.860 |
Such a great explanation, kudos to my colleague, and correct your answer and also make it walk 01:03:11.380 |
And after you being very patient and give it the right instruction, it's able to modify 01:03:16.740 |
the motion descriptor and also generate the right set of rewards to make this happen. 01:03:22.940 |
And now you can teach a robot to do a moonwalk just by using the language as an interface. 01:03:29.980 |
And one day we'll be able to do this on the real robot as well. 01:03:33.900 |
So in the previous section, we showed how the language model did calculate numbers and 01:03:39.820 |
you're constraining them to also just take numbers. 01:03:42.140 |
Here, how do you prevent it from just hallucinating in some program? 01:03:49.940 |
In this work, we are not preventing it to do hallucination in a programmatic way. 01:03:56.580 |
We have a set of system prompts or a set of rules that is explaining the API. 01:04:02.820 |
After all, the reward functions need to be able to be compiled by the optimizer. 01:04:14.100 |
What's more, if it doesn't compile, we can just give the error message to the language 01:04:18.900 |
It doesn't have to propagate all the way to the motion descriptor, it can stay at the 01:04:32.100 |
Using this framework, we can say, open a drawer, take the apple, put it into the drawer, and 01:04:38.540 |
close the drawer, and it will be able to do that. 01:04:43.740 |
Just using reward decoder is not good enough. 01:04:46.100 |
It's rather our two-stage prompt is really, really helpful. 01:04:51.100 |
I think that's another inspiration for other fields, like when your domain is too different 01:04:56.380 |
from language domain, maybe it would be good to find an intermediate representation and 01:05:00.700 |
ask the language model to explain in that intermediate representation before directly 01:05:07.620 |
Finally, we want to transfer this to the real world, but there is a challenge. 01:05:14.220 |
Using simulation, it might generate actions that are too dexterous, like this thing is 01:05:23.780 |
So we add a few more regularizer terms to stabilize the motion, and we also run some 01:05:30.220 |
state estimation on the real robots so that they understand where is the cubes, and then 01:05:37.020 |
we can, in the simulation, grab the motion and then achieve it in the real world. 01:05:41.660 |
So here are some of the execution in the real world. 01:05:45.100 |
So you can, say, pick up the Rubik's cube, and it will generate the motion to pick up 01:06:02.100 |
So here, it can do 10 hertz or even 30 hertz. 01:06:15.420 |
There's one last thing that I want to talk about in terms of finding a new interface. 01:06:20.580 |
So a lot of time, we have been thinking about language model as a semantic engine, a semantic 01:06:27.860 |
So, for example, you say the student takes out the book. 01:06:34.340 |
Language model is able to reason about such a sequence. 01:06:37.780 |
But if you do low-level patterns, like if you just give it obscure numbers, what can 01:06:45.540 |
And we can open up the low-level interface to alpha language model and ask it to do robotics 01:06:52.380 |
So in this paper, "Large Language Model as General Pattern Machines," we explore using 01:06:56.980 |
the low-level interface of a large language model, essentially asking it to reason about 01:07:06.020 |
And it can solve tasks like the ARC challenge and the PCFG. 01:07:14.020 |
So I will dig a little bit into sequence improvement because that's quite relevant to robotics. 01:07:19.180 |
So sequence improvement is that you prompt the language model with state, action, and 01:07:25.220 |
And you just prompt it with higher reward and see if it can generate actions that achieve 01:07:32.820 |
So it's doing reinforced learning or reinforced learning-like thing, but in context. 01:07:39.460 |
So previously, you would need a dedicated algorithm collecting data replay buffer to 01:07:46.940 |
But now you can just build everything in the language model context by leveraging the low-level 01:07:53.820 |
And with that, we can actually do something like clicker training. 01:07:57.100 |
So if you are not very familiar with clicker training, it's how you train a dog. 01:08:02.500 |
You can have a dog, and when it does the right thing, you give it a reward by clicking. 01:08:09.020 |
So the clicker training is giving the agent a reward. 01:08:16.380 |
And we can now use clicker training to train robots as well. 01:08:20.020 |
So here, the robot is exploring, but I would give click when it does the right thing or 01:08:26.300 |
And over time, it will be able to push the backup chips, which is the objective of this 01:08:33.500 |
So you can do this entire decision transformer-like operation, but purely in context, by just 01:08:40.060 |
giving a language model a bunch of patterns and ask it to figure out what is the regularity 01:08:47.260 |
And this way, it can generate new actions to improve the previous sequence. 01:08:54.980 |
So for the language model, we can find new interfaces that are more suitable for teaching 01:09:04.300 |
Reward is a bridge of language model and low-level control, and we can fully leverage it as a 01:09:09.460 |
universal interface, and we can optimize in real time. 01:09:16.060 |
Sometimes it outperforms generating action directly. 01:09:18.420 |
So it really motivates to use the reward functions as interface. 01:09:23.640 |
And in the language model as a general pattern machine, we can use language model beyond 01:09:30.500 |
And also, robotics as a domain, rich of sequence transformation and sequence completion and 01:09:37.540 |
So we can really study the lower-level mechanisms of language models. 01:09:43.380 |
And the key takeaway for this talk is that we are seeing more and more use of foundation 01:09:51.100 |
models, not only on the semantic reasoning side of robotics, but more on the dexterous, 01:09:57.220 |
on the generating actions, on the lower-level embodied intelligence side of robotics. 01:10:03.340 |
And we need to rethink the scaling law of robotics and transformer. 01:10:07.300 |
How do we scale it with limited amount of data? 01:10:10.260 |
We have a new recipe for scaling robot model and data in RT2, which shows that you can 01:10:14.500 |
do more with the same data, with essentially RT1 data plus internet data, you can generalize 01:10:20.460 |
And RTX shows that you can do a lot more with more data. 01:10:24.660 |
There is also benefits to collecting more robotics data. 01:10:29.260 |
And part two, in terms of new interfaces for language models, I think it's worth for the 01:10:35.160 |
robotics field to think about developing new and lower-level interface to language models, 01:10:45.540 |
And if you'll find it interesting, there are a lot of references for you to look into. 01:10:50.540 |
And special thanks to my team, Google DeepMind Robotics team. 01:10:55.740 |
So we are at the forefront of developing foundation models for robotics. 01:11:03.580 |
You mentioned that load numbers are difficult for a lot of our language models, but if you're 01:11:15.760 |
just generating the action tokens themselves, like no rocks or whatever you had in an example, 01:11:21.820 |
why don't you just have a linear layer appended to the transformer that would just generate 01:11:28.860 |
numbers from here that you can type in whatever you need? 01:11:33.580 |
The question is that if the large language models have difficulty understanding numbers, 01:11:40.140 |
why don't we use a linear layer to output the action directly? 01:11:43.460 |
I think language models are difficult to understand numbers. 01:11:47.760 |
But sometimes we still want it to bring in knowledge from the pre-training mixture. 01:11:57.140 |
If I have a new layer, that new layer is not present in the pre-training. 01:12:06.420 |
But at the same time, I don't necessarily think using the raw numbers is the right interface. 01:12:12.300 |
We probably could do some action representation learning to learn a representation. 01:12:16.660 |
And the language model can output that representation. 01:12:19.660 |
So we're still trying to figure out what is the right representation. 01:12:24.180 |
So among the representations that we haven't tried before, like decimal numbers, flow numbers, 01:12:30.300 |
actual tokens, we find that just using numbers or actual tokens would be good enough. 01:12:39.500 |
Yeah, I think both directions are worth exploring. 01:13:01.980 |
There are different advantages of generating action directly. 01:13:06.940 |
I think it borrows the autoregressive nature of language modeling. 01:13:12.660 |
And it aligns with a lot of other tasks, like visual question answering really well. 01:13:18.540 |
The limitation is that then when you are generating actions, it's heavily regularized. 01:13:23.900 |
Can you generate dexterous actions that is so out of distribution that it's kind of difficult? 01:13:29.380 |
The language to reward actually brings a page of the book of traditional robotics, this 01:13:35.140 |
optimization-based or model predictive control. 01:13:40.020 |
And you can also take into, let's say, safety constraints more easily. 01:13:48.820 |
Maybe one recipe is to generate a lot of data with the language to reward system and distill 01:13:56.900 |
Because then you are imbuing your large language model with all this other desirable-- the 01:14:02.780 |
language to reward itself, I don't know how scalable it is. 01:14:08.900 |
So maybe you are limited to what-- you are at the mercy of the training data of the language 01:14:15.620 |
The language model can do moonwalk because it knows what moonwalk is. 01:14:24.180 |
But if you want to scale to completely new things, maybe you can use the language to 01:14:28.100 |
reward to bootstrap your data generation and then put it into the other policy. 01:14:33.460 |
So can you tell us what's the next direction Google is pursuing? 01:14:39.940 |
So it's like, the language is rewarded in the right direction, like scaling out of the 01:14:46.860 |
So the scaling being the end of the lecture, that is a joke. 01:14:57.040 |
So I think everybody is believing in the power of the scaling rule. 01:15:04.860 |
So just by giving it more data, giving it more compute, you will see interesting capabilities 01:15:12.740 |
Yeah, I still think we don't quite have enough data. 01:15:30.360 |
I think that's still probably the biggest bottleneck. 01:15:33.660 |
So we are trying to find ways to do more with limited data. 01:15:40.500 |
And I think it needs some time for us to accumulate enough data. 01:15:45.200 |
And currently, I say, we have signs of life for positive transfer. 01:15:50.320 |
But in language models, people don't talk about positive transfers anymore because it's 01:16:01.840 |
Yeah, how much has your team been thinking about safety and alignment? 01:16:07.440 |
And are you just, right now, relying on the ethics that emerge from the large language 01:16:14.400 |
It won't tell you to kill someone to achieve that. 01:16:18.800 |
Actually, we take safety very, very seriously because all of the other domains of developing 01:16:24.520 |
language models, it doesn't have direct impact on the physical world. 01:16:31.800 |
But here, it could have potential harm to humans and to the environment. 01:16:37.600 |
And Gary Marcus actually gave a comment previously to our work that, what if you say, bring out 01:16:45.560 |
a bowl, feed a cat, and put it in the dishwasher? 01:16:47.440 |
Well, let's put the cat in the dishwasher, right? 01:16:50.040 |
If it misunderstands, actually, it will have a catastrophic failure case. 01:16:57.720 |
We take safety carefully by designing hardware and software safety layers. 01:17:03.520 |
And there are also some constitutional safety thing that is coming out sometime soon. 01:17:11.280 |
I cannot tell much details right now, but sometime soon, we'll release some work. 01:17:17.240 |
Is it something like, if there's a human, just don't interact? 01:17:23.440 |
I think it's a little bit more nuanced and more detailed than that. 01:17:29.880 |
And in some of our experiments, actually, the robot's finger would break off because 01:17:34.040 |
it cannot apply enough force to an environment. 01:17:36.280 |
So that's just yet another way of ensuring safety. 01:17:39.600 |
Can we have some visual language model and a synthesizer or something to stop the problem 01:17:49.120 |
And maybe, this is kind of like interpretive, but both in some logical way.