Back to Index

Stanford CS25: V3 I Low-level Embodied Intelligence w/ Foundation Models


Transcript

So, hey guys, thanks for coming to our second class. Today we have the pleasure of welcoming Faish Shia, he's a senior research scientist at Google DeepMind, where he works on the robotics team. He received his PhD here, actually, working with Sylvio Salvariz in Stanford Vision and Learning Lab, as well as Leonidas Pibus.

And his mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments with applications with home robotics. Recently he has been exploring the use of foundation models for robot decision-making and action generation. So now I'll hand it off to Faish. Hi everyone, I'm super happy to be here and happy to be back.

I graduated from here two years ago, and now I'm a research scientist at Google DeepMind. I work on the robotics team, and today I will be talking about low-level embodied intelligence with foundation models. So it's definitely an interesting topic, and I will introduce what is embodied intelligence and what is low-level embodied intelligence, and how we can accelerate the building of them with foundation models.

All right, so why are we working on embodied intelligence? So embodied intelligence is an integral part of artificial intelligence, and it's an important milestone to artificial general intelligence. And it has a lot of use cases, like for example, we all hope we have a home robot that can be in our home 24/7 and clean the home for us, or clean up our messy room, or cook for us, or taking care of our aging family members.

So we are not quite there yet. In fact, we are quite far from it. That is because our intelligence is currently mostly in the virtual world. So we have AI agents that can help us draft emails or write eloquent essays, but they are not super good at interacting with the messy real world, unstructured, complex environment that humans reside in.

So just to give you guys a couple of examples of how messy the real world can be, and how hostile it could be to robotics, I want to show you a curious mistake or curious error from one of our robots. So the task is to put the Coke can in the sink, and watch what the robot do.

The robot grabs the Coke can and opens the tap. So this is kind of dangerous, but it's kind of interesting, right? Because we never expect it would do something like that. It's just from random noise, it starts to open the tap, and the water starts to come out. So for an agent to have this type of physical intelligence, it needs to understand the effect of its actions, and what is so-called a world model.

So people have been complaining that language models so far don't have a world model. So it doesn't understand geometry, it doesn't understand the spatial relationship of objects, or the effect of actions, basically how objects will move according to physical laws. So we are not quite there yet. In another case, so this is our robot that is ready to deliver a can, or actually throw away a can.

But as you can see, we have this pre-programmed behavior of tucking the arm behind. And in doing that, the can is upside down. So if there's any liquid in the can, it will spill and damage the robot. So it's another example of real world is really complex, and there are a lot of things to model.

And in order for our robots to have this sort of ambient intelligence, they really need to understand a lot of very nuanced details of the environment, and understanding the physics, physical laws, and understanding the effect of its actions. How do we do that? There are many ways to achieve embodied intelligence.

Actually, throughout my PhD study, I've been fascinated by this idea of creating interactive environments, basically, let agent explore in this interactive environment, basically, create environments that are complex enough. So that if the agent needs to survive in such environment, it must develop intelligence. So it's an ecological view of perception and agency, and is popularized by American psychologist James J.

Gibson. So he has a famous quote, "Ask not what is inside your head, but what your head is inside of." So human learned this type of embodied intelligence. Human is able to manipulate objects effortlessly, one, because of evolution, second, because of the childhood experience. We have been playing with this toy, we have been interacting with this toy, and watch the physical effect so that we learn.

And similarly, we can give robots a safe playpen, so they can explore in those environment and interact with environment and play, and watch the effect of actions, and effectively understand how to manipulate those objects. So I have been developing these simulation environments, one of which is called Gibson environment, which is published as CVPR.

It's mainly aiming at simulating the visual world faithfully, and also simulate physical world to some extent. So we built this environment, which is a scanned environment from a lot of houses. And then an agent, we can spawn an agent in that, in this case, a humanoid agent, and the agent can learn to walk or to run in this environment and simulate all this perception information.

So we can create a perception action loop for this agent. And similarly, we can put other types of agents in this environment, in this case, a little cart, and we can also put a quadruped or this ant into this environment. So essentially, we create an environment where we can simulate perception for the agent, and then we can create a neural network to map the perception to action.

And this way, we achieve some sort of physical intelligence. It's mostly for navigation and locomotion. This is not enough. So in this case, the environment is one monolithic piece of mesh. As you can see, the agent run into the wall and it bounced back. So there is no articulation in this environment.

So it's not simulating the full complexity of the environment. So the things that we can do with our agent is rather limited. So that's why we create other simulation environment, one of which is iGibson environment, which is called Interactive Gibson. So what we do is we create, again, scan a lot of real world houses, and then we convert them to CAD assets, basically mesh assets that are interactable.

In this case, we have a simulated agent that go into the environment and then close all the drawers. So we are able to do that because we model the complexity of the world a little bit more. We go beyond just modeling the visual world. We start to model physics a little bit more, basically modeling the degree of freedom in the environment.

And our agent can do more than just navigating around. So we can go even further. So we can even model more degree of freedom. And our agent can develop more complicated behavior, such as unloading a dishwasher and find a bowl or take out the bowl and put it on the table.

So as we scale up the complexity of the environment, we are able to learn much more complicated skills in simulation. And that's one way to achieve embodied intelligence, which is to build complex enough simulation environment. Not just my research, but the entire field of computer vision is undergoing a paradigm shift.

So previously, we are focusing on internet AI. We curate a lot of internet data sets to study problems like classification, segmentation, and detection. Basically all these computer vision problems. Now we focus a lot more on embodied AI, which is adding the action dimension to the problem that we are studying problems like visual navigation, manipulation, rearrangement, embodied question answering, instruction following, and the simulators in some sense replace the original role of data sets.

One thing that doesn't change, which is the data is still super important. We are still relying on a large amount of data to learn this intelligent behavior, no matter if it's from a static data set or from a simulator. So learning in simulation can take a lot of interactions.

So just to give you an example, we create this iGibson environment and we want to learn a behavior called go into a room through a closed door. So this is a rather simple behavior, which I can show on the top right of the screen. So the agent needs to stop in front of the door, it needs to stop at the right distance.

If it's stopped too close to the door, it cannot extend its arm. If it's too far, it cannot open the door. And then it basically opens the door. Let me play this again. Open this door, when there is enough clearance, it will go into the door. However, it takes about 50,000 episodes or 1.25 million environment interactions to learn this type of behavior.

This is because we are using model-free reinforcement learning, the agent is exploring this environment. It could really push any point, it could rather stop at any point. So we give it a reward function to go into the room, but it's very rare that it will stumble upon this behavior.

I would like to argue with foundation models, we can do a lot more different. So what do you do nowadays? You just ask a HHBT, how do you go into a room through a closed door? And it will say, open the door, walk through the door. So this is a gross simplification of the problem.

Of course, the problem is not that simple. But what I'm just saying is that we can leverage a lot of semantic prior from the foundation models. So if we really like data, if we really need a lot of data, the foundation model is a compressed version of the entire data and it's a knowledge base that you can query and to accelerate the development of robotics.

Of course, simulation and real world data is still super, super important, but maybe we can get the best of both worlds. We can use foundation models plus a limited amount of simulation or real world data. So that's what I'm going to talk about today. So where are we in terms of foundation models plus robotics?

So our team at Google DeepMind has been piloting in foundation model plus robotics. So we developed advanced planning, high level planning algorithm. One of the first is called Palm Seiken. It is an algorithm that can parse a user command. So here is a demo. Here is a scenario. Here is a user command.

I spill my Coke on the table. How would you throw it away and bring me something to help clean? And it's querying a large language model, which is given a score highlighted in blue. And there is also an affordance score. The affordance will tell you whether an action at a given state is possible.

It's augmenting the language model to give you only possible things. So essentially, it is doing the semantic planning with a language model. But it's also taking into consideration what it can do. So it's not just outputting the-- like, language model tends to hallucination. It doesn't hallucinate. It only gives you what is possible for the robot to do and what is actionable for the robot.

And the robot is doing the thing that is advancing the long horizon task progress. And also, each task is executed by a low-level policy. Here it doesn't quite clean the table, because we haven't added this to the low-level skill. But imagine there is a low-level skill to clean the table.

It will finish the entire thing. What is a low-level policy used here? The low-level policy used here is Robotic Transformer 1, RT1. It's our team's homegrown transformer. Essentially, we collect a large data set of human demonstrations. We put a transformer, and we train it on this large data set of expert trajectories.

It is able to do about 700 tasks with 97% success rate. And it has interesting generalization behavior. It can operate in a new kitchen it has never seen before, which is showing there is a successful recipe to apply foundation models in robotics. So that's roughly where are we in terms of foundation model plus robotics.

And I will talk about a few new works that is bringing this to the next level. So actually, my teammate, Ted, gave a talk of foundation models plus robotics at the beginning of this year. It's also this class, CS25. I highly recommend it. It's available on YouTube. I actually watched it last night so that I don't repeat some of the contents.

But what he basically mentioned is that he revealed our team's progress in terms of building this robotic foundation models. And we have had a lot of somewhat detour, and now we sort of figured out a recipe. So in 2021 to 2022 is how we scale to many tasks with demonstrations.

How do we collect a large amount of data? In fact, about 100,000 demonstrations. And we tried different ways to do it. We tried behavior cloning. We tried imitation learning plus reinforcement learning, and some other ways, or combining them with language models such as SACAN. In 2022 to 2023, it's about how we can leverage foundation models to accelerate robotics.

We really see a proliferation of using foundation models to accelerate robotics, both on the high-level planning and low-level control, probably leaning more towards a high-level planning. So if the recipe works-- so the recipe is essentially combine a large-scale diverse offline data set with high-capacity architecture, such as a transformer, and using language as a universal glue.

So this will be the recipe to build foundation models for robotics. So if this recipe works, what do we do? What do we do next? Essentially, we're just-- let's just scale everything to orders of magnitude and be done with it and solve robotics. And guess what? That's what we did.

So that's the end of the lecture. I'm going to cut this a little bit short. And that's a joke. That's not happening. So we are still on our way, on our quest to solve low-level embodied intelligence. When I talk to people that you can use foundation models to do robotics, their reaction would be it's mostly doing high-level reasoning.

It doesn't do the low-level manipulation really well. And that's for a reason. One of the reasons is there is a Moravec's paradox. Moravec's paradox is the observation that in artificial intelligence and robotics, contrary to traditional assumptions or our intuitions, reasoning requires very little computation. But sensory motor control and perception skills require enormous compute resources.

That is because as biological creatures, we acquire the sensory motor skills through evolution. This is very different. So we might not be able to reason or do large-scale computation. But this sensory motor control is integral to our survival. So it's essentially already learning our DNA. But in robotics, it's a little bit different.

So the chips are very good at doing reasoning and computation. But they are not super good. They haven't experienced the world. They haven't acquired the sensory motor skills that is necessary for them to do tasks in the real world. Here is an example. When the computer beat Kasparov, basically the human champion in chess, there is another robot arm moving the chess piece.

It can beat the human champion in chess, but there is still someone need to move the chess piece. Similarly, in the AlphaGo moment, when Lee Sedol was beaten by AlphaGo, there is still someone who is moving the chess piece for them. It's not a robot doing that. So this is showing the reasoning is that the hard things are easy, and the easy things are hard.

There's another thing that prevents us from using foundation models more prevalently, more in a larger scale in robotics, which is the training data bias. The training data of foundation models or large language models are mostly language tasks. So it's perhaps not that surprising it knows how to clean up a kitchen because maybe there are wikiHow articles teaching you how to clean up a kitchen or to do something in a procedural way.

But there is no wikiHow articles teaching you how to move your finger five centimeters to the left because people just don't say that. People don't write that down. So there is a very limited amount of this low-level control data in large language model training culture. So we do have a lot of challenges in bringing the foundation models to a lower level.

So that's what I mean by low-level embodied intelligence. So any questions so far? Also, I want to make this quite interactive. So if there is any questions, feel free to interrupt me any time. All right, if not, we can continue. So there are a couple of challenges of using large language models for low-level control.

As I just mentioned, the first thing is lack of data. So we only have perhaps 100,000 episodes of human demonstration data and takes about 13 robots 17 months to collect. So it's a huge amount of effort. In the country, large language models are trained on the order of 1,000 billion tokens.

A smaller palm was trained on 780 billion tokens, and the larger one is trained-- following the chinchilla rule, you would need to train it on 1.35 trillion tokens. So it's a huge amount of discrepancy between how much data we can achieve in robotics and how much we can get in large language models.

So we will always be bounded by robotic data. So maybe we can scale on other fronts. Maybe we can keep the robotics data the same, and then we can scale on other fronts. Like, maybe we can scale the pre-training mix of text and image, or maybe image and text pairs.

Maybe we can build this cake, and the robotics data is just a cherry on top of it. And we can scale the foundation really, really well. Some of my work that I'm going to talk about today actually reused the RT1 data. We don't collect the new data for RT2, but we want to do more things with the same amount of data.

The second challenge is kind of related to the first challenge. Language models lacks an interface for low-level control. If you ask a language model, how do you make a robot dog stand up on two feet, it will tell you a lot of things that sound reasonable, sounds plausible. It will tell you the robot dog's torso is upright, balance over two hind feet, and standing shoulder-width apart.

This is great. This is all great. But we cannot put it on the robot. On the other hand, maybe we can ask a language model to write control code to directly control the robot. But usually, that requires you to curate an API that is friendly to the language model.

It will directly ask it to give you my joint angles to make the robot stand upright. It will not give you the right thing because it doesn't have enough context. So essentially, large language models don't speak robot language. Can we actually find the right robot language? Can we find the interface between large language models and robot control?

Or can we just treat robot action as another language? So that's what we want to find out. In today's agenda, I will be talking about low-level embodied intelligence with foundation models. It's separated into two parts, and it's addressing the two challenges that I've just mentioned. Part one is about model consolidation, joint scaling, and positive transfer.

So I have to put them in one part because they are somewhat related. And part two is developing new interface of large language models. So what do I mean by model consolidation? Yes, question. Yeah, I was going to ask, why couldn't you just fine-tune an RNN for generating low-level code?

Yeah. Yeah. Yeah, that's a great question. So the question is, why cannot we fine-tune language model to directly output low-level code or robot actions? So I will be talking about RT2, which does somewhat similar to that. It's fine-tune language model to output action as a language, to output our action representation.

There are certain downsides to that. Like, for example, you would need to collect additional data to fine-tune a language model. So either we can fine-tune that, or we can use the language model zero-shot if you find the right interface, which I will talk about a little bit in the part two.

Zero-shot and without fine-tuning? Without fine-tuning, yeah. So the model consolidation is, essentially, we can do the high-level reasoning and low-level control in one model. And joint scaling is, not only we scale the robot data, which is expensive. We also scale the pre-training data. Or we already start from a pre-trained vision language model.

And a positive transfer is model benefiting from diverse joint training across internet scale language, vision, and vision language domains combined with robotics. So this is a continuation of the axes that Tad drew in his previous talk. So we can see there is a trend. So this visualization basically highlights some of the work on our team.

And each work, each column, is basically a robotic system that is able to do both high-level reasoning and low-level control. So previously, we need to have separate models for each thing. Previously, in the initial release of SACAN, the planning is done by a large language model. And the affordance is done by a QT opt-like policy trained with Sim2Real.

And the low-level policy is Robotic Transformer 1. So it's each model doing its dedicated thing. And we need to train each model differently, and perhaps with different type of data. And later, we have QTransformer, which unifies, which is kind of an offline RL method that is leveraging transformer architecture.

So it's a high-capacity architecture. It can train on both positive data and negative data. And with that, we are able to gather a policy that is also understanding affordances. So we can unify the low-level policy and affordances. But the planning is still a large language model. And then we have PAL-ME, which is a vision language model, which is a large language model also trained on a vision language domain.

So the PAL-ME can do planning and affordance in just one model. But the low-level is still using RT1. And finally, we unify everything together. Like there is RT2, which I'm going to talk about today, that can do both high-level planning to some extent, generating affordance, and do low-level policies.

So behind the model consolidation is the consolidation of tasks. We can represent every task as a vision plus text to text task. So it's a really universal representation of the task. And then with that, you can train it really on using a lot of data. And you can see positive transfer.

Basically, learning affordance can also tell you how to achieve a task. There are transfer between tasks when you pull all the tasks together. So to understand this joint scaling and to understand the model consolidation, we need to understand PAL-ME a little bit. So PAL-ME is an embodied multimodal language model.

It's based on the PALM architecture. So PALM is a large language model. We made some adaptation on the architecture so it can understand multimodal input. So it is basically one model that is able to take in multimodal input. So in large language models, each word is tokenized and tokenized and getting this embedding of these words.

And then that is fed into a large language model. So in PAL-ME, what we do is instead of using words, we can use multimodal tokens. So the multimodal tokens can come from a vision transformer, a VIT, or it can come from robot sensory data. So every multimodal token, then we map it to the text embedding space.

We basically train a linear affine transform between the multimodal token and the text embedding space. And then we can treat the multimodal token as words as well. So essentially, we have a language model as a solid base, and then we start to adapt it to understand multimodal tokens. So this is quite interesting because it doesn't require a ton of adaptation or fine tuning for it to understand multimodal input.

It just aligns naturally to the multimodal input, such as images. I will show a couple of examples of what it can do. And we can train in the same way as training large language models. So essentially, we can reuse the same infrastructure and training algorithm and everything to train this PAL-ME.

A couple of other things we find along the way is positive transfer, which I will share in a little bit. So I guess here, I also want to mention PAL-ME is one of the largest models we have explored so far. It has 562 billion parameters, which is by concatenating the PALM, 540 billion parameters and the 22 billion VIT.

And we find a lot of emergent capabilities of these models. That is, we haven't expected during training time, but really, we can prompt these models and ask it to do interesting things. We have also explored using neural scene representation, basically an object-centric representation and fed into PAL-ME. So object-centric representation assigns one token to each object.

And we find that this representation is super helpful for robot planning tasks, because traditional VIT representation is based on grid, and it doesn't have a full understanding of light objects and their relationships. We have done an extensive study on the scaling performance and the catastrophic forgetting performance and all other interesting experiments in the paper.

So please refer to the paper for more. So here, I'm just showing some interesting qualitative examples or some emergent capability of PAL-ME that we found out. So first, we found this model has some reasoning capability. You can give it an image and ask it questions that require a little bit of reasoning.

And you can prompt this with, let's think step-by-step, which is a technique used to elicit reasoning in large language models. But here, in multi-modal language models, you can do the same. I guess people are also experimenting these days with GPT-4V. You can also prompt it to think step-by-step or count row-by-row.

But here, this is before GPT-4V, and we were able to elicit reasoning using some of the interesting prompts, such as we can ask it, in this photo, are there more cats or more dogs? Let's think step-by-step. And the PAL-ME found out there are equal amount of dogs and cats.

And on the right, give an image, can I go down the street on a bicycle, yes or no? Let's think step-by-step. And the reply is, do not enter, second, except the bicycles. Do not entry except the bicycles, yes. So it's doing this modest reasoning, and it's mixing this understanding of symbols and also mixing the understanding of text.

So this is quite amazing to me, to be honest, when I first saw this. I didn't expect a multi-modal language model would be able to do that. And we also tried one thing, which is traditionally very difficult to language models, which is to tell a joke. Language models can understand joke, but sometimes it just doesn't-- it's not able to tell you a joke when it comes to the punchline.

Because it's just trying to make something that is plausible and sounds like a joke. And when it comes to the punchline, it doesn't really know what to say. So here, I give it an image, and I ask it to come up with a description, and then comes up with a joke.

So this guides the language model to think step-by-step. And the description is a donkey is carrying a dog, cat, and rooster. And the joke is, what do you call a donkey with a rooster on his back? A rooster booster. It's so creative. Like when I saw this, I'm pleasantly surprised.

And I searched online. I couldn't find another joke like that. So it's actually an original joke by Pomi. And finally, we see some math reasoning with this model. Basically, I give it a messy menu from a pizza store, and I ask it, I'm just buying a pizza for me and my friend.

How much should I pay? Let's think step-by-step. And it's figuring out there is a pizza, and there is $9.99, and it tells you the price. In some of the answers, it even calculates text, but the text is hallucinated. So that doesn't work. All right, let's talk about positive transfer.

So apart from the amazing things that Pomi can do, it also has interesting positive transfer behavior. So when we train Pomi on a single domain, when we train it on just a single robotics task, the performance is not super great. But when we pool all the data together, and we also include internet-scale visual language tasks, such as captioning or visual question answering, it is able to do much better.

So this shows that it's important to mix all the data together and train it jointly. The internet-scale data can act as a regularizer for you to not forget the representations. And those representations are, in turn, very useful for robotics. So that's a positive transfer result. And we start to see more and more positive transfer in other of our studies.

Yes? So how much data do you have to do collectively, like in simulation or in real world? I think the playing with sorting stuff on the table is very impressive. Right. Yeah, that's a very good point. So these are all planning data, like high-level planning. So maybe let's just talk about two things.

So first of all, the sorting results, the low-level policy is still using a traditional controller. So it's using a policy called LAVA. And that policy is trained on 68,000 episodes. The high-level planning is probably easier than you think, because it's giving command to the low-level policy. So it's basically only need to say, put the red block into top-left corner, put another red block into top-left corner.

So it's rather quite standard autoregressive language modeling task. The only thing I need to do is to determine what task is not finished yet. So for example, if the block is already in the corner, it shouldn't call low-level policy to move it to the corner again. So it's rather like parsing the states and understanding the states.

So this high-level policy only requires about 50 to 100 demonstrations to learn. So it's quite parameter efficient. And in the future-- that's a very good question, actually-- in the future, a lot of these tasks can be taught in context. So maybe we just demonstrate it once to the large-language model, then it knows how to do that.

Yeah, this is through human demonstration as well. So a human on a low-level can demonstrate low-level policy by tele-operating a robot to do a certain task. But on a high-level, it could also just give the low-level policy-- imagine your control interface is through text. And then as a human, you can also guide a low-level policy to accomplish a task.

And then that thing can then be used to train a large-language model. So that's for the sorting block. The secant is a little bit more interesting because the planning steps are actually generated by POM. So we essentially distilled POM plus this affordance model into POM-e. So that's a little bit more interesting.

It's like using the AI data to bootstrap itself. That one has about 3,000 episodes, also not quite a lot. But it's able to learn complex planning behavior, replanning behavior, error recovery, which I will show in a slide. So with the POM-e as a high-level planner, we are able to take the rice chips out of the drawer, and there is a twist, which is I will be messing with the robot.

So as it put onto counter, I put it back to the drawer. And as it pick it up again, and then I put it back again. So it's able to understand the state. It's able to understand my task is not finished. I cannot proceed with the next task. Now, after I don't mess with it anymore, it's able to close the drawer and pick up the bag of chips.

So POM-e is able to combine affordance and planning in one model and do complex reasoning of a scene and environment. And interestingly, we can use the exact same model checkpoint to do block sorting as well. So this is the same model checkpoint. It can not only reason about how to bring a bag of chips to a user, it can also sort blocks.

So and it's also responding to adversarial perturbation, like if the user is putting the block in the middle again, it's able to recover from that. So these are all coming from the same model. And it can also tell a joke. So yeah, this is the power of vision language models.

Now we want to go a level deeper. These are all vision language models that are used for planning or high-level reasoning. Can we use them for low-level control? It turns out we can. And that's the RGQ work, which is vision language action model that transfer web knowledge to robotic control.

What can it do? When asked, pick up the extinct animal. And it has a whole range of objects on the table. It will pick up the dinosaur. So it can link the extinct animal to dinosaur and to the action that pick the dinosaur up. So it's really doing this emergent reasoning and also the manipulation in just the one model.

And by the way, this robot hasn't seen any of these before, at least in the robot training data. It might have seen this in their internet catalog, but it has never seen it in the robotics training data. So it's quite interesting how we need to evaluate these robots nowadays.

So when we evaluate language models to prevent data contamination, every time you need to give it new questions because otherwise it might already memorize it in its training. When we evaluate these robots, we actually go to dollar store to buy all these toys to make sure it hasn't seen that before.

And as we run more evaluation, maybe there will be some replication as well. But as you can see, it is able to understand to pick up this dinosaur toy. How did we do that? So we start from a visual language model that is trained on internet-scale data. And then we also combine it with robotics action data, which is the RT1 data and we get RT2.

And we can dive deeper, a little bit deeper into RT2. So first of all, what is a visual language model? A visual language model is a transformer that takes in image and text and output text. So within Google, there is a visual language model called Pali, which is an encoder-decoder type of architecture.

It's basically having a VIT to understand images and then a transformer encoder and the transformer decoder. They encompass both the visual and semantics to understand the world. And in robotics, we have to deal with a lot of both of these. And the question is, can we leverage the knowledge in the visual language models and apply them to robotics?

On the other hand, we have the RT1. If you want to learn more about RT1, you can listen to the previous episode of this CS25 by Tad. So he gave a detailed introduction on the RT1. But the RT1 is, if you stand far enough, it is also a vision language to action or something model.

It takes in human instruction. It takes in the current camera image. The camera image passed through a film-efficient net, which is tokenized into 81 tokens, and then going to a token learner, which compresses everything into eight tokens. And then there is a transformer block, leveraging a lot of self-intention layer, and then generate actions.

The action is also tokenized. The robot has seven degrees of freedom. The anti-factor has six degrees of freedom, its position and the rotation, and the gripper can open and close. And there is another dimension representing terminate the episode or not. Terminating means my task is already done. And we discretize every dimension into 256 bins.

And then we do cross-entropy loss on those bins. So that's the RT1 architecture in a nutshell. It's quite similar to a vision language model with different output tokens. So it's rather natural that we just use a large pre-trained vision language model directly as policy. We can use the poly or poly-me as a policy.

And one question is, how do we deal with actions when using pre-trained vision language models? And here is action representation that we use. The robot actions here are the eight dimensions. And as I mentioned, there is termination, position change, and rotation change. And we discretize everything into 256 bins.

We also have tried other alternative representations, but they are not as good as just this naive representation. Yes? Yeah. Yeah. Oh, the film efficient net is a pre-trained convolutional neural network. It's used to tokenize the images. So the reason that we do this is, through some ablation study, we can tokenize the image in different ways.

We can tokenize in ResNet. We can tokenize everything into ResNet. And we can tokenize using film efficient net. Film, what it means is it also take into the language embedding and append it to the intermediate layers of the ResNet. So we basically have some combination of feathers, and it's encoded in images.

Yeah. That's right. That's right. That's right. That's right. The action is not encoded. The action is in text. It's basically what is shown here. This is the action. It's eight numbers. Each number range from 0 to 255. Yeah. And maybe another note. On the film ResNet, it's about how we tokenize the images and how we combine vision information and language information.

There are many ways to do that. This is not the only way. There is early fusion and late fusion. And there is also cross-attention. You can basically tokenize your image just by itself. And then you can have language and use cross-attention to combine the image and text representation. So here, we are using this model.

This is RT1 for robotics. So we do have a lot of considerations, such as latency. That's why we use this film ResNet, because it's super fast. And it can output a limited amount of tokens, which we can further compress with Token Learner. Yeah. Yeah. So is this autoregressive? Like, every single image it sees, it then reacts with each other?

Right. So it is autoregressive. Yeah. And every time, we use a history of up to six steps. So every time, you see this image right now. And you see about two seconds of history before it. And this will be your input. Yeah. Again, if you have more questions about RT1, I recommend watching the previous episode.

And here, it's all about RT2. So we can convert the string of numbers. This will be our output of our transformer, which is a visual language model. We tried other alternatives, such as floating numbers. Floating numbers is not super friendly to language model tokenizer, because it has these decimal points.

We also tried the human language, such as left or right. It's more a semantic representation. But they cannot be directly executed on a robot, which is a limitation of this method. So if we commit to this action representation, which is just a string of numbers, we essentially get a visual language action model.

We tried different variants, including polyX. This is a pathways language image model. There are 5 billion parameters variant and 55 billion parameter variant. And we also tried POMI, which is 12 billion parameters. The procedure that we did to train this RT2 is via co-fine tuning. Co-fine tuning is to put the internet scale data and the robotic data together.

And then we fine tune it on this mixture of data so that it doesn't-- it retains the internet scale knowledge. Maybe that's also an artifact of our data is too small and not diverse enough. So if you're just a fine tune on robotics data, it will quickly overfit and forget about all this progeny mixture.

Maybe it's a dynamic of scale. So we'll see. At inference time, how do we do this? We basically-- again, we do this autoregressively. We have an instruction of a task. And we format this as a question and answering task. What should the robot do to achieve a certain task?

And the task is a string that human give the robot for the robot to achieve. And it also have the current observation, which is the robot observation, the camera image, RGB image. It pass through a VIT, and then it pass through the large language model, and then output a list of tokens.

So we leverage the constraint decoding to make sure it always have eight numbers. And because otherwise, we cannot de-tokenize it. It's very easy for language model to just miss one number. So we do have some mechanism, such as constraint decoding and beam search, to make sure the format is correct.

After we get the string of eight numbers, we de-tokenize it to a delta T and delta R, which is the anti-factor delta pose. And the robot can just directly run this on the robots. After they run on the robots, we repeat this process. We get another new image, run through this process, and get a new action.

And we repeat this process until a termination is decoded. So some people might be concerned that this is rather slow. It's in fact quite slow, because it's 12 billion parameters, or 5 billion parameters. We cannot run on a robot. So we run on a TPO cluster, and the robot is querying the TPO cluster to get the numbers and apply it on the robot.

So for the 12 billion parameters, we can actually run at 10 hertz. So it's quite fast. For all those models, we can run at least three hertz. So that is sufficient for controlling a robot. And we see a lot of emergent skills that is not on the training set.

Essentially, as I just mentioned, we are probing what this RT2 can do. We actually don't know. So we are trying to figure out what RT2 can do. So we test it with a lot of new tasks, such as put a strawberry into the correct bowl, or move a banana to Germany, just to test its understanding of symbols or flags.

Pick a land animal. There's a horse. There's an octopus. So basically, test its semantic reasoning and also low-level manipulation skills. And we divide the tasks into symbol understanding, and reasoning, and human recognition, and average. And we found that with RT1, which is not trained on internet-scale data, we do quite poorly in these emergent evaluation tasks.

And in the RT2 variants, which is co-fine-tuned on the internet data and our robotics data, we do much better in these tasks. And there is also an effect of scale. So the RT2 with the 55 billion poly is performing better than the 12 billion poly, although they perform quite similarly for in-domain tasks.

But the generalization is kind of interesting. It seems with larger scale, you can generalize better. And here are some videos of the robot achieving these tasks, like moving the banana to a number, put the strawberry into the correct bowl, move a Rubik's cube to the water bottle-- but I'm speaking Chinese-- moving the banana to a German flag.

So it's able to do all of these very interesting tasks. In terms of the quantitative evaluations, we also found that the RT2 policy is quite robust to unseen objects, unseen backgrounds, and unseen environments. And here is another evidence of positive transfer. So co-fine-tuned with VQA data outperforms fine-tuning on robotics only.

And if you're trained on robot data from scratch, it barely works. It almost doesn't work, because it overfits to robot data. And our robot data is just too small. So we do need to do co-fine-tuning, or at least fine-tuning, so it retains its internet scale knowledge. This is also a recipe for how people would develop a domain-specific vision language model.

So you start from a very general vision language model, and you fine-tune on your domain. Or you can co-fine-tune with your specific domain data. This is likely a problem that each vertical of artificial intelligence would incur someday. We can also test on other platforms. Like this shows some cross-embodiment, the RT2, PolyE3b outperforms previous models in terms of moving blocks around a 2D environment.

And in large-language models, we have this chain-of-thought reasoning, which is a method to elicit reasoning in large-language models. You can either do zero-shot chain-of-thought reasoning by, say, eliciting step by step. I'll give you the examples of reasoning. It's basically decoding more things and then come to the conclusion. We can use a similar procedure for the RT2 as well.

So in RT2 PolyE, instead of directly decoding the actions, we can actually decode a plan and then append it with actions. So this gives the language model an opportunity to understand a question or parse a question differently. It also gives us the opportunity to reason about things a little bit.

For example, if you say, "Bring me a drink," and it will say, "Pick up 7-up can," because there's a 7-up can on the table. So we synthesized a couple hundred such examples using a large-language model just by augmenting the instruction and then fine-tuned the RT2 just for a couple hundred steps.

So it's between full fine-tuning and in-context learning, and it is able to do some reasoning. And some of the interesting reasoning tasks include, "I need to hammer a nail. Which object from the scene might be useful?" And in the scene, there is a headphone, there is a rock, and there is a sticky note.

And the robot will say, "Rocks," and then generate actions to pick up the rock. So it's interesting that it's able to do this sort of reasoning with RT2. And here is a demonstration of some of the channel-thought reasoning with RT2 PolyE. And the task is, "Pick up the thing that is different from all other objects." And it picks up the chocolate, because this is a snack and other things are the drink.

And I can also speak a different language, and the plan would be to translate it into a language that it's familiar with, which is English, and then do the task. There are also potentially better cases of the channel-thought reasoning. So here I say, "Move the green object together." And as you can see, the robot oscillates between the two green objects, because there are rather two plans.

It could move the can to the bag of chips, or it could move the bag of chips to the can. It oscillates between two plans until one action brings it to an object, and it will commit to one of the plans rather than another. It's not always guaranteed to work, but it's quite interesting.

And it's also interesting that, again, we are testing the manipulation policy like how we test intelligence of humans or animals or kids, because they're getting more and more advanced. As a summary, we have the vision language and action model that is able to improve the generalization. It can do new tasks and operate the new objects.

It can also do chain-of-thought reasoning, and improving the underlying model, such as the vision language model itself, by scaling it up and training it with internet-scale data or training it with larger or higher-quality internet-scale data, we can achieve better robot control, which is quite amazing, because robotics field has been traditionally developing quite slowly and is bounded by hardware, bounded by a lot of different things, bounded by operation.

But now it seems we can piggyback on the development of the foundation model field, and whatever they do will trickle down to our field as well. And the future will be to increase the motion diversity and extend on the chain-of-thought reasoning capability and many more. And so there is another example of positive transfer, which you might have seen recently.

So far, I've been talking about scaling differently. I've been talking about don't scale robotics data and scale other data. That's because robotics data is so hard to collect, and the purpose is not to avoid collecting robotics data. It's to develop a recipe that you can do more with limited robotics data.

However, there's also an effort from our team and the entire robotics field to scale up the robot data collection, which is called OpenX Embodiment. And the model chain is called RTX, Robotics Transformer X. It's basically 22 type of embodiments and 572 scales and 60 datasets pulled all together. So this will be the ultimate dataset we can use to study positive transfer and to study this joint scaling.

And there are already evidences of positive transfer. So we pulled all the data together from all these labs and find a common action representation that we can use to train a robotic transformer. And we have already found this jointly trained model can outperform task-specific model that is developed in each of the lab.

So there is some benefits in pulling all the data together. So scaling robot data is also quite important. So the summary for this part is that we are having a model consolidation. We can now do the high-level reasoning and low-level control in one model. And the low-level control part is what excites me because it's so far away from the traditional language model domain, it's so different and it shows signs of life that we can trickle down a lot more than we used to think it's possible.

And we can scale the pre-training of vision language models as well as scaling robotics data. And we observe more and more positive transfer model benefiting from diverse joint training across internet-scale language, vision, and vision language domains. All right, so I noticed that we are close to running out of time, so I will just very quickly go through the second part, which I think is also interesting, is to find new interfaces of language models, but I would only talk at a very high level.

So language models, as we can see, can directly output action tokens if we found action representation. So we can treat action as yet another language to the language model. So language model can do translation, so it should be able to generate action as well. But that requires fine-tuning. Can we do it without fine-tuning?

Or can we generate more expressive actions that is beyond the scope of fine-tuning? So that is about finding the right interface. So previously, we have already established that language model doesn't have an action interface. If it has an action interface, it's not as effective. So what is the best interface between language and the low-level actions?

I would argue the best interface between language model and the low-level actions is reward functions. And reward functions is universal. It has been used in reinforcement learning. And it's also a reparameterization of actions. What is action? Let's see if I want to pick up this bottle. And I can say, well, what is a skill?

A skill is a mapping between my observation and my action. So the mapping between my observation and action can be seen as a skill. But a skill can have an alternative definition, which is a set of constraints and a set of objectives. So picking up the bottle means the bottle is in my right hand, and the bottle is off a supporting surface.

That means picking up. And how do I pick it up doesn't really matter. That's a more, to its broader sense, a definition of skills. It's more transferable between different skills. And the constraints and objectives can be represented as rewards. So we can ask language model to generate these reward functions.

And then there is an optimizer. It could be reinforcement learning, or it could be model predictive control that optimize for those rewards and then run it on the robot. So what is in the reward translator? Let's open a box. So the reward translator basically is a two-stage process. It's using the same language model, and it is using two different prompts.

So the motion description basically describes the motion. So just now we found that the language model can output a description of how a robot dog should stand up, but it's not able to achieve that. But the motion description is still sensible. It still makes sense. It gives you the right thing.

So we just generate this motion description, and then we have a reward translator, reward coder that translates this motion description into a piece of code that is representing reward functions. And these reward functions cannot be directly executed on the robot, but it can go through our optimization process to learn how to achieve those reward functions.

So we're using reward as the interface between language model and a low-level controller. And for the low-level controller, we're using Mojoco MPC, which is a model predictive control algorithm. It's basically a black box controller. It samples a lot of trajectories and finds one that optimizes your reward. And we tested on a robot dog, a quadruped robot essentially, and a dextrose manipulator.

So the dextrose manipulator has an arm of six or seven degrees of freedom and a hand. It's impossible to control it because it has so many degrees of freedom. So it's highly challenging. So just to showcase some of the examples, I omitted the motion description part. I only output the reward code part.

So it seems that the language model is able to generate the right reward functions to make the robot stand up on two back feet like a human. And then now we are a little bit more ambitious. We know it can stand up. Can we make the robot do a moonwalk while standing up like this?

So a moonwalk is from Michael Jackson, and it's very challenging. How do we make the robot to do it? So it generates the motion description and generates the reward code. But the motion is not so correct, not exactly what we want. The nice thing about using a language model and using the reward function is that you can coach the robot.

You can go back and explain what went wrong and ask the language model to fix it. So now we can actually say you're being very patient. You say moonwalk means the robot should walk backward while the feet swing as if they are moving forward. Such a great explanation, kudos to my colleague, and correct your answer and also make it walk at a speed of 0.5 meters per second.

And after you being very patient and give it the right instruction, it's able to modify the motion descriptor and also generate the right set of rewards to make this happen. And now you can teach a robot to do a moonwalk just by using the language as an interface. And one day we'll be able to do this on the real robot as well.

Yes. So in the previous section, we showed how the language model did calculate numbers and you're constraining them to also just take numbers. Here, how do you prevent it from just hallucinating in some program? Right. So that's a great question. In this work, we are not preventing it to do hallucination in a programmatic way.

We have a set of system prompts or a set of rules that is explaining the API. After all, the reward functions need to be able to be compiled by the optimizer. We do need to have some check. What's more, if it doesn't compile, we can just give the error message to the language model.

It doesn't have to propagate all the way to the motion descriptor, it can stay at the reward decoder. If there are errors, please fix it. After that, it should be able to fix it. We can also chain multiple tasks together. Using this framework, we can say, open a drawer, take the apple, put it into the drawer, and close the drawer, and it will be able to do that.

So we tried that. Just using reward decoder is not good enough. It's rather our two-stage prompt is really, really helpful. I think that's another inspiration for other fields, like when your domain is too different from language domain, maybe it would be good to find an intermediate representation and ask the language model to explain in that intermediate representation before directly go to a more obscure representation.

Finally, we want to transfer this to the real world, but there is a challenge. Using simulation, it might generate actions that are too dexterous, like this thing is not possible to do in the real world. So we add a few more regularizer terms to stabilize the motion, and we also run some state estimation on the real robots so that they understand where is the cubes, and then we can, in the simulation, grab the motion and then achieve it in the real world.

So here are some of the execution in the real world. So you can, say, pick up the Rubik's cube, and it will generate the motion to pick up the Rubik's cube and grab it. This is quite different from RT2. The motions are quite smooth. It's quite fast. It's much faster than 3 hertz.

So here, it can do 10 hertz or even 30 hertz. So it's comparable with human beings. So that's a language Q reward. There's one last thing that I want to talk about in terms of finding a new interface. So a lot of time, we have been thinking about language model as a semantic engine, a semantic machine.

It understands semantics. So, for example, you say the student takes out the book. You will say book. Language model is able to reason about such a sequence. But if you do low-level patterns, like if you just give it obscure numbers, what can you do? It's actually a low-level interface.

And we can open up the low-level interface to alpha language model and ask it to do robotics tasks. So in this paper, "Large Language Model as General Pattern Machines," we explore using the low-level interface of a large language model, essentially asking it to reason about different sequences. And it's surprisingly quite effective.

And it can solve tasks like the ARC challenge and the PCFG. And it can even do sequence improvement. So I will dig a little bit into sequence improvement because that's quite relevant to robotics. So sequence improvement is that you prompt the language model with state, action, and the reward tuples.

And you just prompt it with higher reward and see if it can generate actions that achieve the higher reward. So it's doing reinforced learning or reinforced learning-like thing, but in context. So this is quite amazing. So previously, you would need a dedicated algorithm collecting data replay buffer to do this reinforced learning.

But now you can just build everything in the language model context by leveraging the low-level interface of a language model. And with that, we can actually do something like clicker training. So if you are not very familiar with clicker training, it's how you train a dog. You can have a dog, and when it does the right thing, you give it a reward by clicking.

So the clicker training is giving the agent a reward. And we can now use clicker training to train robots as well. So here, the robot is exploring, but I would give click when it does the right thing or towards the right direction. And over time, it will be able to push the backup chips, which is the objective of this training.

So you can do this entire decision transformer-like operation, but purely in context, by just giving a language model a bunch of patterns and ask it to figure out what is the regularity of this sequence. And this way, it can generate new actions to improve the previous sequence. So for the language model, we can find new interfaces that are more suitable for teaching it low-level skills.

Reward is a bridge of language model and low-level control, and we can fully leverage it as a universal interface, and we can optimize in real time. Sometimes it outperforms generating action directly. So it really motivates to use the reward functions as interface. And in the language model as a general pattern machine, we can use language model beyond the semantic tasks.

We can ask it to reason low-level things. And also, robotics as a domain, rich of sequence transformation and sequence completion and sequence improvement tasks. So we can really study the lower-level mechanisms of language models. And the key takeaway for this talk is that we are seeing more and more use of foundation models, not only on the semantic reasoning side of robotics, but more on the dexterous, on the generating actions, on the lower-level embodied intelligence side of robotics.

And we need to rethink the scaling law of robotics and transformer. How do we scale it with limited amount of data? We have a new recipe for scaling robot model and data in RT2, which shows that you can do more with the same data, with essentially RT1 data plus internet data, you can generalize to allow more things.

And RTX shows that you can do a lot more with more data. There is also benefits to collecting more robotics data. And there is positive transfers everywhere. And part two, in terms of new interfaces for language models, I think it's worth for the robotics field to think about developing new and lower-level interface to language models, which facilitate learning low-level skills.

With that, I would like to conclude my talk. And if you'll find it interesting, there are a lot of references for you to look into. And special thanks to my team, Google DeepMind Robotics team. So we are at the forefront of developing foundation models for robotics. And stay tuned for more in the future.

Thank you. Yes. You mentioned that load numbers are difficult for a lot of our language models, but if you're just generating the action tokens themselves, like no rocks or whatever you had in an example, why don't you just have a linear layer appended to the transformer that would just generate numbers from here that you can type in whatever you need?

Yeah. The question is that if the large language models have difficulty understanding numbers, why don't we use a linear layer to output the action directly? I think language models are difficult to understand numbers. But sometimes we still want it to bring in knowledge from the pre-training mixture. If I have a new layer, that new layer is not present in the pre-training.

So how do I expect it to transfer? I think that's an interesting question. But at the same time, I don't necessarily think using the raw numbers is the right interface. We probably could do some action representation learning to learn a representation. And the language model can output that representation.

So we're still trying to figure out what is the right representation. So among the representations that we haven't tried before, like decimal numbers, flow numbers, actual tokens, we find that just using numbers or actual tokens would be good enough. Yeah. Yes. Yeah, I think both directions are worth exploring.

There are different advantages of generating action directly. I think it borrows the autoregressive nature of language modeling. And it aligns with a lot of other tasks, like visual question answering really well. The limitation is that then when you are generating actions, it's heavily regularized. Can you generate dexterous actions that is so out of distribution that it's kind of difficult?

The language to reward actually brings a page of the book of traditional robotics, this optimization-based or model predictive control. And you can also take into, let's say, safety constraints more easily. It can generate more diverse actions. Maybe one recipe is to generate a lot of data with the language to reward system and distill them into a transformer.

Because then you are imbuing your large language model with all this other desirable-- the language to reward itself, I don't know how scalable it is. We're not fine-tuning language model. So maybe you are limited to what-- you are at the mercy of the training data of the language model.

The language model can do moonwalk because it knows what moonwalk is. It roughly knows how to do that. But if you want to scale to completely new things, maybe you can use the language to reward to bootstrap your data generation and then put it into the other policy. So can you tell us what's the next direction Google is pursuing?

So it's like, the language is rewarded in the right direction, like scaling out of the room, out of the racks, and so on? Yeah, I think that's a good question. So the scaling being the end of the lecture, that is a joke. But I'm being quite serious. It's actually a promising recipe.

So I think everybody is believing in the power of the scaling rule. So just by giving it more data, giving it more compute, you will see interesting capabilities coming out. Yeah, I still think we don't quite have enough data. I think that's still probably the biggest bottleneck. So we are trying to find ways to do more with limited data.

And we are trying to collect more data. And I think it needs some time for us to accumulate enough data. And currently, I say, we have signs of life for positive transfer. But in language models, people don't talk about positive transfers anymore because it's so commonplace. Right? You see it everywhere.

And robotics is not at that stage yet. Yeah, how much has your team been thinking about safety and alignment? Yeah. And are you just, right now, relying on the ethics that emerge from the large language models? It won't tell you to kill someone to achieve that. Yeah, that's a very good question.

Actually, we take safety very, very seriously because all of the other domains of developing language models, it doesn't have direct impact on the physical world. But here, it could have potential harm to humans and to the environment. And Gary Marcus actually gave a comment previously to our work that, what if you say, bring out a bowl, feed a cat, and put it in the dishwasher?

Well, let's put the cat in the dishwasher, right? If it misunderstands, actually, it will have a catastrophic failure case. We take safety carefully by designing hardware and software safety layers. And there are also some constitutional safety thing that is coming out sometime soon. I cannot tell much details right now, but sometime soon, we'll release some work.

Is it something like, if there's a human, just don't interact? Well, no, no, no. I think it's a little bit more nuanced and more detailed than that. But we do take safety quite seriously. And in some of our experiments, actually, the robot's finger would break off because it cannot apply enough force to an environment.

So that's just yet another way of ensuring safety. Can we have some visual language model and a synthesizer or something to stop the problem that both the internet and the robot? And maybe, this is kind of like interpretive, but both in some logical way. Right, right. So I think it would be possible.

Thank you for the great talk. Thank you. Thank you. Thank you. you you