Stanford CS25: V3 I Low-level Embodied Intelligence w/ Foundation Models

00:00:00.000 | So, hey guys, thanks for coming to our second class.

00:00:10.000 | Today we have the pleasure of welcoming Faish Shia, he's a senior research scientist at

00:00:15.080 | Google DeepMind, where he works on the robotics team.

00:00:18.800 | He received his PhD here, actually, working with Sylvio Salvariz in Stanford Vision and

00:00:24.720 | Learning Lab, as well as Leonidas Pibus.

00:00:27.960 | And his mission is to build intelligent embodied agents that can interact with complex and

00:00:33.380 | unstructured real-world environments with applications with home robotics.

00:00:40.120 | Recently he has been exploring the use of foundation models for robot decision-making

00:00:44.880 | and action generation.

00:00:47.040 | So now I'll hand it off to Faish.

00:00:49.080 | Hi everyone, I'm super happy to be here and happy to be back.

00:00:53.400 | I graduated from here two years ago, and now I'm a research scientist at Google DeepMind.

00:00:58.800 | I work on the robotics team, and today I will be talking about low-level embodied intelligence

00:01:03.880 | with foundation models.

00:01:05.600 | So it's definitely an interesting topic, and I will introduce what is embodied intelligence

00:01:11.320 | and what is low-level embodied intelligence, and how we can accelerate the building of

00:01:16.960 | them with foundation models.

00:01:19.800 | All right, so why are we working on embodied intelligence?

00:01:24.340 | So embodied intelligence is an integral part of artificial intelligence, and it's an important

00:01:31.760 | milestone to artificial general intelligence.

00:01:35.760 | And it has a lot of use cases, like for example, we all hope we have a home robot that can

00:01:41.000 | be in our home 24/7 and clean the home for us, or clean up our messy room, or cook for

00:01:48.720 | us, or taking care of our aging family members.

00:01:52.400 | So we are not quite there yet.

00:01:53.920 | In fact, we are quite far from it.

00:01:56.360 | That is because our intelligence is currently mostly in the virtual world.

00:02:00.840 | So we have AI agents that can help us draft emails or write eloquent essays, but they

00:02:07.480 | are not super good at interacting with the messy real world, unstructured, complex environment

00:02:13.720 | that humans reside in.

00:02:16.600 | So just to give you guys a couple of examples of how messy the real world can be, and how

00:02:22.280 | hostile it could be to robotics, I want to show you a curious mistake or curious error

00:02:28.840 | from one of our robots.

00:02:30.720 | So the task is to put the Coke can in the sink, and watch what the robot do.

00:02:36.380 | The robot grabs the Coke can and opens the tap.

00:02:39.400 | So this is kind of dangerous, but it's kind of interesting, right?

00:02:45.820 | Because we never expect it would do something like that.

00:02:48.800 | It's just from random noise, it starts to open the tap, and the water starts to come

00:02:53.520 | out.

00:02:54.520 | So for an agent to have this type of physical intelligence, it needs to understand the effect

00:03:01.000 | of its actions, and what is so-called a world model.

00:03:04.640 | So people have been complaining that language models so far don't have a world model.

00:03:08.780 | So it doesn't understand geometry, it doesn't understand the spatial relationship of objects,

00:03:15.120 | or the effect of actions, basically how objects will move according to physical laws.

00:03:21.800 | So we are not quite there yet.

00:03:24.040 | In another case, so this is our robot that is ready to deliver a can, or actually throw

00:03:30.120 | away a can.

00:03:31.480 | But as you can see, we have this pre-programmed behavior of tucking the arm behind.

00:03:37.300 | And in doing that, the can is upside down.

00:03:39.800 | So if there's any liquid in the can, it will spill and damage the robot.

00:03:44.800 | So it's another example of real world is really complex, and there are a lot of things to

00:03:49.560 | model.

00:03:50.560 | And in order for our robots to have this sort of ambient intelligence, they really need

00:03:56.620 | to understand a lot of very nuanced details of the environment, and understanding the

00:04:02.120 | physics, physical laws, and understanding the effect of its actions.

00:04:09.280 | How do we do that?

00:04:10.280 | There are many ways to achieve embodied intelligence.

00:04:12.560 | Actually, throughout my PhD study, I've been fascinated by this idea of creating interactive

00:04:18.840 | environments, basically, let agent explore in this interactive environment, basically,

00:04:24.720 | create environments that are complex enough.

00:04:28.520 | So that if the agent needs to survive in such environment, it must develop intelligence.

00:04:33.920 | So it's an ecological view of perception and agency, and is popularized by American psychologist

00:04:40.320 | James J. Gibson.

00:04:42.320 | So he has a famous quote, "Ask not what is inside your head, but what your head is inside

00:04:48.200 | of."

00:04:49.200 | So human learned this type of embodied intelligence.

00:04:51.960 | Human is able to manipulate objects effortlessly, one, because of evolution, second, because

00:04:58.160 | of the childhood experience.

00:04:59.400 | We have been playing with this toy, we have been interacting with this toy, and watch

00:05:03.160 | the physical effect so that we learn.

00:05:06.080 | And similarly, we can give robots a safe playpen, so they can explore in those environment and

00:05:13.240 | interact with environment and play, and watch the effect of actions, and effectively understand

00:05:19.080 | how to manipulate those objects.

00:05:22.460 | So I have been developing these simulation environments, one of which is called Gibson

00:05:29.760 | environment, which is published as CVPR.

00:05:32.520 | It's mainly aiming at simulating the visual world faithfully, and also simulate physical

00:05:39.280 | world to some extent.

00:05:40.960 | So we built this environment, which is a scanned environment from a lot of houses.

00:05:46.680 | And then an agent, we can spawn an agent in that, in this case, a humanoid agent, and

00:05:51.840 | the agent can learn to walk or to run in this environment and simulate all this perception

00:05:58.120 | information.

00:05:59.120 | So we can create a perception action loop for this agent.

00:06:03.560 | And similarly, we can put other types of agents in this environment, in this case, a little

00:06:09.200 | cart, and we can also put a quadruped or this ant into this environment.

00:06:16.040 | So essentially, we create an environment where we can simulate perception for the agent,

00:06:23.000 | and then we can create a neural network to map the perception to action.

00:06:26.840 | And this way, we achieve some sort of physical intelligence.

00:06:30.780 | It's mostly for navigation and locomotion.

00:06:35.960 | This is not enough.

00:06:37.560 | So in this case, the environment is one monolithic piece of mesh.

00:06:42.500 | As you can see, the agent run into the wall and it bounced back.

00:06:46.900 | So there is no articulation in this environment.

00:06:50.080 | So it's not simulating the full complexity of the environment.

00:06:53.800 | So the things that we can do with our agent is rather limited.

00:06:59.080 | So that's why we create other simulation environment, one of which is iGibson environment, which

00:07:04.280 | is called Interactive Gibson.

00:07:06.140 | So what we do is we create, again, scan a lot of real world houses, and then we convert

00:07:13.520 | them to CAD assets, basically mesh assets that are interactable.

00:07:18.600 | In this case, we have a simulated agent that go into the environment and then close all

00:07:23.680 | the drawers.

00:07:25.260 | So we are able to do that because we model the complexity of the world a little bit more.

00:07:30.440 | We go beyond just modeling the visual world.

00:07:33.000 | We start to model physics a little bit more, basically modeling the degree of freedom in

00:07:38.640 | the environment.

00:07:39.640 | And our agent can do more than just navigating around.

00:07:44.040 | So we can go even further.

00:07:47.040 | So we can even model more degree of freedom.

00:07:50.200 | And our agent can develop more complicated behavior, such as unloading a dishwasher and

00:07:55.040 | find a bowl or take out the bowl and put it on the table.

00:07:59.700 | So as we scale up the complexity of the environment, we are able to learn much more complicated

00:08:06.120 | skills in simulation.

00:08:08.920 | And that's one way to achieve embodied intelligence, which is to build complex enough simulation

00:08:14.840 | environment.

00:08:19.080 | Not just my research, but the entire field of computer vision is undergoing a paradigm

00:08:23.720 | shift.

00:08:24.740 | So previously, we are focusing on internet AI.

00:08:27.780 | We curate a lot of internet data sets to study problems like classification, segmentation,

00:08:34.180 | and detection.

00:08:35.180 | Basically all these computer vision problems.

00:08:37.460 | Now we focus a lot more on embodied AI, which is adding the action dimension to the problem

00:08:44.700 | that we are studying problems like visual navigation, manipulation, rearrangement, embodied

00:08:49.340 | question answering, instruction following, and the simulators in some sense replace the

00:08:57.260 | original role of data sets.

00:08:59.900 | One thing that doesn't change, which is the data is still super important.

00:09:04.420 | We are still relying on a large amount of data to learn this intelligent behavior, no

00:09:11.220 | matter if it's from a static data set or from a simulator.

00:09:16.780 | So learning in simulation can take a lot of interactions.

00:09:22.180 | So just to give you an example, we create this iGibson environment and we want to learn

00:09:27.980 | a behavior called go into a room through a closed door.

00:09:31.820 | So this is a rather simple behavior, which I can show on the top right of the screen.

00:09:37.320 | So the agent needs to stop in front of the door, it needs to stop at the right distance.

00:09:42.300 | If it's stopped too close to the door, it cannot extend its arm.

00:09:45.540 | If it's too far, it cannot open the door.

00:09:47.940 | And then it basically opens the door.

00:09:49.940 | Let me play this again.

00:09:50.940 | Open this door, when there is enough clearance, it will go into the door.

00:09:55.100 | However, it takes about 50,000 episodes or 1.25 million environment interactions to learn

00:10:02.100 | this type of behavior.

00:10:04.100 | This is because we are using model-free reinforcement learning, the agent is exploring this environment.

00:10:10.100 | It could really push any point, it could rather stop at any point.

00:10:14.100 | So we give it a reward function to go into the room, but it's very rare that it will

00:10:20.220 | stumble upon this behavior.

00:10:23.340 | I would like to argue with foundation models, we can do a lot more different.

00:10:28.100 | So what do you do nowadays?

00:10:29.700 | You just ask a HHBT, how do you go into a room through a closed door?

00:10:33.980 | And it will say, open the door, walk through the door.

00:10:36.120 | So this is a gross simplification of the problem.

00:10:39.420 | Of course, the problem is not that simple.

00:10:43.660 | But what I'm just saying is that we can leverage a lot of semantic prior from the foundation

00:10:50.460 | models.

00:10:51.460 | So if we really like data, if we really need a lot of data, the foundation model is a compressed

00:10:56.900 | version of the entire data and it's a knowledge base that you can query and to accelerate

00:11:01.900 | the development of robotics.

00:11:03.580 | Of course, simulation and real world data is still super, super important, but maybe

00:11:08.460 | we can get the best of both worlds.

00:11:10.580 | We can use foundation models plus a limited amount of simulation or real world data.

00:11:16.980 | So that's what I'm going to talk about today.

00:11:20.460 | So where are we in terms of foundation models plus robotics?

00:11:24.300 | So our team at Google DeepMind has been piloting in foundation model plus robotics.

00:11:29.940 | So we developed advanced planning, high level planning algorithm.

00:11:34.700 | One of the first is called Palm Seiken.

00:11:37.480 | It is an algorithm that can parse a user command.

00:11:41.540 | So here is a demo.

00:11:42.540 | Here is a scenario.

00:11:43.540 | Here is a user command.

00:11:44.540 | I spill my Coke on the table.

00:11:45.540 | How would you throw it away and bring me something to help clean?

00:11:48.740 | And it's querying a large language model, which is given a score highlighted in blue.

00:11:54.620 | And there is also an affordance score.

00:11:56.180 | The affordance will tell you whether an action at a given state is possible.

00:12:00.580 | It's augmenting the language model to give you only possible things.

00:12:04.880 | So essentially, it is doing the semantic planning with a language model.

00:12:09.860 | But it's also taking into consideration what it can do.

00:12:13.080 | So it's not just outputting the-- like, language model tends to hallucination.

00:12:21.320 | It doesn't hallucinate.

00:12:22.320 | It only gives you what is possible for the robot to do and what is actionable for the

00:12:25.760 | robot.

00:12:26.760 | And the robot is doing the thing that is advancing the long horizon task progress.

00:12:31.860 | And also, each task is executed by a low-level policy.

00:12:36.560 | Here it doesn't quite clean the table, because we haven't added this to the low-level skill.

00:12:42.880 | But imagine there is a low-level skill to clean the table.

00:12:44.960 | It will finish the entire thing.

00:12:48.640 | What is a low-level policy used here?

00:12:50.960 | The low-level policy used here is Robotic Transformer 1, RT1.

00:12:56.100 | It's our team's homegrown transformer.

00:12:59.200 | Essentially, we collect a large data set of human demonstrations.

00:13:04.280 | We put a transformer, and we train it on this large data set of expert trajectories.

00:13:12.440 | It is able to do about 700 tasks with 97% success rate.

00:13:18.440 | And it has interesting generalization behavior.

00:13:21.280 | It can operate in a new kitchen it has never seen before, which is showing there is a successful

00:13:27.780 | recipe to apply foundation models in robotics.

00:13:31.760 | So that's roughly where are we in terms of foundation model plus robotics.

00:13:36.400 | And I will talk about a few new works that is bringing this to the next level.

00:13:44.600 | So actually, my teammate, Ted, gave a talk of foundation models plus robotics at the

00:13:51.000 | beginning of this year.

00:13:53.440 | It's also this class, CS25.

00:13:55.760 | I highly recommend it.

00:13:57.680 | It's available on YouTube.

00:13:58.680 | I actually watched it last night so that I don't repeat some of the contents.

00:14:04.280 | But what he basically mentioned is that he revealed our team's progress in terms of building

00:14:12.040 | this robotic foundation models.

00:14:14.840 | And we have had a lot of somewhat detour, and now we sort of figured out a recipe.

00:14:21.200 | So in 2021 to 2022 is how we scale to many tasks with demonstrations.

00:14:27.040 | How do we collect a large amount of data?

00:14:29.080 | In fact, about 100,000 demonstrations.

00:14:33.680 | And we tried different ways to do it.

00:14:36.440 | We tried behavior cloning.

00:14:38.160 | We tried imitation learning plus reinforcement learning, and some other ways, or combining

00:14:43.060 | them with language models such as SACAN.

00:14:46.520 | In 2022 to 2023, it's about how we can leverage foundation models to accelerate robotics.

00:14:51.880 | We really see a proliferation of using foundation models to accelerate robotics, both on the

00:14:58.680 | high-level planning and low-level control, probably leaning more towards a high-level

00:15:02.780 | planning.

00:15:04.520 | So if the recipe works-- so the recipe is essentially combine a large-scale diverse

00:15:11.760 | offline data set with high-capacity architecture, such as a transformer, and using language

00:15:18.480 | as a universal glue.

00:15:20.000 | So this will be the recipe to build foundation models for robotics.

00:15:25.440 | So if this recipe works, what do we do?

00:15:28.000 | What do we do next?

00:15:29.000 | Essentially, we're just-- let's just scale everything to orders of magnitude and be done

00:15:34.600 | with it and solve robotics.

00:15:38.080 | And guess what?

00:15:39.080 | That's what we did.

00:15:40.080 | So that's the end of the lecture.

00:15:41.080 | I'm going to cut this a little bit short.

00:15:44.120 | And that's a joke.

00:15:45.800 | That's not happening.

00:15:47.600 | So we are still on our way, on our quest to solve low-level embodied intelligence.

00:15:53.680 | When I talk to people that you can use foundation models to do robotics, their reaction would

00:16:01.160 | be it's mostly doing high-level reasoning.

00:16:04.760 | It doesn't do the low-level manipulation really well.

00:16:08.840 | And that's for a reason.

00:16:10.480 | One of the reasons is there is a Moravec's paradox.

00:16:14.000 | Moravec's paradox is the observation that in artificial intelligence and robotics, contrary

00:16:18.600 | to traditional assumptions or our intuitions, reasoning requires very little computation.

00:16:24.000 | But sensory motor control and perception skills require enormous compute resources.

00:16:29.720 | That is because as biological creatures, we acquire the sensory motor skills through evolution.

00:16:38.400 | This is very different.

00:16:39.680 | So we might not be able to reason or do large-scale computation.

00:16:46.720 | But this sensory motor control is integral to our survival.

00:16:51.580 | So it's essentially already learning our DNA.

00:16:55.080 | But in robotics, it's a little bit different.

00:16:57.480 | So the chips are very good at doing reasoning and computation.

00:17:02.360 | But they are not super good.

00:17:03.360 | They haven't experienced the world.

00:17:04.960 | They haven't acquired the sensory motor skills that is necessary for them to do tasks in

00:17:11.040 | the real world.

00:17:12.040 | Here is an example.

00:17:13.680 | When the computer beat Kasparov, basically the human champion in chess, there is another

00:17:22.280 | robot arm moving the chess piece.

00:17:23.880 | It can beat the human champion in chess, but there is still someone need to move the chess

00:17:29.360 | piece.

00:17:30.360 | Similarly, in the AlphaGo moment, when Lee Sedol was beaten by AlphaGo, there is still

00:17:35.000 | someone who is moving the chess piece for them.

00:17:37.840 | It's not a robot doing that.

00:17:39.260 | So this is showing the reasoning is that the hard things are easy, and the easy things

00:17:43.600 | are hard.

00:17:45.640 | There's another thing that prevents us from using foundation models more prevalently,

00:17:51.600 | more in a larger scale in robotics, which is the training data bias.

00:17:57.160 | The training data of foundation models or large language models are mostly language

00:18:01.440 | tasks.

00:18:02.440 | So it's perhaps not that surprising it knows how to clean up a kitchen because maybe there

00:18:07.960 | are wikiHow articles teaching you how to clean up a kitchen or to do something in a procedural

00:18:13.040 | way.

00:18:14.040 | But there is no wikiHow articles teaching you how to move your finger five centimeters

00:18:18.260 | to the left because people just don't say that.

00:18:21.880 | People don't write that down.

00:18:23.040 | So there is a very limited amount of this low-level control data in large language model

00:18:28.720 | training culture.

00:18:29.720 | So we do have a lot of challenges in bringing the foundation models to a lower level.

00:18:34.000 | So that's what I mean by low-level embodied intelligence.

00:18:37.600 | So any questions so far?

00:18:39.680 | Also, I want to make this quite interactive.

00:18:42.040 | So if there is any questions, feel free to interrupt me any time.

00:18:47.040 | All right, if not, we can continue.

00:18:53.120 | So there are a couple of challenges of using large language models for low-level control.

00:18:57.240 | As I just mentioned, the first thing is lack of data.

00:19:01.680 | So we only have perhaps 100,000 episodes of human demonstration data and takes about 13

00:19:10.960 | robots 17 months to collect.

00:19:13.200 | So it's a huge amount of effort.

00:19:15.920 | In the country, large language models are trained on the order of 1,000 billion tokens.

00:19:21.680 | A smaller palm was trained on 780 billion tokens, and the larger one is trained-- following

00:19:31.640 | the chinchilla rule, you would need to train it on 1.35 trillion tokens.

00:19:36.600 | So it's a huge amount of discrepancy between how much data we can achieve in robotics and

00:19:43.320 | how much we can get in large language models.

00:19:48.120 | So we will always be bounded by robotic data.

00:19:50.360 | So maybe we can scale on other fronts.

00:19:53.560 | Maybe we can keep the robotics data the same, and then we can scale on other fronts.

00:19:58.200 | Like, maybe we can scale the pre-training mix of text and image, or maybe image and

00:20:02.320 | text pairs.

00:20:03.520 | Maybe we can build this cake, and the robotics data is just a cherry on top of it.

00:20:10.020 | And we can scale the foundation really, really well.

00:20:14.260 | Some of my work that I'm going to talk about today actually reused the RT1 data.

00:20:18.800 | We don't collect the new data for RT2, but we want to do more things with the same amount

00:20:23.560 | of data.

00:20:26.000 | The second challenge is kind of related to the first challenge.

00:20:30.400 | Language models lacks an interface for low-level control.

00:20:34.680 | If you ask a language model, how do you make a robot dog stand up on two feet, it will

00:20:39.360 | tell you a lot of things that sound reasonable, sounds plausible.

00:20:43.120 | It will tell you the robot dog's torso is upright, balance over two hind feet, and standing

00:20:48.280 | shoulder-width apart.

00:20:49.760 | This is great.

00:20:50.760 | This is all great.

00:20:51.760 | But we cannot put it on the robot.

00:20:55.920 | On the other hand, maybe we can ask a language model to write control code to directly control

00:21:00.120 | the robot.

00:21:01.120 | But usually, that requires you to curate an API that is friendly to the language model.

00:21:06.840 | It will directly ask it to give you my joint angles to make the robot stand upright.

00:21:12.520 | It will not give you the right thing because it doesn't have enough context.

00:21:15.600 | So essentially, large language models don't speak robot language.

00:21:20.700 | Can we actually find the right robot language?

00:21:24.120 | Can we find the interface between large language models and robot control?

00:21:28.240 | Or can we just treat robot action as another language?

00:21:31.960 | So that's what we want to find out.

00:21:36.200 | In today's agenda, I will be talking about low-level embodied intelligence with foundation

00:21:40.280 | models.

00:21:41.280 | It's separated into two parts, and it's addressing the two challenges that I've just mentioned.

00:21:47.480 | Part one is about model consolidation, joint scaling, and positive transfer.

00:21:51.760 | So I have to put them in one part because they are somewhat related.

00:21:56.400 | And part two is developing new interface of large language models.

00:22:00.960 | So what do I mean by model consolidation?

00:22:03.840 | Yes, question.

00:22:04.840 | Yeah, I was going to ask, why couldn't you just fine-tune an RNN for generating low-level

00:22:12.480 | code?

00:22:13.480 | [INAUDIBLE]

00:22:14.480 | Yeah.

00:22:15.480 | Yeah.

00:22:16.480 | Yeah, that's a great question.

00:22:19.880 | So the question is, why cannot we fine-tune language model to directly output low-level

00:22:25.600 | code or robot actions?

00:22:29.880 | So I will be talking about RT2, which does somewhat similar to that.

00:22:33.800 | It's fine-tune language model to output action as a language, to output our action representation.

00:22:41.160 | There are certain downsides to that.

00:22:42.600 | Like, for example, you would need to collect additional data to fine-tune a language model.

00:22:48.720 | So either we can fine-tune that, or we can use the language model zero-shot if you find

00:22:53.160 | the right interface, which I will talk about a little bit in the part two.

00:22:56.360 | Zero-shot and without fine-tuning?

00:22:58.560 | Without fine-tuning, yeah.

00:23:01.220 | So the model consolidation is, essentially, we can do the high-level reasoning and low-level

00:23:05.680 | control in one model.

00:23:07.240 | And joint scaling is, not only we scale the robot data, which is expensive.

00:23:12.000 | We also scale the pre-training data.

00:23:15.880 | Or we already start from a pre-trained vision language model.

00:23:19.720 | And a positive transfer is model benefiting from diverse joint training across internet

00:23:24.560 | scale language, vision, and vision language domains combined with robotics.

00:23:31.720 | So this is a continuation of the axes that Tad drew in his previous talk.

00:23:40.080 | So we can see there is a trend.

00:23:42.560 | So this visualization basically highlights some of the work on our team.

00:23:47.820 | And each work, each column, is basically a robotic system that is able to do both high-level

00:23:55.160 | reasoning and low-level control.

00:23:57.260 | So previously, we need to have separate models for each thing.

00:24:03.080 | Previously, in the initial release of SACAN, the planning is done by a large language model.

00:24:09.040 | And the affordance is done by a QT opt-like policy trained with Sim2Real.

00:24:20.040 | And the low-level policy is Robotic Transformer 1.

00:24:23.720 | So it's each model doing its dedicated thing.

00:24:28.160 | And we need to train each model differently, and perhaps with different type of data.

00:24:34.480 | And later, we have QTransformer, which unifies, which is kind of an offline RL method that

00:24:42.360 | is leveraging transformer architecture.

00:24:44.560 | So it's a high-capacity architecture.

00:24:46.820 | It can train on both positive data and negative data.

00:24:49.520 | And with that, we are able to gather a policy that is also understanding affordances.

00:24:56.180 | So we can unify the low-level policy and affordances.

00:24:58.880 | But the planning is still a large language model.

00:25:01.280 | And then we have PAL-ME, which is a vision language model, which is a large language

00:25:06.580 | model also trained on a vision language domain.

00:25:09.980 | So the PAL-ME can do planning and affordance in just one model.

00:25:13.880 | But the low-level is still using RT1.

00:25:16.280 | And finally, we unify everything together.

00:25:18.620 | Like there is RT2, which I'm going to talk about today, that can do both high-level planning

00:25:23.760 | to some extent, generating affordance, and do low-level policies.

00:25:28.320 | So behind the model consolidation is the consolidation of tasks.

00:25:33.640 | We can represent every task as a vision plus text to text task.

00:25:39.160 | So it's a really universal representation of the task.

00:25:42.840 | And then with that, you can train it really on using a lot of data.

00:25:48.040 | And you can see positive transfer.

00:25:49.960 | Basically, learning affordance can also tell you how to achieve a task.

00:25:56.800 | There are transfer between tasks when you pull all the tasks together.

00:26:03.800 | So to understand this joint scaling and to understand the model consolidation, we need

00:26:09.200 | to understand PAL-ME a little bit.

00:26:12.280 | So PAL-ME is an embodied multimodal language model.

00:26:15.400 | It's based on the PALM architecture.

00:26:17.520 | So PALM is a large language model.

00:26:19.440 | We made some adaptation on the architecture so it can understand multimodal input.

00:26:25.900 | So it is basically one model that is able to take in multimodal input.

00:26:34.760 | So in large language models, each word is tokenized and tokenized and getting this embedding

00:26:43.280 | of these words.

00:26:45.800 | And then that is fed into a large language model.

00:26:49.240 | So in PAL-ME, what we do is instead of using words, we can use multimodal tokens.

00:26:56.120 | So the multimodal tokens can come from a vision transformer, a VIT, or it can come from robot

00:27:04.560 | sensory data.

00:27:06.100 | So every multimodal token, then we map it to the text embedding space.

00:27:14.400 | We basically train a linear affine transform between the multimodal token and the text

00:27:23.120 | embedding space.

00:27:24.380 | And then we can treat the multimodal token as words as well.

00:27:30.200 | So essentially, we have a language model as a solid base, and then we start to adapt it

00:27:37.600 | to understand multimodal tokens.

00:27:39.760 | So this is quite interesting because it doesn't require a ton of adaptation or fine tuning

00:27:46.480 | for it to understand multimodal input.

00:27:50.000 | It just aligns naturally to the multimodal input, such as images.

00:27:54.840 | I will show a couple of examples of what it can do.

00:27:58.400 | And we can train in the same way as training large language models.

00:28:01.600 | So essentially, we can reuse the same infrastructure and training algorithm and everything to train

00:28:07.880 | this PAL-ME.

00:28:10.320 | A couple of other things we find along the way is positive transfer, which I will share

00:28:15.600 | in a little bit.

00:28:17.400 | So I guess here, I also want to mention PAL-ME is one of the largest models we have explored

00:28:24.400 | so far.

00:28:25.400 | It has 562 billion parameters, which is by concatenating the PALM, 540 billion parameters

00:28:32.360 | and the 22 billion VIT.

00:28:34.400 | And we find a lot of emergent capabilities of these models.

00:28:39.040 | That is, we haven't expected during training time, but really, we can prompt these models

00:28:46.080 | and ask it to do interesting things.

00:28:48.920 | We have also explored using neural scene representation, basically an object-centric representation

00:28:57.800 | and fed into PAL-ME.

00:28:59.440 | So object-centric representation assigns one token to each object.

00:29:07.160 | And we find that this representation is super helpful for robot planning tasks, because

00:29:13.040 | traditional VIT representation is based on grid, and it doesn't have a full understanding

00:29:17.760 | of light objects and their relationships.

00:29:20.640 | We have done an extensive study on the scaling performance and the catastrophic forgetting

00:29:27.720 | performance and all other interesting experiments in the paper.

00:29:32.640 | So please refer to the paper for more.

00:29:34.640 | So here, I'm just showing some interesting qualitative examples or some emergent capability

00:29:41.360 | of PAL-ME that we found out.

00:29:44.400 | So first, we found this model has some reasoning capability.

00:29:48.000 | You can give it an image and ask it questions that require a little bit of reasoning.

00:29:52.960 | And you can prompt this with, let's think step-by-step, which is a technique used to

00:29:58.600 | elicit reasoning in large language models.

00:30:01.400 | But here, in multi-modal language models, you can do the same.

00:30:04.760 | I guess people are also experimenting these days with GPT-4V.

00:30:09.440 | You can also prompt it to think step-by-step or count row-by-row.

00:30:13.480 | But here, this is before GPT-4V, and we were able to elicit reasoning using some of the

00:30:19.760 | interesting prompts, such as we can ask it, in this photo, are there more cats or more

00:30:26.040 | dogs?

00:30:27.040 | Let's think step-by-step.

00:30:28.040 | And the PAL-ME found out there are equal amount of dogs and cats.

00:30:32.200 | And on the right, give an image, can I go down the street on a bicycle, yes or no?

00:30:37.880 | Let's think step-by-step.

00:30:39.240 | And the reply is, do not enter, second, except the bicycles.

00:30:42.680 | Do not entry except the bicycles, yes.

00:30:45.200 | So it's doing this modest reasoning, and it's mixing this understanding of symbols and also

00:30:52.800 | mixing the understanding of text.

00:30:55.240 | So this is quite amazing to me, to be honest, when I first saw this.

00:31:00.440 | I didn't expect a multi-modal language model would be able to do that.

00:31:04.880 | And we also tried one thing, which is traditionally very difficult to language models, which is

00:31:10.920 | to tell a joke.

00:31:11.920 | Language models can understand joke, but sometimes it just doesn't-- it's not able

00:31:17.400 | to tell you a joke when it comes to the punchline.

00:31:21.600 | Because it's just trying to make something that is plausible and sounds like a joke.

00:31:27.040 | And when it comes to the punchline, it doesn't really know what to say.

00:31:30.760 | So here, I give it an image, and I ask it to come up with a description, and then comes

00:31:36.560 | up with a joke.

00:31:37.780 | So this guides the language model to think step-by-step.

00:31:40.920 | And the description is a donkey is carrying a dog, cat, and rooster.

00:31:45.160 | And the joke is, what do you call a donkey with a rooster on his back?

00:31:47.760 | A rooster booster.

00:31:48.760 | It's so creative.

00:31:50.200 | Like when I saw this, I'm pleasantly surprised.

00:31:53.240 | And I searched online.

00:31:54.240 | I couldn't find another joke like that.

00:31:56.360 | So it's actually an original joke by Pomi.

00:31:58.840 | And finally, we see some math reasoning with this model.

00:32:03.280 | Basically, I give it a messy menu from a pizza store, and I ask it, I'm just buying

00:32:12.440 | a pizza for me and my friend.

00:32:13.840 | How much should I pay?

00:32:14.840 | Let's think step-by-step.

00:32:16.040 | And it's figuring out there is a pizza, and there is $9.99, and it tells you the price.

00:32:23.520 | In some of the answers, it even calculates text, but the text is hallucinated.

00:32:28.060 | So that doesn't work.

00:32:29.520 | All right, let's talk about positive transfer.

00:32:32.520 | So apart from the amazing things that Pomi can do, it also has interesting positive transfer

00:32:41.080 | behavior.

00:32:43.100 | So when we train Pomi on a single domain, when we train it on just a single robotics

00:32:49.840 | task, the performance is not super great.

00:32:52.480 | But when we pool all the data together, and we also include internet-scale visual language

00:32:59.280 | tasks, such as captioning or visual question answering, it is able to do much better.

00:33:05.400 | So this shows that it's important to mix all the data together and train it jointly.

00:33:12.520 | The internet-scale data can act as a regularizer for you to not forget the representations.

00:33:20.960 | And those representations are, in turn, very useful for robotics.

00:33:26.400 | So that's a positive transfer result.

00:33:28.300 | And we start to see more and more positive transfer in other of our studies.

00:33:31.660 | Yes?

00:33:32.660 | So how much data do you have to do collectively, like in simulation or in real world?

00:33:37.480 | I think the playing with sorting stuff on the table is very impressive.

00:33:44.520 | Right.

00:33:45.520 | Yeah, that's a very good point.

00:33:50.640 | So these are all planning data, like high-level planning.

00:33:57.040 | So maybe let's just talk about two things.

00:34:00.000 | So first of all, the sorting results, the low-level policy is still using a traditional

00:34:07.340 | controller.

00:34:08.600 | So it's using a policy called LAVA.

00:34:10.680 | And that policy is trained on 68,000 episodes.

00:34:16.080 | The high-level planning is probably easier than you think, because it's giving command

00:34:24.600 | to the low-level policy.

00:34:25.800 | So it's basically only need to say, put the red block into top-left corner, put another

00:34:31.480 | red block into top-left corner.

00:34:32.960 | So it's rather quite standard autoregressive language modeling task.

00:34:39.840 | The only thing I need to do is to determine what task is not finished yet.

00:34:45.260 | So for example, if the block is already in the corner, it shouldn't call low-level policy

00:34:49.120 | to move it to the corner again.

00:34:50.780 | So it's rather like parsing the states and understanding the states.

00:34:55.720 | So this high-level policy only requires about 50 to 100 demonstrations to learn.

00:35:00.520 | So it's quite parameter efficient.

00:35:02.800 | And in the future-- that's a very good question, actually-- in the future, a lot of these tasks

00:35:07.280 | can be taught in context.

00:35:09.240 | So maybe we just demonstrate it once to the large-language model, then it knows how to

00:35:13.880 | do that.

00:35:14.880 | [INAUDIBLE]

00:35:15.880 | Yeah, this is through human demonstration as well.

00:35:27.400 | So a human on a low-level can demonstrate low-level policy by tele-operating a robot

00:35:32.640 | to do a certain task.

00:35:34.200 | But on a high-level, it could also just give the low-level policy-- imagine your control

00:35:42.440 | interface is through text.

00:35:44.460 | And then as a human, you can also guide a low-level policy to accomplish a task.

00:35:49.600 | And then that thing can then be used to train a large-language model.

00:35:54.840 | So that's for the sorting block.

00:35:57.280 | The secant is a little bit more interesting because the planning steps are actually generated

00:36:02.740 | by POM.

00:36:04.460 | So we essentially distilled POM plus this affordance model into POM-e.

00:36:10.880 | So that's a little bit more interesting.

00:36:13.020 | It's like using the AI data to bootstrap itself.

00:36:16.840 | That one has about 3,000 episodes, also not quite a lot.

00:36:22.080 | But it's able to learn complex planning behavior, replanning behavior, error recovery, which

00:36:28.680 | I will show in a slide.

00:36:30.000 | So with the POM-e as a high-level planner, we are able to take the rice chips out of

00:36:38.960 | the drawer, and there is a twist, which is I will be messing with the robot.

00:36:47.440 | So as it put onto counter, I put it back to the drawer.

00:36:52.040 | And as it pick it up again, and then I put it back again.

00:36:56.880 | So it's able to understand the state.

00:36:58.400 | It's able to understand my task is not finished.

00:37:01.120 | I cannot proceed with the next task.

00:37:03.240 | Now, after I don't mess with it anymore, it's able to close the drawer and pick up

00:37:08.760 | the bag of chips.

00:37:11.560 | So POM-e is able to combine affordance and planning in one model and do complex reasoning

00:37:19.480 | of a scene and environment.

00:37:22.760 | And interestingly, we can use the exact same model checkpoint to do block sorting as well.

00:37:28.960 | So this is the same model checkpoint.

00:37:30.840 | It can not only reason about how to bring a bag of chips to a user, it can also sort

00:37:37.000 | blocks.

00:37:38.000 | So and it's also responding to adversarial perturbation, like if the user is putting

00:37:46.060 | the block in the middle again, it's able to recover from that.

00:37:50.120 | So these are all coming from the same model.

00:37:53.040 | And it can also tell a joke.

00:37:57.360 | So yeah, this is the power of vision language models.

00:38:03.440 | Now we want to go a level deeper.

00:38:06.160 | These are all vision language models that are used for planning or high-level reasoning.

00:38:10.520 | Can we use them for low-level control?

00:38:12.800 | It turns out we can.

00:38:15.280 | And that's the RGQ work, which is vision language action model that transfer web knowledge to

00:38:20.120 | robotic control.

00:38:21.120 | What can it do?

00:38:23.200 | When asked, pick up the extinct animal.

00:38:28.480 | And it has a whole range of objects on the table.

00:38:31.480 | It will pick up the dinosaur.

00:38:32.860 | So it can link the extinct animal to dinosaur and to the action that pick the dinosaur up.

00:38:40.960 | So it's really doing this emergent reasoning and also the manipulation in just the one

00:38:46.760 | model.

00:38:47.760 | And by the way, this robot hasn't seen any of these before, at least in the robot training

00:38:54.440 | data.

00:38:55.440 | It might have seen this in their internet catalog, but it has never seen it in the robotics

00:39:01.600 | training data.

00:39:03.080 | So it's quite interesting how we need to evaluate these robots nowadays.

00:39:10.680 | So when we evaluate language models to prevent data contamination, every time you need to

00:39:16.480 | give it new questions because otherwise it might already memorize it in its training.

00:39:22.000 | When we evaluate these robots, we actually go to dollar store to buy all these toys to

00:39:27.200 | make sure it hasn't seen that before.

00:39:29.520 | And as we run more evaluation, maybe there will be some replication as well.

00:39:33.840 | But as you can see, it is able to understand to pick up this dinosaur toy.

00:39:40.880 | How did we do that?

00:39:42.960 | So we start from a visual language model that is trained on internet-scale data.

00:39:49.000 | And then we also combine it with robotics action data, which is the RT1 data and we

00:39:53.920 | get RT2.

00:39:55.600 | And we can dive deeper, a little bit deeper into RT2.

00:40:00.280 | So first of all, what is a visual language model?

00:40:02.520 | A visual language model is a transformer that takes in image and text and output text.

00:40:11.380 | So within Google, there is a visual language model called Pali, which is an encoder-decoder

00:40:21.040 | type of architecture.

00:40:22.440 | It's basically having a VIT to understand images and then a transformer encoder and

00:40:27.840 | the transformer decoder.

00:40:30.960 | They encompass both the visual and semantics to understand the world.

00:40:35.800 | And in robotics, we have to deal with a lot of both of these.

00:40:40.500 | And the question is, can we leverage the knowledge in the visual language models and apply them

00:40:46.460 | to robotics?

00:40:48.720 | On the other hand, we have the RT1.

00:40:51.320 | If you want to learn more about RT1, you can listen to the previous episode of this CS25

00:40:57.840 | by Tad.

00:40:59.320 | So he gave a detailed introduction on the RT1.

00:41:02.080 | But the RT1 is, if you stand far enough, it is also a vision language to action or something

00:41:11.840 | model.

00:41:12.840 | It takes in human instruction.

00:41:14.400 | It takes in the current camera image.

00:41:16.220 | The camera image passed through a film-efficient net, which is tokenized into 81 tokens, and

00:41:21.580 | then going to a token learner, which compresses everything into eight tokens.

00:41:26.480 | And then there is a transformer block, leveraging a lot of self-intention layer, and then generate

00:41:31.680 | actions.

00:41:32.680 | The action is also tokenized.

00:41:34.880 | The robot has seven degrees of freedom.

00:41:41.040 | The anti-factor has six degrees of freedom, its position and the rotation, and the gripper

00:41:47.660 | can open and close.

00:41:49.140 | And there is another dimension representing terminate the episode or not.

00:41:54.580 | Terminating means my task is already done.

00:41:57.500 | And we discretize every dimension into 256 bins.

00:42:03.020 | And then we do cross-entropy loss on those bins.

00:42:05.780 | So that's the RT1 architecture in a nutshell.

00:42:10.020 | It's quite similar to a vision language model with different output tokens.

00:42:13.820 | So it's rather natural that we just use a large pre-trained vision language model directly

00:42:19.180 | as policy.

00:42:20.180 | We can use the poly or poly-me as a policy.

00:42:24.440 | And one question is, how do we deal with actions when using pre-trained vision language models?

00:42:30.120 | And here is action representation that we use.

00:42:33.460 | The robot actions here are the eight dimensions.

00:42:38.100 | And as I mentioned, there is termination, position change, and rotation change.

00:42:42.560 | And we discretize everything into 256 bins.

00:42:47.120 | We also have tried other alternative representations, but they are not as good as just this naive

00:42:52.660 | representation.

00:42:53.660 | Yes?

00:42:54.660 | [INAUDIBLE]

00:42:55.660 | Yeah.

00:42:56.660 | Yeah.

00:42:57.660 | [INAUDIBLE]

00:42:58.660 | Oh, the film efficient net is a pre-trained convolutional neural network.

00:43:04.620 | It's used to tokenize the images.

00:43:07.180 | So the reason that we do this is, through some ablation study, we can tokenize the image

00:43:11.740 | in different ways.

00:43:12.740 | We can tokenize in ResNet.

00:43:14.740 | We can tokenize everything into ResNet.

00:43:17.100 | And we can tokenize using film efficient net.

00:43:19.940 | Film, what it means is it also take into the language embedding and append it to the intermediate

00:43:26.300 | layers of the ResNet.

00:43:28.300 | So we basically have some combination of feathers, and it's encoded in images.

00:43:33.820 | Yeah.

00:43:34.820 | [INAUDIBLE]

00:43:35.820 | That's right.

00:43:36.820 | That's right.

00:43:37.820 | [INAUDIBLE]

00:43:38.820 | That's right.

00:43:39.820 | [INAUDIBLE]

00:43:40.820 | That's right.

00:43:41.820 | [INAUDIBLE]

00:43:42.820 | The action is not encoded.

00:43:43.820 | The action is in text.

00:43:44.820 | It's basically what is shown here.

00:43:45.820 | This is the action.

00:43:46.820 | It's eight numbers.

00:43:47.820 | Each number range from 0 to 255.

00:43:48.820 | Yeah.

00:43:49.820 | And maybe another note.

00:43:50.820 | On the film ResNet, it's about how we tokenize the images and how we combine vision information

00:44:13.020 | and language information.

00:44:14.860 | There are many ways to do that.

00:44:16.340 | This is not the only way.

00:44:17.580 | There is early fusion and late fusion.

00:44:20.260 | And there is also cross-attention.

00:44:22.100 | You can basically tokenize your image just by itself.

00:44:25.220 | And then you can have language and use cross-attention to combine the image and text representation.

00:44:31.140 | So here, we are using this model.

00:44:33.540 | This is RT1 for robotics.

00:44:35.540 | So we do have a lot of considerations, such as latency.

00:44:38.660 | That's why we use this film ResNet, because it's super fast.

00:44:42.020 | And it can output a limited amount of tokens, which we can further compress with Token Learner.

00:44:47.620 | Yeah.

00:44:48.620 | Yeah.

00:44:49.620 | So is this autoregressive?

00:44:50.620 | Like, every single image it sees, it then reacts with each other?

00:44:54.740 | Right.

00:44:55.740 | So it is autoregressive.

00:44:56.740 | Yeah.

00:44:57.740 | And every time, we use a history of up to six steps.

00:45:02.100 | So every time, you see this image right now.

00:45:04.540 | And you see about two seconds of history before it.

00:45:09.180 | And this will be your input.

00:45:11.940 | Yeah.

00:45:12.940 | Again, if you have more questions about RT1, I recommend watching the previous episode.

00:45:18.500 | And here, it's all about RT2.

00:45:22.740 | So we can convert the string of numbers.

00:45:26.660 | This will be our output of our transformer, which is a visual language model.

00:45:31.420 | We tried other alternatives, such as floating numbers.

00:45:35.300 | Floating numbers is not super friendly to language model tokenizer, because it has these

00:45:41.340 | decimal points.

00:45:42.340 | We also tried the human language, such as left or right.

00:45:45.100 | It's more a semantic representation.

00:45:46.820 | But they cannot be directly executed on a robot, which is a limitation of this method.

00:45:53.440 | So if we commit to this action representation, which is just a string of numbers, we essentially

00:45:59.180 | get a visual language action model.

00:46:01.420 | We tried different variants, including polyX.

00:46:05.380 | This is a pathways language image model.

00:46:11.700 | There are 5 billion parameters variant and 55 billion parameter variant.

00:46:16.020 | And we also tried POMI, which is 12 billion parameters.

00:46:20.300 | The procedure that we did to train this RT2 is via co-fine tuning.

00:46:26.700 | Co-fine tuning is to put the internet scale data and the robotic data together.

00:46:32.880 | And then we fine tune it on this mixture of data so that it doesn't-- it retains the internet

00:46:39.100 | scale knowledge.

00:46:42.940 | Maybe that's also an artifact of our data is too small and not diverse enough.

00:46:46.620 | So if you're just a fine tune on robotics data, it will quickly overfit and forget about

00:46:52.180 | all this progeny mixture.

00:46:54.480 | Maybe it's a dynamic of scale.

00:46:56.660 | So we'll see.

00:46:59.580 | At inference time, how do we do this?

00:47:02.260 | We basically-- again, we do this autoregressively.

00:47:05.860 | We have an instruction of a task.

00:47:10.180 | And we format this as a question and answering task.

00:47:13.260 | What should the robot do to achieve a certain task?

00:47:15.860 | And the task is a string that human give the robot for the robot to achieve.

00:47:20.620 | And it also have the current observation, which is the robot observation, the camera

00:47:29.600 | image, RGB image.

00:47:30.600 | It pass through a VIT, and then it pass through the large language model, and then output

00:47:36.560 | a list of tokens.

00:47:38.380 | So we leverage the constraint decoding to make sure it always have eight numbers.

00:47:45.680 | And because otherwise, we cannot de-tokenize it.

00:47:49.640 | It's very easy for language model to just miss one number.

00:47:52.860 | So we do have some mechanism, such as constraint decoding and beam search, to make sure the

00:47:58.320 | format is correct.

00:47:59.320 | After we get the string of eight numbers, we de-tokenize it to a delta T and delta R,

00:48:04.920 | which is the anti-factor delta pose.

00:48:07.280 | And the robot can just directly run this on the robots.

00:48:10.120 | After they run on the robots, we repeat this process.

00:48:13.280 | We get another new image, run through this process, and get a new action.

00:48:17.160 | And we repeat this process until a termination is decoded.

00:48:21.240 | So some people might be concerned that this is rather slow.

00:48:27.040 | It's in fact quite slow, because it's 12 billion parameters, or 5 billion parameters.

00:48:33.160 | We cannot run on a robot.

00:48:34.920 | So we run on a TPO cluster, and the robot is querying the TPO cluster to get the numbers

00:48:40.240 | and apply it on the robot.

00:48:42.980 | So for the 12 billion parameters, we can actually run at 10 hertz.

00:48:48.080 | So it's quite fast.

00:48:49.280 | For all those models, we can run at least three hertz.

00:48:52.060 | So that is sufficient for controlling a robot.

00:48:57.960 | And we see a lot of emergent skills that is not on the training set.

00:49:04.880 | Essentially, as I just mentioned, we are probing what this RT2 can do.

00:49:09.280 | We actually don't know.

00:49:10.280 | So we are trying to figure out what RT2 can do.

00:49:12.440 | So we test it with a lot of new tasks, such as put a strawberry into the correct bowl,

00:49:18.720 | or move a banana to Germany, just to test its understanding of symbols or flags.

00:49:26.360 | Pick a land animal.

00:49:27.360 | There's a horse.

00:49:28.360 | There's an octopus.

00:49:29.360 | So basically, test its semantic reasoning and also low-level manipulation skills.

00:49:36.240 | And we divide the tasks into symbol understanding, and reasoning, and human recognition, and

00:49:44.280 | average.

00:49:45.280 | And we found that with RT1, which is not trained on internet-scale data, we do quite poorly

00:49:53.080 | in these emergent evaluation tasks.

00:49:56.200 | And in the RT2 variants, which is co-fine-tuned on the internet data and our robotics data,

00:50:06.280 | we do much better in these tasks.

00:50:08.280 | And there is also an effect of scale.

00:50:11.040 | So the RT2 with the 55 billion poly is performing better than the 12 billion poly, although

00:50:17.920 | they perform quite similarly for in-domain tasks.

00:50:20.840 | But the generalization is kind of interesting.

00:50:23.200 | It seems with larger scale, you can generalize better.

00:50:27.920 | And here are some videos of the robot achieving these tasks, like moving the banana to a number,

00:50:35.840 | put the strawberry into the correct bowl, move a Rubik's cube to the water bottle--

00:50:41.880 | but I'm speaking Chinese-- moving the banana to a German flag.

00:50:46.840 | So it's able to do all of these very interesting tasks.

00:50:51.960 | In terms of the quantitative evaluations, we also found that the RT2 policy is quite

00:50:57.880 | robust to unseen objects, unseen backgrounds, and unseen environments.

00:51:03.960 | And here is another evidence of positive transfer.

00:51:07.040 | So co-fine-tuned with VQA data outperforms fine-tuning on robotics only.

00:51:12.520 | And if you're trained on robot data from scratch, it barely works.

00:51:16.680 | It almost doesn't work, because it overfits to robot data.

00:51:19.720 | And our robot data is just too small.

00:51:21.880 | So we do need to do co-fine-tuning, or at least fine-tuning, so it retains its internet

00:51:29.320 | scale knowledge.

00:51:30.660 | This is also a recipe for how people would develop a domain-specific vision language

00:51:35.960 | model.

00:51:36.960 | So you start from a very general vision language model, and you fine-tune on your domain.

00:51:41.240 | Or you can co-fine-tune with your specific domain data.

00:51:45.540 | This is likely a problem that each vertical of artificial intelligence would incur someday.

00:51:54.520 | We can also test on other platforms.

00:51:56.800 | Like this shows some cross-embodiment, the RT2, PolyE3b outperforms previous models in

00:52:02.120 | terms of moving blocks around a 2D environment.

00:52:09.240 | And in large-language models, we have this chain-of-thought reasoning, which is a method

00:52:14.640 | to elicit reasoning in large-language models.

00:52:18.120 | You can either do zero-shot chain-of-thought reasoning by, say, eliciting step by step.

00:52:21.400 | I'll give you the examples of reasoning.

00:52:23.800 | It's basically decoding more things and then come to the conclusion.

00:52:27.840 | We can use a similar procedure for the RT2 as well.

00:52:32.120 | So in RT2 PolyE, instead of directly decoding the actions, we can actually decode a plan

00:52:38.060 | and then append it with actions.

00:52:40.280 | So this gives the language model an opportunity to understand a question or parse a question

00:52:44.940 | differently.

00:52:45.940 | It also gives us the opportunity to reason about things a little bit.

00:52:50.200 | For example, if you say, "Bring me a drink," and it will say, "Pick up 7-up can," because

00:52:55.200 | there's a 7-up can on the table.

00:52:57.720 | So we synthesized a couple hundred such examples using a large-language model just by augmenting

00:53:02.960 | the instruction and then fine-tuned the RT2 just for a couple hundred steps.

00:53:07.080 | So it's between full fine-tuning and in-context learning, and it is able to do some reasoning.

00:53:13.480 | And some of the interesting reasoning tasks include, "I need to hammer a nail.

00:53:17.560 | Which object from the scene might be useful?"

00:53:19.440 | And in the scene, there is a headphone, there is a rock, and there is a sticky note.

00:53:24.600 | And the robot will say, "Rocks," and then generate actions to pick up the rock.

00:53:28.460 | So it's interesting that it's able to do this sort of reasoning with RT2.

00:53:34.040 | And here is a demonstration of some of the channel-thought reasoning with RT2 PolyE.

00:53:39.880 | And the task is, "Pick up the thing that is different from all other objects."

00:53:44.280 | And it picks up the chocolate, because this is a snack and other things are the drink.

00:53:49.120 | And I can also speak a different language, and the plan would be to translate it into

00:53:53.920 | a language that it's familiar with, which is English, and then do the task.

00:54:01.280 | There are also potentially better cases of the channel-thought reasoning.

00:54:04.700 | So here I say, "Move the green object together."

00:54:06.720 | And as you can see, the robot oscillates between the two green objects, because there are rather

00:54:10.720 | two plans.

00:54:11.720 | It could move the can to the bag of chips, or it could move the bag of chips to the can.

00:54:16.820 | It oscillates between two plans until one action brings it to an object, and it will

00:54:22.220 | commit to one of the plans rather than another.

00:54:26.460 | It's not always guaranteed to work, but it's quite interesting.

00:54:29.920 | And it's also interesting that, again, we are testing the manipulation policy like how

00:54:34.460 | we test intelligence of humans or animals or kids, because they're getting more and

00:54:40.980 | more advanced.

00:54:41.980 | As a summary, we have the vision language and action model that is able to improve the

00:54:49.560 | generalization.

00:54:50.560 | It can do new tasks and operate the new objects.

00:54:53.520 | It can also do chain-of-thought reasoning, and improving the underlying model, such as

00:54:59.440 | the vision language model itself, by scaling it up and training it with internet-scale

00:55:06.320 | data or training it with larger or higher-quality internet-scale data, we can achieve better

00:55:11.040 | robot control, which is quite amazing, because robotics field has been traditionally developing

00:55:16.280 | quite slowly and is bounded by hardware, bounded by a lot of different things, bounded by operation.

00:55:21.120 | But now it seems we can piggyback on the development of the foundation model field, and whatever

00:55:27.920 | they do will trickle down to our field as well.

00:55:30.560 | And the future will be to increase the motion diversity and extend on the chain-of-thought

00:55:35.040 | reasoning capability and many more.

00:55:40.040 | And so there is another example of positive transfer, which you might have seen recently.

00:55:46.520 | So far, I've been talking about scaling differently.

00:55:49.520 | I've been talking about don't scale robotics data and scale other data.

00:55:54.120 | That's because robotics data is so hard to collect, and the purpose is not to avoid collecting

00:55:59.160 | robotics data.

00:56:00.160 | It's to develop a recipe that you can do more with limited robotics data.

00:56:05.560 | However, there's also an effort from our team and the entire robotics field to scale up

00:56:12.640 | the robot data collection, which is called OpenX Embodiment.

00:56:16.840 | And the model chain is called RTX, Robotics Transformer X.

00:56:20.440 | It's basically 22 type of embodiments and 572 scales and 60 datasets pulled all together.

00:56:28.520 | So this will be the ultimate dataset we can use to study positive transfer and to study

00:56:33.880 | this joint scaling.

00:56:37.280 | And there are already evidences of positive transfer.

00:56:42.080 | So we pulled all the data together from all these labs and find a common action representation

00:56:50.020 | that we can use to train a robotic transformer.

00:56:53.080 | And we have already found this jointly trained model can outperform task-specific model that

00:57:00.200 | is developed in each of the lab.

00:57:02.560 | So there is some benefits in pulling all the data together.

00:57:05.840 | So scaling robot data is also quite important.

00:57:12.560 | So the summary for this part is that we are having a model consolidation.

00:57:16.640 | We can now do the high-level reasoning and low-level control in one model.

00:57:21.240 | And the low-level control part is what excites me because it's so far away from the traditional

00:57:26.720 | language model domain, it's so different and it shows signs of life that we can trickle

00:57:33.640 | down a lot more than we used to think it's possible.

00:57:37.880 | And we can scale the pre-training of vision language models as well as scaling robotics

00:57:41.720 | data.

00:57:42.720 | And we observe more and more positive transfer model benefiting from diverse joint training

00:57:47.560 | across internet-scale language, vision, and vision language domains.

00:57:52.200 | All right, so I noticed that we are close to running out of time, so I will just very

00:58:00.120 | quickly go through the second part, which I think is also interesting, is to find new

00:58:04.680 | interfaces of language models, but I would only talk at a very high level.

00:58:10.040 | So language models, as we can see, can directly output action tokens if we found action representation.

00:58:15.800 | So we can treat action as yet another language to the language model.

00:58:20.240 | So language model can do translation, so it should be able to generate action as well.

00:58:23.880 | But that requires fine-tuning.

00:58:25.540 | Can we do it without fine-tuning?

00:58:27.920 | Or can we generate more expressive actions that is beyond the scope of fine-tuning?

00:58:34.640 | So that is about finding the right interface.

00:58:38.040 | So previously, we have already established that language model doesn't have an action

00:58:42.880 | interface.

00:58:43.880 | If it has an action interface, it's not as effective.

00:58:48.480 | So what is the best interface between language and the low-level actions?

00:58:51.840 | I would argue the best interface between language model and the low-level actions is reward

00:59:00.400 | functions.

00:59:01.400 | And reward functions is universal.

00:59:04.240 | It has been used in reinforcement learning.

00:59:06.500 | And it's also a reparameterization of actions.

00:59:11.000 | What is action?

00:59:12.000 | Let's see if I want to pick up this bottle.

00:59:15.680 | And I can say, well, what is a skill?

00:59:17.720 | A skill is a mapping between my observation and my action.

00:59:21.480 | So the mapping between my observation and action can be seen as a skill.

00:59:25.360 | But a skill can have an alternative definition, which is a set of constraints and a set of

00:59:30.240 | objectives.

00:59:31.500 | So picking up the bottle means the bottle is in my right hand, and the bottle is off

00:59:37.640 | a supporting surface.

00:59:39.080 | That means picking up.

00:59:40.180 | And how do I pick it up doesn't really matter.

00:59:42.960 | That's a more, to its broader sense, a definition of skills.

00:59:47.120 | It's more transferable between different skills.

00:59:51.600 | And the constraints and objectives can be represented as rewards.

00:59:57.560 | So we can ask language model to generate these reward functions.

01:00:02.480 | And then there is an optimizer.

01:00:04.360 | It could be reinforcement learning, or it could be model predictive control that optimize

01:00:09.520 | for those rewards and then run it on the robot.

01:00:14.600 | So what is in the reward translator?

01:00:16.960 | Let's open a box.

01:00:19.560 | So the reward translator basically is a two-stage process.

01:00:23.360 | It's using the same language model, and it is using two different prompts.

01:00:28.160 | So the motion description basically describes the motion.

01:00:32.360 | So just now we found that the language model can output a description of how a robot dog

01:00:38.640 | should stand up, but it's not able to achieve that.

01:00:42.240 | But the motion description is still sensible.

01:00:44.400 | It still makes sense.

01:00:45.400 | It gives you the right thing.

01:00:46.820 | So we just generate this motion description, and then we have a reward translator, reward

01:00:52.600 | coder that translates this motion description into a piece of code that is representing

01:01:00.440 | reward functions.

01:01:02.640 | And these reward functions cannot be directly executed on the robot, but it can go through

01:01:08.520 | our optimization process to learn how to achieve those reward functions.

01:01:13.480 | So we're using reward as the interface between language model and a low-level controller.

01:01:20.020 | And for the low-level controller, we're using Mojoco MPC, which is a model predictive control

01:01:26.140 | algorithm.

01:01:27.140 | It's basically a black box controller.

01:01:29.660 | It samples a lot of trajectories and finds one that optimizes your reward.

01:01:36.180 | And we tested on a robot dog, a quadruped robot essentially, and a dextrose manipulator.

01:01:41.540 | So the dextrose manipulator has an arm of six or seven degrees of freedom and a hand.

01:01:49.380 | It's impossible to control it because it has so many degrees of freedom.

01:01:52.400 | So it's highly challenging.

01:01:56.600 | So just to showcase some of the examples, I omitted the motion description part.

01:02:02.060 | I only output the reward code part.

01:02:07.380 | So it seems that the language model is able to generate the right reward functions to

01:02:13.340 | make the robot stand up on two back feet like a human.

01:02:18.140 | And then now we are a little bit more ambitious.

01:02:20.380 | We know it can stand up.

01:02:22.020 | Can we make the robot do a moonwalk while standing up like this?

01:02:25.380 | So a moonwalk is from Michael Jackson, and it's very challenging.

01:02:28.540 | How do we make the robot to do it?

01:02:29.980 | So it generates the motion description and generates the reward code.

01:02:35.260 | But the motion is not so correct, not exactly what we want.

01:02:41.060 | The nice thing about using a language model and using the reward function is that you

01:02:44.500 | can coach the robot.

01:02:45.820 | You can go back and explain what went wrong and ask the language model to fix it.

01:02:51.300 | So now we can actually say you're being very patient.

01:02:54.540 | You say moonwalk means the robot should walk backward while the feet swing as if they are

01:02:59.940 | moving forward.

01:03:02.860 | Such a great explanation, kudos to my colleague, and correct your answer and also make it walk

01:03:08.540 | at a speed of 0.5 meters per second.

01:03:11.380 | And after you being very patient and give it the right instruction, it's able to modify

01:03:16.740 | the motion descriptor and also generate the right set of rewards to make this happen.

01:03:22.940 | And now you can teach a robot to do a moonwalk just by using the language as an interface.

01:03:29.980 | And one day we'll be able to do this on the real robot as well.

01:03:32.900 | Yes.

01:03:33.900 | So in the previous section, we showed how the language model did calculate numbers and

01:03:39.820 | you're constraining them to also just take numbers.

01:03:42.140 | Here, how do you prevent it from just hallucinating in some program?

01:03:46.420 | Right.

01:03:47.420 | So that's a great question.

01:03:49.940 | In this work, we are not preventing it to do hallucination in a programmatic way.

01:03:56.580 | We have a set of system prompts or a set of rules that is explaining the API.

01:04:02.820 | After all, the reward functions need to be able to be compiled by the optimizer.

01:04:11.860 | We do need to have some check.

01:04:14.100 | What's more, if it doesn't compile, we can just give the error message to the language

01:04:17.900 | model.

01:04:18.900 | It doesn't have to propagate all the way to the motion descriptor, it can stay at the

01:04:22.300 | reward decoder.

01:04:23.300 | If there are errors, please fix it.

01:04:25.140 | After that, it should be able to fix it.

01:04:28.880 | We can also chain multiple tasks together.

01:04:32.100 | Using this framework, we can say, open a drawer, take the apple, put it into the drawer, and

01:04:38.540 | close the drawer, and it will be able to do that.

01:04:42.020 | So we tried that.

01:04:43.740 | Just using reward decoder is not good enough.

01:04:46.100 | It's rather our two-stage prompt is really, really helpful.

01:04:51.100 | I think that's another inspiration for other fields, like when your domain is too different

01:04:56.380 | from language domain, maybe it would be good to find an intermediate representation and

01:05:00.700 | ask the language model to explain in that intermediate representation before directly

01:05:05.000 | go to a more obscure representation.

01:05:07.620 | Finally, we want to transfer this to the real world, but there is a challenge.

01:05:14.220 | Using simulation, it might generate actions that are too dexterous, like this thing is

01:05:21.580 | not possible to do in the real world.

01:05:23.780 | So we add a few more regularizer terms to stabilize the motion, and we also run some

01:05:30.220 | state estimation on the real robots so that they understand where is the cubes, and then

01:05:37.020 | we can, in the simulation, grab the motion and then achieve it in the real world.

01:05:41.660 | So here are some of the execution in the real world.

01:05:45.100 | So you can, say, pick up the Rubik's cube, and it will generate the motion to pick up

01:05:50.580 | the Rubik's cube and grab it.

01:05:52.300 | This is quite different from RT2.

01:05:53.700 | The motions are quite smooth.

01:05:57.580 | It's quite fast.

01:05:58.580 | It's much faster than 3 hertz.

01:06:02.100 | So here, it can do 10 hertz or even 30 hertz.

01:06:08.100 | So it's comparable with human beings.

01:06:13.360 | So that's a language Q reward.

01:06:15.420 | There's one last thing that I want to talk about in terms of finding a new interface.

01:06:20.580 | So a lot of time, we have been thinking about language model as a semantic engine, a semantic

01:06:25.860 | machine.

01:06:26.860 | It understands semantics.

01:06:27.860 | So, for example, you say the student takes out the book.

01:06:33.340 | You will say book.

01:06:34.340 | Language model is able to reason about such a sequence.

01:06:37.780 | But if you do low-level patterns, like if you just give it obscure numbers, what can

01:06:42.700 | you do?

01:06:43.700 | It's actually a low-level interface.

01:06:45.540 | And we can open up the low-level interface to alpha language model and ask it to do robotics

01:06:51.380 | tasks.

01:06:52.380 | So in this paper, "Large Language Model as General Pattern Machines," we explore using

01:06:56.980 | the low-level interface of a large language model, essentially asking it to reason about

01:07:02.700 | different sequences.

01:07:03.860 | And it's surprisingly quite effective.

01:07:06.020 | And it can solve tasks like the ARC challenge and the PCFG.

01:07:11.940 | And it can even do sequence improvement.

01:07:14.020 | So I will dig a little bit into sequence improvement because that's quite relevant to robotics.

01:07:19.180 | So sequence improvement is that you prompt the language model with state, action, and

01:07:23.860 | the reward tuples.

01:07:25.220 | And you just prompt it with higher reward and see if it can generate actions that achieve

01:07:31.420 | the higher reward.

01:07:32.820 | So it's doing reinforced learning or reinforced learning-like thing, but in context.

01:07:38.460 | So this is quite amazing.

01:07:39.460 | So previously, you would need a dedicated algorithm collecting data replay buffer to

01:07:45.060 | do this reinforced learning.

01:07:46.940 | But now you can just build everything in the language model context by leveraging the low-level

01:07:50.860 | interface of a language model.

01:07:53.820 | And with that, we can actually do something like clicker training.

01:07:57.100 | So if you are not very familiar with clicker training, it's how you train a dog.

01:08:02.500 | You can have a dog, and when it does the right thing, you give it a reward by clicking.

01:08:09.020 | So the clicker training is giving the agent a reward.

01:08:16.380 | And we can now use clicker training to train robots as well.

01:08:20.020 | So here, the robot is exploring, but I would give click when it does the right thing or

01:08:24.860 | towards the right direction.

01:08:26.300 | And over time, it will be able to push the backup chips, which is the objective of this

01:08:32.500 | training.

01:08:33.500 | So you can do this entire decision transformer-like operation, but purely in context, by just

01:08:40.060 | giving a language model a bunch of patterns and ask it to figure out what is the regularity

01:08:45.620 | of this sequence.

01:08:47.260 | And this way, it can generate new actions to improve the previous sequence.

01:08:54.980 | So for the language model, we can find new interfaces that are more suitable for teaching

01:09:02.140 | it low-level skills.

01:09:04.300 | Reward is a bridge of language model and low-level control, and we can fully leverage it as a

01:09:09.460 | universal interface, and we can optimize in real time.

01:09:16.060 | Sometimes it outperforms generating action directly.

01:09:18.420 | So it really motivates to use the reward functions as interface.

01:09:23.640 | And in the language model as a general pattern machine, we can use language model beyond

01:09:27.180 | the semantic tasks.

01:09:28.180 | We can ask it to reason low-level things.

01:09:30.500 | And also, robotics as a domain, rich of sequence transformation and sequence completion and

01:09:36.020 | sequence improvement tasks.

01:09:37.540 | So we can really study the lower-level mechanisms of language models.

01:09:43.380 | And the key takeaway for this talk is that we are seeing more and more use of foundation

01:09:51.100 | models, not only on the semantic reasoning side of robotics, but more on the dexterous,

01:09:57.220 | on the generating actions, on the lower-level embodied intelligence side of robotics.

01:10:03.340 | And we need to rethink the scaling law of robotics and transformer.

01:10:07.300 | How do we scale it with limited amount of data?

01:10:10.260 | We have a new recipe for scaling robot model and data in RT2, which shows that you can

01:10:14.500 | do more with the same data, with essentially RT1 data plus internet data, you can generalize

01:10:19.460 | to allow more things.

01:10:20.460 | And RTX shows that you can do a lot more with more data.

01:10:24.660 | There is also benefits to collecting more robotics data.

01:10:27.140 | And there is positive transfers everywhere.

01:10:29.260 | And part two, in terms of new interfaces for language models, I think it's worth for the

01:10:35.160 | robotics field to think about developing new and lower-level interface to language models,

01:10:39.780 | which facilitate learning low-level skills.

01:10:43.300 | With that, I would like to conclude my talk.

01:10:45.540 | And if you'll find it interesting, there are a lot of references for you to look into.

01:10:50.540 | And special thanks to my team, Google DeepMind Robotics team.

01:10:55.740 | So we are at the forefront of developing foundation models for robotics.

01:10:59.900 | And stay tuned for more in the future.

01:11:01.580 | Thank you.

01:11:02.580 | Yes.

01:11:03.580 | You mentioned that load numbers are difficult for a lot of our language models, but if you're

01:11:15.760 | just generating the action tokens themselves, like no rocks or whatever you had in an example,

01:11:21.820 | why don't you just have a linear layer appended to the transformer that would just generate

01:11:28.860 | numbers from here that you can type in whatever you need?

01:11:32.580 | Yeah.

01:11:33.580 | The question is that if the large language models have difficulty understanding numbers,

01:11:40.140 | why don't we use a linear layer to output the action directly?

01:11:43.460 | I think language models are difficult to understand numbers.

01:11:47.760 | But sometimes we still want it to bring in knowledge from the pre-training mixture.

01:11:57.140 | If I have a new layer, that new layer is not present in the pre-training.

01:12:02.260 | So how do I expect it to transfer?

01:12:04.460 | I think that's an interesting question.

01:12:06.420 | But at the same time, I don't necessarily think using the raw numbers is the right interface.

01:12:12.300 | We probably could do some action representation learning to learn a representation.

01:12:16.660 | And the language model can output that representation.

01:12:19.660 | So we're still trying to figure out what is the right representation.

01:12:24.180 | So among the representations that we haven't tried before, like decimal numbers, flow numbers,

01:12:30.300 | actual tokens, we find that just using numbers or actual tokens would be good enough.

01:12:36.500 | Yeah.

01:12:37.500 | Yes.

01:12:38.500 | [INAUDIBLE]

01:12:39.500 | Yeah, I think both directions are worth exploring.

01:13:01.980 | There are different advantages of generating action directly.

01:13:06.940 | I think it borrows the autoregressive nature of language modeling.

01:13:12.660 | And it aligns with a lot of other tasks, like visual question answering really well.

01:13:18.540 | The limitation is that then when you are generating actions, it's heavily regularized.

01:13:23.900 | Can you generate dexterous actions that is so out of distribution that it's kind of difficult?

01:13:29.380 | The language to reward actually brings a page of the book of traditional robotics, this

01:13:35.140 | optimization-based or model predictive control.

01:13:40.020 | And you can also take into, let's say, safety constraints more easily.

01:13:46.180 | It can generate more diverse actions.

01:13:48.820 | Maybe one recipe is to generate a lot of data with the language to reward system and distill

01:13:54.180 | them into a transformer.

01:13:56.900 | Because then you are imbuing your large language model with all this other desirable-- the

01:14:02.780 | language to reward itself, I don't know how scalable it is.

01:14:07.060 | We're not fine-tuning language model.

01:14:08.900 | So maybe you are limited to what-- you are at the mercy of the training data of the language

01:14:14.620 | model.

01:14:15.620 | The language model can do moonwalk because it knows what moonwalk is.

01:14:20.220 | It roughly knows how to do that.

01:14:24.180 | But if you want to scale to completely new things, maybe you can use the language to

01:14:28.100 | reward to bootstrap your data generation and then put it into the other policy.

01:14:33.460 | So can you tell us what's the next direction Google is pursuing?

01:14:39.940 | So it's like, the language is rewarded in the right direction, like scaling out of the

01:14:43.180 | room, out of the racks, and so on?

01:14:45.220 | Yeah, I think that's a good question.

01:14:46.860 | So the scaling being the end of the lecture, that is a joke.

01:14:51.940 | But I'm being quite serious.

01:14:54.380 | It's actually a promising recipe.

01:14:57.040 | So I think everybody is believing in the power of the scaling rule.

01:15:04.860 | So just by giving it more data, giving it more compute, you will see interesting capabilities

01:15:10.740 | coming out.

01:15:11.740 | [INAUDIBLE]

01:15:12.740 | Yeah, I still think we don't quite have enough data.

01:15:30.360 | I think that's still probably the biggest bottleneck.

01:15:33.660 | So we are trying to find ways to do more with limited data.

01:15:38.360 | And we are trying to collect more data.

01:15:40.500 | And I think it needs some time for us to accumulate enough data.

01:15:45.200 | And currently, I say, we have signs of life for positive transfer.

01:15:50.320 | But in language models, people don't talk about positive transfers anymore because it's

01:15:54.680 | so commonplace.

01:15:55.680 | Right?

01:15:56.680 | You see it everywhere.

01:15:58.680 | And robotics is not at that stage yet.

01:16:01.840 | Yeah, how much has your team been thinking about safety and alignment?

01:16:06.440 | Yeah.

01:16:07.440 | And are you just, right now, relying on the ethics that emerge from the large language

01:16:13.400 | models?

01:16:14.400 | It won't tell you to kill someone to achieve that.

01:16:17.280 | Yeah, that's a very good question.

01:16:18.800 | Actually, we take safety very, very seriously because all of the other domains of developing

01:16:24.520 | language models, it doesn't have direct impact on the physical world.

01:16:31.800 | But here, it could have potential harm to humans and to the environment.

01:16:37.600 | And Gary Marcus actually gave a comment previously to our work that, what if you say, bring out

01:16:45.560 | a bowl, feed a cat, and put it in the dishwasher?

01:16:47.440 | Well, let's put the cat in the dishwasher, right?

01:16:50.040 | If it misunderstands, actually, it will have a catastrophic failure case.

01:16:57.720 | We take safety carefully by designing hardware and software safety layers.

01:17:03.520 | And there are also some constitutional safety thing that is coming out sometime soon.

01:17:11.280 | I cannot tell much details right now, but sometime soon, we'll release some work.

01:17:17.240 | Is it something like, if there's a human, just don't interact?

01:17:21.320 | Well, no, no, no.

01:17:23.440 | I think it's a little bit more nuanced and more detailed than that.

01:17:27.600 | But we do take safety quite seriously.

01:17:29.880 | And in some of our experiments, actually, the robot's finger would break off because

01:17:34.040 | it cannot apply enough force to an environment.

01:17:36.280 | So that's just yet another way of ensuring safety.

01:17:39.600 | Can we have some visual language model and a synthesizer or something to stop the problem

01:17:45.600 | that both the internet and the robot?

01:17:49.120 | And maybe, this is kind of like interpretive, but both in some logical way.

01:17:55.160 | Right, right.

01:17:56.160 | So I think it would be possible.

01:17:58.000 | [INAUDIBLE]

01:17:59.000 | Thank you for the great talk.

01:18:06.960 | Thank you.

01:18:07.960 | Thank you.

01:18:08.960 | Thank you.

01:18:08.960 | you

01:18:11.020 | you