back to index

Stanford CS25: V3 I Low-level Embodied Intelligence w/ Foundation Models


Whisper Transcript | Transcript Only Page

00:00:00.000 | So, hey guys, thanks for coming to our second class.
00:00:10.000 | Today we have the pleasure of welcoming Faish Shia, he's a senior research scientist at
00:00:15.080 | Google DeepMind, where he works on the robotics team.
00:00:18.800 | He received his PhD here, actually, working with Sylvio Salvariz in Stanford Vision and
00:00:24.720 | Learning Lab, as well as Leonidas Pibus.
00:00:27.960 | And his mission is to build intelligent embodied agents that can interact with complex and
00:00:33.380 | unstructured real-world environments with applications with home robotics.
00:00:40.120 | Recently he has been exploring the use of foundation models for robot decision-making
00:00:44.880 | and action generation.
00:00:47.040 | So now I'll hand it off to Faish.
00:00:49.080 | Hi everyone, I'm super happy to be here and happy to be back.
00:00:53.400 | I graduated from here two years ago, and now I'm a research scientist at Google DeepMind.
00:00:58.800 | I work on the robotics team, and today I will be talking about low-level embodied intelligence
00:01:03.880 | with foundation models.
00:01:05.600 | So it's definitely an interesting topic, and I will introduce what is embodied intelligence
00:01:11.320 | and what is low-level embodied intelligence, and how we can accelerate the building of
00:01:16.960 | them with foundation models.
00:01:19.800 | All right, so why are we working on embodied intelligence?
00:01:24.340 | So embodied intelligence is an integral part of artificial intelligence, and it's an important
00:01:31.760 | milestone to artificial general intelligence.
00:01:35.760 | And it has a lot of use cases, like for example, we all hope we have a home robot that can
00:01:41.000 | be in our home 24/7 and clean the home for us, or clean up our messy room, or cook for
00:01:48.720 | us, or taking care of our aging family members.
00:01:52.400 | So we are not quite there yet.
00:01:53.920 | In fact, we are quite far from it.
00:01:56.360 | That is because our intelligence is currently mostly in the virtual world.
00:02:00.840 | So we have AI agents that can help us draft emails or write eloquent essays, but they
00:02:07.480 | are not super good at interacting with the messy real world, unstructured, complex environment
00:02:13.720 | that humans reside in.
00:02:16.600 | So just to give you guys a couple of examples of how messy the real world can be, and how
00:02:22.280 | hostile it could be to robotics, I want to show you a curious mistake or curious error
00:02:28.840 | from one of our robots.
00:02:30.720 | So the task is to put the Coke can in the sink, and watch what the robot do.
00:02:36.380 | The robot grabs the Coke can and opens the tap.
00:02:39.400 | So this is kind of dangerous, but it's kind of interesting, right?
00:02:45.820 | Because we never expect it would do something like that.
00:02:48.800 | It's just from random noise, it starts to open the tap, and the water starts to come
00:02:54.520 | So for an agent to have this type of physical intelligence, it needs to understand the effect
00:03:01.000 | of its actions, and what is so-called a world model.
00:03:04.640 | So people have been complaining that language models so far don't have a world model.
00:03:08.780 | So it doesn't understand geometry, it doesn't understand the spatial relationship of objects,
00:03:15.120 | or the effect of actions, basically how objects will move according to physical laws.
00:03:21.800 | So we are not quite there yet.
00:03:24.040 | In another case, so this is our robot that is ready to deliver a can, or actually throw
00:03:30.120 | away a can.
00:03:31.480 | But as you can see, we have this pre-programmed behavior of tucking the arm behind.
00:03:37.300 | And in doing that, the can is upside down.
00:03:39.800 | So if there's any liquid in the can, it will spill and damage the robot.
00:03:44.800 | So it's another example of real world is really complex, and there are a lot of things to
00:03:49.560 | model.
00:03:50.560 | And in order for our robots to have this sort of ambient intelligence, they really need
00:03:56.620 | to understand a lot of very nuanced details of the environment, and understanding the
00:04:02.120 | physics, physical laws, and understanding the effect of its actions.
00:04:09.280 | How do we do that?
00:04:10.280 | There are many ways to achieve embodied intelligence.
00:04:12.560 | Actually, throughout my PhD study, I've been fascinated by this idea of creating interactive
00:04:18.840 | environments, basically, let agent explore in this interactive environment, basically,
00:04:24.720 | create environments that are complex enough.
00:04:28.520 | So that if the agent needs to survive in such environment, it must develop intelligence.
00:04:33.920 | So it's an ecological view of perception and agency, and is popularized by American psychologist
00:04:40.320 | James J. Gibson.
00:04:42.320 | So he has a famous quote, "Ask not what is inside your head, but what your head is inside
00:04:49.200 | So human learned this type of embodied intelligence.
00:04:51.960 | Human is able to manipulate objects effortlessly, one, because of evolution, second, because
00:04:58.160 | of the childhood experience.
00:04:59.400 | We have been playing with this toy, we have been interacting with this toy, and watch
00:05:03.160 | the physical effect so that we learn.
00:05:06.080 | And similarly, we can give robots a safe playpen, so they can explore in those environment and
00:05:13.240 | interact with environment and play, and watch the effect of actions, and effectively understand
00:05:19.080 | how to manipulate those objects.
00:05:22.460 | So I have been developing these simulation environments, one of which is called Gibson
00:05:29.760 | environment, which is published as CVPR.
00:05:32.520 | It's mainly aiming at simulating the visual world faithfully, and also simulate physical
00:05:39.280 | world to some extent.
00:05:40.960 | So we built this environment, which is a scanned environment from a lot of houses.
00:05:46.680 | And then an agent, we can spawn an agent in that, in this case, a humanoid agent, and
00:05:51.840 | the agent can learn to walk or to run in this environment and simulate all this perception
00:05:58.120 | information.
00:05:59.120 | So we can create a perception action loop for this agent.
00:06:03.560 | And similarly, we can put other types of agents in this environment, in this case, a little
00:06:09.200 | cart, and we can also put a quadruped or this ant into this environment.
00:06:16.040 | So essentially, we create an environment where we can simulate perception for the agent,
00:06:23.000 | and then we can create a neural network to map the perception to action.
00:06:26.840 | And this way, we achieve some sort of physical intelligence.
00:06:30.780 | It's mostly for navigation and locomotion.
00:06:35.960 | This is not enough.
00:06:37.560 | So in this case, the environment is one monolithic piece of mesh.
00:06:42.500 | As you can see, the agent run into the wall and it bounced back.
00:06:46.900 | So there is no articulation in this environment.
00:06:50.080 | So it's not simulating the full complexity of the environment.
00:06:53.800 | So the things that we can do with our agent is rather limited.
00:06:59.080 | So that's why we create other simulation environment, one of which is iGibson environment, which
00:07:04.280 | is called Interactive Gibson.
00:07:06.140 | So what we do is we create, again, scan a lot of real world houses, and then we convert
00:07:13.520 | them to CAD assets, basically mesh assets that are interactable.
00:07:18.600 | In this case, we have a simulated agent that go into the environment and then close all
00:07:23.680 | the drawers.
00:07:25.260 | So we are able to do that because we model the complexity of the world a little bit more.
00:07:30.440 | We go beyond just modeling the visual world.
00:07:33.000 | We start to model physics a little bit more, basically modeling the degree of freedom in
00:07:38.640 | the environment.
00:07:39.640 | And our agent can do more than just navigating around.
00:07:44.040 | So we can go even further.
00:07:47.040 | So we can even model more degree of freedom.
00:07:50.200 | And our agent can develop more complicated behavior, such as unloading a dishwasher and
00:07:55.040 | find a bowl or take out the bowl and put it on the table.
00:07:59.700 | So as we scale up the complexity of the environment, we are able to learn much more complicated
00:08:06.120 | skills in simulation.
00:08:08.920 | And that's one way to achieve embodied intelligence, which is to build complex enough simulation
00:08:14.840 | environment.
00:08:19.080 | Not just my research, but the entire field of computer vision is undergoing a paradigm
00:08:23.720 | shift.
00:08:24.740 | So previously, we are focusing on internet AI.
00:08:27.780 | We curate a lot of internet data sets to study problems like classification, segmentation,
00:08:34.180 | and detection.
00:08:35.180 | Basically all these computer vision problems.
00:08:37.460 | Now we focus a lot more on embodied AI, which is adding the action dimension to the problem
00:08:44.700 | that we are studying problems like visual navigation, manipulation, rearrangement, embodied
00:08:49.340 | question answering, instruction following, and the simulators in some sense replace the
00:08:57.260 | original role of data sets.
00:08:59.900 | One thing that doesn't change, which is the data is still super important.
00:09:04.420 | We are still relying on a large amount of data to learn this intelligent behavior, no
00:09:11.220 | matter if it's from a static data set or from a simulator.
00:09:16.780 | So learning in simulation can take a lot of interactions.
00:09:22.180 | So just to give you an example, we create this iGibson environment and we want to learn
00:09:27.980 | a behavior called go into a room through a closed door.
00:09:31.820 | So this is a rather simple behavior, which I can show on the top right of the screen.
00:09:37.320 | So the agent needs to stop in front of the door, it needs to stop at the right distance.
00:09:42.300 | If it's stopped too close to the door, it cannot extend its arm.
00:09:45.540 | If it's too far, it cannot open the door.
00:09:47.940 | And then it basically opens the door.
00:09:49.940 | Let me play this again.
00:09:50.940 | Open this door, when there is enough clearance, it will go into the door.
00:09:55.100 | However, it takes about 50,000 episodes or 1.25 million environment interactions to learn
00:10:02.100 | this type of behavior.
00:10:04.100 | This is because we are using model-free reinforcement learning, the agent is exploring this environment.
00:10:10.100 | It could really push any point, it could rather stop at any point.
00:10:14.100 | So we give it a reward function to go into the room, but it's very rare that it will
00:10:20.220 | stumble upon this behavior.
00:10:23.340 | I would like to argue with foundation models, we can do a lot more different.
00:10:28.100 | So what do you do nowadays?
00:10:29.700 | You just ask a HHBT, how do you go into a room through a closed door?
00:10:33.980 | And it will say, open the door, walk through the door.
00:10:36.120 | So this is a gross simplification of the problem.
00:10:39.420 | Of course, the problem is not that simple.
00:10:43.660 | But what I'm just saying is that we can leverage a lot of semantic prior from the foundation
00:10:50.460 | models.
00:10:51.460 | So if we really like data, if we really need a lot of data, the foundation model is a compressed
00:10:56.900 | version of the entire data and it's a knowledge base that you can query and to accelerate
00:11:01.900 | the development of robotics.
00:11:03.580 | Of course, simulation and real world data is still super, super important, but maybe
00:11:08.460 | we can get the best of both worlds.
00:11:10.580 | We can use foundation models plus a limited amount of simulation or real world data.
00:11:16.980 | So that's what I'm going to talk about today.
00:11:20.460 | So where are we in terms of foundation models plus robotics?
00:11:24.300 | So our team at Google DeepMind has been piloting in foundation model plus robotics.
00:11:29.940 | So we developed advanced planning, high level planning algorithm.
00:11:34.700 | One of the first is called Palm Seiken.
00:11:37.480 | It is an algorithm that can parse a user command.
00:11:41.540 | So here is a demo.
00:11:42.540 | Here is a scenario.
00:11:43.540 | Here is a user command.
00:11:44.540 | I spill my Coke on the table.
00:11:45.540 | How would you throw it away and bring me something to help clean?
00:11:48.740 | And it's querying a large language model, which is given a score highlighted in blue.
00:11:54.620 | And there is also an affordance score.
00:11:56.180 | The affordance will tell you whether an action at a given state is possible.
00:12:00.580 | It's augmenting the language model to give you only possible things.
00:12:04.880 | So essentially, it is doing the semantic planning with a language model.
00:12:09.860 | But it's also taking into consideration what it can do.
00:12:13.080 | So it's not just outputting the-- like, language model tends to hallucination.
00:12:21.320 | It doesn't hallucinate.
00:12:22.320 | It only gives you what is possible for the robot to do and what is actionable for the
00:12:25.760 | robot.
00:12:26.760 | And the robot is doing the thing that is advancing the long horizon task progress.
00:12:31.860 | And also, each task is executed by a low-level policy.
00:12:36.560 | Here it doesn't quite clean the table, because we haven't added this to the low-level skill.
00:12:42.880 | But imagine there is a low-level skill to clean the table.
00:12:44.960 | It will finish the entire thing.
00:12:48.640 | What is a low-level policy used here?
00:12:50.960 | The low-level policy used here is Robotic Transformer 1, RT1.
00:12:56.100 | It's our team's homegrown transformer.
00:12:59.200 | Essentially, we collect a large data set of human demonstrations.
00:13:04.280 | We put a transformer, and we train it on this large data set of expert trajectories.
00:13:12.440 | It is able to do about 700 tasks with 97% success rate.
00:13:18.440 | And it has interesting generalization behavior.
00:13:21.280 | It can operate in a new kitchen it has never seen before, which is showing there is a successful
00:13:27.780 | recipe to apply foundation models in robotics.
00:13:31.760 | So that's roughly where are we in terms of foundation model plus robotics.
00:13:36.400 | And I will talk about a few new works that is bringing this to the next level.
00:13:44.600 | So actually, my teammate, Ted, gave a talk of foundation models plus robotics at the
00:13:51.000 | beginning of this year.
00:13:53.440 | It's also this class, CS25.
00:13:55.760 | I highly recommend it.
00:13:57.680 | It's available on YouTube.
00:13:58.680 | I actually watched it last night so that I don't repeat some of the contents.
00:14:04.280 | But what he basically mentioned is that he revealed our team's progress in terms of building
00:14:12.040 | this robotic foundation models.
00:14:14.840 | And we have had a lot of somewhat detour, and now we sort of figured out a recipe.
00:14:21.200 | So in 2021 to 2022 is how we scale to many tasks with demonstrations.
00:14:27.040 | How do we collect a large amount of data?
00:14:29.080 | In fact, about 100,000 demonstrations.
00:14:33.680 | And we tried different ways to do it.
00:14:36.440 | We tried behavior cloning.
00:14:38.160 | We tried imitation learning plus reinforcement learning, and some other ways, or combining
00:14:43.060 | them with language models such as SACAN.
00:14:46.520 | In 2022 to 2023, it's about how we can leverage foundation models to accelerate robotics.
00:14:51.880 | We really see a proliferation of using foundation models to accelerate robotics, both on the
00:14:58.680 | high-level planning and low-level control, probably leaning more towards a high-level
00:15:02.780 | planning.
00:15:04.520 | So if the recipe works-- so the recipe is essentially combine a large-scale diverse
00:15:11.760 | offline data set with high-capacity architecture, such as a transformer, and using language
00:15:18.480 | as a universal glue.
00:15:20.000 | So this will be the recipe to build foundation models for robotics.
00:15:25.440 | So if this recipe works, what do we do?
00:15:28.000 | What do we do next?
00:15:29.000 | Essentially, we're just-- let's just scale everything to orders of magnitude and be done
00:15:34.600 | with it and solve robotics.
00:15:38.080 | And guess what?
00:15:39.080 | That's what we did.
00:15:40.080 | So that's the end of the lecture.
00:15:41.080 | I'm going to cut this a little bit short.
00:15:44.120 | And that's a joke.
00:15:45.800 | That's not happening.
00:15:47.600 | So we are still on our way, on our quest to solve low-level embodied intelligence.
00:15:53.680 | When I talk to people that you can use foundation models to do robotics, their reaction would
00:16:01.160 | be it's mostly doing high-level reasoning.
00:16:04.760 | It doesn't do the low-level manipulation really well.
00:16:08.840 | And that's for a reason.
00:16:10.480 | One of the reasons is there is a Moravec's paradox.
00:16:14.000 | Moravec's paradox is the observation that in artificial intelligence and robotics, contrary
00:16:18.600 | to traditional assumptions or our intuitions, reasoning requires very little computation.
00:16:24.000 | But sensory motor control and perception skills require enormous compute resources.
00:16:29.720 | That is because as biological creatures, we acquire the sensory motor skills through evolution.
00:16:38.400 | This is very different.
00:16:39.680 | So we might not be able to reason or do large-scale computation.
00:16:46.720 | But this sensory motor control is integral to our survival.
00:16:51.580 | So it's essentially already learning our DNA.
00:16:55.080 | But in robotics, it's a little bit different.
00:16:57.480 | So the chips are very good at doing reasoning and computation.
00:17:02.360 | But they are not super good.
00:17:03.360 | They haven't experienced the world.
00:17:04.960 | They haven't acquired the sensory motor skills that is necessary for them to do tasks in
00:17:11.040 | the real world.
00:17:12.040 | Here is an example.
00:17:13.680 | When the computer beat Kasparov, basically the human champion in chess, there is another
00:17:22.280 | robot arm moving the chess piece.
00:17:23.880 | It can beat the human champion in chess, but there is still someone need to move the chess
00:17:29.360 | piece.
00:17:30.360 | Similarly, in the AlphaGo moment, when Lee Sedol was beaten by AlphaGo, there is still
00:17:35.000 | someone who is moving the chess piece for them.
00:17:37.840 | It's not a robot doing that.
00:17:39.260 | So this is showing the reasoning is that the hard things are easy, and the easy things
00:17:43.600 | are hard.
00:17:45.640 | There's another thing that prevents us from using foundation models more prevalently,
00:17:51.600 | more in a larger scale in robotics, which is the training data bias.
00:17:57.160 | The training data of foundation models or large language models are mostly language
00:18:01.440 | tasks.
00:18:02.440 | So it's perhaps not that surprising it knows how to clean up a kitchen because maybe there
00:18:07.960 | are wikiHow articles teaching you how to clean up a kitchen or to do something in a procedural
00:18:14.040 | But there is no wikiHow articles teaching you how to move your finger five centimeters
00:18:18.260 | to the left because people just don't say that.
00:18:21.880 | People don't write that down.
00:18:23.040 | So there is a very limited amount of this low-level control data in large language model
00:18:28.720 | training culture.
00:18:29.720 | So we do have a lot of challenges in bringing the foundation models to a lower level.
00:18:34.000 | So that's what I mean by low-level embodied intelligence.
00:18:37.600 | So any questions so far?
00:18:39.680 | Also, I want to make this quite interactive.
00:18:42.040 | So if there is any questions, feel free to interrupt me any time.
00:18:47.040 | All right, if not, we can continue.
00:18:53.120 | So there are a couple of challenges of using large language models for low-level control.
00:18:57.240 | As I just mentioned, the first thing is lack of data.
00:19:01.680 | So we only have perhaps 100,000 episodes of human demonstration data and takes about 13
00:19:10.960 | robots 17 months to collect.
00:19:13.200 | So it's a huge amount of effort.
00:19:15.920 | In the country, large language models are trained on the order of 1,000 billion tokens.
00:19:21.680 | A smaller palm was trained on 780 billion tokens, and the larger one is trained-- following
00:19:31.640 | the chinchilla rule, you would need to train it on 1.35 trillion tokens.
00:19:36.600 | So it's a huge amount of discrepancy between how much data we can achieve in robotics and
00:19:43.320 | how much we can get in large language models.
00:19:48.120 | So we will always be bounded by robotic data.
00:19:50.360 | So maybe we can scale on other fronts.
00:19:53.560 | Maybe we can keep the robotics data the same, and then we can scale on other fronts.
00:19:58.200 | Like, maybe we can scale the pre-training mix of text and image, or maybe image and
00:20:02.320 | text pairs.
00:20:03.520 | Maybe we can build this cake, and the robotics data is just a cherry on top of it.
00:20:10.020 | And we can scale the foundation really, really well.
00:20:14.260 | Some of my work that I'm going to talk about today actually reused the RT1 data.
00:20:18.800 | We don't collect the new data for RT2, but we want to do more things with the same amount
00:20:23.560 | of data.
00:20:26.000 | The second challenge is kind of related to the first challenge.
00:20:30.400 | Language models lacks an interface for low-level control.
00:20:34.680 | If you ask a language model, how do you make a robot dog stand up on two feet, it will
00:20:39.360 | tell you a lot of things that sound reasonable, sounds plausible.
00:20:43.120 | It will tell you the robot dog's torso is upright, balance over two hind feet, and standing
00:20:48.280 | shoulder-width apart.
00:20:49.760 | This is great.
00:20:50.760 | This is all great.
00:20:51.760 | But we cannot put it on the robot.
00:20:55.920 | On the other hand, maybe we can ask a language model to write control code to directly control
00:21:00.120 | the robot.
00:21:01.120 | But usually, that requires you to curate an API that is friendly to the language model.
00:21:06.840 | It will directly ask it to give you my joint angles to make the robot stand upright.
00:21:12.520 | It will not give you the right thing because it doesn't have enough context.
00:21:15.600 | So essentially, large language models don't speak robot language.
00:21:20.700 | Can we actually find the right robot language?
00:21:24.120 | Can we find the interface between large language models and robot control?
00:21:28.240 | Or can we just treat robot action as another language?
00:21:31.960 | So that's what we want to find out.
00:21:36.200 | In today's agenda, I will be talking about low-level embodied intelligence with foundation
00:21:40.280 | models.
00:21:41.280 | It's separated into two parts, and it's addressing the two challenges that I've just mentioned.
00:21:47.480 | Part one is about model consolidation, joint scaling, and positive transfer.
00:21:51.760 | So I have to put them in one part because they are somewhat related.
00:21:56.400 | And part two is developing new interface of large language models.
00:22:00.960 | So what do I mean by model consolidation?
00:22:03.840 | Yes, question.
00:22:04.840 | Yeah, I was going to ask, why couldn't you just fine-tune an RNN for generating low-level
00:22:12.480 | code?
00:22:13.480 | [INAUDIBLE]
00:22:14.480 | Yeah.
00:22:15.480 | Yeah.
00:22:16.480 | Yeah, that's a great question.
00:22:19.880 | So the question is, why cannot we fine-tune language model to directly output low-level
00:22:25.600 | code or robot actions?
00:22:29.880 | So I will be talking about RT2, which does somewhat similar to that.
00:22:33.800 | It's fine-tune language model to output action as a language, to output our action representation.
00:22:41.160 | There are certain downsides to that.
00:22:42.600 | Like, for example, you would need to collect additional data to fine-tune a language model.
00:22:48.720 | So either we can fine-tune that, or we can use the language model zero-shot if you find
00:22:53.160 | the right interface, which I will talk about a little bit in the part two.
00:22:56.360 | Zero-shot and without fine-tuning?
00:22:58.560 | Without fine-tuning, yeah.
00:23:01.220 | So the model consolidation is, essentially, we can do the high-level reasoning and low-level
00:23:05.680 | control in one model.
00:23:07.240 | And joint scaling is, not only we scale the robot data, which is expensive.
00:23:12.000 | We also scale the pre-training data.
00:23:15.880 | Or we already start from a pre-trained vision language model.
00:23:19.720 | And a positive transfer is model benefiting from diverse joint training across internet
00:23:24.560 | scale language, vision, and vision language domains combined with robotics.
00:23:31.720 | So this is a continuation of the axes that Tad drew in his previous talk.
00:23:40.080 | So we can see there is a trend.
00:23:42.560 | So this visualization basically highlights some of the work on our team.
00:23:47.820 | And each work, each column, is basically a robotic system that is able to do both high-level
00:23:55.160 | reasoning and low-level control.
00:23:57.260 | So previously, we need to have separate models for each thing.
00:24:03.080 | Previously, in the initial release of SACAN, the planning is done by a large language model.
00:24:09.040 | And the affordance is done by a QT opt-like policy trained with Sim2Real.
00:24:20.040 | And the low-level policy is Robotic Transformer 1.
00:24:23.720 | So it's each model doing its dedicated thing.
00:24:28.160 | And we need to train each model differently, and perhaps with different type of data.
00:24:34.480 | And later, we have QTransformer, which unifies, which is kind of an offline RL method that
00:24:42.360 | is leveraging transformer architecture.
00:24:44.560 | So it's a high-capacity architecture.
00:24:46.820 | It can train on both positive data and negative data.
00:24:49.520 | And with that, we are able to gather a policy that is also understanding affordances.
00:24:56.180 | So we can unify the low-level policy and affordances.
00:24:58.880 | But the planning is still a large language model.
00:25:01.280 | And then we have PAL-ME, which is a vision language model, which is a large language
00:25:06.580 | model also trained on a vision language domain.
00:25:09.980 | So the PAL-ME can do planning and affordance in just one model.
00:25:13.880 | But the low-level is still using RT1.
00:25:16.280 | And finally, we unify everything together.
00:25:18.620 | Like there is RT2, which I'm going to talk about today, that can do both high-level planning
00:25:23.760 | to some extent, generating affordance, and do low-level policies.
00:25:28.320 | So behind the model consolidation is the consolidation of tasks.
00:25:33.640 | We can represent every task as a vision plus text to text task.
00:25:39.160 | So it's a really universal representation of the task.
00:25:42.840 | And then with that, you can train it really on using a lot of data.
00:25:48.040 | And you can see positive transfer.
00:25:49.960 | Basically, learning affordance can also tell you how to achieve a task.
00:25:56.800 | There are transfer between tasks when you pull all the tasks together.
00:26:03.800 | So to understand this joint scaling and to understand the model consolidation, we need
00:26:09.200 | to understand PAL-ME a little bit.
00:26:12.280 | So PAL-ME is an embodied multimodal language model.
00:26:15.400 | It's based on the PALM architecture.
00:26:17.520 | So PALM is a large language model.
00:26:19.440 | We made some adaptation on the architecture so it can understand multimodal input.
00:26:25.900 | So it is basically one model that is able to take in multimodal input.
00:26:34.760 | So in large language models, each word is tokenized and tokenized and getting this embedding
00:26:43.280 | of these words.
00:26:45.800 | And then that is fed into a large language model.
00:26:49.240 | So in PAL-ME, what we do is instead of using words, we can use multimodal tokens.
00:26:56.120 | So the multimodal tokens can come from a vision transformer, a VIT, or it can come from robot
00:27:04.560 | sensory data.
00:27:06.100 | So every multimodal token, then we map it to the text embedding space.
00:27:14.400 | We basically train a linear affine transform between the multimodal token and the text
00:27:23.120 | embedding space.
00:27:24.380 | And then we can treat the multimodal token as words as well.
00:27:30.200 | So essentially, we have a language model as a solid base, and then we start to adapt it
00:27:37.600 | to understand multimodal tokens.
00:27:39.760 | So this is quite interesting because it doesn't require a ton of adaptation or fine tuning
00:27:46.480 | for it to understand multimodal input.
00:27:50.000 | It just aligns naturally to the multimodal input, such as images.
00:27:54.840 | I will show a couple of examples of what it can do.
00:27:58.400 | And we can train in the same way as training large language models.
00:28:01.600 | So essentially, we can reuse the same infrastructure and training algorithm and everything to train
00:28:07.880 | this PAL-ME.
00:28:10.320 | A couple of other things we find along the way is positive transfer, which I will share
00:28:15.600 | in a little bit.
00:28:17.400 | So I guess here, I also want to mention PAL-ME is one of the largest models we have explored
00:28:24.400 | so far.
00:28:25.400 | It has 562 billion parameters, which is by concatenating the PALM, 540 billion parameters
00:28:32.360 | and the 22 billion VIT.
00:28:34.400 | And we find a lot of emergent capabilities of these models.
00:28:39.040 | That is, we haven't expected during training time, but really, we can prompt these models
00:28:46.080 | and ask it to do interesting things.
00:28:48.920 | We have also explored using neural scene representation, basically an object-centric representation
00:28:57.800 | and fed into PAL-ME.
00:28:59.440 | So object-centric representation assigns one token to each object.
00:29:07.160 | And we find that this representation is super helpful for robot planning tasks, because
00:29:13.040 | traditional VIT representation is based on grid, and it doesn't have a full understanding
00:29:17.760 | of light objects and their relationships.
00:29:20.640 | We have done an extensive study on the scaling performance and the catastrophic forgetting
00:29:27.720 | performance and all other interesting experiments in the paper.
00:29:32.640 | So please refer to the paper for more.
00:29:34.640 | So here, I'm just showing some interesting qualitative examples or some emergent capability
00:29:41.360 | of PAL-ME that we found out.
00:29:44.400 | So first, we found this model has some reasoning capability.
00:29:48.000 | You can give it an image and ask it questions that require a little bit of reasoning.
00:29:52.960 | And you can prompt this with, let's think step-by-step, which is a technique used to
00:29:58.600 | elicit reasoning in large language models.
00:30:01.400 | But here, in multi-modal language models, you can do the same.
00:30:04.760 | I guess people are also experimenting these days with GPT-4V.
00:30:09.440 | You can also prompt it to think step-by-step or count row-by-row.
00:30:13.480 | But here, this is before GPT-4V, and we were able to elicit reasoning using some of the
00:30:19.760 | interesting prompts, such as we can ask it, in this photo, are there more cats or more
00:30:26.040 | dogs?
00:30:27.040 | Let's think step-by-step.
00:30:28.040 | And the PAL-ME found out there are equal amount of dogs and cats.
00:30:32.200 | And on the right, give an image, can I go down the street on a bicycle, yes or no?
00:30:37.880 | Let's think step-by-step.
00:30:39.240 | And the reply is, do not enter, second, except the bicycles.
00:30:42.680 | Do not entry except the bicycles, yes.
00:30:45.200 | So it's doing this modest reasoning, and it's mixing this understanding of symbols and also
00:30:52.800 | mixing the understanding of text.
00:30:55.240 | So this is quite amazing to me, to be honest, when I first saw this.
00:31:00.440 | I didn't expect a multi-modal language model would be able to do that.
00:31:04.880 | And we also tried one thing, which is traditionally very difficult to language models, which is
00:31:10.920 | to tell a joke.
00:31:11.920 | Language models can understand joke, but sometimes it just doesn't-- it's not able
00:31:17.400 | to tell you a joke when it comes to the punchline.
00:31:21.600 | Because it's just trying to make something that is plausible and sounds like a joke.
00:31:27.040 | And when it comes to the punchline, it doesn't really know what to say.
00:31:30.760 | So here, I give it an image, and I ask it to come up with a description, and then comes
00:31:36.560 | up with a joke.
00:31:37.780 | So this guides the language model to think step-by-step.
00:31:40.920 | And the description is a donkey is carrying a dog, cat, and rooster.
00:31:45.160 | And the joke is, what do you call a donkey with a rooster on his back?
00:31:47.760 | A rooster booster.
00:31:48.760 | It's so creative.
00:31:50.200 | Like when I saw this, I'm pleasantly surprised.
00:31:53.240 | And I searched online.
00:31:54.240 | I couldn't find another joke like that.
00:31:56.360 | So it's actually an original joke by Pomi.
00:31:58.840 | And finally, we see some math reasoning with this model.
00:32:03.280 | Basically, I give it a messy menu from a pizza store, and I ask it, I'm just buying
00:32:12.440 | a pizza for me and my friend.
00:32:13.840 | How much should I pay?
00:32:14.840 | Let's think step-by-step.
00:32:16.040 | And it's figuring out there is a pizza, and there is $9.99, and it tells you the price.
00:32:23.520 | In some of the answers, it even calculates text, but the text is hallucinated.
00:32:28.060 | So that doesn't work.
00:32:29.520 | All right, let's talk about positive transfer.
00:32:32.520 | So apart from the amazing things that Pomi can do, it also has interesting positive transfer
00:32:41.080 | behavior.
00:32:43.100 | So when we train Pomi on a single domain, when we train it on just a single robotics
00:32:49.840 | task, the performance is not super great.
00:32:52.480 | But when we pool all the data together, and we also include internet-scale visual language
00:32:59.280 | tasks, such as captioning or visual question answering, it is able to do much better.
00:33:05.400 | So this shows that it's important to mix all the data together and train it jointly.
00:33:12.520 | The internet-scale data can act as a regularizer for you to not forget the representations.
00:33:20.960 | And those representations are, in turn, very useful for robotics.
00:33:26.400 | So that's a positive transfer result.
00:33:28.300 | And we start to see more and more positive transfer in other of our studies.
00:33:32.660 | So how much data do you have to do collectively, like in simulation or in real world?
00:33:37.480 | I think the playing with sorting stuff on the table is very impressive.
00:33:44.520 | Right.
00:33:45.520 | Yeah, that's a very good point.
00:33:50.640 | So these are all planning data, like high-level planning.
00:33:57.040 | So maybe let's just talk about two things.
00:34:00.000 | So first of all, the sorting results, the low-level policy is still using a traditional
00:34:07.340 | controller.
00:34:08.600 | So it's using a policy called LAVA.
00:34:10.680 | And that policy is trained on 68,000 episodes.
00:34:16.080 | The high-level planning is probably easier than you think, because it's giving command
00:34:24.600 | to the low-level policy.
00:34:25.800 | So it's basically only need to say, put the red block into top-left corner, put another
00:34:31.480 | red block into top-left corner.
00:34:32.960 | So it's rather quite standard autoregressive language modeling task.
00:34:39.840 | The only thing I need to do is to determine what task is not finished yet.
00:34:45.260 | So for example, if the block is already in the corner, it shouldn't call low-level policy
00:34:49.120 | to move it to the corner again.
00:34:50.780 | So it's rather like parsing the states and understanding the states.
00:34:55.720 | So this high-level policy only requires about 50 to 100 demonstrations to learn.
00:35:00.520 | So it's quite parameter efficient.
00:35:02.800 | And in the future-- that's a very good question, actually-- in the future, a lot of these tasks
00:35:07.280 | can be taught in context.
00:35:09.240 | So maybe we just demonstrate it once to the large-language model, then it knows how to
00:35:13.880 | do that.
00:35:14.880 | [INAUDIBLE]
00:35:15.880 | Yeah, this is through human demonstration as well.
00:35:27.400 | So a human on a low-level can demonstrate low-level policy by tele-operating a robot
00:35:32.640 | to do a certain task.
00:35:34.200 | But on a high-level, it could also just give the low-level policy-- imagine your control
00:35:42.440 | interface is through text.
00:35:44.460 | And then as a human, you can also guide a low-level policy to accomplish a task.
00:35:49.600 | And then that thing can then be used to train a large-language model.
00:35:54.840 | So that's for the sorting block.
00:35:57.280 | The secant is a little bit more interesting because the planning steps are actually generated
00:36:02.740 | by POM.
00:36:04.460 | So we essentially distilled POM plus this affordance model into POM-e.
00:36:10.880 | So that's a little bit more interesting.
00:36:13.020 | It's like using the AI data to bootstrap itself.
00:36:16.840 | That one has about 3,000 episodes, also not quite a lot.
00:36:22.080 | But it's able to learn complex planning behavior, replanning behavior, error recovery, which
00:36:28.680 | I will show in a slide.
00:36:30.000 | So with the POM-e as a high-level planner, we are able to take the rice chips out of
00:36:38.960 | the drawer, and there is a twist, which is I will be messing with the robot.
00:36:47.440 | So as it put onto counter, I put it back to the drawer.
00:36:52.040 | And as it pick it up again, and then I put it back again.
00:36:56.880 | So it's able to understand the state.
00:36:58.400 | It's able to understand my task is not finished.
00:37:01.120 | I cannot proceed with the next task.
00:37:03.240 | Now, after I don't mess with it anymore, it's able to close the drawer and pick up
00:37:08.760 | the bag of chips.
00:37:11.560 | So POM-e is able to combine affordance and planning in one model and do complex reasoning
00:37:19.480 | of a scene and environment.
00:37:22.760 | And interestingly, we can use the exact same model checkpoint to do block sorting as well.
00:37:28.960 | So this is the same model checkpoint.
00:37:30.840 | It can not only reason about how to bring a bag of chips to a user, it can also sort
00:37:37.000 | blocks.
00:37:38.000 | So and it's also responding to adversarial perturbation, like if the user is putting
00:37:46.060 | the block in the middle again, it's able to recover from that.
00:37:50.120 | So these are all coming from the same model.
00:37:53.040 | And it can also tell a joke.
00:37:57.360 | So yeah, this is the power of vision language models.
00:38:03.440 | Now we want to go a level deeper.
00:38:06.160 | These are all vision language models that are used for planning or high-level reasoning.
00:38:10.520 | Can we use them for low-level control?
00:38:12.800 | It turns out we can.
00:38:15.280 | And that's the RGQ work, which is vision language action model that transfer web knowledge to
00:38:20.120 | robotic control.
00:38:21.120 | What can it do?
00:38:23.200 | When asked, pick up the extinct animal.
00:38:28.480 | And it has a whole range of objects on the table.
00:38:31.480 | It will pick up the dinosaur.
00:38:32.860 | So it can link the extinct animal to dinosaur and to the action that pick the dinosaur up.
00:38:40.960 | So it's really doing this emergent reasoning and also the manipulation in just the one
00:38:46.760 | model.
00:38:47.760 | And by the way, this robot hasn't seen any of these before, at least in the robot training
00:38:54.440 | data.
00:38:55.440 | It might have seen this in their internet catalog, but it has never seen it in the robotics
00:39:01.600 | training data.
00:39:03.080 | So it's quite interesting how we need to evaluate these robots nowadays.
00:39:10.680 | So when we evaluate language models to prevent data contamination, every time you need to
00:39:16.480 | give it new questions because otherwise it might already memorize it in its training.
00:39:22.000 | When we evaluate these robots, we actually go to dollar store to buy all these toys to
00:39:27.200 | make sure it hasn't seen that before.
00:39:29.520 | And as we run more evaluation, maybe there will be some replication as well.
00:39:33.840 | But as you can see, it is able to understand to pick up this dinosaur toy.
00:39:40.880 | How did we do that?
00:39:42.960 | So we start from a visual language model that is trained on internet-scale data.
00:39:49.000 | And then we also combine it with robotics action data, which is the RT1 data and we
00:39:53.920 | get RT2.
00:39:55.600 | And we can dive deeper, a little bit deeper into RT2.
00:40:00.280 | So first of all, what is a visual language model?
00:40:02.520 | A visual language model is a transformer that takes in image and text and output text.
00:40:11.380 | So within Google, there is a visual language model called Pali, which is an encoder-decoder
00:40:21.040 | type of architecture.
00:40:22.440 | It's basically having a VIT to understand images and then a transformer encoder and
00:40:27.840 | the transformer decoder.
00:40:30.960 | They encompass both the visual and semantics to understand the world.
00:40:35.800 | And in robotics, we have to deal with a lot of both of these.
00:40:40.500 | And the question is, can we leverage the knowledge in the visual language models and apply them
00:40:46.460 | to robotics?
00:40:48.720 | On the other hand, we have the RT1.
00:40:51.320 | If you want to learn more about RT1, you can listen to the previous episode of this CS25
00:40:57.840 | by Tad.
00:40:59.320 | So he gave a detailed introduction on the RT1.
00:41:02.080 | But the RT1 is, if you stand far enough, it is also a vision language to action or something
00:41:11.840 | model.
00:41:12.840 | It takes in human instruction.
00:41:14.400 | It takes in the current camera image.
00:41:16.220 | The camera image passed through a film-efficient net, which is tokenized into 81 tokens, and
00:41:21.580 | then going to a token learner, which compresses everything into eight tokens.
00:41:26.480 | And then there is a transformer block, leveraging a lot of self-intention layer, and then generate
00:41:31.680 | actions.
00:41:32.680 | The action is also tokenized.
00:41:34.880 | The robot has seven degrees of freedom.
00:41:41.040 | The anti-factor has six degrees of freedom, its position and the rotation, and the gripper
00:41:47.660 | can open and close.
00:41:49.140 | And there is another dimension representing terminate the episode or not.
00:41:54.580 | Terminating means my task is already done.
00:41:57.500 | And we discretize every dimension into 256 bins.
00:42:03.020 | And then we do cross-entropy loss on those bins.
00:42:05.780 | So that's the RT1 architecture in a nutshell.
00:42:10.020 | It's quite similar to a vision language model with different output tokens.
00:42:13.820 | So it's rather natural that we just use a large pre-trained vision language model directly
00:42:19.180 | as policy.
00:42:20.180 | We can use the poly or poly-me as a policy.
00:42:24.440 | And one question is, how do we deal with actions when using pre-trained vision language models?
00:42:30.120 | And here is action representation that we use.
00:42:33.460 | The robot actions here are the eight dimensions.
00:42:38.100 | And as I mentioned, there is termination, position change, and rotation change.
00:42:42.560 | And we discretize everything into 256 bins.
00:42:47.120 | We also have tried other alternative representations, but they are not as good as just this naive
00:42:52.660 | representation.
00:42:54.660 | [INAUDIBLE]
00:42:55.660 | Yeah.
00:42:56.660 | Yeah.
00:42:57.660 | [INAUDIBLE]
00:42:58.660 | Oh, the film efficient net is a pre-trained convolutional neural network.
00:43:04.620 | It's used to tokenize the images.
00:43:07.180 | So the reason that we do this is, through some ablation study, we can tokenize the image
00:43:11.740 | in different ways.
00:43:12.740 | We can tokenize in ResNet.
00:43:14.740 | We can tokenize everything into ResNet.
00:43:17.100 | And we can tokenize using film efficient net.
00:43:19.940 | Film, what it means is it also take into the language embedding and append it to the intermediate
00:43:26.300 | layers of the ResNet.
00:43:28.300 | So we basically have some combination of feathers, and it's encoded in images.
00:43:33.820 | Yeah.
00:43:34.820 | [INAUDIBLE]
00:43:35.820 | That's right.
00:43:36.820 | That's right.
00:43:37.820 | [INAUDIBLE]
00:43:38.820 | That's right.
00:43:39.820 | [INAUDIBLE]
00:43:40.820 | That's right.
00:43:41.820 | [INAUDIBLE]
00:43:42.820 | The action is not encoded.
00:43:43.820 | The action is in text.
00:43:44.820 | It's basically what is shown here.
00:43:45.820 | This is the action.
00:43:46.820 | It's eight numbers.
00:43:47.820 | Each number range from 0 to 255.
00:43:48.820 | Yeah.
00:43:49.820 | And maybe another note.
00:43:50.820 | On the film ResNet, it's about how we tokenize the images and how we combine vision information
00:44:13.020 | and language information.
00:44:14.860 | There are many ways to do that.
00:44:16.340 | This is not the only way.
00:44:17.580 | There is early fusion and late fusion.
00:44:20.260 | And there is also cross-attention.
00:44:22.100 | You can basically tokenize your image just by itself.
00:44:25.220 | And then you can have language and use cross-attention to combine the image and text representation.
00:44:31.140 | So here, we are using this model.
00:44:33.540 | This is RT1 for robotics.
00:44:35.540 | So we do have a lot of considerations, such as latency.
00:44:38.660 | That's why we use this film ResNet, because it's super fast.
00:44:42.020 | And it can output a limited amount of tokens, which we can further compress with Token Learner.
00:44:47.620 | Yeah.
00:44:48.620 | Yeah.
00:44:49.620 | So is this autoregressive?
00:44:50.620 | Like, every single image it sees, it then reacts with each other?
00:44:54.740 | Right.
00:44:55.740 | So it is autoregressive.
00:44:56.740 | Yeah.
00:44:57.740 | And every time, we use a history of up to six steps.
00:45:02.100 | So every time, you see this image right now.
00:45:04.540 | And you see about two seconds of history before it.
00:45:09.180 | And this will be your input.
00:45:11.940 | Yeah.
00:45:12.940 | Again, if you have more questions about RT1, I recommend watching the previous episode.
00:45:18.500 | And here, it's all about RT2.
00:45:22.740 | So we can convert the string of numbers.
00:45:26.660 | This will be our output of our transformer, which is a visual language model.
00:45:31.420 | We tried other alternatives, such as floating numbers.
00:45:35.300 | Floating numbers is not super friendly to language model tokenizer, because it has these
00:45:41.340 | decimal points.
00:45:42.340 | We also tried the human language, such as left or right.
00:45:45.100 | It's more a semantic representation.
00:45:46.820 | But they cannot be directly executed on a robot, which is a limitation of this method.
00:45:53.440 | So if we commit to this action representation, which is just a string of numbers, we essentially
00:45:59.180 | get a visual language action model.
00:46:01.420 | We tried different variants, including polyX.
00:46:05.380 | This is a pathways language image model.
00:46:11.700 | There are 5 billion parameters variant and 55 billion parameter variant.
00:46:16.020 | And we also tried POMI, which is 12 billion parameters.
00:46:20.300 | The procedure that we did to train this RT2 is via co-fine tuning.
00:46:26.700 | Co-fine tuning is to put the internet scale data and the robotic data together.
00:46:32.880 | And then we fine tune it on this mixture of data so that it doesn't-- it retains the internet
00:46:39.100 | scale knowledge.
00:46:42.940 | Maybe that's also an artifact of our data is too small and not diverse enough.
00:46:46.620 | So if you're just a fine tune on robotics data, it will quickly overfit and forget about
00:46:52.180 | all this progeny mixture.
00:46:54.480 | Maybe it's a dynamic of scale.
00:46:56.660 | So we'll see.
00:46:59.580 | At inference time, how do we do this?
00:47:02.260 | We basically-- again, we do this autoregressively.
00:47:05.860 | We have an instruction of a task.
00:47:10.180 | And we format this as a question and answering task.
00:47:13.260 | What should the robot do to achieve a certain task?
00:47:15.860 | And the task is a string that human give the robot for the robot to achieve.
00:47:20.620 | And it also have the current observation, which is the robot observation, the camera
00:47:29.600 | image, RGB image.
00:47:30.600 | It pass through a VIT, and then it pass through the large language model, and then output
00:47:36.560 | a list of tokens.
00:47:38.380 | So we leverage the constraint decoding to make sure it always have eight numbers.
00:47:45.680 | And because otherwise, we cannot de-tokenize it.
00:47:49.640 | It's very easy for language model to just miss one number.
00:47:52.860 | So we do have some mechanism, such as constraint decoding and beam search, to make sure the
00:47:58.320 | format is correct.
00:47:59.320 | After we get the string of eight numbers, we de-tokenize it to a delta T and delta R,
00:48:04.920 | which is the anti-factor delta pose.
00:48:07.280 | And the robot can just directly run this on the robots.
00:48:10.120 | After they run on the robots, we repeat this process.
00:48:13.280 | We get another new image, run through this process, and get a new action.
00:48:17.160 | And we repeat this process until a termination is decoded.
00:48:21.240 | So some people might be concerned that this is rather slow.
00:48:27.040 | It's in fact quite slow, because it's 12 billion parameters, or 5 billion parameters.
00:48:33.160 | We cannot run on a robot.
00:48:34.920 | So we run on a TPO cluster, and the robot is querying the TPO cluster to get the numbers
00:48:40.240 | and apply it on the robot.
00:48:42.980 | So for the 12 billion parameters, we can actually run at 10 hertz.
00:48:48.080 | So it's quite fast.
00:48:49.280 | For all those models, we can run at least three hertz.
00:48:52.060 | So that is sufficient for controlling a robot.
00:48:57.960 | And we see a lot of emergent skills that is not on the training set.
00:49:04.880 | Essentially, as I just mentioned, we are probing what this RT2 can do.
00:49:09.280 | We actually don't know.
00:49:10.280 | So we are trying to figure out what RT2 can do.
00:49:12.440 | So we test it with a lot of new tasks, such as put a strawberry into the correct bowl,
00:49:18.720 | or move a banana to Germany, just to test its understanding of symbols or flags.
00:49:26.360 | Pick a land animal.
00:49:27.360 | There's a horse.
00:49:28.360 | There's an octopus.
00:49:29.360 | So basically, test its semantic reasoning and also low-level manipulation skills.
00:49:36.240 | And we divide the tasks into symbol understanding, and reasoning, and human recognition, and
00:49:44.280 | average.
00:49:45.280 | And we found that with RT1, which is not trained on internet-scale data, we do quite poorly
00:49:53.080 | in these emergent evaluation tasks.
00:49:56.200 | And in the RT2 variants, which is co-fine-tuned on the internet data and our robotics data,
00:50:06.280 | we do much better in these tasks.
00:50:08.280 | And there is also an effect of scale.
00:50:11.040 | So the RT2 with the 55 billion poly is performing better than the 12 billion poly, although
00:50:17.920 | they perform quite similarly for in-domain tasks.
00:50:20.840 | But the generalization is kind of interesting.
00:50:23.200 | It seems with larger scale, you can generalize better.
00:50:27.920 | And here are some videos of the robot achieving these tasks, like moving the banana to a number,
00:50:35.840 | put the strawberry into the correct bowl, move a Rubik's cube to the water bottle--
00:50:41.880 | but I'm speaking Chinese-- moving the banana to a German flag.
00:50:46.840 | So it's able to do all of these very interesting tasks.
00:50:51.960 | In terms of the quantitative evaluations, we also found that the RT2 policy is quite
00:50:57.880 | robust to unseen objects, unseen backgrounds, and unseen environments.
00:51:03.960 | And here is another evidence of positive transfer.
00:51:07.040 | So co-fine-tuned with VQA data outperforms fine-tuning on robotics only.
00:51:12.520 | And if you're trained on robot data from scratch, it barely works.
00:51:16.680 | It almost doesn't work, because it overfits to robot data.
00:51:19.720 | And our robot data is just too small.
00:51:21.880 | So we do need to do co-fine-tuning, or at least fine-tuning, so it retains its internet
00:51:29.320 | scale knowledge.
00:51:30.660 | This is also a recipe for how people would develop a domain-specific vision language
00:51:35.960 | model.
00:51:36.960 | So you start from a very general vision language model, and you fine-tune on your domain.
00:51:41.240 | Or you can co-fine-tune with your specific domain data.
00:51:45.540 | This is likely a problem that each vertical of artificial intelligence would incur someday.
00:51:54.520 | We can also test on other platforms.
00:51:56.800 | Like this shows some cross-embodiment, the RT2, PolyE3b outperforms previous models in
00:52:02.120 | terms of moving blocks around a 2D environment.
00:52:09.240 | And in large-language models, we have this chain-of-thought reasoning, which is a method
00:52:14.640 | to elicit reasoning in large-language models.
00:52:18.120 | You can either do zero-shot chain-of-thought reasoning by, say, eliciting step by step.
00:52:21.400 | I'll give you the examples of reasoning.
00:52:23.800 | It's basically decoding more things and then come to the conclusion.
00:52:27.840 | We can use a similar procedure for the RT2 as well.
00:52:32.120 | So in RT2 PolyE, instead of directly decoding the actions, we can actually decode a plan
00:52:38.060 | and then append it with actions.
00:52:40.280 | So this gives the language model an opportunity to understand a question or parse a question
00:52:44.940 | differently.
00:52:45.940 | It also gives us the opportunity to reason about things a little bit.
00:52:50.200 | For example, if you say, "Bring me a drink," and it will say, "Pick up 7-up can," because
00:52:55.200 | there's a 7-up can on the table.
00:52:57.720 | So we synthesized a couple hundred such examples using a large-language model just by augmenting
00:53:02.960 | the instruction and then fine-tuned the RT2 just for a couple hundred steps.
00:53:07.080 | So it's between full fine-tuning and in-context learning, and it is able to do some reasoning.
00:53:13.480 | And some of the interesting reasoning tasks include, "I need to hammer a nail.
00:53:17.560 | Which object from the scene might be useful?"
00:53:19.440 | And in the scene, there is a headphone, there is a rock, and there is a sticky note.
00:53:24.600 | And the robot will say, "Rocks," and then generate actions to pick up the rock.
00:53:28.460 | So it's interesting that it's able to do this sort of reasoning with RT2.
00:53:34.040 | And here is a demonstration of some of the channel-thought reasoning with RT2 PolyE.
00:53:39.880 | And the task is, "Pick up the thing that is different from all other objects."
00:53:44.280 | And it picks up the chocolate, because this is a snack and other things are the drink.
00:53:49.120 | And I can also speak a different language, and the plan would be to translate it into
00:53:53.920 | a language that it's familiar with, which is English, and then do the task.
00:54:01.280 | There are also potentially better cases of the channel-thought reasoning.
00:54:04.700 | So here I say, "Move the green object together."
00:54:06.720 | And as you can see, the robot oscillates between the two green objects, because there are rather
00:54:10.720 | two plans.
00:54:11.720 | It could move the can to the bag of chips, or it could move the bag of chips to the can.
00:54:16.820 | It oscillates between two plans until one action brings it to an object, and it will
00:54:22.220 | commit to one of the plans rather than another.
00:54:26.460 | It's not always guaranteed to work, but it's quite interesting.
00:54:29.920 | And it's also interesting that, again, we are testing the manipulation policy like how
00:54:34.460 | we test intelligence of humans or animals or kids, because they're getting more and
00:54:40.980 | more advanced.
00:54:41.980 | As a summary, we have the vision language and action model that is able to improve the
00:54:49.560 | generalization.
00:54:50.560 | It can do new tasks and operate the new objects.
00:54:53.520 | It can also do chain-of-thought reasoning, and improving the underlying model, such as
00:54:59.440 | the vision language model itself, by scaling it up and training it with internet-scale
00:55:06.320 | data or training it with larger or higher-quality internet-scale data, we can achieve better
00:55:11.040 | robot control, which is quite amazing, because robotics field has been traditionally developing
00:55:16.280 | quite slowly and is bounded by hardware, bounded by a lot of different things, bounded by operation.
00:55:21.120 | But now it seems we can piggyback on the development of the foundation model field, and whatever
00:55:27.920 | they do will trickle down to our field as well.
00:55:30.560 | And the future will be to increase the motion diversity and extend on the chain-of-thought
00:55:35.040 | reasoning capability and many more.
00:55:40.040 | And so there is another example of positive transfer, which you might have seen recently.
00:55:46.520 | So far, I've been talking about scaling differently.
00:55:49.520 | I've been talking about don't scale robotics data and scale other data.
00:55:54.120 | That's because robotics data is so hard to collect, and the purpose is not to avoid collecting
00:55:59.160 | robotics data.
00:56:00.160 | It's to develop a recipe that you can do more with limited robotics data.
00:56:05.560 | However, there's also an effort from our team and the entire robotics field to scale up
00:56:12.640 | the robot data collection, which is called OpenX Embodiment.
00:56:16.840 | And the model chain is called RTX, Robotics Transformer X.
00:56:20.440 | It's basically 22 type of embodiments and 572 scales and 60 datasets pulled all together.
00:56:28.520 | So this will be the ultimate dataset we can use to study positive transfer and to study
00:56:33.880 | this joint scaling.
00:56:37.280 | And there are already evidences of positive transfer.
00:56:42.080 | So we pulled all the data together from all these labs and find a common action representation
00:56:50.020 | that we can use to train a robotic transformer.
00:56:53.080 | And we have already found this jointly trained model can outperform task-specific model that
00:57:00.200 | is developed in each of the lab.
00:57:02.560 | So there is some benefits in pulling all the data together.
00:57:05.840 | So scaling robot data is also quite important.
00:57:12.560 | So the summary for this part is that we are having a model consolidation.
00:57:16.640 | We can now do the high-level reasoning and low-level control in one model.
00:57:21.240 | And the low-level control part is what excites me because it's so far away from the traditional
00:57:26.720 | language model domain, it's so different and it shows signs of life that we can trickle
00:57:33.640 | down a lot more than we used to think it's possible.
00:57:37.880 | And we can scale the pre-training of vision language models as well as scaling robotics
00:57:41.720 | data.
00:57:42.720 | And we observe more and more positive transfer model benefiting from diverse joint training
00:57:47.560 | across internet-scale language, vision, and vision language domains.
00:57:52.200 | All right, so I noticed that we are close to running out of time, so I will just very
00:58:00.120 | quickly go through the second part, which I think is also interesting, is to find new
00:58:04.680 | interfaces of language models, but I would only talk at a very high level.
00:58:10.040 | So language models, as we can see, can directly output action tokens if we found action representation.
00:58:15.800 | So we can treat action as yet another language to the language model.
00:58:20.240 | So language model can do translation, so it should be able to generate action as well.
00:58:23.880 | But that requires fine-tuning.
00:58:25.540 | Can we do it without fine-tuning?
00:58:27.920 | Or can we generate more expressive actions that is beyond the scope of fine-tuning?
00:58:34.640 | So that is about finding the right interface.
00:58:38.040 | So previously, we have already established that language model doesn't have an action
00:58:42.880 | interface.
00:58:43.880 | If it has an action interface, it's not as effective.
00:58:48.480 | So what is the best interface between language and the low-level actions?
00:58:51.840 | I would argue the best interface between language model and the low-level actions is reward
00:59:00.400 | functions.
00:59:01.400 | And reward functions is universal.
00:59:04.240 | It has been used in reinforcement learning.
00:59:06.500 | And it's also a reparameterization of actions.
00:59:11.000 | What is action?
00:59:12.000 | Let's see if I want to pick up this bottle.
00:59:15.680 | And I can say, well, what is a skill?
00:59:17.720 | A skill is a mapping between my observation and my action.
00:59:21.480 | So the mapping between my observation and action can be seen as a skill.
00:59:25.360 | But a skill can have an alternative definition, which is a set of constraints and a set of
00:59:30.240 | objectives.
00:59:31.500 | So picking up the bottle means the bottle is in my right hand, and the bottle is off
00:59:37.640 | a supporting surface.
00:59:39.080 | That means picking up.
00:59:40.180 | And how do I pick it up doesn't really matter.
00:59:42.960 | That's a more, to its broader sense, a definition of skills.
00:59:47.120 | It's more transferable between different skills.
00:59:51.600 | And the constraints and objectives can be represented as rewards.
00:59:57.560 | So we can ask language model to generate these reward functions.
01:00:02.480 | And then there is an optimizer.
01:00:04.360 | It could be reinforcement learning, or it could be model predictive control that optimize
01:00:09.520 | for those rewards and then run it on the robot.
01:00:14.600 | So what is in the reward translator?
01:00:16.960 | Let's open a box.
01:00:19.560 | So the reward translator basically is a two-stage process.
01:00:23.360 | It's using the same language model, and it is using two different prompts.
01:00:28.160 | So the motion description basically describes the motion.
01:00:32.360 | So just now we found that the language model can output a description of how a robot dog
01:00:38.640 | should stand up, but it's not able to achieve that.
01:00:42.240 | But the motion description is still sensible.
01:00:44.400 | It still makes sense.
01:00:45.400 | It gives you the right thing.
01:00:46.820 | So we just generate this motion description, and then we have a reward translator, reward
01:00:52.600 | coder that translates this motion description into a piece of code that is representing
01:01:00.440 | reward functions.
01:01:02.640 | And these reward functions cannot be directly executed on the robot, but it can go through
01:01:08.520 | our optimization process to learn how to achieve those reward functions.
01:01:13.480 | So we're using reward as the interface between language model and a low-level controller.
01:01:20.020 | And for the low-level controller, we're using Mojoco MPC, which is a model predictive control
01:01:26.140 | algorithm.
01:01:27.140 | It's basically a black box controller.
01:01:29.660 | It samples a lot of trajectories and finds one that optimizes your reward.
01:01:36.180 | And we tested on a robot dog, a quadruped robot essentially, and a dextrose manipulator.
01:01:41.540 | So the dextrose manipulator has an arm of six or seven degrees of freedom and a hand.
01:01:49.380 | It's impossible to control it because it has so many degrees of freedom.
01:01:52.400 | So it's highly challenging.
01:01:56.600 | So just to showcase some of the examples, I omitted the motion description part.
01:02:02.060 | I only output the reward code part.
01:02:07.380 | So it seems that the language model is able to generate the right reward functions to
01:02:13.340 | make the robot stand up on two back feet like a human.
01:02:18.140 | And then now we are a little bit more ambitious.
01:02:20.380 | We know it can stand up.
01:02:22.020 | Can we make the robot do a moonwalk while standing up like this?
01:02:25.380 | So a moonwalk is from Michael Jackson, and it's very challenging.
01:02:28.540 | How do we make the robot to do it?
01:02:29.980 | So it generates the motion description and generates the reward code.
01:02:35.260 | But the motion is not so correct, not exactly what we want.
01:02:41.060 | The nice thing about using a language model and using the reward function is that you
01:02:44.500 | can coach the robot.
01:02:45.820 | You can go back and explain what went wrong and ask the language model to fix it.
01:02:51.300 | So now we can actually say you're being very patient.
01:02:54.540 | You say moonwalk means the robot should walk backward while the feet swing as if they are
01:02:59.940 | moving forward.
01:03:02.860 | Such a great explanation, kudos to my colleague, and correct your answer and also make it walk
01:03:08.540 | at a speed of 0.5 meters per second.
01:03:11.380 | And after you being very patient and give it the right instruction, it's able to modify
01:03:16.740 | the motion descriptor and also generate the right set of rewards to make this happen.
01:03:22.940 | And now you can teach a robot to do a moonwalk just by using the language as an interface.
01:03:29.980 | And one day we'll be able to do this on the real robot as well.
01:03:33.900 | So in the previous section, we showed how the language model did calculate numbers and
01:03:39.820 | you're constraining them to also just take numbers.
01:03:42.140 | Here, how do you prevent it from just hallucinating in some program?
01:03:46.420 | Right.
01:03:47.420 | So that's a great question.
01:03:49.940 | In this work, we are not preventing it to do hallucination in a programmatic way.
01:03:56.580 | We have a set of system prompts or a set of rules that is explaining the API.
01:04:02.820 | After all, the reward functions need to be able to be compiled by the optimizer.
01:04:11.860 | We do need to have some check.
01:04:14.100 | What's more, if it doesn't compile, we can just give the error message to the language
01:04:17.900 | model.
01:04:18.900 | It doesn't have to propagate all the way to the motion descriptor, it can stay at the
01:04:22.300 | reward decoder.
01:04:23.300 | If there are errors, please fix it.
01:04:25.140 | After that, it should be able to fix it.
01:04:28.880 | We can also chain multiple tasks together.
01:04:32.100 | Using this framework, we can say, open a drawer, take the apple, put it into the drawer, and
01:04:38.540 | close the drawer, and it will be able to do that.
01:04:42.020 | So we tried that.
01:04:43.740 | Just using reward decoder is not good enough.
01:04:46.100 | It's rather our two-stage prompt is really, really helpful.
01:04:51.100 | I think that's another inspiration for other fields, like when your domain is too different
01:04:56.380 | from language domain, maybe it would be good to find an intermediate representation and
01:05:00.700 | ask the language model to explain in that intermediate representation before directly
01:05:05.000 | go to a more obscure representation.
01:05:07.620 | Finally, we want to transfer this to the real world, but there is a challenge.
01:05:14.220 | Using simulation, it might generate actions that are too dexterous, like this thing is
01:05:21.580 | not possible to do in the real world.
01:05:23.780 | So we add a few more regularizer terms to stabilize the motion, and we also run some
01:05:30.220 | state estimation on the real robots so that they understand where is the cubes, and then
01:05:37.020 | we can, in the simulation, grab the motion and then achieve it in the real world.
01:05:41.660 | So here are some of the execution in the real world.
01:05:45.100 | So you can, say, pick up the Rubik's cube, and it will generate the motion to pick up
01:05:50.580 | the Rubik's cube and grab it.
01:05:52.300 | This is quite different from RT2.
01:05:53.700 | The motions are quite smooth.
01:05:57.580 | It's quite fast.
01:05:58.580 | It's much faster than 3 hertz.
01:06:02.100 | So here, it can do 10 hertz or even 30 hertz.
01:06:08.100 | So it's comparable with human beings.
01:06:13.360 | So that's a language Q reward.
01:06:15.420 | There's one last thing that I want to talk about in terms of finding a new interface.
01:06:20.580 | So a lot of time, we have been thinking about language model as a semantic engine, a semantic
01:06:25.860 | machine.
01:06:26.860 | It understands semantics.
01:06:27.860 | So, for example, you say the student takes out the book.
01:06:33.340 | You will say book.
01:06:34.340 | Language model is able to reason about such a sequence.
01:06:37.780 | But if you do low-level patterns, like if you just give it obscure numbers, what can
01:06:42.700 | you do?
01:06:43.700 | It's actually a low-level interface.
01:06:45.540 | And we can open up the low-level interface to alpha language model and ask it to do robotics
01:06:51.380 | tasks.
01:06:52.380 | So in this paper, "Large Language Model as General Pattern Machines," we explore using
01:06:56.980 | the low-level interface of a large language model, essentially asking it to reason about
01:07:02.700 | different sequences.
01:07:03.860 | And it's surprisingly quite effective.
01:07:06.020 | And it can solve tasks like the ARC challenge and the PCFG.
01:07:11.940 | And it can even do sequence improvement.
01:07:14.020 | So I will dig a little bit into sequence improvement because that's quite relevant to robotics.
01:07:19.180 | So sequence improvement is that you prompt the language model with state, action, and
01:07:23.860 | the reward tuples.
01:07:25.220 | And you just prompt it with higher reward and see if it can generate actions that achieve
01:07:31.420 | the higher reward.
01:07:32.820 | So it's doing reinforced learning or reinforced learning-like thing, but in context.
01:07:38.460 | So this is quite amazing.
01:07:39.460 | So previously, you would need a dedicated algorithm collecting data replay buffer to
01:07:45.060 | do this reinforced learning.
01:07:46.940 | But now you can just build everything in the language model context by leveraging the low-level
01:07:50.860 | interface of a language model.
01:07:53.820 | And with that, we can actually do something like clicker training.
01:07:57.100 | So if you are not very familiar with clicker training, it's how you train a dog.
01:08:02.500 | You can have a dog, and when it does the right thing, you give it a reward by clicking.
01:08:09.020 | So the clicker training is giving the agent a reward.
01:08:16.380 | And we can now use clicker training to train robots as well.
01:08:20.020 | So here, the robot is exploring, but I would give click when it does the right thing or
01:08:24.860 | towards the right direction.
01:08:26.300 | And over time, it will be able to push the backup chips, which is the objective of this
01:08:32.500 | training.
01:08:33.500 | So you can do this entire decision transformer-like operation, but purely in context, by just
01:08:40.060 | giving a language model a bunch of patterns and ask it to figure out what is the regularity
01:08:45.620 | of this sequence.
01:08:47.260 | And this way, it can generate new actions to improve the previous sequence.
01:08:54.980 | So for the language model, we can find new interfaces that are more suitable for teaching
01:09:02.140 | it low-level skills.
01:09:04.300 | Reward is a bridge of language model and low-level control, and we can fully leverage it as a
01:09:09.460 | universal interface, and we can optimize in real time.
01:09:16.060 | Sometimes it outperforms generating action directly.
01:09:18.420 | So it really motivates to use the reward functions as interface.
01:09:23.640 | And in the language model as a general pattern machine, we can use language model beyond
01:09:27.180 | the semantic tasks.
01:09:28.180 | We can ask it to reason low-level things.
01:09:30.500 | And also, robotics as a domain, rich of sequence transformation and sequence completion and
01:09:36.020 | sequence improvement tasks.
01:09:37.540 | So we can really study the lower-level mechanisms of language models.
01:09:43.380 | And the key takeaway for this talk is that we are seeing more and more use of foundation
01:09:51.100 | models, not only on the semantic reasoning side of robotics, but more on the dexterous,
01:09:57.220 | on the generating actions, on the lower-level embodied intelligence side of robotics.
01:10:03.340 | And we need to rethink the scaling law of robotics and transformer.
01:10:07.300 | How do we scale it with limited amount of data?
01:10:10.260 | We have a new recipe for scaling robot model and data in RT2, which shows that you can
01:10:14.500 | do more with the same data, with essentially RT1 data plus internet data, you can generalize
01:10:19.460 | to allow more things.
01:10:20.460 | And RTX shows that you can do a lot more with more data.
01:10:24.660 | There is also benefits to collecting more robotics data.
01:10:27.140 | And there is positive transfers everywhere.
01:10:29.260 | And part two, in terms of new interfaces for language models, I think it's worth for the
01:10:35.160 | robotics field to think about developing new and lower-level interface to language models,
01:10:39.780 | which facilitate learning low-level skills.
01:10:43.300 | With that, I would like to conclude my talk.
01:10:45.540 | And if you'll find it interesting, there are a lot of references for you to look into.
01:10:50.540 | And special thanks to my team, Google DeepMind Robotics team.
01:10:55.740 | So we are at the forefront of developing foundation models for robotics.
01:10:59.900 | And stay tuned for more in the future.
01:11:01.580 | Thank you.
01:11:03.580 | You mentioned that load numbers are difficult for a lot of our language models, but if you're
01:11:15.760 | just generating the action tokens themselves, like no rocks or whatever you had in an example,
01:11:21.820 | why don't you just have a linear layer appended to the transformer that would just generate
01:11:28.860 | numbers from here that you can type in whatever you need?
01:11:32.580 | Yeah.
01:11:33.580 | The question is that if the large language models have difficulty understanding numbers,
01:11:40.140 | why don't we use a linear layer to output the action directly?
01:11:43.460 | I think language models are difficult to understand numbers.
01:11:47.760 | But sometimes we still want it to bring in knowledge from the pre-training mixture.
01:11:57.140 | If I have a new layer, that new layer is not present in the pre-training.
01:12:02.260 | So how do I expect it to transfer?
01:12:04.460 | I think that's an interesting question.
01:12:06.420 | But at the same time, I don't necessarily think using the raw numbers is the right interface.
01:12:12.300 | We probably could do some action representation learning to learn a representation.
01:12:16.660 | And the language model can output that representation.
01:12:19.660 | So we're still trying to figure out what is the right representation.
01:12:24.180 | So among the representations that we haven't tried before, like decimal numbers, flow numbers,
01:12:30.300 | actual tokens, we find that just using numbers or actual tokens would be good enough.
01:12:36.500 | Yeah.
01:12:38.500 | [INAUDIBLE]
01:12:39.500 | Yeah, I think both directions are worth exploring.
01:13:01.980 | There are different advantages of generating action directly.
01:13:06.940 | I think it borrows the autoregressive nature of language modeling.
01:13:12.660 | And it aligns with a lot of other tasks, like visual question answering really well.
01:13:18.540 | The limitation is that then when you are generating actions, it's heavily regularized.
01:13:23.900 | Can you generate dexterous actions that is so out of distribution that it's kind of difficult?
01:13:29.380 | The language to reward actually brings a page of the book of traditional robotics, this
01:13:35.140 | optimization-based or model predictive control.
01:13:40.020 | And you can also take into, let's say, safety constraints more easily.
01:13:46.180 | It can generate more diverse actions.
01:13:48.820 | Maybe one recipe is to generate a lot of data with the language to reward system and distill
01:13:54.180 | them into a transformer.
01:13:56.900 | Because then you are imbuing your large language model with all this other desirable-- the
01:14:02.780 | language to reward itself, I don't know how scalable it is.
01:14:07.060 | We're not fine-tuning language model.
01:14:08.900 | So maybe you are limited to what-- you are at the mercy of the training data of the language
01:14:14.620 | model.
01:14:15.620 | The language model can do moonwalk because it knows what moonwalk is.
01:14:20.220 | It roughly knows how to do that.
01:14:24.180 | But if you want to scale to completely new things, maybe you can use the language to
01:14:28.100 | reward to bootstrap your data generation and then put it into the other policy.
01:14:33.460 | So can you tell us what's the next direction Google is pursuing?
01:14:39.940 | So it's like, the language is rewarded in the right direction, like scaling out of the
01:14:43.180 | room, out of the racks, and so on?
01:14:45.220 | Yeah, I think that's a good question.
01:14:46.860 | So the scaling being the end of the lecture, that is a joke.
01:14:51.940 | But I'm being quite serious.
01:14:54.380 | It's actually a promising recipe.
01:14:57.040 | So I think everybody is believing in the power of the scaling rule.
01:15:04.860 | So just by giving it more data, giving it more compute, you will see interesting capabilities
01:15:10.740 | coming out.
01:15:11.740 | [INAUDIBLE]
01:15:12.740 | Yeah, I still think we don't quite have enough data.
01:15:30.360 | I think that's still probably the biggest bottleneck.
01:15:33.660 | So we are trying to find ways to do more with limited data.
01:15:38.360 | And we are trying to collect more data.
01:15:40.500 | And I think it needs some time for us to accumulate enough data.
01:15:45.200 | And currently, I say, we have signs of life for positive transfer.
01:15:50.320 | But in language models, people don't talk about positive transfers anymore because it's
01:15:54.680 | so commonplace.
01:15:55.680 | Right?
01:15:56.680 | You see it everywhere.
01:15:58.680 | And robotics is not at that stage yet.
01:16:01.840 | Yeah, how much has your team been thinking about safety and alignment?
01:16:06.440 | Yeah.
01:16:07.440 | And are you just, right now, relying on the ethics that emerge from the large language
01:16:13.400 | models?
01:16:14.400 | It won't tell you to kill someone to achieve that.
01:16:17.280 | Yeah, that's a very good question.
01:16:18.800 | Actually, we take safety very, very seriously because all of the other domains of developing
01:16:24.520 | language models, it doesn't have direct impact on the physical world.
01:16:31.800 | But here, it could have potential harm to humans and to the environment.
01:16:37.600 | And Gary Marcus actually gave a comment previously to our work that, what if you say, bring out
01:16:45.560 | a bowl, feed a cat, and put it in the dishwasher?
01:16:47.440 | Well, let's put the cat in the dishwasher, right?
01:16:50.040 | If it misunderstands, actually, it will have a catastrophic failure case.
01:16:57.720 | We take safety carefully by designing hardware and software safety layers.
01:17:03.520 | And there are also some constitutional safety thing that is coming out sometime soon.
01:17:11.280 | I cannot tell much details right now, but sometime soon, we'll release some work.
01:17:17.240 | Is it something like, if there's a human, just don't interact?
01:17:21.320 | Well, no, no, no.
01:17:23.440 | I think it's a little bit more nuanced and more detailed than that.
01:17:27.600 | But we do take safety quite seriously.
01:17:29.880 | And in some of our experiments, actually, the robot's finger would break off because
01:17:34.040 | it cannot apply enough force to an environment.
01:17:36.280 | So that's just yet another way of ensuring safety.
01:17:39.600 | Can we have some visual language model and a synthesizer or something to stop the problem
01:17:45.600 | that both the internet and the robot?
01:17:49.120 | And maybe, this is kind of like interpretive, but both in some logical way.
01:17:55.160 | Right, right.
01:17:56.160 | So I think it would be possible.
01:17:58.000 | [INAUDIBLE]
01:17:59.000 | Thank you for the great talk.
01:18:06.960 | Thank you.
01:18:07.960 | Thank you.
01:18:08.960 | Thank you.