back to index

Stanford CS25: V2 I Robotics and Imitation Learning


Whisper Transcript | Transcript Only Page

00:00:00.000 | I'm really happy to be here. I guess to shortly introduce myself. My name is Ted Xiao. I'm
00:00:13.580 | a senior research engineer at the Google Brain team. I've been on working on robotics now
00:00:18.460 | for the past five years. I've touched upon a few topics including multitask learning,
00:00:23.980 | reinforcement learning, and then lately just broadly thinking about how we can scale robots
00:00:28.300 | to make sure that we can actually work in the wild in the real world. I guess today
00:00:33.060 | I'll be talking about quite a few different topics, but as a first preface, I guess the
00:00:38.940 | first thing to know is that our team is pretty massive now. All of these projects are huge
00:00:43.020 | collaborations with some products have more than 40 people working on these for many years.
00:00:48.240 | So these are large efforts and I'm just very fortunate to call myself to be on teams of
00:00:52.500 | very smart people. Secondly, some of my takes are spicier or more controversial than others.
00:00:59.300 | So all of those opinions are definitely only my own and don't reflect those of Google or
00:01:04.060 | anyone else on the team. So with that out of the way, yeah, welcome to my TEDx talk.
00:01:16.040 | So I think maybe some of you have seen a lot of the cool robot learning videos out in the
00:01:21.820 | wild these days, but I am more excited than ever and it's not just hype I think. I think
00:01:27.180 | there's been a fundamental shift in how researcher and robotics view learning over the past two
00:01:32.340 | years and I think the shift has a lot to do with all of the trends happening more broadly
00:01:38.160 | in foundation modeling, in large-scale internet models, across different fields like language,
00:01:44.220 | audio, and so on. But I think my goal today is to convey to you why I am particularly
00:01:50.280 | excited about this time today right now and why there's been a very fundamental one 80-degree
00:01:56.340 | paradigm shift I think across the robot learning field. And if you walk away from this talk
00:02:01.740 | with just one thing and that's you're slightly a bit more excited about robotics than you
00:02:05.860 | were before or believe that the time is now for these robots to really start scaling exponentially
00:02:11.380 | and doing something really cool, I think then my talk will have succeeded. The talk will
00:02:22.320 | have a few parts. We're going to start at a very high level and just talk about why
00:02:26.920 | a foundation model for robotics at all, what that might look like, and the ingredients
00:02:32.380 | and recipe for how we might get there. Then we'll dive into a few different works pretty
00:02:38.000 | deeply that my team has been very proud of over the past year or two. And finally, we'll
00:02:42.920 | go back to the high level and then zoom out and think about what's next for robot learning.
00:02:49.780 | So why a foundation model for robotics? One second, let me try to hide this thing. No,
00:02:59.980 | that's fine. I'll keep that bar there for now. But the top bar says why a foundation
00:03:03.520 | model for robotics. Being coined here at Stanford, and I'll use the phrases internet scale model,
00:03:10.000 | foundation model, and large language model pretty interchangeably throughout. And I hope
00:03:13.600 | it's pretty clear. But generally, when I'm talking about these big monolithic beasts
00:03:18.080 | that are training on tons of data, they have two very important properties that I think
00:03:22.640 | are quite nice. One is emergence. When very simple things kind of work at a small scale,
00:03:30.220 | they get a ton better when you just scale things up more data, more compute larger models.
00:03:36.080 | And what we see here is that when these models even become good enough, the domain space
00:03:41.700 | of what they're good at and able to do starts to go combinatorial even larger. And here
00:03:46.560 | for these two points, I would like to suggest two blog posts I highly recommend. One is
00:03:51.980 | from Jacob Steinhardt called more is different for AI. And this kind of links the phenomenon
00:03:56.220 | that we see in other fields, like physics or biology, for example, individual water
00:04:01.380 | molecules will behave very differently and have very different, let's say electrostatic
00:04:05.960 | forces, then they start to clump up clump up and start behaving as a liquid altogether.
00:04:10.440 | We see this in herds of animal and flocking patterns, we see this in humans and economies,
00:04:14.760 | we see this all across different fields. And now even an AI, we see models that are doing
00:04:19.060 | stuff that will not be even possible where they add a smaller scale. But when they reach
00:04:23.500 | some critical scale in size, they start to work really, really well. This is documented
00:04:28.760 | by Jason in his blog post emergence and LLMs, which you see this plot on the bottom left,
00:04:35.580 | successfully across a bunch of different tasks, whether it's modular arithmetic or purging
00:04:39.640 | question answering, the success rate is basically flat until these models get big enough, good
00:04:44.500 | enough. And then the success rates just kind of skyrocket. And that's why I think these
00:04:50.320 | are particularly exciting. So yeah, question.
00:04:54.480 | I'm curious to know, do robotic foundation models display scale in real life?
00:05:00.980 | Great question. And I'm really glad you asked. We have, I'm pretty excited to present some
00:05:06.120 | directions we have along that I hope will answer your question in maybe about 10 minutes
00:05:09.320 | or so. Yeah. But I think that's a question on all of our minds, including myself. So
00:05:15.920 | I think before we even get to the feasibility or the existence of any robotic foundation
00:05:20.100 | models, like is this even needed? And I think the argument that I don't think is obvious
00:05:25.600 | is that I think emerging capabilities and relying on these might be actually indispensable
00:05:30.360 | for robotics to actually work. A lot of the research over the past decades of robotics
00:05:34.360 | has been in one bin, one room, one table, one robot, one building even, but these are
00:05:40.280 | so vastly different from the orders of magnitude, more complex while real world situations that
00:05:46.240 | humans operate in every single day. And I think to make that gigantic leap, we're going
00:05:50.780 | to have to rely on this emerging capability scaling curve where things kind of work. You
00:05:55.600 | have very canned demos. Maybe you have, you know, a humanoid robot program to backflip
00:06:00.440 | after hundreds of trials, but going from that to like the chaotic real world, I think we're
00:06:05.000 | going to have to rely on this emergence phenomenon for that. And I think maybe even intellectually
00:06:12.360 | or academically, it's also interesting to think about why or why not a foundation model
00:06:18.160 | for robotics might even work. It's worked in so many other domains. There's existence
00:06:22.720 | proofs in audio, music, coding, language, another domain every single day, it seems
00:06:26.560 | with 3D models and beyond. But if there is something very special about robotics, whether
00:06:32.800 | it's embodiment or causality or physical grounding, and that is the barrier to making this very
00:06:38.520 | simple recipe that's working all these other domains. If there is something special about
00:06:42.720 | robotics that causes this recipe to fail, I think that's quite interesting to study
00:06:46.620 | why that is. I'm personally an optimist. I don't think there is some magical secret sauce
00:06:51.600 | that's going to keep robotics from being tackled with the same formulas and recipes that's
00:06:56.000 | worked elsewhere. But, you know, I think this is a question I'd like to find out the answer
00:06:59.800 | to. And so maybe then instead of just motivating this philosophically, okay, we need foundation
00:07:07.400 | models, foundation models are great. Let's try to build one for robotics. How do we actually
00:07:11.520 | do that? Well, I think we can leverage a few ingredients by standing on the shoulder of
00:07:17.600 | giants and looking at other domains. The first one is looking at different design principles
00:07:22.360 | of ML scaling from other domains. Let's look first at high capacity architectures, the
00:07:28.520 | topic of this class today. Ideas such as self-attention, as all the different ideas encompassed in
00:07:35.000 | the transformer, as Andrej Karpathy famously said, it's like a magical universal differentiable
00:07:39.880 | computer that's very general, very robust, and very remarkably scalable on many different
00:07:44.720 | dimensions. Let's use those. We should also leverage the more guiding principles that
00:07:50.280 | have been seen, the scaling laws, the trends, this year's Cinchilla, you know, we not only
00:07:55.160 | have to scale the model size, we also have to scale compute, and we also have to scale
00:07:59.480 | the number of unique tokens in the corpus of the vast data sets that we train on. But
00:08:04.040 | if we do all three together, this has been shown to reliably have a pretty good chance
00:08:09.120 | of succeeding, no matter what domain you're looking at. And so, and finally, what that
00:08:14.560 | kind of means, and I think this is actually going to come up later, is that data set size
00:08:18.720 | seems to matter these days a lot more than quality. Even if you have some sentences on
00:08:22.880 | Wikipedia that are misspelled, or some, you know, falsehoods, or some things that aren't
00:08:27.360 | so desirable, if in aggregate, your data set is diverse enough, and interesting enough,
00:08:32.480 | these things will hopefully wash out in the mix. Ingredient number two, the proliferation
00:08:39.120 | of the internet scale models themselves, not just the principles. What's exciting, and
00:08:44.880 | I'm sure it's, you know, definitely been very shocking for both experts and lay people alike,
00:08:50.080 | is that a lot of these generative models across many different modalities have been experiencing
00:08:54.320 | emerging capabilities and have been surpassing all of our wildest expectations time and time
00:08:59.400 | and again. But even when we think that we're exhausted, all this stuff is too much, it's
00:09:03.720 | not going to work, something will come out and completely blow me out of the water. And
00:09:06.880 | I think this trend will definitely keep continuing. And I think, in addition to that, they not
00:09:11.440 | only will continue coming on and accelerate more rapidly, they're going to happen with
00:09:15.200 | it, whether or not like we do anything, you know, in the grand scale, speaking, me as
00:09:20.200 | a robotics researcher, or, you know, you and whatever subfield you're on, there are parts
00:09:25.180 | of machine learning that likely you'll probably not ever touch in at least the near future.
00:09:29.840 | And those parts will be seeing tremendous breakthroughs and scaling and new capabilities
00:09:33.280 | coming online every single week. And you can look at this not only in the impressiveness
00:09:39.760 | of the models, but also the acceleration of progress, the timescales in which new models
00:09:44.560 | are being released, why we're large collaborations are being worked on by many groups, and then,
00:09:49.440 | you know, being available to access for all to use and build upon. And the final ingredient
00:09:56.460 | in this trend is more of a robotic specific one, but it is a vast shift from online robotic
00:10:02.860 | learning, where robots collect experience online, make actions and learn through trial
00:10:08.380 | and error to an offline setting where we decouple the data generation process from the data
00:10:14.260 | consumption process. As we've seen, and all these other foundation modeling domains, these
00:10:19.740 | big internet scale data sets are so diverse, and they're static, we just scrape them once
00:10:24.320 | or scrape them multiple times continuously. But we aggregate a continuous pile that's
00:10:29.040 | just growing. Here, we see either the pile data set from a Luther or Lyon 5b for image
00:10:34.860 | paired image text. And these are pretty big, and they're orders of magnitude more than
00:10:39.200 | what we've seen before. And they are definitely a key ingredient to why other domains have
00:10:43.560 | been doing so well at training these big foundation models. And this coming back to robotics,
00:10:50.800 | then I'd like to take a brief detour into how the shift came to be because it's very
00:10:55.840 | easy to say in a sentence, yeah, robotics is offline more than online. And this is coming
00:10:59.640 | as kind of a no brainer to many folks who are coming from other domains, like this is
00:11:04.040 | the way things are done. But in robotics, this has been a very big shift. And I think
00:11:09.040 | robotics has also been synonymous with RL, reinforcement learning for a lot of people.
00:11:13.720 | And I think increasingly, this is becoming less true. And so I'd like to take you down
00:11:17.840 | a brief trip down the history of my team, their side of the talks as brief history of
00:11:23.400 | robotics at Google. And yeah, of course, thanks. And I think this is not just for dramatic
00:11:29.480 | exposition, it's really to try to guide you through how drastically our team's thinking
00:11:34.960 | has kind of evolved over the years, and how that's going to inform the design decisions
00:11:40.400 | and the kind of risks and research directions that we take in the specific projects that
00:11:45.240 | I'm going to show coming up. Thank you. So in 2016, some of you may have seen this, we
00:11:50.560 | had what we call the arm farm, seven KUKA robots in a room collecting picking data 24/7.
00:11:56.480 | And this was doing on policy RL in the real world, we were the first team to kind of say,
00:12:00.960 | hey, can we can we even do this with the goal of saying, can we do end to end robot learning
00:12:05.880 | with results in the real world, this was kind of risky at the time, it was not a common
00:12:10.000 | take. And from that we developed several interesting research directions that we started exploring,
00:12:14.720 | we looked into stuff like QT opt, which is a Q learning method, working on continuous
00:12:20.960 | control actions. While taking a vision inputs, we worked on cycle GAN to transform simulation
00:12:27.820 | based images into real real looking images for sensor real, we looked at concurrent role
00:12:32.560 | of how we get robots moving faster and more efficiently in the real world. I'm sorry,
00:12:36.200 | do you have a question?
00:12:39.040 | Yeah, great question. And that one, I think was basically, the arms would pick stuff up
00:12:49.000 | from the bin, if they messed up, and it fell out, well, we come back the next morning,
00:12:52.680 | and there'd be objects scattered all throughout the room. So there was no reset. But if they
00:12:57.400 | missed a little bit, the objects would fall back into the bin and hopefully be in a position
00:13:01.000 | where they could pick them up again.
00:13:03.160 | Oh, yeah, of course. Thanks. I'll do that in the future. On this specific question was
00:13:09.680 | for this 24 seven arm farm, how did we do resets? And the answer is, well, we didn't
00:13:15.480 | we designed the bin so that they were kind of banked. So that object slightly missed,
00:13:19.080 | they would fall back in the bin, rearm themselves, maybe add more diversity with the training
00:13:22.880 | data. But this was doing off policy online RL with q learning. And we mixed it with some
00:13:28.400 | data deployed again.
00:13:31.800 | Next, we kind of went through this consolidation phase around 2020. When we're like, alright,
00:13:37.880 | this is pretty cool. And you know, but we want to get out of the bin, how do we do more
00:13:41.760 | complex tasks and a more practical setting that could be closer to something that humans
00:13:46.440 | would want to use that's more general every day. There, we kind of settled on this office
00:13:50.740 | micro kitchen environment, if you've heard of the famous Google micro kitchens. And I
00:13:55.240 | think this was the setting we decided to operate in. And there, we started collecting data,
00:14:00.600 | we scaled our real operations. And there, we kind of scaled approaches to some different
00:14:04.360 | things. And I think in the bottom right here is like the more mechanized reset version,
00:14:08.800 | I would say of the arm farm. Here, we had a bin that folded in half. And this was doing
00:14:13.640 | multitask RL in the real world. And the bin would flip in half dumping objects from one
00:14:17.480 | side to the other. So you could do more interesting tasks, whereas the arm farm was pick anything
00:14:21.200 | up. Now we could say, hey, pick up the carrot and place the tomato on to the plate. And
00:14:26.440 | then the bin would flip and you'd reset. Some other works so far at multitask imitation
00:14:31.060 | learning, this is BC zero. And then we also look at stuff like combining reinforcement
00:14:34.900 | learning with imitation learning bootstrapping.
00:14:39.360 | But in 2020, once again, we realized we were working on a ton of different directions,
00:14:45.160 | and we wanted to consolidate. And I think the two main things that were really bothering
00:14:49.080 | us at the point at the time, where we were hitting two main walls across all these methods,
00:14:53.980 | some of them were plateauing at this 50 to 70% of, you know, rough range in the real
00:14:58.880 | world. And other methods were requiring very specific data distributions, they had to be
00:15:04.240 | on policy, or they could only use demonstrations, or they blah, blah, blah, like, there were
00:15:07.920 | so many different nuances and like gotchas to all these different methods, and all these
00:15:11.880 | different drawbacks. And so the question we posed was, we're open to any method, any strategy
00:15:18.180 | that will enable us to solve tasks in a very performant matter more than 90% in the real
00:15:22.720 | world. And also that can scale with some kind of data that we can collect, you know, and
00:15:28.560 | maybe this is a bit more lax than let's say, an academic setting where you're much more
00:15:33.120 | resource constrained. But at the end of the day, you know, even our team does not have
00:15:36.600 | infinite money, we still have a certain number of robots, a certain number of operators,
00:15:40.560 | and we're constrained by the laws of physics. So we need some way to acquire more data that
00:15:44.000 | we can then learn from. And so we're all scratching our heads thinking about this for a few months
00:15:47.720 | in spring 2022. We decided on going with multitask imitation learning. So this was a vast departure
00:15:54.600 | from the 24/7 arm farm. This was a vast evolution of how we approach the problem. We found that,
00:16:00.560 | you know, with enough, you know, gentle care and love, multitask imitation learning was
00:16:04.520 | able to hit these 90% numbers, and it was able to get better with more demonstrations.
00:16:09.320 | These aren't the cheapest thing, but it was able to scale with additional demonstrations,
00:16:13.760 | which was the sign of life that we were looking for. So that brings us to less than a year
00:16:18.360 | ago, our team was deciding this is the path forward, at least in the near term future.
00:16:23.120 | But maybe, you know, we could just think about how the approach we were taking here might
00:16:30.040 | also spread out in the future. And we might be able to bring back these other threads.
00:16:34.280 | For example, if now that we're decoupling this data collection of demonstrations or
00:16:39.160 | etc. from how you learn from them with a multitask imitation learning policy, maybe we can in
00:16:44.080 | the future then do something like offline RL. But I think at a high level now, I've
00:16:48.840 | just you know, in a few short minutes, just compressed six years of very bitter lessons
00:16:54.040 | that our team has been learning. And I think from where we are today, and looking back,
00:16:57.960 | even just two years ago, if you told me that the strategies we're deploying today could
00:17:01.680 | just scale the way they are, I probably would not have believed you.
00:17:05.920 | Great question. So I think task conditioning is definitely still was an open question at
00:17:20.320 | the time. But I think with this work, BC zero, we found that language was able, at least
00:17:27.120 | in a templated language, kind of representation was good enough where we could direct I think
00:17:32.120 | BC zeros over 80 tasks. So they were they were very templated, like pick grapes, or
00:17:37.320 | like move grapes onto play or drag this across black drag cloth across table. And I think
00:17:44.240 | this representation was still enough where you're learning a good number of skills that
00:17:47.840 | you're passing in essentially a one hot ID into your policy network, and it will learn
00:17:51.400 | to imitate that. And for each one of those 80 tasks, we'd collect hundreds or 1000s of
00:17:55.320 | demonstrations. And I will touch upon the specifics of that a bit later, too.
00:18:03.720 | So yeah, today, and or at least in 2022, let's do offline methods, let's decouple data generation
00:18:10.320 | from data consumption. And let's take these three lessons now that we touched upon. Let's
00:18:15.320 | take the design principles of ML scaling, and then figure out what lessons can actually
00:18:18.960 | be applied when you look into the future for recipe for robot learning and foundation models.
00:18:26.080 | The first lesson I think is very important is these high capacity architectures like
00:18:29.680 | attention. The second I'll touch on later is data interoperability, tokenization, tokenization,
00:18:34.960 | discretization. And the second ingredient is the proliferation of these models themselves.
00:18:40.440 | Can we leverage them because they will get better over time. And I think here, I would
00:18:44.680 | like to plug my colleague Carol Hausman's bitter lesson 2.0, which is the bitter lesson.
00:18:49.560 | The first one from Richard Sutton was, you should leverage methods that scale with more
00:18:53.560 | compute. And maybe in today's day and age, the lesson is that we should leverage methods
00:18:58.800 | that are able to utilize improvements in foundation models, because they're going to get better.
00:19:03.400 | Yeah. So both in the lesson 1.0 and 2.0, one thing that's always been clear to me is suppose
00:19:10.400 | I have a set of methods. And I want to choose the methods that are going to scale with more
00:19:13.400 | compute or in this case, scale with better foundation models. The question is, how do
00:19:17.360 | I actually decide which of those methods meet those criteria?
00:19:22.080 | Yeah, great question. I think, and maybe it's, I think that's a very, I don't have a good
00:19:27.720 | answer for that. Oh, sorry. Yeah, yeah. The question was in bitter lesson 1.0 and bitter
00:19:32.560 | lesson 2.0, the question is, well, that's great. That's the lesson, but how do we actually
00:19:36.480 | decide which methods meet this criteria? And I think, you know, my answer is it's not always
00:19:42.000 | obvious and it's actually quite tricky sometimes, but maybe, you know, sometimes, you know,
00:19:46.680 | what you can be very confident that, oh yeah, this will definitely scale with more data
00:19:50.320 | and compute. And some that are same, but basically the more hard-coded you are, the more assumptions,
00:19:54.400 | more heuristics you bake in, the more you in our, in our day and age, the more you rely
00:19:58.400 | on a specific implementation of a specific foundation model of a specific class of algorithm,
00:20:04.560 | maybe that will be less robust than a method that just assumes some very abstract input
00:20:09.000 | and output and assumes that how you get from that input and output can improve over time.
00:20:13.280 | And maybe the algorithm itself even changes altogether. So I think that would be my take
00:20:17.760 | on the bitter lesson 2.0, but this is definitely still, I think the jury is still out on this.
00:20:25.440 | And my, my, my basic, my, my, one of the things I like to propose is that language is the
00:20:30.280 | way that we can leverage bitter lesson 2.0. If you have language as the universal representation
00:20:35.600 | through which all of these foundations communicate to each other, whether it's, you know, captioning
00:20:39.640 | or generation or whatnot, I think that's one way that we could leverage a bitter lesson
00:20:44.160 | 2.0. And finally, the third ingredient offline robot learning, decoupling data generation
00:20:51.680 | from data consumption, putting these all together, my recipe for how one take at a modern attempt
00:20:58.740 | that embodied intelligence would look like would be to combine these large offline datasets
00:21:03.640 | with high capacity architectures by using language as the universal glue. And in the
00:21:08.560 | works I'm going to present shortly, all of our different projects, I think in some way
00:21:13.040 | or another are inspired by this philosophy. And now, now that we've kind of, you know,
00:21:22.320 | understood the motivations and potentially one possible approach, of course, largely
00:21:29.600 | the first offline is a high capacity architectures using language as a universal glue. I'm curious
00:21:34.320 | to know which, if any of these are currently bottlenecks, not the right word, which means
00:21:39.560 | they're limited. Got it. Because it seems to me like we already have large offline datasets,
00:21:43.320 | we have high capacity architectures, and you know, those architectures are relatively just
00:21:46.560 | a piece of language, but it seems like we already have all the components necessary.
00:21:49.720 | So why is this then not a solved problem? The question was these, it seems like we have
00:21:55.880 | a lot of these ingredients. And so why hasn't robotics been solved yet? So I would argue
00:22:01.240 | that actually this take here, and maybe I'm, this is to the wrong audience at the moment,
00:22:05.880 | but I think this is non, very non-obvious across the robotics field. Many people do
00:22:09.720 | not agree with all of these, much less two of these, or even any of these points. And
00:22:15.160 | so I think, and also the existence of the scale of how mature each of these components
00:22:20.520 | are within robotics is at very different stages. And I would say like, and we can talk a bit
00:22:24.760 | later about like, for example, like data scale, or the architectures that have kind of diffused
00:22:29.560 | through osmosis from other ML domains into robotics. But I think we're still at very
00:22:34.440 | different stages on how, how much people have actually bought into these lessons and invested
00:22:39.000 | in them.
00:22:40.000 | Yeah, I can probably, I also don't want to get into too much trouble here, but I'll
00:22:57.840 | probably get myself in a bit of hot water in a few slides. So I'll, I'll extend upon
00:23:01.560 | it a bit then.
00:23:02.560 | I'm just curious to know what their opinion is and why you think they're wrong.
00:23:07.280 | Yeah. And I would say that like me personally, and, you know, not speaking for my team, but
00:23:12.960 | a lot of people on my team are probably at the very extreme end of learning, scaling,
00:23:18.480 | data-driven, you know, foundation model based, let's go big. And I think a lot of people
00:23:24.160 | don't believe that. And yeah, happy to discuss why later, maybe after the Zoom as well. So,
00:23:30.160 | so yeah. Well, okay then let's, let's go ahead and dive in and see how this recipe
00:23:35.920 | might actually percolate into specific domains. And the first one is RT1. This is a recent
00:23:43.360 | work from our group that works on how we can scale imitation learning. And let's look at
00:23:47.960 | how we can actually apply these first principles.
00:23:50.880 | So the first one is to consider what we actually, let's, let's put ourselves into the spring
00:23:55.920 | 2020 mindset. We we've been collecting demonstrations for a while. This is a ton of demos, like
00:24:00.840 | a hundred thousand over that was collected over like a year and a half on many, many
00:24:05.200 | robots on many, many tasks that exists. It was expensive. And over time, this will actually,
00:24:11.200 | you know, not trickle up at insane amounts. Like we won't just get a hundred thousand
00:24:14.840 | new high quality demos every day. This will grow over time, but it's not going to, you
00:24:19.000 | know, grow for free. And autonomous ways of doing this is very hard. As you saw earlier
00:24:23.320 | with MPOP with the bin reset mechanism, or DeepMind has a work on RGB stacking, where
00:24:27.560 | they try to do autonomous resets. And you know what, the way that we're doing it right
00:24:31.160 | now, or at least for this paper was human teleoperation pioneered by BC zero, and that
00:24:36.960 | was very expensive as well. So there's going to be a limited throughput. And finally BC
00:24:41.480 | zero used a ResNet based backbone, and it was pretty good, but I found that it was very
00:24:45.120 | sensitive to training distributions. For example, when they remove data from some teleoperators
00:24:49.680 | to make the data more homogenous performance got better, and that's not really a property
00:24:53.680 | we like, right? We want more data, even if it's not exactly the same. So the lesson here,
00:25:00.120 | models need to be robust and they need to generalize. Cool. So we have models and to
00:25:04.440 | be robust and generalized. What else do we have? Well, off the shelf models are pretty
00:25:07.840 | slow. If we take in these huge, you know, vision transformers from other domains, they're
00:25:12.120 | not going to run on the real robot. We need to be able to run at a pretty high frequency.
00:25:16.000 | They need to be reactive. Inference time needs to be slow because all our models are vision
00:25:20.520 | based. And finally, we want our data to be able to understand language. As I mentioned,
00:25:26.720 | robust language is the universal glue. Our data set already has some language. We want
00:25:30.400 | eventual models to be very multimodal. This is a first principle that we need to dig in.
00:25:35.840 | What does this mean? We can't just take something existing. We probably need to design or at
00:25:39.520 | least modify something from the ground up. And let's take the best practices that we've
00:25:43.780 | seen work in other fields. And so we worked for a bit and we came up with this architecture
00:25:51.160 | for RT1. Again, once again, this was a large team with a bunch of different contributions,
00:25:56.040 | and I'll just go through a few of them here. At a high level, RT1 is robotics transformer.
00:26:02.560 | It operates at three hertz. It takes in visual input from the robot RGB camera, as well as
00:26:08.600 | a natural language instruction. There, the image is patchified and fed into a film efficient
00:26:14.680 | net tokenizer. It's then passed into token learner, which I'll talk about soon. And then
00:26:20.000 | also the language instructions are tokenized and then they are put into the same transformer.
00:26:25.440 | And then finally, we output discretized actions as tokens and send that to the real world
00:26:31.040 | in three hertz in closed loop. This transformer is a decoder one. We use a sparse categorical
00:26:38.740 | entropy objective for action prediction by applying a causal mask. We use the pre-trained
00:26:44.400 | efficient net backbone, and we also use token learner for faster inference. Diving a little
00:26:50.320 | bit deeper. Oh, sorry. Yeah. A question. Great question. So the image token, when it
00:27:01.920 | goes in from, so each image is the, you know, the high fidelity RGB image from the camera.
00:27:07.320 | We split that up into 81 separate patches. And so each patch is, you know, it's spatially
00:27:12.240 | just like the square there. But the cool thing is that what token learner does here, this
00:27:18.080 | thing here is it's a previous work from our group that takes in a bunch of possible you
00:27:24.640 | know, image patches and dynamically selects which of those image patch tokens are more
00:27:30.280 | relevant for the tax at hand, given the existing context. So from those 81 image patch tokens,
00:27:36.240 | we sub sample eight of them to use for inference. And this happens at every time step. And that
00:27:42.520 | process has learned which of the eight patches are relevant at any given moment. And otherwise,
00:27:48.840 | we're sending in way too many tokens and the context length would explode and we wouldn't
00:27:52.600 | be able to do inference on robots. We are also passing in a sequence sequence length
00:27:56.780 | of six images. History is quite important when you're doing temporally coherent tasks
00:28:01.920 | in the real world where things like physics and you know, exactly this, this nuanced detail
00:28:06.320 | of what the objects are doing in relation to each other into your robot. Those details
00:28:10.240 | really matter. And in total, the the model size is 35 million parameters, which is quite
00:28:17.600 | a bit smaller than a lot of these other, you know, huge internet scale models. And finally,
00:28:23.720 | one main difference here is action discretization. Before a lot of the products we're doing,
00:28:28.680 | we're doing continuous control. And if you think about it, right, our robot does have,
00:28:32.640 | we do end effector pose control on position control. And there, the real world is a continuous
00:28:37.260 | state space. But, um, and to do that, we, we had to come up with many algorithmic novelties,
00:28:43.080 | for example, a, a, a CEM actor that did basically sampling of these continuous action spaces
00:28:48.440 | to propose the highest ones that would get rated by the Q function. And we do this twice,
00:28:52.280 | blah, blah, blah. And, but that's like so sensitive, but we needed to get, do that to
00:28:55.840 | get things to work. But now we just decided, let's just, you know, bin our actions. It's
00:29:00.560 | only 256 discrete actions. And let's just predict those as tokens. Um, any question?
00:29:06.360 | Yeah, what I was going to ask is, so you're mentioning that you have this design required
00:29:11.620 | or engineering requirement about speed and latency reaction. And then you say that that
00:29:16.060 | necessitates having a relatively small model, which makes sense. But one message of scaling
00:29:20.540 | when we're talking about foundation models is that we don't want to be bottlenecked by
00:29:23.380 | either data compute or parameters. So I guess what I'm curious to know is how do you balance
00:29:27.900 | these off in the sense that you want to have lots of parameters to have a really powerful
00:29:31.580 | model, while on the other hand, you want to have very fast input.
00:29:34.660 | Yeah, great question. And to repeat it, the question is, um, we kind of set a pretty hard
00:29:39.560 | constraint with a hundred millisecond inference time yet. A lot of the lessons in foundation
00:29:43.840 | modeling is that you shouldn't be constraining yourself against any dimension, whether it's
00:29:47.260 | data set, size, compute, or model capacity. And I think my initial answer to that is that's
00:29:52.440 | a very great point and something I think that's going to be coming up as a severe bottleneck
00:29:57.000 | in the future. But for, for our initial case, I think this is more of an exploration of
00:30:01.080 | whether these principles and even scaling well beyond what we were looking at now to
00:30:05.440 | work already on this 35 million is gigantic compared to a lot of prior work using, for
00:30:10.800 | example, a ResNet-34 or whatnot. So this is already much bigger than, you know, a lot
00:30:15.240 | of other options. And maybe for now, at least it's the easiest, it's the largest scale we
00:30:21.040 | could go to roughly in the short term without having to think of more tricks.
00:30:25.200 | Yeah, we can talk about it a bit later, maybe. I think I'd also love to hear your thoughts
00:30:33.040 | too, because it's very non-obvious how we can get past some of these bottlenecks.
00:31:00.160 | Yeah, great question. We ran some ablations on model size. I might have that in a few
00:31:06.040 | slides, but maybe we can return to that then. And if not, I can, yeah, but great question.
00:31:15.040 | So yeah, that's the architecture and I'll discuss some of the ablations and the trends
00:31:18.720 | later on, but maybe, you know, this is a robotics lecture, I should show you some pretty visuals,
00:31:23.920 | right? So let's look at some evaluations we did. We compared against some baselines. One
00:31:28.880 | is Gato, which you might be familiar with. And then the other one is BC0, the ResNet
00:31:34.560 | based one. And we find that we evaluate unseen tasks versus unseen tasks. And we also add
00:31:40.320 | in various distractor objects. Our normal data collect looks like this top left picture,
00:31:45.160 | three cans on a gray desk, that's basically it. But then we push it further by bringing
00:31:49.960 | in a lot more objects so that the table is so cluttered that even as a human, sometimes
00:31:54.040 | it's hard to find the object that you're actually looking for. We add in table class,
00:31:58.800 | we make the textures very different. We bring it to new micro kitchens with new surfaces
00:32:03.120 | all together. And we find that RT1 is more robust than these other different methods.
00:32:08.200 | Yeah.
00:32:09.200 | [inaudible]
00:32:10.200 | Good question. The question was, was the Gato model trained on our data or was it just already
00:32:25.200 | included in Gato? The answer is this data was not included in Gato. And so we retrained
00:32:29.560 | the Gato model only on our data. Yeah. And yeah, so here's just a different visualization
00:32:35.680 | of the robot going out in our micro kitchen and doing different interesting things. You
00:32:39.760 | can see here that it's trained on one setting, but then it goes into brand new kitchen, brand
00:32:44.400 | new countertops, new objects, and it's able to do all of them pretty robustly. We also
00:32:49.520 | put it into a long horizon setting using the SACAN framework that we'll talk about next.
00:32:56.360 | But in these settings, a lot of them are mixing all of these generalization capabilities.
00:33:00.920 | And on the plot on the left here, we're using what we call generalization levels inspired
00:33:04.600 | by the VIMA paper that would basically increasingly change more factors of variation simultaneously.
00:33:10.400 | And here we found RT1 is the most robust.
00:33:14.080 | Yeah, good question. We'll go into a bit more detail later, but I think at a high level,
00:33:25.360 | teleoperators get a structure template at a command of like verb, noun, verb or something
00:33:30.400 | like pick Coke can or move Apple near sponge. And we have around 700 tasks set up this way
00:33:38.400 | and they go ahead and collect that data, test done. And then later we have, we make sure
00:33:43.240 | that successes are actually successes and we discard stuff that's like unsafe, for example.
00:33:48.000 | Oh yeah, I got it. For this paper, we, we, we utilize 130,000 demonstrations for this.
00:33:56.720 | Yeah, great question. I think a lot of prior work has also been done on this, but it's
00:34:13.120 | also noted that when you have, for example, the question was, did you find that the, the,
00:34:18.920 | the trajectories in your dataset were very multimodal. And I think what you mean by that
00:34:22.720 | is that to go from point A to point B, I can go left or I can go right, or I can go straight.
00:34:29.040 | And I think this kind of diversity in basically for a single image state, but yet my data
00:34:34.880 | has three possible labels that can have very bad effects sometimes. For us, I think because
00:34:40.120 | we are using teleoperator demonstrations, the data was more homogenous than perhaps
00:34:44.800 | like in the wild, for example, there's a type of data function called play data where operators
00:34:49.160 | just do whatever they want and we label in hindsight. And I think our data is more homogenous
00:34:53.120 | than that, but we did not find a lot of the issues that we've seen in prior projects.
00:34:57.760 | One potential answer is maybe it's, it's the, it's the architecture itself, but we can talk
00:35:03.160 | about that later too. Yeah. Question. Great question. We actually do have a termination
00:35:16.640 | action. So the, the policy itself, so the question was how do you determine when a episode
00:35:21.440 | is complete and the policy is able to predict terminate because at the end of each teleoperation
00:35:26.800 | session, the operator can click a button and it's marked as episodes done. Yeah, I think
00:35:39.720 | for these evaluations, we were quite strict, but definitely I think in some cases, you
00:35:44.760 | know, maybe, maybe if we're just doing an experiment for ourselves, we'll have a dense
00:35:48.800 | reward scale of like grasp the object and move closer, grasp the object and almost got
00:35:53.560 | there, but mess up at the end. And we'll have like a, a grading curve basically. But for
00:35:57.320 | all of these, all of these stats I'm showing here, it was zero or one, one fully complete
00:36:02.320 | zero was not fully complete. Yeah. Cool. And I think what was exciting side and maybe talking
00:36:10.600 | about the multimodality aspect is then we pushed the limit even further. We were Trent,
00:36:14.360 | we decided to train on very diverse data distributions. You're you're back by then. Yeah. Okay. So
00:36:21.680 | right now you saw 130 to a thousand demonstrations train on this everyday robot proprietary mobile
00:36:28.760 | manipulator, but we were also looking to train on very different data distributions with
00:36:33.080 | very different, you know, action distributions, very different trajectories, even very different
00:36:37.000 | visuals objects tasks. And to do that, we included two other data sources. One was simulation
00:36:42.520 | data, which was kind of our robot blood and sin, but it looked quite different. And also
00:36:47.220 | this data was collected with reinforcement learning and not with teleoperate demonstrations
00:36:52.080 | in the past with all of the aisle plus RL work that I mentioned, we found that combined
00:36:56.920 | these, these two types of data was going to be very difficult because RL data has very
00:37:01.920 | short action. It's very quick. That's very optimized for the specific reward function
00:37:06.800 | versus human collective tele-operation data is a lot more, you know, human life, so to
00:37:11.960 | speak. And finally, we revived a data set from many years ago at 2018. If you remember
00:37:16.640 | the Kuka project, that arm farm has not been operational in that state for many years now,
00:37:21.000 | but we had that data still. And so we were hoping to see if a different robot with a
00:37:25.920 | different action space on different objects with different visuals in a different building
00:37:30.040 | could still be combined with data from this micro kitchen, a robot data set that we train
00:37:35.680 | on originally. And what was very surprising to me is that Archie one was able to learn
00:37:40.280 | from all of these very diverse data distributions. I had never seen a result like this or any
00:37:44.880 | other architecture, for example, a Resnet or even another learning method like reinforcement
00:37:50.180 | learning could successfully learn on such different data distributions. So robustly.
00:37:55.840 | And we evaluated, for example, on combining concepts. So we would have the original everyday
00:38:01.000 | robot robot pick up objects that were only seen in the Kuka project, or we would put
00:38:06.200 | objects only seen in simulation and see if our policy could understand that. So it did
00:38:10.080 | seem like it could generalize between objects and seen in other data sets and concepts that
00:38:14.080 | had seen in other data sets into the setting it was in now in the real micro kitchen. And
00:38:18.960 | that was a very fun result.
00:38:20.760 | I have a question. How did you find the action spaces of the everyday robot with the Kuka?
00:38:27.160 | Great question. Yeah, we just tokenized it and make sure that the tokenization scheme
00:38:31.560 | was kind of interoperable. And I think that was the I can dive into that in a bit later
00:38:36.840 | too. Yeah. And note that does not mean we can send the exact actions for one robot to
00:38:43.160 | another and have it execute. It was more just like in the data set, I think even by human
00:38:47.320 | inspect, you can tell that these are coming from two different robots.
00:38:51.680 | So yeah, let's look at some ablations for the scaling laws that we're all here for now.
00:38:56.040 | We found that, you know, reducing data site size reduces performance. But more interesting
00:39:00.240 | maybe is task diversity was quite important. Here we have two different trends.
00:39:06.600 | The green line is what happens when you reduce the total amount of episodes per task. And
00:39:12.120 | then gray on here, the purple curve is for what happens when you reduce the total number
00:39:17.260 | of tasks. And we found that having more tasks is relatively more important than having more
00:39:23.080 | data for each task. And I think this was a lesson that I think is probably going to suggest
00:39:29.920 | ways that, you know, we should scale robotics even further is not to just collect more data
00:39:34.080 | of the same task in the same settings, but to go out into the wild and get more diverse
00:39:38.440 | behavior. How do you define diversity for data?
00:39:44.200 | Great question. Question is, how do you define data diversity? In this case, it's just a
00:39:49.040 | number of unique structured templated commands that teleoperators receive. So those 700 templated
00:39:54.760 | commands, when we start reducing them and only train on 500 or only train on 300 of
00:39:59.880 | them, performance drops much quicker than if we had taken the same proportional cuts
00:40:05.200 | to the total amount. Yeah, so I guess I'm very familiar with, like, it seems almost
00:40:15.760 | a linear relationship for diversity and structure. Yeah, I don't think we, the question was,
00:40:25.280 | there seems to be almost a linear correlation between data size and success rate. And I
00:40:29.240 | think, you know, we could apply some fancy, like, you know, scaling law, you know, trying
00:40:33.240 | to curve fitting, but we didn't look too much into that because, you know, this is a trend
00:40:37.800 | that we kind of expected. We just weren't sure about the magnitude of how much it would
00:40:41.640 | affect us. And I think I don't have any really good insights on this besides that we see
00:40:49.000 | this phenomenon empirically. Yeah. Yeah, and great question. So the question is, oh, maybe
00:41:08.880 | this will just go on indefinitely. Or is there something magical about, you know, January
00:41:12.760 | and I think this is maybe also a, this is one we start to conflate the algorithmic exploration
00:41:19.920 | with like the practical considerations of scaling real world operations, which was when
00:41:23.960 | we got enough data, our policies were hitting, you know, saturating on these hitting close
00:41:27.320 | to a hundred percent. We were like, all right, let's connect, collect another data set. So
00:41:31.360 | we basically collect until it's at a hundred and then we switch to something else. But
00:41:35.240 | at this point, what was interesting is that when we kind of bet really big on this RT1
00:41:39.480 | feature, we'd already been collecting demos for a while. So it was possible that we had
00:41:43.600 | collected more than we needed. And in some cases, I actually, you could cut tasks without
00:41:47.680 | losing too much performance, which was quite interesting. But
00:41:50.200 | yeah, great question. And the question is whether or not
00:42:20.040 | all tasks are created equal in terms of like their capacity and entropy for different behaviors
00:42:24.120 | you could learn from them. And yeah, that's definitely true. Some tasks are much easier.
00:42:28.000 | We have a task that's just pick up this object. It's going to have much less interesting stuff
00:42:31.840 | you can squeeze out of it then, you know, moving something into a drawer and then closing
00:42:35.620 | the drawer. But yeah, great question. Great. Now ablations. We also trained without the
00:42:43.760 | big model size. We did it without pre-training, without you know, with continuous instead
00:42:47.920 | of discrete actions, with autoregressive actions, without history, without the transformer.
00:42:53.720 | And I think all of these design choices did seem to be required for robust performance.
00:42:59.440 | Oh, yeah, of course. Yeah, I think all I mean, like, and again, you know, for paper writing,
00:43:17.240 | it's kind of like the best thing that we can empirically find. That's that's the method.
00:43:21.560 | And then we'll figure out why each of these are important. And so, yeah, I think what
00:43:25.080 | one surprising thing here, perhaps, was that autoregressive actions hurt, you might think
00:43:29.240 | that passing in more information is always better than passing in fewer, fewer information.
00:43:33.760 | But in this case, maybe conditioning on your previous actions was kind of doing kind of
00:43:38.480 | like in context learning, it was doing online systems identification to figure out what
00:43:44.120 | teleoperator this data came from, and like how you can overfit to that specific set of
00:43:48.680 | action history. And so removing that was actually better. One interesting tidbit there. Cool
00:43:56.880 | then. And maybe in the interest of time, I'll try to get through the other ones a bit more
00:44:02.760 | quicker. And then we can maybe just do a few, I'll just do the questions at the end, if
00:44:07.120 | that's possible, just so we have time to get through everything. The next work here, moving
00:44:11.700 | a bit away from skill learning, then and actually on to the planning level, I think the first
00:44:15.720 | project took a lot of the design principles of other fields, and this offline robot learning
00:44:20.400 | paradigm and put it into the skill learning. Can we actually bring that now to other parts
00:44:24.680 | of the robotic system? And the first work here is SACAN. If you remember here, back
00:44:28.640 | in this timeline, in 2022, we started thinking about, oh, yeah, how do we scale this multitask
00:44:34.440 | imitation learning, but at the same time, large language models and, you know, other
00:44:38.880 | types of foundation models are really picking up steam, whether it was Imogen or Dolly 2.
00:44:44.440 | And we definitely wanted to figure out how we could use those as well. We had come up
00:44:47.880 | with this RTU 2.1 design that we're betting big on. But from here, we started to explore
00:44:53.080 | how all of the beta lesson 2.0, we could start utilizing foundation models within the context
00:44:58.280 | of our full stack system. The problem of doing this naively is that language models are not
00:45:04.880 | completely a very natural fit for robotics. For example, if you're a robot in a kitchen,
00:45:09.900 | you ask a language model, I spilled my drink, what can you do? Language model will give
00:45:12.960 | you stuff that's not very relevant. It's going to ask you to vacuum it, it's going to ask
00:45:16.560 | you to call a cleaner, or it's going to apologize. And these are not things that the robot can
00:45:20.640 | do in your kitchen with your spilled drink to help you. And so there are two parts of
00:45:25.920 | this then. The one issue is that our robots are limited. They are very constrained with
00:45:31.480 | what they can do. They cannot do everything, but they can do certain things. And then the
00:45:35.600 | second problem is that the language models are also constrained. They don't know what
00:45:40.560 | the robot sees. They don't understand that they are in a robot body in a micro kitchen
00:45:44.760 | needing to do real stuff in the physical world. And so we need to get the robots to speak
00:45:50.240 | language model language, and then the language model to speak robot language. To do this,
00:45:55.440 | we present SACAN in the same setting. Please put an apple on the table. We score the predictions
00:46:02.600 | of the language model on a constrained set of tasks that we know the robot has been trained
00:46:07.240 | to do. And then we also take the affordance function from the robot. An affordance function
00:46:11.120 | is a estimation of, given some kind of state, what the robot is able to do, how confident
00:46:17.480 | it is that it can successfully accomplish that task in the given state. In our case,
00:46:21.480 | we use something like a value function from reinforcement learning, which kind of encompasses
00:46:24.800 | this quality. Given these two values, these two scores, we have the confidence from a
00:46:28.760 | language model, and then the confidence from the robot. We can combine these, and then
00:46:33.080 | hopefully the combined prediction is both something that's going to be very semantically
00:46:37.000 | relevant for the high level instruction. Finding an apple is the first step, and please put
00:46:41.400 | an apple on the table. But it's also something that the robot can do. There's no robot in
00:46:44.960 | the frame, but it knows that it's been trained to find an apple, so it can navigate around
00:46:48.800 | to find it. And so hopefully we can do this then in closed loop, and then keep on going
00:46:53.100 | and predicting a high level plan from the language model that's grounded with the affordance
00:46:57.160 | function of what the robot understands. There's a video here of the SIGCHIAN doing different
00:47:04.120 | stuff, but happy to share it later offline. It's very cool, trust me. It's the greatest
00:47:09.420 | thing since sliced bread. And yeah, some numbers then. We tested this out on very long horizon
00:47:20.800 | instructions encompassing more than 10 separate navigation and manipulation skills in the
00:47:26.080 | micro kitchen that you see on the bottom right. We evaluated hundreds of different evaluations
00:47:32.460 | on this, and we tested out a lot of different concepts, including things like rephrasing
00:47:37.300 | by using single primitives, by drawing instructions that just came from colleagues and friends.
00:47:43.980 | And then we found that while there were failures in both the language model planning side,
00:47:49.420 | where it would predict the wrong path for the current situation, as well as on the policy
00:47:53.000 | execution side, even when it gets a good plan, the robot will mess up sometimes. Overall,
00:47:57.440 | it was still doing quite well. And now let's kind of take this back to the lesson. I think
00:48:05.520 | this is a very great example of how we can leverage internet scale foundation models
00:48:11.320 | as they get better. When we started the project, we started with a language model called Flan
00:48:15.280 | from Google. Throughout our implementation, Palm came online, Pathways language model.
00:48:21.040 | And when that happened, we were able to just hot swap it in, and performance just kind
00:48:25.860 | of got better for free without us having to do anything. By just assuming that language
00:48:30.100 | was the API, the plan just has to be any string. It can come from any source. It can come from
00:48:34.440 | a human. It can come from a language model. When we improve that language model, the system
00:48:38.340 | gets better overall. And here you see with the scaling sizes as the model LLM increased
00:48:43.540 | in size, our planning performance got even better. And some cool tricks here to get it
00:48:52.460 | working. Well, how do we actually produce this plan? Well, just by prompting, as is
00:48:57.020 | the rich these days, with chain of thought and with better prompting of just giving examples
00:49:01.540 | of here are some great robot plans. Now give me a new plan starting with this high-level
00:49:06.860 | instruction. We saw that the robot could do all things from understanding different languages
00:49:11.660 | to asking them to do very complex reasoning, like, hey, give me something caffeinated,
00:49:16.780 | or I don't do caffeine anymore. Get me something better. Or I could bring me a healthy snack
00:49:21.620 | versus bring me an unhealthy snack. Seikan was able to reason through all of these.
00:49:29.380 | I think that was our kind of the first contact of robotics with language models on our team.
00:49:34.520 | And it was the first exploration into how these two worlds could overlap. There was
00:49:38.820 | definitely still improvements, though. And in our monologue, we tried to improve those
00:49:41.940 | further by bringing in vision language models. The idea here is that we had very high plan
00:49:49.180 | rate success with Seikan. But unfortunately, it wasn't really able to recover from failures.
00:49:54.980 | What I mean by that is that the language model would not really get updates of what was going
00:49:59.040 | on in the world, so that if this was the plan it proposed, go to the table, pick up a Coke,
00:50:03.140 | bring it to you, but you messed up picking the Coke can. You dropped it on the floor.
00:50:07.000 | It would still continue trying to bring it to you, put it aside, but all of that does
00:50:10.020 | not really matter anymore because you dropped the Coke can. And so in this work, in our
00:50:15.300 | monologue, we were really hoping to figure out how we could add closed loop dynamic feedback
00:50:20.180 | from the environment into this planning process. Let's take that exact same example. Now, instead
00:50:27.240 | of just directly printing every instruction, maybe we add back some feedback from the scene,
00:50:31.980 | also conveyed using language as the universal API here. The scene can tell you what's actually
00:50:36.980 | in there. Maybe the robot asks a question now. In the robot, this is a language model
00:50:41.660 | asking the clarification question. Maybe here, a human responds or another language
00:50:45.380 | model. Then you can predict the action or the next task to do once the language model
00:50:49.860 | has enough context. And maybe you even add in stuff like success detection and so on
00:50:54.660 | and so forth. How do we do this then? Well, the first thing that we implement is what
00:51:00.700 | we call passive scene description. Just using either an off-the-shelf engineered heuristic,
00:51:06.060 | using object detection models, something like Vylde, you can describe the scene in text
00:51:10.980 | and just convey all that context to the language model.
00:51:15.420 | For active scene description, this is maybe similar to visual question answering if you're
00:51:18.780 | familiar with that field. The language model can actually propose active queries that it's
00:51:24.740 | curious about in the scene, maybe to make sure that it has enough context to move on.
00:51:28.940 | And here, either a human can provide the answer, or in the future, a VQA model as they improve
00:51:34.100 | can provide that. And finally, for success detection, this is
00:51:38.340 | very important to allow the language model planner to know when to try to retry something.
00:51:43.940 | Here we take in the first and last image, fine tune a clip success detector, and use
00:51:48.420 | that to provide binary success/failure information back to our language model.
00:51:56.620 | And for the results-wise, we can see a very similar SACAN long-horizon evaluation, but
00:52:02.340 | here what's interesting is that we're able to basically implement all these different
00:52:08.340 | automated feedback mechanisms on the robot, and so that it's able to reason and recover
00:52:13.440 | from things. Here you see it's going to try to go to the
00:52:17.660 | table, but the human's actually been saying, "Hey, I changed my mind." And then the human
00:52:23.300 | changes mind again, asking it to go back and forth. And the robot's able to, maybe we're
00:52:27.720 | kind of torturing the language model at this point, but the language model's able to replan
00:52:31.300 | and make sure that the human intent is satisfied. We also tried, I'm not sure if this video
00:52:39.980 | shows it, but situations where we did adversarial inputs, where I walked around and just kind
00:52:44.780 | of knocking objects out of the robot's hands and forcing the success detector to tell it,
00:52:49.620 | "You messed up, try again." And we also tried this out on a couple of different domains,
00:53:00.140 | a simulated tabletop manipulation domain, as well as a real-world manipulation domain,
00:53:04.200 | and we found that this was much better than SACAN, or let's say just only using visual
00:53:10.200 | features themselves with something like Clipboard. And I think here, it really speaks towards
00:53:17.600 | a trend that I've really come to appreciate. In 2018, a robotics professor once said that
00:53:23.200 | when they looked at all the different things preventing robot learning from scaling tremendously,
00:53:27.160 | it thought the bottleneck was high-level semantic planning, about reasoning, about common sense.
00:53:31.460 | And I think in 2022 and 2023, language models can provide a one path of how this can kind
00:53:38.020 | of be offloaded, at least in the interim. And I think if language models are the API,
00:53:43.820 | then you can just bring in these vision language models as object detectors get better, as
00:53:48.240 | success detectors, as VQA, as language models get better, you can bring them all into the
00:53:51.980 | fold and they act as kind of a life vest. If your robot currently does not have common
00:53:56.820 | sense reasoning, these other models can act as a scaffold and a life vest to bring you
00:54:01.140 | up to par with what they currently love. And maybe then in the future, you'll get beyond
00:54:05.620 | what the language models know, but in the short term, it does seem that we can leverage
00:54:08.620 | them to accelerate what we can do in the real world. Moving on now from, we saw now how
00:54:15.840 | language models can do planning. We saw how vision language models can help planning.
00:54:19.860 | And now we're going to switch gears a bit and think about how vision language models
00:54:22.980 | can help other aspects of the bottlenecks that robot learning faces. One of these is
00:54:29.140 | that data collection is very expensive. As we mentioned before, we did have this 130,000
00:54:36.140 | demonstration data set, but it was collected over a year and a half at significant cost,
00:54:42.180 | both in resources and time and money and with many, many robots. And of course, these tasks
00:54:49.140 | too were also a bit limited, right? We use 700 very templated commands, instructions
00:54:55.020 | that we give to teleoperators, because we knew that this would scale, right? If we collected
00:55:00.140 | enough data for each of these templated tasks, we could do that specific task. And here's
00:55:05.420 | the flow that someone was asking about earlier. We give this PICOCAN instruction, the operator
00:55:09.820 | controls the robot in the real world, finish the task, marks the episode as terminate,
00:55:14.620 | and then shade that out to this big orange data set. And that big orange data set is
00:55:18.580 | what we trained on in all of the previous projects for the control policies. What we
00:55:23.060 | additionally considered was adding a bit of crowdsourced hindsight annotation. If you're
00:55:27.420 | familiar with it, with a hindsight experience replay and reinforcement learning with goal
00:55:31.340 | conditioning with, you know, maybe the robot did something that wasn't just this high level
00:55:36.500 | template instruction. We could ask a human to describe more verbosely what the robot
00:55:41.500 | did. Maybe it picked up the COCAN that was on the right side of the table. Maybe it picked
00:55:45.060 | it up and then knocked it over. Maybe it moved it very slowly to the middle. There's a lot
00:55:49.500 | of semantic diversity encompassed in this demonstration that is not totally caught by
00:55:56.460 | this high level templated PICOCAN instruction. So we labeled 3% of this big orange data set
00:56:02.580 | with these very verbose descriptions. And next, we kind of applied the pseudo-label
00:56:09.420 | strategy that's been seen in other fields, such as video pre-training with their inverse
00:56:13.860 | dynamics model. But instead, we apply that to the instructions, to the semantics of what's
00:56:18.700 | contained in your data set. So step one, we pre-train a clip model on your small label
00:56:25.380 | data set of 3% of your main data. Then you go ahead and use that train BLM data to label
00:56:32.420 | all of the templated instruction demonstrations that you had before that 130,000 episode data
00:56:38.540 | sets. Now you have a re-labeled data set, which has a large diversity of interesting
00:56:43.380 | semantic instructions. And then we plug in all of these data sets into RT1 and just train
00:56:50.140 | a language condition behavior cloning policy, similarly to how we would normally. But even
00:56:55.740 | though normally we just use data set B, the orange one, now we use all three data sets.
00:57:01.540 | And then finally, we evaluate on entirely new unseen instructions. In the prior works,
00:57:08.340 | we were evaluating mainly on the 700 templated instructions. But in this work, we actually
00:57:13.100 | go beyond that. We can type in almost anything you want that you think might succeed. And
00:57:18.620 | you can phrase it how you can. You can add typos. You can even do it by referring to
00:57:23.340 | semantic concepts. You can add spatial concepts. And we see how it does. The reason that this
00:57:30.540 | might work, maybe visually to represent this, is here are the t-SNE embeddings on the left
00:57:36.300 | and the right. It's the same embeddings. But on the left, they're colored by the original
00:57:41.340 | templated instruction that was used to collect that episode. And on the right is what the
00:57:47.460 | vision language model thinks. If it's allowed to put a free form natural language caption
00:57:52.700 | and assign it to that episode, you see that on the left, you have these big clusters of
00:57:56.660 | pick cocaine is like, you know, hundreds or thousands of episodes, but we all just call
00:58:00.620 | them pick cocaine. On the right, then we can then expand those concepts and say, actually,
00:58:04.860 | this episode is picking up the red cocaine. This episode is picking up the crumpled cocaine.
00:58:10.320 | This is picking up the cocaine that's next to the chip bag. And so you can get a lot
00:58:14.160 | more mileage out of the same underlying data set by just using language as the diversity
00:58:19.140 | mechanism through which you kind of expand the concepts that you're considering. And
00:58:23.300 | for example, in the middle, you see, you know, open top drawer can become hold and pull out
00:58:27.500 | the top drawer. We have stuff like the center left for the middle, for the middle episode,
00:58:32.740 | for the bottom one, pick green rice chips from white bowl becomes lift up the green
00:58:36.500 | chip bag from the bowl and drop it at the bottom left corner of the table. So you got
00:58:39.860 | a lot of these semantic, you know, spatial concepts that are now going to be in your
00:58:44.080 | target supervised labels.
00:58:45.880 | I have a question. Yeah. Great question. So I guess if I can rephrase a bit, the problem
00:59:15.740 | is that like, it's actually a very difficult and perhaps even untrackable problem of how
00:59:20.180 | you map all the linguistic concepts you see out in the wild down to like, maybe like embodied
00:59:25.060 | specific types of episodes. And like, here, maybe I would say is that we are definitely
00:59:30.180 | introducing a lot of our priors and our biases onto like, maybe what we call as left, you
00:59:35.980 | mean left 10 centimeters of two centimeters, like, like, what do words mean? And these
00:59:41.420 | definitions, what do they mean to us, to the crowd compute raters that generated these
00:59:45.980 | captions? What do they mean to the robot? What do they mean to the language models?
00:59:48.860 | Maybe these are all slightly different, but the hope is at least if they're roughly similar,
00:59:54.180 | we'll get like directionally correct improvements. So I would say the nuances of this specific
00:59:59.660 | hard lines of definitions and like actual, like, you know, semantic meaning of these
01:00:04.900 | words, I think that's maybe out of scope right now, but maybe something we'll dive into further
01:00:10.500 | at a higher level, though. I think basically the bar is just so low. We have the 700 template
01:00:14.940 | instructions that are basically one hot IDs, and we just want to make those closer to natural
01:00:20.100 | language, even if by a little. And I think at least we're, we're, we're trying to get
01:00:25.500 | towards that with these vision language models that are captioning automatically. Hope that
01:00:30.180 | answers your question. And we also compare it to a few baselines on the top left here.
01:00:37.820 | We look at what if we only train on this 3% of these fancy human rated labels? What if
01:00:43.580 | we only train on the original RT1 data sets? What if we train on both of these? And what
01:00:48.820 | if we train on both of these plus all of the predictions given by our BLM? And what's interesting
01:00:54.120 | here is that, you know, relabeling seems to universally help. We evaluated only on novel
01:01:01.540 | instructions that was new for this project. It's the first time on a robotics project
01:01:05.020 | where we only tested on sentence, I could type whatever I thought I'll type it in. And
01:01:08.780 | that became the test set. And we just had to make sure that it was never contained in
01:01:13.300 | the training coverage. And you see all these interesting examples on the right here of
01:01:17.700 | stuff like move the lonely object to the others. I have no idea how this works. Stuff like
01:01:23.580 | news, lifting the yellow rectangle, talking about colors, talking about move the right
01:01:27.700 | apple to the left. Here, we actually had two apples in the scene. And actually in our training
01:01:32.420 | demonstration data, we never collected scenes with duplicate objects, just because, you
01:01:37.180 | know, we thought of this multi-modality problem. If you just say pick cocaine in this two cocaine,
01:01:41.100 | it's going to be very difficult to figure out which one to do. But with language labeling,
01:01:44.860 | it seems like maybe we could do that now. So even though we never trained on scenes
01:01:47.900 | of two apples, now you can evaluate on them and just specify with language, which apple
01:01:52.180 | you want to go for. And it was working pretty reasonably. And finally, for the last example
01:01:58.340 | here, I thought it was kind of interesting. A single cocaine, we try to do a novel behavior.
01:02:03.500 | Push towards the left was not a templated instruction. We only had move cocaine near
01:02:09.180 | Y, where Y is another object, move cocaine near apple, move cocaine near sponge. So pushing
01:02:15.260 | this motion of just pushing the cocaine into air essentially was not something that we
01:02:20.340 | ever encompassed, but maybe it was in one of the labels. Maybe like if you've seen like
01:02:24.700 | move cocaine near apple and apples on the left, and you saw move cocaine near sponge
01:02:29.060 | and the sponge is on the left, you would general, the model can generalize and be like, oh,
01:02:32.620 | left means this side of the table, not a specific object. So maybe that's what's happening,
01:02:37.380 | but it's very unclear. This is, as I said, you know, just, I type, I thought of something,
01:02:42.380 | I typed it and just saw what happened. And we definitely hope to explore this more quantitatively
01:02:46.980 | in the future. Bottom left, of course, is I think comparing against non-visual augmentation.
01:02:51.980 | So maybe you can also get these interesting concepts just from language alone, right here.
01:02:56.380 | We had adding random noise or we do Madlib style, just swapping out words, or we even
01:03:01.260 | use a LLM GPT-3 in this case to propose rephrasing of existing instructions. But I think my takeaway
01:03:07.620 | there is that you really need visual grounding for the visual language model to say, actually,
01:03:12.260 | yeah, this caption is factually accurate at this given point in time. And that it's, you
01:03:17.300 | know, something perhaps that would be interesting for a robot. That fine-tuning process provides
01:03:22.300 | both of those. Yeah, yeah, definitely. These are just some subsets of five of these evaluation
01:03:37.860 | instructions, but we had over 60 of them. We didn't do a full quantitative ablation,
01:03:42.620 | for example, as we did in RT1. We had this like seen and unseen task set, and that was
01:03:47.900 | compositional. You would see, you know, move Coke near Apple, and you would see move Apple
01:03:52.500 | near sponge, but we'd hold out, move Coke near sponge, and we would test that out. But
01:03:56.420 | in this case, I think we can go much more beyond that. Because our language is completely
01:03:59.860 | freeform, the compositional space of what you can kind of combine is just going to be
01:04:04.700 | much larger. So we did try a little bit to answer your question. We tried some combinatorial
01:04:08.860 | evaluations, but there's definitely a lot more thoroughness that we could do there,
01:04:14.140 | too. How am I doing on time? Okay, 10 minutes. Maybe I'll try to wrap up pretty soon, then.
01:04:20.180 | The dial-up takeaway, then, is that two parts, right? Lesson two, leverage foundation models.
01:04:25.260 | Let's use them as data augmentation. And lesson three, let's make sure that our offline data
01:04:29.460 | set, you know, is robust enough where these different behaviors exist, and you can describe
01:04:34.500 | them in language. If you don't have enough diverse behaviors, no matter how good your
01:04:37.940 | labeling is, you probably can't elicit all of the interesting concepts that you want
01:04:41.540 | to learn from. And maybe most exciting for me here was that actually some label noise
01:04:46.660 | is okay. Notoriously, in supervised learning and imitation learning, you need very clean
01:04:50.980 | labels that are always 100% true, right? You don't want to be learning from, like, noisy
01:04:55.540 | data where some, like, you know, large percentage is just not accurate. But in our case, it
01:05:00.180 | seems that, like, some label noise was okay. The vision language model was not always predicting
01:05:06.340 | factually accurate descriptions of the scene. And I think this definitely hurt when it got
01:05:12.140 | too high, the noise, but at smaller levels, it definitely still seemed to be okay and
01:05:17.060 | robust enough to handle that. So, that was a deep dive, then, on some individual works
01:05:23.560 | that use this big recipe of language, foundation models, offline data sets in different parts
01:05:29.460 | of the robot system. And this was the kind of pitch at the beginning, and I hope you
01:05:35.260 | at least see a little bit of how our team has tried to take these principles and apply
01:05:39.580 | them to accelerating robot learning in the real world. As we see these different types
01:05:44.660 | of ingredients and lessons map onto different parts of the robot system altogether. For
01:05:50.300 | skill learning, right, that was RQ1 that we talked about. For planning, that was SACAN,
01:05:54.260 | and then adding the closed-loop feedback with vision language models, that was inner monologue.
01:05:58.580 | For low-level control, we didn't talk about this today, but an exciting work from our
01:06:01.700 | team is actually using language models to predict code that's executed on the robot
01:06:06.180 | directly, perhaps as low-level controllers. Language models, you know, they read textbooks,
01:06:11.100 | they've read raw stocks, they've read, you know, UR5 documentation code, and they can
01:06:14.860 | write code for these robots, and we can execute that. For data augmentation, we saw Dial with
01:06:20.100 | vision language models. And also, I didn't talk about this here, but for object-centric
01:06:25.100 | representations, for things like feature activation maps for specific objects, we can use those
01:06:29.820 | as task representation for mapping a scene. And in NLMAP, they did that for object-centric
01:06:36.100 | navigation around the micro kitchen that we looked at. And I think, hopefully, in the
01:06:41.500 | next, you know, coming weeks and months, we have a few more rows and entries to add here
01:06:45.780 | as well, but I think this kind of mindset is a very exciting research direction of how
01:06:52.020 | you can apply these big high-level concepts about foundation models and offline data sets,
01:06:56.380 | when you look at what exists in the robot systems of today, and you find many gaps and
01:07:00.700 | opportunities still available where we can do everything from exploratory pilots on how
01:07:05.880 | this might look like, all the way to more extensive evaluations and really building
01:07:09.460 | out robust systems. I think both of these have value. So, I'll conclude with just saying
01:07:16.380 | that it was very fun exploring all of these complementary directions, but there are still
01:07:21.160 | some major questions of how we can take these concepts even further, and how these trends
01:07:25.680 | and ideas might even evolve moving forward as foundation models get better, as more data
01:07:30.380 | set becomes available online, as more data becomes homogenized and tokenized and interoperable.
01:07:36.120 | And I think a lot of the concepts from other fields, like linguistics and vision, and from,
01:07:40.660 | you know, all of the big scaling kind of level questions that are being pioneered in language-based
01:07:46.580 | foundation models, hopefully, those kind of ideas can trickle down to robotics. Maybe
01:07:50.660 | even robotics can provide something back by providing embodied action causal data sets
01:07:55.380 | that maybe might improve the quality of reasoning of some of these large language models that
01:07:59.820 | are not embodied. With that, though, I guess I'd like to, you know, thank everyone for
01:08:05.520 | your time and for Dave and Sia for inviting me, and open to any questions about the papers
01:08:11.000 | or just at a high level as well. Thanks so much.
01:08:36.020 | Yeah great question. So the question, I guess, is like, what about tasks that require more
01:08:39.560 | semantic reasoning, like, you know, operating at a certain speed or with maybe like, I don't
01:08:44.400 | know, numerical reasoning within the question, the prompt itself. I would say, so for a lot
01:08:50.040 | of the more common sense reasoning, like, you know, throw away three co-cans, you know,
01:08:55.880 | after another, I think, you know, the language model is very good at that right now. So for
01:09:00.160 | the secant planner, it will predict, you know, throw away the co-can three separate times.
01:09:05.260 | For the low level skill policy learning, though, I think that's more of a, that's more high
01:09:11.120 | variance, I would say. And definitely for right now, we don't really condition on speed
01:09:16.680 | or how you do it exactly. But that's definitely maybe something I could do if you could relabel
01:09:22.420 | with, like, pick up the co-can slowly versus pick up the co-can quickly. Maybe that is
01:09:26.840 | something a vision language model could recognize.
01:09:55.660 | The question was, at what scale do we see like combinatorial generalization start to
01:10:00.700 | occur, maybe between like, you've seen colors of one block, and then you want to evaluate
01:10:04.820 | on a new color? And I think that's a great question. And unfortunately, my answer is
01:10:09.060 | going to be very vague. And it depends. It depends on how you define your tasks. It depends
01:10:13.240 | on the scale of your data set. And it depends on like, the concepts that you're trying to
01:10:16.400 | generalize across. I think there have been numerous attempts to kind of basically formalize
01:10:22.440 | what it means to generalize within, you know, learning and within robotics, even within
01:10:26.960 | like the specific settings we consider. And I don't think there are any clear trends be
01:10:31.600 | like, of where you can say, oh, yeah, this is the number I need to hit where, you know,
01:10:35.480 | I can generalize across x, y, z dimensions. Like, you could evaluate all those, but I
01:10:39.680 | don't think it will help you predict new trends, at least right now. I think we're probably,
01:10:43.160 | you know, this is just me talking, I would say we're one order of magnitude off before
01:10:47.480 | we can start to make very broadly generalizing statements about generalization capabilities.
01:10:53.680 | I think, you know, add one or two more zeros to our data set size, and we can start to
01:10:57.360 | talk about that in terms of task object skills. Yeah.
01:11:18.440 | Yeah, very astute observation. So the question, the question was that in SACAN, the value
01:11:34.200 | functions that predict these scalars on the right here for the affordances are only storing
01:11:39.280 | a certain limited number of tasks. So is that the bottleneck? And I would say yes, 100%.
01:11:44.000 | Scaling the number of tasks that your system is able to do that you can then give to the
01:11:48.240 | planner as its buffet of options to choose, that is the bottleneck, right? No matter how
01:11:52.480 | good your planner is, if you can only do like three tasks, there's only certain like combinations
01:11:58.320 | of those three tasks that it can do to, you know, map on to a high level instruction.
01:12:02.920 | So as you add more tasks, as the low level skill capabilities of your robot increases,
01:12:08.160 | you're kind of like adding precision to like the coverage of the high level instructions
01:12:13.040 | that your robot can try to do. So that's one of the main bottlenecks I see today.
01:12:32.080 | Great question. So have we tried RQ1 with RLHF or with RL? I think the short answer
01:12:39.920 | is I think we have some stuff in the works that is doing that. But right now, for all
01:12:43.760 | of our projects, currently, we're just using this implementation learning loss. Again,
01:12:49.240 | I think I view this multitasking limitation bet that we're making is kind of an existence
01:12:52.960 | proof. It works, it's not cheap, but it kind of does work and it does scale. And that at
01:12:58.160 | least is a good starting point. And our main, you know, hope over the next months and years
01:13:03.120 | is can we improve beyond that? Can we add back in offline improvement? You know, can
01:13:07.040 | we add in RL back to the equation somehow? I'm an RL person at heart, so I really hope
01:13:11.520 | so. Sorry, could you repeat that?
01:13:37.040 | Yeah, good question. So regarding task balance and whether text-only data is sufficient for
01:13:54.200 | helping motor control learning, I think my hope is that when, you know, when we experience
01:14:01.800 | emergence in both the robotics space and we've already seen emergence in the language space,
01:14:06.600 | at some point, maybe these reasoning concepts will start to transfer between the two. I
01:14:10.520 | would point them to one interesting paper, which is, I think, can Wikipedia help reinforce
01:14:15.120 | that learning from Shane and some other folks? They pre-train, you know, a large policy network
01:14:21.360 | on like, you know, auto-aggressive token prediction on Wikipedia, just text only, and they use
01:14:25.880 | that to initialize, like, control for Atari games with RL, and this actually helped. So,
01:14:31.400 | you know, maybe this is philosophical, but maybe there's something about decision-making
01:14:34.800 | reasoning that transfers between text and action data, so.
01:14:42.760 | Great question. I definitely agree. You know, passing in six images is not going to be enough
01:14:58.840 | when you're executing tasks for minutes at a time. Like, clean my whole house, and then
01:15:02.600 | you can only pass in the last, like, you know, two seconds. Like, come on. So, I think that's
01:15:07.400 | definitely going to be a limitation as our tasks set more complex and long horizon, and
01:15:12.280 | I think here, it's another open question, too, is context length. We have high-dimensional
01:15:16.960 | images, even with token learning for reducing the number of patches that we pass through,
01:15:22.100 | it's still, you know, very high-dimensional, and we quickly hit the context length cap.
01:15:26.620 | Can we do, how do we, you know, improve beyond this? Maybe it's like retrieval transformers
01:15:31.240 | or some other kind of mechanism. Great question. I think we are hoping to explore
01:15:41.160 | that in the future, but with this, like, context length limitation, we are already near the
01:15:45.280 | context length capacity with just these six images alone, much less, you know, passing
01:15:50.140 | in whole trajectories of zero-shot behavior, few-shot behavior we wish to see. So, 2BD,
01:15:57.800 | I think. Cool. Thank you, guys.
01:16:01.800 | Thank you.
01:16:02.800 | [end of transcript]
01:16:02.800 | [BLANK_AUDIO]