back to indexStanford CS25: V2 I Robotics and Imitation Learning
00:00:00.000 |
I'm really happy to be here. I guess to shortly introduce myself. My name is Ted Xiao. I'm 00:00:13.580 |
a senior research engineer at the Google Brain team. I've been on working on robotics now 00:00:18.460 |
for the past five years. I've touched upon a few topics including multitask learning, 00:00:23.980 |
reinforcement learning, and then lately just broadly thinking about how we can scale robots 00:00:28.300 |
to make sure that we can actually work in the wild in the real world. I guess today 00:00:33.060 |
I'll be talking about quite a few different topics, but as a first preface, I guess the 00:00:38.940 |
first thing to know is that our team is pretty massive now. All of these projects are huge 00:00:43.020 |
collaborations with some products have more than 40 people working on these for many years. 00:00:48.240 |
So these are large efforts and I'm just very fortunate to call myself to be on teams of 00:00:52.500 |
very smart people. Secondly, some of my takes are spicier or more controversial than others. 00:00:59.300 |
So all of those opinions are definitely only my own and don't reflect those of Google or 00:01:04.060 |
anyone else on the team. So with that out of the way, yeah, welcome to my TEDx talk. 00:01:16.040 |
So I think maybe some of you have seen a lot of the cool robot learning videos out in the 00:01:21.820 |
wild these days, but I am more excited than ever and it's not just hype I think. I think 00:01:27.180 |
there's been a fundamental shift in how researcher and robotics view learning over the past two 00:01:32.340 |
years and I think the shift has a lot to do with all of the trends happening more broadly 00:01:38.160 |
in foundation modeling, in large-scale internet models, across different fields like language, 00:01:44.220 |
audio, and so on. But I think my goal today is to convey to you why I am particularly 00:01:50.280 |
excited about this time today right now and why there's been a very fundamental one 80-degree 00:01:56.340 |
paradigm shift I think across the robot learning field. And if you walk away from this talk 00:02:01.740 |
with just one thing and that's you're slightly a bit more excited about robotics than you 00:02:05.860 |
were before or believe that the time is now for these robots to really start scaling exponentially 00:02:11.380 |
and doing something really cool, I think then my talk will have succeeded. The talk will 00:02:22.320 |
have a few parts. We're going to start at a very high level and just talk about why 00:02:26.920 |
a foundation model for robotics at all, what that might look like, and the ingredients 00:02:32.380 |
and recipe for how we might get there. Then we'll dive into a few different works pretty 00:02:38.000 |
deeply that my team has been very proud of over the past year or two. And finally, we'll 00:02:42.920 |
go back to the high level and then zoom out and think about what's next for robot learning. 00:02:49.780 |
So why a foundation model for robotics? One second, let me try to hide this thing. No, 00:02:59.980 |
that's fine. I'll keep that bar there for now. But the top bar says why a foundation 00:03:03.520 |
model for robotics. Being coined here at Stanford, and I'll use the phrases internet scale model, 00:03:10.000 |
foundation model, and large language model pretty interchangeably throughout. And I hope 00:03:13.600 |
it's pretty clear. But generally, when I'm talking about these big monolithic beasts 00:03:18.080 |
that are training on tons of data, they have two very important properties that I think 00:03:22.640 |
are quite nice. One is emergence. When very simple things kind of work at a small scale, 00:03:30.220 |
they get a ton better when you just scale things up more data, more compute larger models. 00:03:36.080 |
And what we see here is that when these models even become good enough, the domain space 00:03:41.700 |
of what they're good at and able to do starts to go combinatorial even larger. And here 00:03:46.560 |
for these two points, I would like to suggest two blog posts I highly recommend. One is 00:03:51.980 |
from Jacob Steinhardt called more is different for AI. And this kind of links the phenomenon 00:03:56.220 |
that we see in other fields, like physics or biology, for example, individual water 00:04:01.380 |
molecules will behave very differently and have very different, let's say electrostatic 00:04:05.960 |
forces, then they start to clump up clump up and start behaving as a liquid altogether. 00:04:10.440 |
We see this in herds of animal and flocking patterns, we see this in humans and economies, 00:04:14.760 |
we see this all across different fields. And now even an AI, we see models that are doing 00:04:19.060 |
stuff that will not be even possible where they add a smaller scale. But when they reach 00:04:23.500 |
some critical scale in size, they start to work really, really well. This is documented 00:04:28.760 |
by Jason in his blog post emergence and LLMs, which you see this plot on the bottom left, 00:04:35.580 |
successfully across a bunch of different tasks, whether it's modular arithmetic or purging 00:04:39.640 |
question answering, the success rate is basically flat until these models get big enough, good 00:04:44.500 |
enough. And then the success rates just kind of skyrocket. And that's why I think these 00:04:50.320 |
are particularly exciting. So yeah, question. 00:04:54.480 |
I'm curious to know, do robotic foundation models display scale in real life? 00:05:00.980 |
Great question. And I'm really glad you asked. We have, I'm pretty excited to present some 00:05:06.120 |
directions we have along that I hope will answer your question in maybe about 10 minutes 00:05:09.320 |
or so. Yeah. But I think that's a question on all of our minds, including myself. So 00:05:15.920 |
I think before we even get to the feasibility or the existence of any robotic foundation 00:05:20.100 |
models, like is this even needed? And I think the argument that I don't think is obvious 00:05:25.600 |
is that I think emerging capabilities and relying on these might be actually indispensable 00:05:30.360 |
for robotics to actually work. A lot of the research over the past decades of robotics 00:05:34.360 |
has been in one bin, one room, one table, one robot, one building even, but these are 00:05:40.280 |
so vastly different from the orders of magnitude, more complex while real world situations that 00:05:46.240 |
humans operate in every single day. And I think to make that gigantic leap, we're going 00:05:50.780 |
to have to rely on this emerging capability scaling curve where things kind of work. You 00:05:55.600 |
have very canned demos. Maybe you have, you know, a humanoid robot program to backflip 00:06:00.440 |
after hundreds of trials, but going from that to like the chaotic real world, I think we're 00:06:05.000 |
going to have to rely on this emergence phenomenon for that. And I think maybe even intellectually 00:06:12.360 |
or academically, it's also interesting to think about why or why not a foundation model 00:06:18.160 |
for robotics might even work. It's worked in so many other domains. There's existence 00:06:22.720 |
proofs in audio, music, coding, language, another domain every single day, it seems 00:06:26.560 |
with 3D models and beyond. But if there is something very special about robotics, whether 00:06:32.800 |
it's embodiment or causality or physical grounding, and that is the barrier to making this very 00:06:38.520 |
simple recipe that's working all these other domains. If there is something special about 00:06:42.720 |
robotics that causes this recipe to fail, I think that's quite interesting to study 00:06:46.620 |
why that is. I'm personally an optimist. I don't think there is some magical secret sauce 00:06:51.600 |
that's going to keep robotics from being tackled with the same formulas and recipes that's 00:06:56.000 |
worked elsewhere. But, you know, I think this is a question I'd like to find out the answer 00:06:59.800 |
to. And so maybe then instead of just motivating this philosophically, okay, we need foundation 00:07:07.400 |
models, foundation models are great. Let's try to build one for robotics. How do we actually 00:07:11.520 |
do that? Well, I think we can leverage a few ingredients by standing on the shoulder of 00:07:17.600 |
giants and looking at other domains. The first one is looking at different design principles 00:07:22.360 |
of ML scaling from other domains. Let's look first at high capacity architectures, the 00:07:28.520 |
topic of this class today. Ideas such as self-attention, as all the different ideas encompassed in 00:07:35.000 |
the transformer, as Andrej Karpathy famously said, it's like a magical universal differentiable 00:07:39.880 |
computer that's very general, very robust, and very remarkably scalable on many different 00:07:44.720 |
dimensions. Let's use those. We should also leverage the more guiding principles that 00:07:50.280 |
have been seen, the scaling laws, the trends, this year's Cinchilla, you know, we not only 00:07:55.160 |
have to scale the model size, we also have to scale compute, and we also have to scale 00:07:59.480 |
the number of unique tokens in the corpus of the vast data sets that we train on. But 00:08:04.040 |
if we do all three together, this has been shown to reliably have a pretty good chance 00:08:09.120 |
of succeeding, no matter what domain you're looking at. And so, and finally, what that 00:08:14.560 |
kind of means, and I think this is actually going to come up later, is that data set size 00:08:18.720 |
seems to matter these days a lot more than quality. Even if you have some sentences on 00:08:22.880 |
Wikipedia that are misspelled, or some, you know, falsehoods, or some things that aren't 00:08:27.360 |
so desirable, if in aggregate, your data set is diverse enough, and interesting enough, 00:08:32.480 |
these things will hopefully wash out in the mix. Ingredient number two, the proliferation 00:08:39.120 |
of the internet scale models themselves, not just the principles. What's exciting, and 00:08:44.880 |
I'm sure it's, you know, definitely been very shocking for both experts and lay people alike, 00:08:50.080 |
is that a lot of these generative models across many different modalities have been experiencing 00:08:54.320 |
emerging capabilities and have been surpassing all of our wildest expectations time and time 00:08:59.400 |
and again. But even when we think that we're exhausted, all this stuff is too much, it's 00:09:03.720 |
not going to work, something will come out and completely blow me out of the water. And 00:09:06.880 |
I think this trend will definitely keep continuing. And I think, in addition to that, they not 00:09:11.440 |
only will continue coming on and accelerate more rapidly, they're going to happen with 00:09:15.200 |
it, whether or not like we do anything, you know, in the grand scale, speaking, me as 00:09:20.200 |
a robotics researcher, or, you know, you and whatever subfield you're on, there are parts 00:09:25.180 |
of machine learning that likely you'll probably not ever touch in at least the near future. 00:09:29.840 |
And those parts will be seeing tremendous breakthroughs and scaling and new capabilities 00:09:33.280 |
coming online every single week. And you can look at this not only in the impressiveness 00:09:39.760 |
of the models, but also the acceleration of progress, the timescales in which new models 00:09:44.560 |
are being released, why we're large collaborations are being worked on by many groups, and then, 00:09:49.440 |
you know, being available to access for all to use and build upon. And the final ingredient 00:09:56.460 |
in this trend is more of a robotic specific one, but it is a vast shift from online robotic 00:10:02.860 |
learning, where robots collect experience online, make actions and learn through trial 00:10:08.380 |
and error to an offline setting where we decouple the data generation process from the data 00:10:14.260 |
consumption process. As we've seen, and all these other foundation modeling domains, these 00:10:19.740 |
big internet scale data sets are so diverse, and they're static, we just scrape them once 00:10:24.320 |
or scrape them multiple times continuously. But we aggregate a continuous pile that's 00:10:29.040 |
just growing. Here, we see either the pile data set from a Luther or Lyon 5b for image 00:10:34.860 |
paired image text. And these are pretty big, and they're orders of magnitude more than 00:10:39.200 |
what we've seen before. And they are definitely a key ingredient to why other domains have 00:10:43.560 |
been doing so well at training these big foundation models. And this coming back to robotics, 00:10:50.800 |
then I'd like to take a brief detour into how the shift came to be because it's very 00:10:55.840 |
easy to say in a sentence, yeah, robotics is offline more than online. And this is coming 00:10:59.640 |
as kind of a no brainer to many folks who are coming from other domains, like this is 00:11:04.040 |
the way things are done. But in robotics, this has been a very big shift. And I think 00:11:09.040 |
robotics has also been synonymous with RL, reinforcement learning for a lot of people. 00:11:13.720 |
And I think increasingly, this is becoming less true. And so I'd like to take you down 00:11:17.840 |
a brief trip down the history of my team, their side of the talks as brief history of 00:11:23.400 |
robotics at Google. And yeah, of course, thanks. And I think this is not just for dramatic 00:11:29.480 |
exposition, it's really to try to guide you through how drastically our team's thinking 00:11:34.960 |
has kind of evolved over the years, and how that's going to inform the design decisions 00:11:40.400 |
and the kind of risks and research directions that we take in the specific projects that 00:11:45.240 |
I'm going to show coming up. Thank you. So in 2016, some of you may have seen this, we 00:11:50.560 |
had what we call the arm farm, seven KUKA robots in a room collecting picking data 24/7. 00:11:56.480 |
And this was doing on policy RL in the real world, we were the first team to kind of say, 00:12:00.960 |
hey, can we can we even do this with the goal of saying, can we do end to end robot learning 00:12:05.880 |
with results in the real world, this was kind of risky at the time, it was not a common 00:12:10.000 |
take. And from that we developed several interesting research directions that we started exploring, 00:12:14.720 |
we looked into stuff like QT opt, which is a Q learning method, working on continuous 00:12:20.960 |
control actions. While taking a vision inputs, we worked on cycle GAN to transform simulation 00:12:27.820 |
based images into real real looking images for sensor real, we looked at concurrent role 00:12:32.560 |
of how we get robots moving faster and more efficiently in the real world. I'm sorry, 00:12:39.040 |
Yeah, great question. And that one, I think was basically, the arms would pick stuff up 00:12:49.000 |
from the bin, if they messed up, and it fell out, well, we come back the next morning, 00:12:52.680 |
and there'd be objects scattered all throughout the room. So there was no reset. But if they 00:12:57.400 |
missed a little bit, the objects would fall back into the bin and hopefully be in a position 00:13:03.160 |
Oh, yeah, of course. Thanks. I'll do that in the future. On this specific question was 00:13:09.680 |
for this 24 seven arm farm, how did we do resets? And the answer is, well, we didn't 00:13:15.480 |
we designed the bin so that they were kind of banked. So that object slightly missed, 00:13:19.080 |
they would fall back in the bin, rearm themselves, maybe add more diversity with the training 00:13:22.880 |
data. But this was doing off policy online RL with q learning. And we mixed it with some 00:13:31.800 |
Next, we kind of went through this consolidation phase around 2020. When we're like, alright, 00:13:37.880 |
this is pretty cool. And you know, but we want to get out of the bin, how do we do more 00:13:41.760 |
complex tasks and a more practical setting that could be closer to something that humans 00:13:46.440 |
would want to use that's more general every day. There, we kind of settled on this office 00:13:50.740 |
micro kitchen environment, if you've heard of the famous Google micro kitchens. And I 00:13:55.240 |
think this was the setting we decided to operate in. And there, we started collecting data, 00:14:00.600 |
we scaled our real operations. And there, we kind of scaled approaches to some different 00:14:04.360 |
things. And I think in the bottom right here is like the more mechanized reset version, 00:14:08.800 |
I would say of the arm farm. Here, we had a bin that folded in half. And this was doing 00:14:13.640 |
multitask RL in the real world. And the bin would flip in half dumping objects from one 00:14:17.480 |
side to the other. So you could do more interesting tasks, whereas the arm farm was pick anything 00:14:21.200 |
up. Now we could say, hey, pick up the carrot and place the tomato on to the plate. And 00:14:26.440 |
then the bin would flip and you'd reset. Some other works so far at multitask imitation 00:14:31.060 |
learning, this is BC zero. And then we also look at stuff like combining reinforcement 00:14:34.900 |
learning with imitation learning bootstrapping. 00:14:39.360 |
But in 2020, once again, we realized we were working on a ton of different directions, 00:14:45.160 |
and we wanted to consolidate. And I think the two main things that were really bothering 00:14:49.080 |
us at the point at the time, where we were hitting two main walls across all these methods, 00:14:53.980 |
some of them were plateauing at this 50 to 70% of, you know, rough range in the real 00:14:58.880 |
world. And other methods were requiring very specific data distributions, they had to be 00:15:04.240 |
on policy, or they could only use demonstrations, or they blah, blah, blah, like, there were 00:15:07.920 |
so many different nuances and like gotchas to all these different methods, and all these 00:15:11.880 |
different drawbacks. And so the question we posed was, we're open to any method, any strategy 00:15:18.180 |
that will enable us to solve tasks in a very performant matter more than 90% in the real 00:15:22.720 |
world. And also that can scale with some kind of data that we can collect, you know, and 00:15:28.560 |
maybe this is a bit more lax than let's say, an academic setting where you're much more 00:15:33.120 |
resource constrained. But at the end of the day, you know, even our team does not have 00:15:36.600 |
infinite money, we still have a certain number of robots, a certain number of operators, 00:15:40.560 |
and we're constrained by the laws of physics. So we need some way to acquire more data that 00:15:44.000 |
we can then learn from. And so we're all scratching our heads thinking about this for a few months 00:15:47.720 |
in spring 2022. We decided on going with multitask imitation learning. So this was a vast departure 00:15:54.600 |
from the 24/7 arm farm. This was a vast evolution of how we approach the problem. We found that, 00:16:00.560 |
you know, with enough, you know, gentle care and love, multitask imitation learning was 00:16:04.520 |
able to hit these 90% numbers, and it was able to get better with more demonstrations. 00:16:09.320 |
These aren't the cheapest thing, but it was able to scale with additional demonstrations, 00:16:13.760 |
which was the sign of life that we were looking for. So that brings us to less than a year 00:16:18.360 |
ago, our team was deciding this is the path forward, at least in the near term future. 00:16:23.120 |
But maybe, you know, we could just think about how the approach we were taking here might 00:16:30.040 |
also spread out in the future. And we might be able to bring back these other threads. 00:16:34.280 |
For example, if now that we're decoupling this data collection of demonstrations or 00:16:39.160 |
etc. from how you learn from them with a multitask imitation learning policy, maybe we can in 00:16:44.080 |
the future then do something like offline RL. But I think at a high level now, I've 00:16:48.840 |
just you know, in a few short minutes, just compressed six years of very bitter lessons 00:16:54.040 |
that our team has been learning. And I think from where we are today, and looking back, 00:16:57.960 |
even just two years ago, if you told me that the strategies we're deploying today could 00:17:01.680 |
just scale the way they are, I probably would not have believed you. 00:17:05.920 |
Great question. So I think task conditioning is definitely still was an open question at 00:17:20.320 |
the time. But I think with this work, BC zero, we found that language was able, at least 00:17:27.120 |
in a templated language, kind of representation was good enough where we could direct I think 00:17:32.120 |
BC zeros over 80 tasks. So they were they were very templated, like pick grapes, or 00:17:37.320 |
like move grapes onto play or drag this across black drag cloth across table. And I think 00:17:44.240 |
this representation was still enough where you're learning a good number of skills that 00:17:47.840 |
you're passing in essentially a one hot ID into your policy network, and it will learn 00:17:51.400 |
to imitate that. And for each one of those 80 tasks, we'd collect hundreds or 1000s of 00:17:55.320 |
demonstrations. And I will touch upon the specifics of that a bit later, too. 00:18:03.720 |
So yeah, today, and or at least in 2022, let's do offline methods, let's decouple data generation 00:18:10.320 |
from data consumption. And let's take these three lessons now that we touched upon. Let's 00:18:15.320 |
take the design principles of ML scaling, and then figure out what lessons can actually 00:18:18.960 |
be applied when you look into the future for recipe for robot learning and foundation models. 00:18:26.080 |
The first lesson I think is very important is these high capacity architectures like 00:18:29.680 |
attention. The second I'll touch on later is data interoperability, tokenization, tokenization, 00:18:34.960 |
discretization. And the second ingredient is the proliferation of these models themselves. 00:18:40.440 |
Can we leverage them because they will get better over time. And I think here, I would 00:18:44.680 |
like to plug my colleague Carol Hausman's bitter lesson 2.0, which is the bitter lesson. 00:18:49.560 |
The first one from Richard Sutton was, you should leverage methods that scale with more 00:18:53.560 |
compute. And maybe in today's day and age, the lesson is that we should leverage methods 00:18:58.800 |
that are able to utilize improvements in foundation models, because they're going to get better. 00:19:03.400 |
Yeah. So both in the lesson 1.0 and 2.0, one thing that's always been clear to me is suppose 00:19:10.400 |
I have a set of methods. And I want to choose the methods that are going to scale with more 00:19:13.400 |
compute or in this case, scale with better foundation models. The question is, how do 00:19:17.360 |
I actually decide which of those methods meet those criteria? 00:19:22.080 |
Yeah, great question. I think, and maybe it's, I think that's a very, I don't have a good 00:19:27.720 |
answer for that. Oh, sorry. Yeah, yeah. The question was in bitter lesson 1.0 and bitter 00:19:32.560 |
lesson 2.0, the question is, well, that's great. That's the lesson, but how do we actually 00:19:36.480 |
decide which methods meet this criteria? And I think, you know, my answer is it's not always 00:19:42.000 |
obvious and it's actually quite tricky sometimes, but maybe, you know, sometimes, you know, 00:19:46.680 |
what you can be very confident that, oh yeah, this will definitely scale with more data 00:19:50.320 |
and compute. And some that are same, but basically the more hard-coded you are, the more assumptions, 00:19:54.400 |
more heuristics you bake in, the more you in our, in our day and age, the more you rely 00:19:58.400 |
on a specific implementation of a specific foundation model of a specific class of algorithm, 00:20:04.560 |
maybe that will be less robust than a method that just assumes some very abstract input 00:20:09.000 |
and output and assumes that how you get from that input and output can improve over time. 00:20:13.280 |
And maybe the algorithm itself even changes altogether. So I think that would be my take 00:20:17.760 |
on the bitter lesson 2.0, but this is definitely still, I think the jury is still out on this. 00:20:25.440 |
And my, my, my basic, my, my, one of the things I like to propose is that language is the 00:20:30.280 |
way that we can leverage bitter lesson 2.0. If you have language as the universal representation 00:20:35.600 |
through which all of these foundations communicate to each other, whether it's, you know, captioning 00:20:39.640 |
or generation or whatnot, I think that's one way that we could leverage a bitter lesson 00:20:44.160 |
2.0. And finally, the third ingredient offline robot learning, decoupling data generation 00:20:51.680 |
from data consumption, putting these all together, my recipe for how one take at a modern attempt 00:20:58.740 |
that embodied intelligence would look like would be to combine these large offline datasets 00:21:03.640 |
with high capacity architectures by using language as the universal glue. And in the 00:21:08.560 |
works I'm going to present shortly, all of our different projects, I think in some way 00:21:13.040 |
or another are inspired by this philosophy. And now, now that we've kind of, you know, 00:21:22.320 |
understood the motivations and potentially one possible approach, of course, largely 00:21:29.600 |
the first offline is a high capacity architectures using language as a universal glue. I'm curious 00:21:34.320 |
to know which, if any of these are currently bottlenecks, not the right word, which means 00:21:39.560 |
they're limited. Got it. Because it seems to me like we already have large offline datasets, 00:21:43.320 |
we have high capacity architectures, and you know, those architectures are relatively just 00:21:46.560 |
a piece of language, but it seems like we already have all the components necessary. 00:21:49.720 |
So why is this then not a solved problem? The question was these, it seems like we have 00:21:55.880 |
a lot of these ingredients. And so why hasn't robotics been solved yet? So I would argue 00:22:01.240 |
that actually this take here, and maybe I'm, this is to the wrong audience at the moment, 00:22:05.880 |
but I think this is non, very non-obvious across the robotics field. Many people do 00:22:09.720 |
not agree with all of these, much less two of these, or even any of these points. And 00:22:15.160 |
so I think, and also the existence of the scale of how mature each of these components 00:22:20.520 |
are within robotics is at very different stages. And I would say like, and we can talk a bit 00:22:24.760 |
later about like, for example, like data scale, or the architectures that have kind of diffused 00:22:29.560 |
through osmosis from other ML domains into robotics. But I think we're still at very 00:22:34.440 |
different stages on how, how much people have actually bought into these lessons and invested 00:22:40.000 |
Yeah, I can probably, I also don't want to get into too much trouble here, but I'll 00:22:57.840 |
probably get myself in a bit of hot water in a few slides. So I'll, I'll extend upon 00:23:02.560 |
I'm just curious to know what their opinion is and why you think they're wrong. 00:23:07.280 |
Yeah. And I would say that like me personally, and, you know, not speaking for my team, but 00:23:12.960 |
a lot of people on my team are probably at the very extreme end of learning, scaling, 00:23:18.480 |
data-driven, you know, foundation model based, let's go big. And I think a lot of people 00:23:24.160 |
don't believe that. And yeah, happy to discuss why later, maybe after the Zoom as well. So, 00:23:30.160 |
so yeah. Well, okay then let's, let's go ahead and dive in and see how this recipe 00:23:35.920 |
might actually percolate into specific domains. And the first one is RT1. This is a recent 00:23:43.360 |
work from our group that works on how we can scale imitation learning. And let's look at 00:23:47.960 |
how we can actually apply these first principles. 00:23:50.880 |
So the first one is to consider what we actually, let's, let's put ourselves into the spring 00:23:55.920 |
2020 mindset. We we've been collecting demonstrations for a while. This is a ton of demos, like 00:24:00.840 |
a hundred thousand over that was collected over like a year and a half on many, many 00:24:05.200 |
robots on many, many tasks that exists. It was expensive. And over time, this will actually, 00:24:11.200 |
you know, not trickle up at insane amounts. Like we won't just get a hundred thousand 00:24:14.840 |
new high quality demos every day. This will grow over time, but it's not going to, you 00:24:19.000 |
know, grow for free. And autonomous ways of doing this is very hard. As you saw earlier 00:24:23.320 |
with MPOP with the bin reset mechanism, or DeepMind has a work on RGB stacking, where 00:24:27.560 |
they try to do autonomous resets. And you know what, the way that we're doing it right 00:24:31.160 |
now, or at least for this paper was human teleoperation pioneered by BC zero, and that 00:24:36.960 |
was very expensive as well. So there's going to be a limited throughput. And finally BC 00:24:41.480 |
zero used a ResNet based backbone, and it was pretty good, but I found that it was very 00:24:45.120 |
sensitive to training distributions. For example, when they remove data from some teleoperators 00:24:49.680 |
to make the data more homogenous performance got better, and that's not really a property 00:24:53.680 |
we like, right? We want more data, even if it's not exactly the same. So the lesson here, 00:25:00.120 |
models need to be robust and they need to generalize. Cool. So we have models and to 00:25:04.440 |
be robust and generalized. What else do we have? Well, off the shelf models are pretty 00:25:07.840 |
slow. If we take in these huge, you know, vision transformers from other domains, they're 00:25:12.120 |
not going to run on the real robot. We need to be able to run at a pretty high frequency. 00:25:16.000 |
They need to be reactive. Inference time needs to be slow because all our models are vision 00:25:20.520 |
based. And finally, we want our data to be able to understand language. As I mentioned, 00:25:26.720 |
robust language is the universal glue. Our data set already has some language. We want 00:25:30.400 |
eventual models to be very multimodal. This is a first principle that we need to dig in. 00:25:35.840 |
What does this mean? We can't just take something existing. We probably need to design or at 00:25:39.520 |
least modify something from the ground up. And let's take the best practices that we've 00:25:43.780 |
seen work in other fields. And so we worked for a bit and we came up with this architecture 00:25:51.160 |
for RT1. Again, once again, this was a large team with a bunch of different contributions, 00:25:56.040 |
and I'll just go through a few of them here. At a high level, RT1 is robotics transformer. 00:26:02.560 |
It operates at three hertz. It takes in visual input from the robot RGB camera, as well as 00:26:08.600 |
a natural language instruction. There, the image is patchified and fed into a film efficient 00:26:14.680 |
net tokenizer. It's then passed into token learner, which I'll talk about soon. And then 00:26:20.000 |
also the language instructions are tokenized and then they are put into the same transformer. 00:26:25.440 |
And then finally, we output discretized actions as tokens and send that to the real world 00:26:31.040 |
in three hertz in closed loop. This transformer is a decoder one. We use a sparse categorical 00:26:38.740 |
entropy objective for action prediction by applying a causal mask. We use the pre-trained 00:26:44.400 |
efficient net backbone, and we also use token learner for faster inference. Diving a little 00:26:50.320 |
bit deeper. Oh, sorry. Yeah. A question. Great question. So the image token, when it 00:27:01.920 |
goes in from, so each image is the, you know, the high fidelity RGB image from the camera. 00:27:07.320 |
We split that up into 81 separate patches. And so each patch is, you know, it's spatially 00:27:12.240 |
just like the square there. But the cool thing is that what token learner does here, this 00:27:18.080 |
thing here is it's a previous work from our group that takes in a bunch of possible you 00:27:24.640 |
know, image patches and dynamically selects which of those image patch tokens are more 00:27:30.280 |
relevant for the tax at hand, given the existing context. So from those 81 image patch tokens, 00:27:36.240 |
we sub sample eight of them to use for inference. And this happens at every time step. And that 00:27:42.520 |
process has learned which of the eight patches are relevant at any given moment. And otherwise, 00:27:48.840 |
we're sending in way too many tokens and the context length would explode and we wouldn't 00:27:52.600 |
be able to do inference on robots. We are also passing in a sequence sequence length 00:27:56.780 |
of six images. History is quite important when you're doing temporally coherent tasks 00:28:01.920 |
in the real world where things like physics and you know, exactly this, this nuanced detail 00:28:06.320 |
of what the objects are doing in relation to each other into your robot. Those details 00:28:10.240 |
really matter. And in total, the the model size is 35 million parameters, which is quite 00:28:17.600 |
a bit smaller than a lot of these other, you know, huge internet scale models. And finally, 00:28:23.720 |
one main difference here is action discretization. Before a lot of the products we're doing, 00:28:28.680 |
we're doing continuous control. And if you think about it, right, our robot does have, 00:28:32.640 |
we do end effector pose control on position control. And there, the real world is a continuous 00:28:37.260 |
state space. But, um, and to do that, we, we had to come up with many algorithmic novelties, 00:28:43.080 |
for example, a, a, a CEM actor that did basically sampling of these continuous action spaces 00:28:48.440 |
to propose the highest ones that would get rated by the Q function. And we do this twice, 00:28:52.280 |
blah, blah, blah. And, but that's like so sensitive, but we needed to get, do that to 00:28:55.840 |
get things to work. But now we just decided, let's just, you know, bin our actions. It's 00:29:00.560 |
only 256 discrete actions. And let's just predict those as tokens. Um, any question? 00:29:06.360 |
Yeah, what I was going to ask is, so you're mentioning that you have this design required 00:29:11.620 |
or engineering requirement about speed and latency reaction. And then you say that that 00:29:16.060 |
necessitates having a relatively small model, which makes sense. But one message of scaling 00:29:20.540 |
when we're talking about foundation models is that we don't want to be bottlenecked by 00:29:23.380 |
either data compute or parameters. So I guess what I'm curious to know is how do you balance 00:29:27.900 |
these off in the sense that you want to have lots of parameters to have a really powerful 00:29:31.580 |
model, while on the other hand, you want to have very fast input. 00:29:34.660 |
Yeah, great question. And to repeat it, the question is, um, we kind of set a pretty hard 00:29:39.560 |
constraint with a hundred millisecond inference time yet. A lot of the lessons in foundation 00:29:43.840 |
modeling is that you shouldn't be constraining yourself against any dimension, whether it's 00:29:47.260 |
data set, size, compute, or model capacity. And I think my initial answer to that is that's 00:29:52.440 |
a very great point and something I think that's going to be coming up as a severe bottleneck 00:29:57.000 |
in the future. But for, for our initial case, I think this is more of an exploration of 00:30:01.080 |
whether these principles and even scaling well beyond what we were looking at now to 00:30:05.440 |
work already on this 35 million is gigantic compared to a lot of prior work using, for 00:30:10.800 |
example, a ResNet-34 or whatnot. So this is already much bigger than, you know, a lot 00:30:15.240 |
of other options. And maybe for now, at least it's the easiest, it's the largest scale we 00:30:21.040 |
could go to roughly in the short term without having to think of more tricks. 00:30:25.200 |
Yeah, we can talk about it a bit later, maybe. I think I'd also love to hear your thoughts 00:30:33.040 |
too, because it's very non-obvious how we can get past some of these bottlenecks. 00:31:00.160 |
Yeah, great question. We ran some ablations on model size. I might have that in a few 00:31:06.040 |
slides, but maybe we can return to that then. And if not, I can, yeah, but great question. 00:31:15.040 |
So yeah, that's the architecture and I'll discuss some of the ablations and the trends 00:31:18.720 |
later on, but maybe, you know, this is a robotics lecture, I should show you some pretty visuals, 00:31:23.920 |
right? So let's look at some evaluations we did. We compared against some baselines. One 00:31:28.880 |
is Gato, which you might be familiar with. And then the other one is BC0, the ResNet 00:31:34.560 |
based one. And we find that we evaluate unseen tasks versus unseen tasks. And we also add 00:31:40.320 |
in various distractor objects. Our normal data collect looks like this top left picture, 00:31:45.160 |
three cans on a gray desk, that's basically it. But then we push it further by bringing 00:31:49.960 |
in a lot more objects so that the table is so cluttered that even as a human, sometimes 00:31:54.040 |
it's hard to find the object that you're actually looking for. We add in table class, 00:31:58.800 |
we make the textures very different. We bring it to new micro kitchens with new surfaces 00:32:03.120 |
all together. And we find that RT1 is more robust than these other different methods. 00:32:10.200 |
Good question. The question was, was the Gato model trained on our data or was it just already 00:32:25.200 |
included in Gato? The answer is this data was not included in Gato. And so we retrained 00:32:29.560 |
the Gato model only on our data. Yeah. And yeah, so here's just a different visualization 00:32:35.680 |
of the robot going out in our micro kitchen and doing different interesting things. You 00:32:39.760 |
can see here that it's trained on one setting, but then it goes into brand new kitchen, brand 00:32:44.400 |
new countertops, new objects, and it's able to do all of them pretty robustly. We also 00:32:49.520 |
put it into a long horizon setting using the SACAN framework that we'll talk about next. 00:32:56.360 |
But in these settings, a lot of them are mixing all of these generalization capabilities. 00:33:00.920 |
And on the plot on the left here, we're using what we call generalization levels inspired 00:33:04.600 |
by the VIMA paper that would basically increasingly change more factors of variation simultaneously. 00:33:14.080 |
Yeah, good question. We'll go into a bit more detail later, but I think at a high level, 00:33:25.360 |
teleoperators get a structure template at a command of like verb, noun, verb or something 00:33:30.400 |
like pick Coke can or move Apple near sponge. And we have around 700 tasks set up this way 00:33:38.400 |
and they go ahead and collect that data, test done. And then later we have, we make sure 00:33:43.240 |
that successes are actually successes and we discard stuff that's like unsafe, for example. 00:33:48.000 |
Oh yeah, I got it. For this paper, we, we, we utilize 130,000 demonstrations for this. 00:33:56.720 |
Yeah, great question. I think a lot of prior work has also been done on this, but it's 00:34:13.120 |
also noted that when you have, for example, the question was, did you find that the, the, 00:34:18.920 |
the trajectories in your dataset were very multimodal. And I think what you mean by that 00:34:22.720 |
is that to go from point A to point B, I can go left or I can go right, or I can go straight. 00:34:29.040 |
And I think this kind of diversity in basically for a single image state, but yet my data 00:34:34.880 |
has three possible labels that can have very bad effects sometimes. For us, I think because 00:34:40.120 |
we are using teleoperator demonstrations, the data was more homogenous than perhaps 00:34:44.800 |
like in the wild, for example, there's a type of data function called play data where operators 00:34:49.160 |
just do whatever they want and we label in hindsight. And I think our data is more homogenous 00:34:53.120 |
than that, but we did not find a lot of the issues that we've seen in prior projects. 00:34:57.760 |
One potential answer is maybe it's, it's the, it's the architecture itself, but we can talk 00:35:03.160 |
about that later too. Yeah. Question. Great question. We actually do have a termination 00:35:16.640 |
action. So the, the policy itself, so the question was how do you determine when a episode 00:35:21.440 |
is complete and the policy is able to predict terminate because at the end of each teleoperation 00:35:26.800 |
session, the operator can click a button and it's marked as episodes done. Yeah, I think 00:35:39.720 |
for these evaluations, we were quite strict, but definitely I think in some cases, you 00:35:44.760 |
know, maybe, maybe if we're just doing an experiment for ourselves, we'll have a dense 00:35:48.800 |
reward scale of like grasp the object and move closer, grasp the object and almost got 00:35:53.560 |
there, but mess up at the end. And we'll have like a, a grading curve basically. But for 00:35:57.320 |
all of these, all of these stats I'm showing here, it was zero or one, one fully complete 00:36:02.320 |
zero was not fully complete. Yeah. Cool. And I think what was exciting side and maybe talking 00:36:10.600 |
about the multimodality aspect is then we pushed the limit even further. We were Trent, 00:36:14.360 |
we decided to train on very diverse data distributions. You're you're back by then. Yeah. Okay. So 00:36:21.680 |
right now you saw 130 to a thousand demonstrations train on this everyday robot proprietary mobile 00:36:28.760 |
manipulator, but we were also looking to train on very different data distributions with 00:36:33.080 |
very different, you know, action distributions, very different trajectories, even very different 00:36:37.000 |
visuals objects tasks. And to do that, we included two other data sources. One was simulation 00:36:42.520 |
data, which was kind of our robot blood and sin, but it looked quite different. And also 00:36:47.220 |
this data was collected with reinforcement learning and not with teleoperate demonstrations 00:36:52.080 |
in the past with all of the aisle plus RL work that I mentioned, we found that combined 00:36:56.920 |
these, these two types of data was going to be very difficult because RL data has very 00:37:01.920 |
short action. It's very quick. That's very optimized for the specific reward function 00:37:06.800 |
versus human collective tele-operation data is a lot more, you know, human life, so to 00:37:11.960 |
speak. And finally, we revived a data set from many years ago at 2018. If you remember 00:37:16.640 |
the Kuka project, that arm farm has not been operational in that state for many years now, 00:37:21.000 |
but we had that data still. And so we were hoping to see if a different robot with a 00:37:25.920 |
different action space on different objects with different visuals in a different building 00:37:30.040 |
could still be combined with data from this micro kitchen, a robot data set that we train 00:37:35.680 |
on originally. And what was very surprising to me is that Archie one was able to learn 00:37:40.280 |
from all of these very diverse data distributions. I had never seen a result like this or any 00:37:44.880 |
other architecture, for example, a Resnet or even another learning method like reinforcement 00:37:50.180 |
learning could successfully learn on such different data distributions. So robustly. 00:37:55.840 |
And we evaluated, for example, on combining concepts. So we would have the original everyday 00:38:01.000 |
robot robot pick up objects that were only seen in the Kuka project, or we would put 00:38:06.200 |
objects only seen in simulation and see if our policy could understand that. So it did 00:38:10.080 |
seem like it could generalize between objects and seen in other data sets and concepts that 00:38:14.080 |
had seen in other data sets into the setting it was in now in the real micro kitchen. And 00:38:20.760 |
I have a question. How did you find the action spaces of the everyday robot with the Kuka? 00:38:27.160 |
Great question. Yeah, we just tokenized it and make sure that the tokenization scheme 00:38:31.560 |
was kind of interoperable. And I think that was the I can dive into that in a bit later 00:38:36.840 |
too. Yeah. And note that does not mean we can send the exact actions for one robot to 00:38:43.160 |
another and have it execute. It was more just like in the data set, I think even by human 00:38:47.320 |
inspect, you can tell that these are coming from two different robots. 00:38:51.680 |
So yeah, let's look at some ablations for the scaling laws that we're all here for now. 00:38:56.040 |
We found that, you know, reducing data site size reduces performance. But more interesting 00:39:00.240 |
maybe is task diversity was quite important. Here we have two different trends. 00:39:06.600 |
The green line is what happens when you reduce the total amount of episodes per task. And 00:39:12.120 |
then gray on here, the purple curve is for what happens when you reduce the total number 00:39:17.260 |
of tasks. And we found that having more tasks is relatively more important than having more 00:39:23.080 |
data for each task. And I think this was a lesson that I think is probably going to suggest 00:39:29.920 |
ways that, you know, we should scale robotics even further is not to just collect more data 00:39:34.080 |
of the same task in the same settings, but to go out into the wild and get more diverse 00:39:38.440 |
behavior. How do you define diversity for data? 00:39:44.200 |
Great question. Question is, how do you define data diversity? In this case, it's just a 00:39:49.040 |
number of unique structured templated commands that teleoperators receive. So those 700 templated 00:39:54.760 |
commands, when we start reducing them and only train on 500 or only train on 300 of 00:39:59.880 |
them, performance drops much quicker than if we had taken the same proportional cuts 00:40:05.200 |
to the total amount. Yeah, so I guess I'm very familiar with, like, it seems almost 00:40:15.760 |
a linear relationship for diversity and structure. Yeah, I don't think we, the question was, 00:40:25.280 |
there seems to be almost a linear correlation between data size and success rate. And I 00:40:29.240 |
think, you know, we could apply some fancy, like, you know, scaling law, you know, trying 00:40:33.240 |
to curve fitting, but we didn't look too much into that because, you know, this is a trend 00:40:37.800 |
that we kind of expected. We just weren't sure about the magnitude of how much it would 00:40:41.640 |
affect us. And I think I don't have any really good insights on this besides that we see 00:40:49.000 |
this phenomenon empirically. Yeah. Yeah, and great question. So the question is, oh, maybe 00:41:08.880 |
this will just go on indefinitely. Or is there something magical about, you know, January 00:41:12.760 |
and I think this is maybe also a, this is one we start to conflate the algorithmic exploration 00:41:19.920 |
with like the practical considerations of scaling real world operations, which was when 00:41:23.960 |
we got enough data, our policies were hitting, you know, saturating on these hitting close 00:41:27.320 |
to a hundred percent. We were like, all right, let's connect, collect another data set. So 00:41:31.360 |
we basically collect until it's at a hundred and then we switch to something else. But 00:41:35.240 |
at this point, what was interesting is that when we kind of bet really big on this RT1 00:41:39.480 |
feature, we'd already been collecting demos for a while. So it was possible that we had 00:41:43.600 |
collected more than we needed. And in some cases, I actually, you could cut tasks without 00:41:47.680 |
losing too much performance, which was quite interesting. But 00:41:50.200 |
yeah, great question. And the question is whether or not 00:42:20.040 |
all tasks are created equal in terms of like their capacity and entropy for different behaviors 00:42:24.120 |
you could learn from them. And yeah, that's definitely true. Some tasks are much easier. 00:42:28.000 |
We have a task that's just pick up this object. It's going to have much less interesting stuff 00:42:31.840 |
you can squeeze out of it then, you know, moving something into a drawer and then closing 00:42:35.620 |
the drawer. But yeah, great question. Great. Now ablations. We also trained without the 00:42:43.760 |
big model size. We did it without pre-training, without you know, with continuous instead 00:42:47.920 |
of discrete actions, with autoregressive actions, without history, without the transformer. 00:42:53.720 |
And I think all of these design choices did seem to be required for robust performance. 00:42:59.440 |
Oh, yeah, of course. Yeah, I think all I mean, like, and again, you know, for paper writing, 00:43:17.240 |
it's kind of like the best thing that we can empirically find. That's that's the method. 00:43:21.560 |
And then we'll figure out why each of these are important. And so, yeah, I think what 00:43:25.080 |
one surprising thing here, perhaps, was that autoregressive actions hurt, you might think 00:43:29.240 |
that passing in more information is always better than passing in fewer, fewer information. 00:43:33.760 |
But in this case, maybe conditioning on your previous actions was kind of doing kind of 00:43:38.480 |
like in context learning, it was doing online systems identification to figure out what 00:43:44.120 |
teleoperator this data came from, and like how you can overfit to that specific set of 00:43:48.680 |
action history. And so removing that was actually better. One interesting tidbit there. Cool 00:43:56.880 |
then. And maybe in the interest of time, I'll try to get through the other ones a bit more 00:44:02.760 |
quicker. And then we can maybe just do a few, I'll just do the questions at the end, if 00:44:07.120 |
that's possible, just so we have time to get through everything. The next work here, moving 00:44:11.700 |
a bit away from skill learning, then and actually on to the planning level, I think the first 00:44:15.720 |
project took a lot of the design principles of other fields, and this offline robot learning 00:44:20.400 |
paradigm and put it into the skill learning. Can we actually bring that now to other parts 00:44:24.680 |
of the robotic system? And the first work here is SACAN. If you remember here, back 00:44:28.640 |
in this timeline, in 2022, we started thinking about, oh, yeah, how do we scale this multitask 00:44:34.440 |
imitation learning, but at the same time, large language models and, you know, other 00:44:38.880 |
types of foundation models are really picking up steam, whether it was Imogen or Dolly 2. 00:44:44.440 |
And we definitely wanted to figure out how we could use those as well. We had come up 00:44:47.880 |
with this RTU 2.1 design that we're betting big on. But from here, we started to explore 00:44:53.080 |
how all of the beta lesson 2.0, we could start utilizing foundation models within the context 00:44:58.280 |
of our full stack system. The problem of doing this naively is that language models are not 00:45:04.880 |
completely a very natural fit for robotics. For example, if you're a robot in a kitchen, 00:45:09.900 |
you ask a language model, I spilled my drink, what can you do? Language model will give 00:45:12.960 |
you stuff that's not very relevant. It's going to ask you to vacuum it, it's going to ask 00:45:16.560 |
you to call a cleaner, or it's going to apologize. And these are not things that the robot can 00:45:20.640 |
do in your kitchen with your spilled drink to help you. And so there are two parts of 00:45:25.920 |
this then. The one issue is that our robots are limited. They are very constrained with 00:45:31.480 |
what they can do. They cannot do everything, but they can do certain things. And then the 00:45:35.600 |
second problem is that the language models are also constrained. They don't know what 00:45:40.560 |
the robot sees. They don't understand that they are in a robot body in a micro kitchen 00:45:44.760 |
needing to do real stuff in the physical world. And so we need to get the robots to speak 00:45:50.240 |
language model language, and then the language model to speak robot language. To do this, 00:45:55.440 |
we present SACAN in the same setting. Please put an apple on the table. We score the predictions 00:46:02.600 |
of the language model on a constrained set of tasks that we know the robot has been trained 00:46:07.240 |
to do. And then we also take the affordance function from the robot. An affordance function 00:46:11.120 |
is a estimation of, given some kind of state, what the robot is able to do, how confident 00:46:17.480 |
it is that it can successfully accomplish that task in the given state. In our case, 00:46:21.480 |
we use something like a value function from reinforcement learning, which kind of encompasses 00:46:24.800 |
this quality. Given these two values, these two scores, we have the confidence from a 00:46:28.760 |
language model, and then the confidence from the robot. We can combine these, and then 00:46:33.080 |
hopefully the combined prediction is both something that's going to be very semantically 00:46:37.000 |
relevant for the high level instruction. Finding an apple is the first step, and please put 00:46:41.400 |
an apple on the table. But it's also something that the robot can do. There's no robot in 00:46:44.960 |
the frame, but it knows that it's been trained to find an apple, so it can navigate around 00:46:48.800 |
to find it. And so hopefully we can do this then in closed loop, and then keep on going 00:46:53.100 |
and predicting a high level plan from the language model that's grounded with the affordance 00:46:57.160 |
function of what the robot understands. There's a video here of the SIGCHIAN doing different 00:47:04.120 |
stuff, but happy to share it later offline. It's very cool, trust me. It's the greatest 00:47:09.420 |
thing since sliced bread. And yeah, some numbers then. We tested this out on very long horizon 00:47:20.800 |
instructions encompassing more than 10 separate navigation and manipulation skills in the 00:47:26.080 |
micro kitchen that you see on the bottom right. We evaluated hundreds of different evaluations 00:47:32.460 |
on this, and we tested out a lot of different concepts, including things like rephrasing 00:47:37.300 |
by using single primitives, by drawing instructions that just came from colleagues and friends. 00:47:43.980 |
And then we found that while there were failures in both the language model planning side, 00:47:49.420 |
where it would predict the wrong path for the current situation, as well as on the policy 00:47:53.000 |
execution side, even when it gets a good plan, the robot will mess up sometimes. Overall, 00:47:57.440 |
it was still doing quite well. And now let's kind of take this back to the lesson. I think 00:48:05.520 |
this is a very great example of how we can leverage internet scale foundation models 00:48:11.320 |
as they get better. When we started the project, we started with a language model called Flan 00:48:15.280 |
from Google. Throughout our implementation, Palm came online, Pathways language model. 00:48:21.040 |
And when that happened, we were able to just hot swap it in, and performance just kind 00:48:25.860 |
of got better for free without us having to do anything. By just assuming that language 00:48:30.100 |
was the API, the plan just has to be any string. It can come from any source. It can come from 00:48:34.440 |
a human. It can come from a language model. When we improve that language model, the system 00:48:38.340 |
gets better overall. And here you see with the scaling sizes as the model LLM increased 00:48:43.540 |
in size, our planning performance got even better. And some cool tricks here to get it 00:48:52.460 |
working. Well, how do we actually produce this plan? Well, just by prompting, as is 00:48:57.020 |
the rich these days, with chain of thought and with better prompting of just giving examples 00:49:01.540 |
of here are some great robot plans. Now give me a new plan starting with this high-level 00:49:06.860 |
instruction. We saw that the robot could do all things from understanding different languages 00:49:11.660 |
to asking them to do very complex reasoning, like, hey, give me something caffeinated, 00:49:16.780 |
or I don't do caffeine anymore. Get me something better. Or I could bring me a healthy snack 00:49:21.620 |
versus bring me an unhealthy snack. Seikan was able to reason through all of these. 00:49:29.380 |
I think that was our kind of the first contact of robotics with language models on our team. 00:49:34.520 |
And it was the first exploration into how these two worlds could overlap. There was 00:49:38.820 |
definitely still improvements, though. And in our monologue, we tried to improve those 00:49:41.940 |
further by bringing in vision language models. The idea here is that we had very high plan 00:49:49.180 |
rate success with Seikan. But unfortunately, it wasn't really able to recover from failures. 00:49:54.980 |
What I mean by that is that the language model would not really get updates of what was going 00:49:59.040 |
on in the world, so that if this was the plan it proposed, go to the table, pick up a Coke, 00:50:03.140 |
bring it to you, but you messed up picking the Coke can. You dropped it on the floor. 00:50:07.000 |
It would still continue trying to bring it to you, put it aside, but all of that does 00:50:10.020 |
not really matter anymore because you dropped the Coke can. And so in this work, in our 00:50:15.300 |
monologue, we were really hoping to figure out how we could add closed loop dynamic feedback 00:50:20.180 |
from the environment into this planning process. Let's take that exact same example. Now, instead 00:50:27.240 |
of just directly printing every instruction, maybe we add back some feedback from the scene, 00:50:31.980 |
also conveyed using language as the universal API here. The scene can tell you what's actually 00:50:36.980 |
in there. Maybe the robot asks a question now. In the robot, this is a language model 00:50:41.660 |
asking the clarification question. Maybe here, a human responds or another language 00:50:45.380 |
model. Then you can predict the action or the next task to do once the language model 00:50:49.860 |
has enough context. And maybe you even add in stuff like success detection and so on 00:50:54.660 |
and so forth. How do we do this then? Well, the first thing that we implement is what 00:51:00.700 |
we call passive scene description. Just using either an off-the-shelf engineered heuristic, 00:51:06.060 |
using object detection models, something like Vylde, you can describe the scene in text 00:51:10.980 |
and just convey all that context to the language model. 00:51:15.420 |
For active scene description, this is maybe similar to visual question answering if you're 00:51:18.780 |
familiar with that field. The language model can actually propose active queries that it's 00:51:24.740 |
curious about in the scene, maybe to make sure that it has enough context to move on. 00:51:28.940 |
And here, either a human can provide the answer, or in the future, a VQA model as they improve 00:51:34.100 |
can provide that. And finally, for success detection, this is 00:51:38.340 |
very important to allow the language model planner to know when to try to retry something. 00:51:43.940 |
Here we take in the first and last image, fine tune a clip success detector, and use 00:51:48.420 |
that to provide binary success/failure information back to our language model. 00:51:56.620 |
And for the results-wise, we can see a very similar SACAN long-horizon evaluation, but 00:52:02.340 |
here what's interesting is that we're able to basically implement all these different 00:52:08.340 |
automated feedback mechanisms on the robot, and so that it's able to reason and recover 00:52:13.440 |
from things. Here you see it's going to try to go to the 00:52:17.660 |
table, but the human's actually been saying, "Hey, I changed my mind." And then the human 00:52:23.300 |
changes mind again, asking it to go back and forth. And the robot's able to, maybe we're 00:52:27.720 |
kind of torturing the language model at this point, but the language model's able to replan 00:52:31.300 |
and make sure that the human intent is satisfied. We also tried, I'm not sure if this video 00:52:39.980 |
shows it, but situations where we did adversarial inputs, where I walked around and just kind 00:52:44.780 |
of knocking objects out of the robot's hands and forcing the success detector to tell it, 00:52:49.620 |
"You messed up, try again." And we also tried this out on a couple of different domains, 00:53:00.140 |
a simulated tabletop manipulation domain, as well as a real-world manipulation domain, 00:53:04.200 |
and we found that this was much better than SACAN, or let's say just only using visual 00:53:10.200 |
features themselves with something like Clipboard. And I think here, it really speaks towards 00:53:17.600 |
a trend that I've really come to appreciate. In 2018, a robotics professor once said that 00:53:23.200 |
when they looked at all the different things preventing robot learning from scaling tremendously, 00:53:27.160 |
it thought the bottleneck was high-level semantic planning, about reasoning, about common sense. 00:53:31.460 |
And I think in 2022 and 2023, language models can provide a one path of how this can kind 00:53:38.020 |
of be offloaded, at least in the interim. And I think if language models are the API, 00:53:43.820 |
then you can just bring in these vision language models as object detectors get better, as 00:53:48.240 |
success detectors, as VQA, as language models get better, you can bring them all into the 00:53:51.980 |
fold and they act as kind of a life vest. If your robot currently does not have common 00:53:56.820 |
sense reasoning, these other models can act as a scaffold and a life vest to bring you 00:54:01.140 |
up to par with what they currently love. And maybe then in the future, you'll get beyond 00:54:05.620 |
what the language models know, but in the short term, it does seem that we can leverage 00:54:08.620 |
them to accelerate what we can do in the real world. Moving on now from, we saw now how 00:54:15.840 |
language models can do planning. We saw how vision language models can help planning. 00:54:19.860 |
And now we're going to switch gears a bit and think about how vision language models 00:54:22.980 |
can help other aspects of the bottlenecks that robot learning faces. One of these is 00:54:29.140 |
that data collection is very expensive. As we mentioned before, we did have this 130,000 00:54:36.140 |
demonstration data set, but it was collected over a year and a half at significant cost, 00:54:42.180 |
both in resources and time and money and with many, many robots. And of course, these tasks 00:54:49.140 |
too were also a bit limited, right? We use 700 very templated commands, instructions 00:54:55.020 |
that we give to teleoperators, because we knew that this would scale, right? If we collected 00:55:00.140 |
enough data for each of these templated tasks, we could do that specific task. And here's 00:55:05.420 |
the flow that someone was asking about earlier. We give this PICOCAN instruction, the operator 00:55:09.820 |
controls the robot in the real world, finish the task, marks the episode as terminate, 00:55:14.620 |
and then shade that out to this big orange data set. And that big orange data set is 00:55:18.580 |
what we trained on in all of the previous projects for the control policies. What we 00:55:23.060 |
additionally considered was adding a bit of crowdsourced hindsight annotation. If you're 00:55:27.420 |
familiar with it, with a hindsight experience replay and reinforcement learning with goal 00:55:31.340 |
conditioning with, you know, maybe the robot did something that wasn't just this high level 00:55:36.500 |
template instruction. We could ask a human to describe more verbosely what the robot 00:55:41.500 |
did. Maybe it picked up the COCAN that was on the right side of the table. Maybe it picked 00:55:45.060 |
it up and then knocked it over. Maybe it moved it very slowly to the middle. There's a lot 00:55:49.500 |
of semantic diversity encompassed in this demonstration that is not totally caught by 00:55:56.460 |
this high level templated PICOCAN instruction. So we labeled 3% of this big orange data set 00:56:02.580 |
with these very verbose descriptions. And next, we kind of applied the pseudo-label 00:56:09.420 |
strategy that's been seen in other fields, such as video pre-training with their inverse 00:56:13.860 |
dynamics model. But instead, we apply that to the instructions, to the semantics of what's 00:56:18.700 |
contained in your data set. So step one, we pre-train a clip model on your small label 00:56:25.380 |
data set of 3% of your main data. Then you go ahead and use that train BLM data to label 00:56:32.420 |
all of the templated instruction demonstrations that you had before that 130,000 episode data 00:56:38.540 |
sets. Now you have a re-labeled data set, which has a large diversity of interesting 00:56:43.380 |
semantic instructions. And then we plug in all of these data sets into RT1 and just train 00:56:50.140 |
a language condition behavior cloning policy, similarly to how we would normally. But even 00:56:55.740 |
though normally we just use data set B, the orange one, now we use all three data sets. 00:57:01.540 |
And then finally, we evaluate on entirely new unseen instructions. In the prior works, 00:57:08.340 |
we were evaluating mainly on the 700 templated instructions. But in this work, we actually 00:57:13.100 |
go beyond that. We can type in almost anything you want that you think might succeed. And 00:57:18.620 |
you can phrase it how you can. You can add typos. You can even do it by referring to 00:57:23.340 |
semantic concepts. You can add spatial concepts. And we see how it does. The reason that this 00:57:30.540 |
might work, maybe visually to represent this, is here are the t-SNE embeddings on the left 00:57:36.300 |
and the right. It's the same embeddings. But on the left, they're colored by the original 00:57:41.340 |
templated instruction that was used to collect that episode. And on the right is what the 00:57:47.460 |
vision language model thinks. If it's allowed to put a free form natural language caption 00:57:52.700 |
and assign it to that episode, you see that on the left, you have these big clusters of 00:57:56.660 |
pick cocaine is like, you know, hundreds or thousands of episodes, but we all just call 00:58:00.620 |
them pick cocaine. On the right, then we can then expand those concepts and say, actually, 00:58:04.860 |
this episode is picking up the red cocaine. This episode is picking up the crumpled cocaine. 00:58:10.320 |
This is picking up the cocaine that's next to the chip bag. And so you can get a lot 00:58:14.160 |
more mileage out of the same underlying data set by just using language as the diversity 00:58:19.140 |
mechanism through which you kind of expand the concepts that you're considering. And 00:58:23.300 |
for example, in the middle, you see, you know, open top drawer can become hold and pull out 00:58:27.500 |
the top drawer. We have stuff like the center left for the middle, for the middle episode, 00:58:32.740 |
for the bottom one, pick green rice chips from white bowl becomes lift up the green 00:58:36.500 |
chip bag from the bowl and drop it at the bottom left corner of the table. So you got 00:58:39.860 |
a lot of these semantic, you know, spatial concepts that are now going to be in your 00:58:45.880 |
I have a question. Yeah. Great question. So I guess if I can rephrase a bit, the problem 00:59:15.740 |
is that like, it's actually a very difficult and perhaps even untrackable problem of how 00:59:20.180 |
you map all the linguistic concepts you see out in the wild down to like, maybe like embodied 00:59:25.060 |
specific types of episodes. And like, here, maybe I would say is that we are definitely 00:59:30.180 |
introducing a lot of our priors and our biases onto like, maybe what we call as left, you 00:59:35.980 |
mean left 10 centimeters of two centimeters, like, like, what do words mean? And these 00:59:41.420 |
definitions, what do they mean to us, to the crowd compute raters that generated these 00:59:45.980 |
captions? What do they mean to the robot? What do they mean to the language models? 00:59:48.860 |
Maybe these are all slightly different, but the hope is at least if they're roughly similar, 00:59:54.180 |
we'll get like directionally correct improvements. So I would say the nuances of this specific 00:59:59.660 |
hard lines of definitions and like actual, like, you know, semantic meaning of these 01:00:04.900 |
words, I think that's maybe out of scope right now, but maybe something we'll dive into further 01:00:10.500 |
at a higher level, though. I think basically the bar is just so low. We have the 700 template 01:00:14.940 |
instructions that are basically one hot IDs, and we just want to make those closer to natural 01:00:20.100 |
language, even if by a little. And I think at least we're, we're, we're trying to get 01:00:25.500 |
towards that with these vision language models that are captioning automatically. Hope that 01:00:30.180 |
answers your question. And we also compare it to a few baselines on the top left here. 01:00:37.820 |
We look at what if we only train on this 3% of these fancy human rated labels? What if 01:00:43.580 |
we only train on the original RT1 data sets? What if we train on both of these? And what 01:00:48.820 |
if we train on both of these plus all of the predictions given by our BLM? And what's interesting 01:00:54.120 |
here is that, you know, relabeling seems to universally help. We evaluated only on novel 01:01:01.540 |
instructions that was new for this project. It's the first time on a robotics project 01:01:05.020 |
where we only tested on sentence, I could type whatever I thought I'll type it in. And 01:01:08.780 |
that became the test set. And we just had to make sure that it was never contained in 01:01:13.300 |
the training coverage. And you see all these interesting examples on the right here of 01:01:17.700 |
stuff like move the lonely object to the others. I have no idea how this works. Stuff like 01:01:23.580 |
news, lifting the yellow rectangle, talking about colors, talking about move the right 01:01:27.700 |
apple to the left. Here, we actually had two apples in the scene. And actually in our training 01:01:32.420 |
demonstration data, we never collected scenes with duplicate objects, just because, you 01:01:37.180 |
know, we thought of this multi-modality problem. If you just say pick cocaine in this two cocaine, 01:01:41.100 |
it's going to be very difficult to figure out which one to do. But with language labeling, 01:01:44.860 |
it seems like maybe we could do that now. So even though we never trained on scenes 01:01:47.900 |
of two apples, now you can evaluate on them and just specify with language, which apple 01:01:52.180 |
you want to go for. And it was working pretty reasonably. And finally, for the last example 01:01:58.340 |
here, I thought it was kind of interesting. A single cocaine, we try to do a novel behavior. 01:02:03.500 |
Push towards the left was not a templated instruction. We only had move cocaine near 01:02:09.180 |
Y, where Y is another object, move cocaine near apple, move cocaine near sponge. So pushing 01:02:15.260 |
this motion of just pushing the cocaine into air essentially was not something that we 01:02:20.340 |
ever encompassed, but maybe it was in one of the labels. Maybe like if you've seen like 01:02:24.700 |
move cocaine near apple and apples on the left, and you saw move cocaine near sponge 01:02:29.060 |
and the sponge is on the left, you would general, the model can generalize and be like, oh, 01:02:32.620 |
left means this side of the table, not a specific object. So maybe that's what's happening, 01:02:37.380 |
but it's very unclear. This is, as I said, you know, just, I type, I thought of something, 01:02:42.380 |
I typed it and just saw what happened. And we definitely hope to explore this more quantitatively 01:02:46.980 |
in the future. Bottom left, of course, is I think comparing against non-visual augmentation. 01:02:51.980 |
So maybe you can also get these interesting concepts just from language alone, right here. 01:02:56.380 |
We had adding random noise or we do Madlib style, just swapping out words, or we even 01:03:01.260 |
use a LLM GPT-3 in this case to propose rephrasing of existing instructions. But I think my takeaway 01:03:07.620 |
there is that you really need visual grounding for the visual language model to say, actually, 01:03:12.260 |
yeah, this caption is factually accurate at this given point in time. And that it's, you 01:03:17.300 |
know, something perhaps that would be interesting for a robot. That fine-tuning process provides 01:03:22.300 |
both of those. Yeah, yeah, definitely. These are just some subsets of five of these evaluation 01:03:37.860 |
instructions, but we had over 60 of them. We didn't do a full quantitative ablation, 01:03:42.620 |
for example, as we did in RT1. We had this like seen and unseen task set, and that was 01:03:47.900 |
compositional. You would see, you know, move Coke near Apple, and you would see move Apple 01:03:52.500 |
near sponge, but we'd hold out, move Coke near sponge, and we would test that out. But 01:03:56.420 |
in this case, I think we can go much more beyond that. Because our language is completely 01:03:59.860 |
freeform, the compositional space of what you can kind of combine is just going to be 01:04:04.700 |
much larger. So we did try a little bit to answer your question. We tried some combinatorial 01:04:08.860 |
evaluations, but there's definitely a lot more thoroughness that we could do there, 01:04:14.140 |
too. How am I doing on time? Okay, 10 minutes. Maybe I'll try to wrap up pretty soon, then. 01:04:20.180 |
The dial-up takeaway, then, is that two parts, right? Lesson two, leverage foundation models. 01:04:25.260 |
Let's use them as data augmentation. And lesson three, let's make sure that our offline data 01:04:29.460 |
set, you know, is robust enough where these different behaviors exist, and you can describe 01:04:34.500 |
them in language. If you don't have enough diverse behaviors, no matter how good your 01:04:37.940 |
labeling is, you probably can't elicit all of the interesting concepts that you want 01:04:41.540 |
to learn from. And maybe most exciting for me here was that actually some label noise 01:04:46.660 |
is okay. Notoriously, in supervised learning and imitation learning, you need very clean 01:04:50.980 |
labels that are always 100% true, right? You don't want to be learning from, like, noisy 01:04:55.540 |
data where some, like, you know, large percentage is just not accurate. But in our case, it 01:05:00.180 |
seems that, like, some label noise was okay. The vision language model was not always predicting 01:05:06.340 |
factually accurate descriptions of the scene. And I think this definitely hurt when it got 01:05:12.140 |
too high, the noise, but at smaller levels, it definitely still seemed to be okay and 01:05:17.060 |
robust enough to handle that. So, that was a deep dive, then, on some individual works 01:05:23.560 |
that use this big recipe of language, foundation models, offline data sets in different parts 01:05:29.460 |
of the robot system. And this was the kind of pitch at the beginning, and I hope you 01:05:35.260 |
at least see a little bit of how our team has tried to take these principles and apply 01:05:39.580 |
them to accelerating robot learning in the real world. As we see these different types 01:05:44.660 |
of ingredients and lessons map onto different parts of the robot system altogether. For 01:05:50.300 |
skill learning, right, that was RQ1 that we talked about. For planning, that was SACAN, 01:05:54.260 |
and then adding the closed-loop feedback with vision language models, that was inner monologue. 01:05:58.580 |
For low-level control, we didn't talk about this today, but an exciting work from our 01:06:01.700 |
team is actually using language models to predict code that's executed on the robot 01:06:06.180 |
directly, perhaps as low-level controllers. Language models, you know, they read textbooks, 01:06:11.100 |
they've read raw stocks, they've read, you know, UR5 documentation code, and they can 01:06:14.860 |
write code for these robots, and we can execute that. For data augmentation, we saw Dial with 01:06:20.100 |
vision language models. And also, I didn't talk about this here, but for object-centric 01:06:25.100 |
representations, for things like feature activation maps for specific objects, we can use those 01:06:29.820 |
as task representation for mapping a scene. And in NLMAP, they did that for object-centric 01:06:36.100 |
navigation around the micro kitchen that we looked at. And I think, hopefully, in the 01:06:41.500 |
next, you know, coming weeks and months, we have a few more rows and entries to add here 01:06:45.780 |
as well, but I think this kind of mindset is a very exciting research direction of how 01:06:52.020 |
you can apply these big high-level concepts about foundation models and offline data sets, 01:06:56.380 |
when you look at what exists in the robot systems of today, and you find many gaps and 01:07:00.700 |
opportunities still available where we can do everything from exploratory pilots on how 01:07:05.880 |
this might look like, all the way to more extensive evaluations and really building 01:07:09.460 |
out robust systems. I think both of these have value. So, I'll conclude with just saying 01:07:16.380 |
that it was very fun exploring all of these complementary directions, but there are still 01:07:21.160 |
some major questions of how we can take these concepts even further, and how these trends 01:07:25.680 |
and ideas might even evolve moving forward as foundation models get better, as more data 01:07:30.380 |
set becomes available online, as more data becomes homogenized and tokenized and interoperable. 01:07:36.120 |
And I think a lot of the concepts from other fields, like linguistics and vision, and from, 01:07:40.660 |
you know, all of the big scaling kind of level questions that are being pioneered in language-based 01:07:46.580 |
foundation models, hopefully, those kind of ideas can trickle down to robotics. Maybe 01:07:50.660 |
even robotics can provide something back by providing embodied action causal data sets 01:07:55.380 |
that maybe might improve the quality of reasoning of some of these large language models that 01:07:59.820 |
are not embodied. With that, though, I guess I'd like to, you know, thank everyone for 01:08:05.520 |
your time and for Dave and Sia for inviting me, and open to any questions about the papers 01:08:11.000 |
or just at a high level as well. Thanks so much. 01:08:36.020 |
Yeah great question. So the question, I guess, is like, what about tasks that require more 01:08:39.560 |
semantic reasoning, like, you know, operating at a certain speed or with maybe like, I don't 01:08:44.400 |
know, numerical reasoning within the question, the prompt itself. I would say, so for a lot 01:08:50.040 |
of the more common sense reasoning, like, you know, throw away three co-cans, you know, 01:08:55.880 |
after another, I think, you know, the language model is very good at that right now. So for 01:09:00.160 |
the secant planner, it will predict, you know, throw away the co-can three separate times. 01:09:05.260 |
For the low level skill policy learning, though, I think that's more of a, that's more high 01:09:11.120 |
variance, I would say. And definitely for right now, we don't really condition on speed 01:09:16.680 |
or how you do it exactly. But that's definitely maybe something I could do if you could relabel 01:09:22.420 |
with, like, pick up the co-can slowly versus pick up the co-can quickly. Maybe that is 01:09:26.840 |
something a vision language model could recognize. 01:09:55.660 |
The question was, at what scale do we see like combinatorial generalization start to 01:10:00.700 |
occur, maybe between like, you've seen colors of one block, and then you want to evaluate 01:10:04.820 |
on a new color? And I think that's a great question. And unfortunately, my answer is 01:10:09.060 |
going to be very vague. And it depends. It depends on how you define your tasks. It depends 01:10:13.240 |
on the scale of your data set. And it depends on like, the concepts that you're trying to 01:10:16.400 |
generalize across. I think there have been numerous attempts to kind of basically formalize 01:10:22.440 |
what it means to generalize within, you know, learning and within robotics, even within 01:10:26.960 |
like the specific settings we consider. And I don't think there are any clear trends be 01:10:31.600 |
like, of where you can say, oh, yeah, this is the number I need to hit where, you know, 01:10:35.480 |
I can generalize across x, y, z dimensions. Like, you could evaluate all those, but I 01:10:39.680 |
don't think it will help you predict new trends, at least right now. I think we're probably, 01:10:43.160 |
you know, this is just me talking, I would say we're one order of magnitude off before 01:10:47.480 |
we can start to make very broadly generalizing statements about generalization capabilities. 01:10:53.680 |
I think, you know, add one or two more zeros to our data set size, and we can start to 01:10:57.360 |
talk about that in terms of task object skills. Yeah. 01:11:18.440 |
Yeah, very astute observation. So the question, the question was that in SACAN, the value 01:11:34.200 |
functions that predict these scalars on the right here for the affordances are only storing 01:11:39.280 |
a certain limited number of tasks. So is that the bottleneck? And I would say yes, 100%. 01:11:44.000 |
Scaling the number of tasks that your system is able to do that you can then give to the 01:11:48.240 |
planner as its buffet of options to choose, that is the bottleneck, right? No matter how 01:11:52.480 |
good your planner is, if you can only do like three tasks, there's only certain like combinations 01:11:58.320 |
of those three tasks that it can do to, you know, map on to a high level instruction. 01:12:02.920 |
So as you add more tasks, as the low level skill capabilities of your robot increases, 01:12:08.160 |
you're kind of like adding precision to like the coverage of the high level instructions 01:12:13.040 |
that your robot can try to do. So that's one of the main bottlenecks I see today. 01:12:32.080 |
Great question. So have we tried RQ1 with RLHF or with RL? I think the short answer 01:12:39.920 |
is I think we have some stuff in the works that is doing that. But right now, for all 01:12:43.760 |
of our projects, currently, we're just using this implementation learning loss. Again, 01:12:49.240 |
I think I view this multitasking limitation bet that we're making is kind of an existence 01:12:52.960 |
proof. It works, it's not cheap, but it kind of does work and it does scale. And that at 01:12:58.160 |
least is a good starting point. And our main, you know, hope over the next months and years 01:13:03.120 |
is can we improve beyond that? Can we add back in offline improvement? You know, can 01:13:07.040 |
we add in RL back to the equation somehow? I'm an RL person at heart, so I really hope 01:13:37.040 |
Yeah, good question. So regarding task balance and whether text-only data is sufficient for 01:13:54.200 |
helping motor control learning, I think my hope is that when, you know, when we experience 01:14:01.800 |
emergence in both the robotics space and we've already seen emergence in the language space, 01:14:06.600 |
at some point, maybe these reasoning concepts will start to transfer between the two. I 01:14:10.520 |
would point them to one interesting paper, which is, I think, can Wikipedia help reinforce 01:14:15.120 |
that learning from Shane and some other folks? They pre-train, you know, a large policy network 01:14:21.360 |
on like, you know, auto-aggressive token prediction on Wikipedia, just text only, and they use 01:14:25.880 |
that to initialize, like, control for Atari games with RL, and this actually helped. So, 01:14:31.400 |
you know, maybe this is philosophical, but maybe there's something about decision-making 01:14:34.800 |
reasoning that transfers between text and action data, so. 01:14:42.760 |
Great question. I definitely agree. You know, passing in six images is not going to be enough 01:14:58.840 |
when you're executing tasks for minutes at a time. Like, clean my whole house, and then 01:15:02.600 |
you can only pass in the last, like, you know, two seconds. Like, come on. So, I think that's 01:15:07.400 |
definitely going to be a limitation as our tasks set more complex and long horizon, and 01:15:12.280 |
I think here, it's another open question, too, is context length. We have high-dimensional 01:15:16.960 |
images, even with token learning for reducing the number of patches that we pass through, 01:15:22.100 |
it's still, you know, very high-dimensional, and we quickly hit the context length cap. 01:15:26.620 |
Can we do, how do we, you know, improve beyond this? Maybe it's like retrieval transformers 01:15:31.240 |
or some other kind of mechanism. Great question. I think we are hoping to explore 01:15:41.160 |
that in the future, but with this, like, context length limitation, we are already near the 01:15:45.280 |
context length capacity with just these six images alone, much less, you know, passing 01:15:50.140 |
in whole trajectories of zero-shot behavior, few-shot behavior we wish to see. So, 2BD,