Back to Index

Stanford CS25: V2 I Robotics and Imitation Learning


Transcript

I'm really happy to be here. I guess to shortly introduce myself. My name is Ted Xiao. I'm a senior research engineer at the Google Brain team. I've been on working on robotics now for the past five years. I've touched upon a few topics including multitask learning, reinforcement learning, and then lately just broadly thinking about how we can scale robots to make sure that we can actually work in the wild in the real world.

I guess today I'll be talking about quite a few different topics, but as a first preface, I guess the first thing to know is that our team is pretty massive now. All of these projects are huge collaborations with some products have more than 40 people working on these for many years.

So these are large efforts and I'm just very fortunate to call myself to be on teams of very smart people. Secondly, some of my takes are spicier or more controversial than others. So all of those opinions are definitely only my own and don't reflect those of Google or anyone else on the team.

So with that out of the way, yeah, welcome to my TEDx talk. So I think maybe some of you have seen a lot of the cool robot learning videos out in the wild these days, but I am more excited than ever and it's not just hype I think. I think there's been a fundamental shift in how researcher and robotics view learning over the past two years and I think the shift has a lot to do with all of the trends happening more broadly in foundation modeling, in large-scale internet models, across different fields like language, audio, and so on.

But I think my goal today is to convey to you why I am particularly excited about this time today right now and why there's been a very fundamental one 80-degree paradigm shift I think across the robot learning field. And if you walk away from this talk with just one thing and that's you're slightly a bit more excited about robotics than you were before or believe that the time is now for these robots to really start scaling exponentially and doing something really cool, I think then my talk will have succeeded.

The talk will have a few parts. We're going to start at a very high level and just talk about why a foundation model for robotics at all, what that might look like, and the ingredients and recipe for how we might get there. Then we'll dive into a few different works pretty deeply that my team has been very proud of over the past year or two.

And finally, we'll go back to the high level and then zoom out and think about what's next for robot learning. So why a foundation model for robotics? One second, let me try to hide this thing. No, that's fine. I'll keep that bar there for now. But the top bar says why a foundation model for robotics.

Being coined here at Stanford, and I'll use the phrases internet scale model, foundation model, and large language model pretty interchangeably throughout. And I hope it's pretty clear. But generally, when I'm talking about these big monolithic beasts that are training on tons of data, they have two very important properties that I think are quite nice.

One is emergence. When very simple things kind of work at a small scale, they get a ton better when you just scale things up more data, more compute larger models. And what we see here is that when these models even become good enough, the domain space of what they're good at and able to do starts to go combinatorial even larger.

And here for these two points, I would like to suggest two blog posts I highly recommend. One is from Jacob Steinhardt called more is different for AI. And this kind of links the phenomenon that we see in other fields, like physics or biology, for example, individual water molecules will behave very differently and have very different, let's say electrostatic forces, then they start to clump up clump up and start behaving as a liquid altogether.

We see this in herds of animal and flocking patterns, we see this in humans and economies, we see this all across different fields. And now even an AI, we see models that are doing stuff that will not be even possible where they add a smaller scale. But when they reach some critical scale in size, they start to work really, really well.

This is documented by Jason in his blog post emergence and LLMs, which you see this plot on the bottom left, successfully across a bunch of different tasks, whether it's modular arithmetic or purging question answering, the success rate is basically flat until these models get big enough, good enough. And then the success rates just kind of skyrocket.

And that's why I think these are particularly exciting. So yeah, question. I'm curious to know, do robotic foundation models display scale in real life? Great question. And I'm really glad you asked. We have, I'm pretty excited to present some directions we have along that I hope will answer your question in maybe about 10 minutes or so.

Yeah. But I think that's a question on all of our minds, including myself. So I think before we even get to the feasibility or the existence of any robotic foundation models, like is this even needed? And I think the argument that I don't think is obvious is that I think emerging capabilities and relying on these might be actually indispensable for robotics to actually work.

A lot of the research over the past decades of robotics has been in one bin, one room, one table, one robot, one building even, but these are so vastly different from the orders of magnitude, more complex while real world situations that humans operate in every single day. And I think to make that gigantic leap, we're going to have to rely on this emerging capability scaling curve where things kind of work.

You have very canned demos. Maybe you have, you know, a humanoid robot program to backflip after hundreds of trials, but going from that to like the chaotic real world, I think we're going to have to rely on this emergence phenomenon for that. And I think maybe even intellectually or academically, it's also interesting to think about why or why not a foundation model for robotics might even work.

It's worked in so many other domains. There's existence proofs in audio, music, coding, language, another domain every single day, it seems with 3D models and beyond. But if there is something very special about robotics, whether it's embodiment or causality or physical grounding, and that is the barrier to making this very simple recipe that's working all these other domains.

If there is something special about robotics that causes this recipe to fail, I think that's quite interesting to study why that is. I'm personally an optimist. I don't think there is some magical secret sauce that's going to keep robotics from being tackled with the same formulas and recipes that's worked elsewhere.

But, you know, I think this is a question I'd like to find out the answer to. And so maybe then instead of just motivating this philosophically, okay, we need foundation models, foundation models are great. Let's try to build one for robotics. How do we actually do that? Well, I think we can leverage a few ingredients by standing on the shoulder of giants and looking at other domains.

The first one is looking at different design principles of ML scaling from other domains. Let's look first at high capacity architectures, the topic of this class today. Ideas such as self-attention, as all the different ideas encompassed in the transformer, as Andrej Karpathy famously said, it's like a magical universal differentiable computer that's very general, very robust, and very remarkably scalable on many different dimensions.

Let's use those. We should also leverage the more guiding principles that have been seen, the scaling laws, the trends, this year's Cinchilla, you know, we not only have to scale the model size, we also have to scale compute, and we also have to scale the number of unique tokens in the corpus of the vast data sets that we train on.

But if we do all three together, this has been shown to reliably have a pretty good chance of succeeding, no matter what domain you're looking at. And so, and finally, what that kind of means, and I think this is actually going to come up later, is that data set size seems to matter these days a lot more than quality.

Even if you have some sentences on Wikipedia that are misspelled, or some, you know, falsehoods, or some things that aren't so desirable, if in aggregate, your data set is diverse enough, and interesting enough, these things will hopefully wash out in the mix. Ingredient number two, the proliferation of the internet scale models themselves, not just the principles.

What's exciting, and I'm sure it's, you know, definitely been very shocking for both experts and lay people alike, is that a lot of these generative models across many different modalities have been experiencing emerging capabilities and have been surpassing all of our wildest expectations time and time and again. But even when we think that we're exhausted, all this stuff is too much, it's not going to work, something will come out and completely blow me out of the water.

And I think this trend will definitely keep continuing. And I think, in addition to that, they not only will continue coming on and accelerate more rapidly, they're going to happen with it, whether or not like we do anything, you know, in the grand scale, speaking, me as a robotics researcher, or, you know, you and whatever subfield you're on, there are parts of machine learning that likely you'll probably not ever touch in at least the near future.

And those parts will be seeing tremendous breakthroughs and scaling and new capabilities coming online every single week. And you can look at this not only in the impressiveness of the models, but also the acceleration of progress, the timescales in which new models are being released, why we're large collaborations are being worked on by many groups, and then, you know, being available to access for all to use and build upon.

And the final ingredient in this trend is more of a robotic specific one, but it is a vast shift from online robotic learning, where robots collect experience online, make actions and learn through trial and error to an offline setting where we decouple the data generation process from the data consumption process.

As we've seen, and all these other foundation modeling domains, these big internet scale data sets are so diverse, and they're static, we just scrape them once or scrape them multiple times continuously. But we aggregate a continuous pile that's just growing. Here, we see either the pile data set from a Luther or Lyon 5b for image paired image text.

And these are pretty big, and they're orders of magnitude more than what we've seen before. And they are definitely a key ingredient to why other domains have been doing so well at training these big foundation models. And this coming back to robotics, then I'd like to take a brief detour into how the shift came to be because it's very easy to say in a sentence, yeah, robotics is offline more than online.

And this is coming as kind of a no brainer to many folks who are coming from other domains, like this is the way things are done. But in robotics, this has been a very big shift. And I think robotics has also been synonymous with RL, reinforcement learning for a lot of people.

And I think increasingly, this is becoming less true. And so I'd like to take you down a brief trip down the history of my team, their side of the talks as brief history of robotics at Google. And yeah, of course, thanks. And I think this is not just for dramatic exposition, it's really to try to guide you through how drastically our team's thinking has kind of evolved over the years, and how that's going to inform the design decisions and the kind of risks and research directions that we take in the specific projects that I'm going to show coming up.

Thank you. So in 2016, some of you may have seen this, we had what we call the arm farm, seven KUKA robots in a room collecting picking data 24/7. And this was doing on policy RL in the real world, we were the first team to kind of say, hey, can we can we even do this with the goal of saying, can we do end to end robot learning with results in the real world, this was kind of risky at the time, it was not a common take.

And from that we developed several interesting research directions that we started exploring, we looked into stuff like QT opt, which is a Q learning method, working on continuous control actions. While taking a vision inputs, we worked on cycle GAN to transform simulation based images into real real looking images for sensor real, we looked at concurrent role of how we get robots moving faster and more efficiently in the real world.

I'm sorry, do you have a question? Yeah, great question. And that one, I think was basically, the arms would pick stuff up from the bin, if they messed up, and it fell out, well, we come back the next morning, and there'd be objects scattered all throughout the room. So there was no reset.

But if they missed a little bit, the objects would fall back into the bin and hopefully be in a position where they could pick them up again. Oh, yeah, of course. Thanks. I'll do that in the future. On this specific question was for this 24 seven arm farm, how did we do resets?

And the answer is, well, we didn't we designed the bin so that they were kind of banked. So that object slightly missed, they would fall back in the bin, rearm themselves, maybe add more diversity with the training data. But this was doing off policy online RL with q learning.

And we mixed it with some data deployed again. Next, we kind of went through this consolidation phase around 2020. When we're like, alright, this is pretty cool. And you know, but we want to get out of the bin, how do we do more complex tasks and a more practical setting that could be closer to something that humans would want to use that's more general every day.

There, we kind of settled on this office micro kitchen environment, if you've heard of the famous Google micro kitchens. And I think this was the setting we decided to operate in. And there, we started collecting data, we scaled our real operations. And there, we kind of scaled approaches to some different things.

And I think in the bottom right here is like the more mechanized reset version, I would say of the arm farm. Here, we had a bin that folded in half. And this was doing multitask RL in the real world. And the bin would flip in half dumping objects from one side to the other.

So you could do more interesting tasks, whereas the arm farm was pick anything up. Now we could say, hey, pick up the carrot and place the tomato on to the plate. And then the bin would flip and you'd reset. Some other works so far at multitask imitation learning, this is BC zero.

And then we also look at stuff like combining reinforcement learning with imitation learning bootstrapping. But in 2020, once again, we realized we were working on a ton of different directions, and we wanted to consolidate. And I think the two main things that were really bothering us at the point at the time, where we were hitting two main walls across all these methods, some of them were plateauing at this 50 to 70% of, you know, rough range in the real world.

And other methods were requiring very specific data distributions, they had to be on policy, or they could only use demonstrations, or they blah, blah, blah, like, there were so many different nuances and like gotchas to all these different methods, and all these different drawbacks. And so the question we posed was, we're open to any method, any strategy that will enable us to solve tasks in a very performant matter more than 90% in the real world.

And also that can scale with some kind of data that we can collect, you know, and maybe this is a bit more lax than let's say, an academic setting where you're much more resource constrained. But at the end of the day, you know, even our team does not have infinite money, we still have a certain number of robots, a certain number of operators, and we're constrained by the laws of physics.

So we need some way to acquire more data that we can then learn from. And so we're all scratching our heads thinking about this for a few months in spring 2022. We decided on going with multitask imitation learning. So this was a vast departure from the 24/7 arm farm.

This was a vast evolution of how we approach the problem. We found that, you know, with enough, you know, gentle care and love, multitask imitation learning was able to hit these 90% numbers, and it was able to get better with more demonstrations. These aren't the cheapest thing, but it was able to scale with additional demonstrations, which was the sign of life that we were looking for.

So that brings us to less than a year ago, our team was deciding this is the path forward, at least in the near term future. But maybe, you know, we could just think about how the approach we were taking here might also spread out in the future. And we might be able to bring back these other threads.

For example, if now that we're decoupling this data collection of demonstrations or etc. from how you learn from them with a multitask imitation learning policy, maybe we can in the future then do something like offline RL. But I think at a high level now, I've just you know, in a few short minutes, just compressed six years of very bitter lessons that our team has been learning.

And I think from where we are today, and looking back, even just two years ago, if you told me that the strategies we're deploying today could just scale the way they are, I probably would not have believed you. Great question. So I think task conditioning is definitely still was an open question at the time.

But I think with this work, BC zero, we found that language was able, at least in a templated language, kind of representation was good enough where we could direct I think BC zeros over 80 tasks. So they were they were very templated, like pick grapes, or like move grapes onto play or drag this across black drag cloth across table.

And I think this representation was still enough where you're learning a good number of skills that you're passing in essentially a one hot ID into your policy network, and it will learn to imitate that. And for each one of those 80 tasks, we'd collect hundreds or 1000s of demonstrations.

And I will touch upon the specifics of that a bit later, too. So yeah, today, and or at least in 2022, let's do offline methods, let's decouple data generation from data consumption. And let's take these three lessons now that we touched upon. Let's take the design principles of ML scaling, and then figure out what lessons can actually be applied when you look into the future for recipe for robot learning and foundation models.

The first lesson I think is very important is these high capacity architectures like attention. The second I'll touch on later is data interoperability, tokenization, tokenization, discretization. And the second ingredient is the proliferation of these models themselves. Can we leverage them because they will get better over time. And I think here, I would like to plug my colleague Carol Hausman's bitter lesson 2.0, which is the bitter lesson.

The first one from Richard Sutton was, you should leverage methods that scale with more compute. And maybe in today's day and age, the lesson is that we should leverage methods that are able to utilize improvements in foundation models, because they're going to get better. Yeah. So both in the lesson 1.0 and 2.0, one thing that's always been clear to me is suppose I have a set of methods.

And I want to choose the methods that are going to scale with more compute or in this case, scale with better foundation models. The question is, how do I actually decide which of those methods meet those criteria? Yeah, great question. I think, and maybe it's, I think that's a very, I don't have a good answer for that.

Oh, sorry. Yeah, yeah. The question was in bitter lesson 1.0 and bitter lesson 2.0, the question is, well, that's great. That's the lesson, but how do we actually decide which methods meet this criteria? And I think, you know, my answer is it's not always obvious and it's actually quite tricky sometimes, but maybe, you know, sometimes, you know, what you can be very confident that, oh yeah, this will definitely scale with more data and compute.

And some that are same, but basically the more hard-coded you are, the more assumptions, more heuristics you bake in, the more you in our, in our day and age, the more you rely on a specific implementation of a specific foundation model of a specific class of algorithm, maybe that will be less robust than a method that just assumes some very abstract input and output and assumes that how you get from that input and output can improve over time.

And maybe the algorithm itself even changes altogether. So I think that would be my take on the bitter lesson 2.0, but this is definitely still, I think the jury is still out on this. And my, my, my basic, my, my, one of the things I like to propose is that language is the way that we can leverage bitter lesson 2.0.

If you have language as the universal representation through which all of these foundations communicate to each other, whether it's, you know, captioning or generation or whatnot, I think that's one way that we could leverage a bitter lesson 2.0. And finally, the third ingredient offline robot learning, decoupling data generation from data consumption, putting these all together, my recipe for how one take at a modern attempt that embodied intelligence would look like would be to combine these large offline datasets with high capacity architectures by using language as the universal glue.

And in the works I'm going to present shortly, all of our different projects, I think in some way or another are inspired by this philosophy. And now, now that we've kind of, you know, understood the motivations and potentially one possible approach, of course, largely the first offline is a high capacity architectures using language as a universal glue.

I'm curious to know which, if any of these are currently bottlenecks, not the right word, which means they're limited. Got it. Because it seems to me like we already have large offline datasets, we have high capacity architectures, and you know, those architectures are relatively just a piece of language, but it seems like we already have all the components necessary.

So why is this then not a solved problem? The question was these, it seems like we have a lot of these ingredients. And so why hasn't robotics been solved yet? So I would argue that actually this take here, and maybe I'm, this is to the wrong audience at the moment, but I think this is non, very non-obvious across the robotics field.

Many people do not agree with all of these, much less two of these, or even any of these points. And so I think, and also the existence of the scale of how mature each of these components are within robotics is at very different stages. And I would say like, and we can talk a bit later about like, for example, like data scale, or the architectures that have kind of diffused through osmosis from other ML domains into robotics.

But I think we're still at very different stages on how, how much people have actually bought into these lessons and invested in them. Yeah, I can probably, I also don't want to get into too much trouble here, but I'll probably get myself in a bit of hot water in a few slides.

So I'll, I'll extend upon it a bit then. I'm just curious to know what their opinion is and why you think they're wrong. Yeah. And I would say that like me personally, and, you know, not speaking for my team, but a lot of people on my team are probably at the very extreme end of learning, scaling, data-driven, you know, foundation model based, let's go big.

And I think a lot of people don't believe that. And yeah, happy to discuss why later, maybe after the Zoom as well. So, so yeah. Well, okay then let's, let's go ahead and dive in and see how this recipe might actually percolate into specific domains. And the first one is RT1.

This is a recent work from our group that works on how we can scale imitation learning. And let's look at how we can actually apply these first principles. So the first one is to consider what we actually, let's, let's put ourselves into the spring 2020 mindset. We we've been collecting demonstrations for a while.

This is a ton of demos, like a hundred thousand over that was collected over like a year and a half on many, many robots on many, many tasks that exists. It was expensive. And over time, this will actually, you know, not trickle up at insane amounts. Like we won't just get a hundred thousand new high quality demos every day.

This will grow over time, but it's not going to, you know, grow for free. And autonomous ways of doing this is very hard. As you saw earlier with MPOP with the bin reset mechanism, or DeepMind has a work on RGB stacking, where they try to do autonomous resets. And you know what, the way that we're doing it right now, or at least for this paper was human teleoperation pioneered by BC zero, and that was very expensive as well.

So there's going to be a limited throughput. And finally BC zero used a ResNet based backbone, and it was pretty good, but I found that it was very sensitive to training distributions. For example, when they remove data from some teleoperators to make the data more homogenous performance got better, and that's not really a property we like, right?

We want more data, even if it's not exactly the same. So the lesson here, models need to be robust and they need to generalize. Cool. So we have models and to be robust and generalized. What else do we have? Well, off the shelf models are pretty slow. If we take in these huge, you know, vision transformers from other domains, they're not going to run on the real robot.

We need to be able to run at a pretty high frequency. They need to be reactive. Inference time needs to be slow because all our models are vision based. And finally, we want our data to be able to understand language. As I mentioned, robust language is the universal glue.

Our data set already has some language. We want eventual models to be very multimodal. This is a first principle that we need to dig in. What does this mean? We can't just take something existing. We probably need to design or at least modify something from the ground up. And let's take the best practices that we've seen work in other fields.

And so we worked for a bit and we came up with this architecture for RT1. Again, once again, this was a large team with a bunch of different contributions, and I'll just go through a few of them here. At a high level, RT1 is robotics transformer. It operates at three hertz.

It takes in visual input from the robot RGB camera, as well as a natural language instruction. There, the image is patchified and fed into a film efficient net tokenizer. It's then passed into token learner, which I'll talk about soon. And then also the language instructions are tokenized and then they are put into the same transformer.

And then finally, we output discretized actions as tokens and send that to the real world in three hertz in closed loop. This transformer is a decoder one. We use a sparse categorical entropy objective for action prediction by applying a causal mask. We use the pre-trained efficient net backbone, and we also use token learner for faster inference.

Diving a little bit deeper. Oh, sorry. Yeah. A question. Great question. So the image token, when it goes in from, so each image is the, you know, the high fidelity RGB image from the camera. We split that up into 81 separate patches. And so each patch is, you know, it's spatially just like the square there.

But the cool thing is that what token learner does here, this thing here is it's a previous work from our group that takes in a bunch of possible you know, image patches and dynamically selects which of those image patch tokens are more relevant for the tax at hand, given the existing context.

So from those 81 image patch tokens, we sub sample eight of them to use for inference. And this happens at every time step. And that process has learned which of the eight patches are relevant at any given moment. And otherwise, we're sending in way too many tokens and the context length would explode and we wouldn't be able to do inference on robots.

We are also passing in a sequence sequence length of six images. History is quite important when you're doing temporally coherent tasks in the real world where things like physics and you know, exactly this, this nuanced detail of what the objects are doing in relation to each other into your robot.

Those details really matter. And in total, the the model size is 35 million parameters, which is quite a bit smaller than a lot of these other, you know, huge internet scale models. And finally, one main difference here is action discretization. Before a lot of the products we're doing, we're doing continuous control.

And if you think about it, right, our robot does have, we do end effector pose control on position control. And there, the real world is a continuous state space. But, um, and to do that, we, we had to come up with many algorithmic novelties, for example, a, a, a CEM actor that did basically sampling of these continuous action spaces to propose the highest ones that would get rated by the Q function.

And we do this twice, blah, blah, blah. And, but that's like so sensitive, but we needed to get, do that to get things to work. But now we just decided, let's just, you know, bin our actions. It's only 256 discrete actions. And let's just predict those as tokens. Um, any question?

Yeah, what I was going to ask is, so you're mentioning that you have this design required or engineering requirement about speed and latency reaction. And then you say that that necessitates having a relatively small model, which makes sense. But one message of scaling when we're talking about foundation models is that we don't want to be bottlenecked by either data compute or parameters.

So I guess what I'm curious to know is how do you balance these off in the sense that you want to have lots of parameters to have a really powerful model, while on the other hand, you want to have very fast input. Yeah, great question. And to repeat it, the question is, um, we kind of set a pretty hard constraint with a hundred millisecond inference time yet.

A lot of the lessons in foundation modeling is that you shouldn't be constraining yourself against any dimension, whether it's data set, size, compute, or model capacity. And I think my initial answer to that is that's a very great point and something I think that's going to be coming up as a severe bottleneck in the future.

But for, for our initial case, I think this is more of an exploration of whether these principles and even scaling well beyond what we were looking at now to work already on this 35 million is gigantic compared to a lot of prior work using, for example, a ResNet-34 or whatnot.

So this is already much bigger than, you know, a lot of other options. And maybe for now, at least it's the easiest, it's the largest scale we could go to roughly in the short term without having to think of more tricks. Yeah, we can talk about it a bit later, maybe.

I think I'd also love to hear your thoughts too, because it's very non-obvious how we can get past some of these bottlenecks. Yeah, great question. We ran some ablations on model size. I might have that in a few slides, but maybe we can return to that then. And if not, I can, yeah, but great question.

So yeah, that's the architecture and I'll discuss some of the ablations and the trends later on, but maybe, you know, this is a robotics lecture, I should show you some pretty visuals, right? So let's look at some evaluations we did. We compared against some baselines. One is Gato, which you might be familiar with.

And then the other one is BC0, the ResNet based one. And we find that we evaluate unseen tasks versus unseen tasks. And we also add in various distractor objects. Our normal data collect looks like this top left picture, three cans on a gray desk, that's basically it. But then we push it further by bringing in a lot more objects so that the table is so cluttered that even as a human, sometimes it's hard to find the object that you're actually looking for.

We add in table class, we make the textures very different. We bring it to new micro kitchens with new surfaces all together. And we find that RT1 is more robust than these other different methods. Yeah. Good question. The question was, was the Gato model trained on our data or was it just already included in Gato?

The answer is this data was not included in Gato. And so we retrained the Gato model only on our data. Yeah. And yeah, so here's just a different visualization of the robot going out in our micro kitchen and doing different interesting things. You can see here that it's trained on one setting, but then it goes into brand new kitchen, brand new countertops, new objects, and it's able to do all of them pretty robustly.

We also put it into a long horizon setting using the SACAN framework that we'll talk about next. But in these settings, a lot of them are mixing all of these generalization capabilities. And on the plot on the left here, we're using what we call generalization levels inspired by the VIMA paper that would basically increasingly change more factors of variation simultaneously.

And here we found RT1 is the most robust. Yeah, good question. We'll go into a bit more detail later, but I think at a high level, teleoperators get a structure template at a command of like verb, noun, verb or something like pick Coke can or move Apple near sponge.

And we have around 700 tasks set up this way and they go ahead and collect that data, test done. And then later we have, we make sure that successes are actually successes and we discard stuff that's like unsafe, for example. Oh yeah, I got it. For this paper, we, we, we utilize 130,000 demonstrations for this.

Yeah, great question. I think a lot of prior work has also been done on this, but it's also noted that when you have, for example, the question was, did you find that the, the, the trajectories in your dataset were very multimodal. And I think what you mean by that is that to go from point A to point B, I can go left or I can go right, or I can go straight.

And I think this kind of diversity in basically for a single image state, but yet my data has three possible labels that can have very bad effects sometimes. For us, I think because we are using teleoperator demonstrations, the data was more homogenous than perhaps like in the wild, for example, there's a type of data function called play data where operators just do whatever they want and we label in hindsight.

And I think our data is more homogenous than that, but we did not find a lot of the issues that we've seen in prior projects. One potential answer is maybe it's, it's the, it's the architecture itself, but we can talk about that later too. Yeah. Question. Great question. We actually do have a termination action.

So the, the policy itself, so the question was how do you determine when a episode is complete and the policy is able to predict terminate because at the end of each teleoperation session, the operator can click a button and it's marked as episodes done. Yeah, I think for these evaluations, we were quite strict, but definitely I think in some cases, you know, maybe, maybe if we're just doing an experiment for ourselves, we'll have a dense reward scale of like grasp the object and move closer, grasp the object and almost got there, but mess up at the end.

And we'll have like a, a grading curve basically. But for all of these, all of these stats I'm showing here, it was zero or one, one fully complete zero was not fully complete. Yeah. Cool. And I think what was exciting side and maybe talking about the multimodality aspect is then we pushed the limit even further.

We were Trent, we decided to train on very diverse data distributions. You're you're back by then. Yeah. Okay. So right now you saw 130 to a thousand demonstrations train on this everyday robot proprietary mobile manipulator, but we were also looking to train on very different data distributions with very different, you know, action distributions, very different trajectories, even very different visuals objects tasks.

And to do that, we included two other data sources. One was simulation data, which was kind of our robot blood and sin, but it looked quite different. And also this data was collected with reinforcement learning and not with teleoperate demonstrations in the past with all of the aisle plus RL work that I mentioned, we found that combined these, these two types of data was going to be very difficult because RL data has very short action.

It's very quick. That's very optimized for the specific reward function versus human collective tele-operation data is a lot more, you know, human life, so to speak. And finally, we revived a data set from many years ago at 2018. If you remember the Kuka project, that arm farm has not been operational in that state for many years now, but we had that data still.

And so we were hoping to see if a different robot with a different action space on different objects with different visuals in a different building could still be combined with data from this micro kitchen, a robot data set that we train on originally. And what was very surprising to me is that Archie one was able to learn from all of these very diverse data distributions.

I had never seen a result like this or any other architecture, for example, a Resnet or even another learning method like reinforcement learning could successfully learn on such different data distributions. So robustly. And we evaluated, for example, on combining concepts. So we would have the original everyday robot robot pick up objects that were only seen in the Kuka project, or we would put objects only seen in simulation and see if our policy could understand that.

So it did seem like it could generalize between objects and seen in other data sets and concepts that had seen in other data sets into the setting it was in now in the real micro kitchen. And that was a very fun result. I have a question. How did you find the action spaces of the everyday robot with the Kuka?

Great question. Yeah, we just tokenized it and make sure that the tokenization scheme was kind of interoperable. And I think that was the I can dive into that in a bit later too. Yeah. And note that does not mean we can send the exact actions for one robot to another and have it execute.

It was more just like in the data set, I think even by human inspect, you can tell that these are coming from two different robots. So yeah, let's look at some ablations for the scaling laws that we're all here for now. We found that, you know, reducing data site size reduces performance.

But more interesting maybe is task diversity was quite important. Here we have two different trends. The green line is what happens when you reduce the total amount of episodes per task. And then gray on here, the purple curve is for what happens when you reduce the total number of tasks.

And we found that having more tasks is relatively more important than having more data for each task. And I think this was a lesson that I think is probably going to suggest ways that, you know, we should scale robotics even further is not to just collect more data of the same task in the same settings, but to go out into the wild and get more diverse behavior.

How do you define diversity for data? Great question. Question is, how do you define data diversity? In this case, it's just a number of unique structured templated commands that teleoperators receive. So those 700 templated commands, when we start reducing them and only train on 500 or only train on 300 of them, performance drops much quicker than if we had taken the same proportional cuts to the total amount.

Yeah, so I guess I'm very familiar with, like, it seems almost a linear relationship for diversity and structure. Yeah, I don't think we, the question was, there seems to be almost a linear correlation between data size and success rate. And I think, you know, we could apply some fancy, like, you know, scaling law, you know, trying to curve fitting, but we didn't look too much into that because, you know, this is a trend that we kind of expected.

We just weren't sure about the magnitude of how much it would affect us. And I think I don't have any really good insights on this besides that we see this phenomenon empirically. Yeah. Yeah, and great question. So the question is, oh, maybe this will just go on indefinitely. Or is there something magical about, you know, January and I think this is maybe also a, this is one we start to conflate the algorithmic exploration with like the practical considerations of scaling real world operations, which was when we got enough data, our policies were hitting, you know, saturating on these hitting close to a hundred percent.

We were like, all right, let's connect, collect another data set. So we basically collect until it's at a hundred and then we switch to something else. But at this point, what was interesting is that when we kind of bet really big on this RT1 feature, we'd already been collecting demos for a while.

So it was possible that we had collected more than we needed. And in some cases, I actually, you could cut tasks without losing too much performance, which was quite interesting. But yeah, great question. And the question is whether or not all tasks are created equal in terms of like their capacity and entropy for different behaviors you could learn from them.

And yeah, that's definitely true. Some tasks are much easier. We have a task that's just pick up this object. It's going to have much less interesting stuff you can squeeze out of it then, you know, moving something into a drawer and then closing the drawer. But yeah, great question.

Great. Now ablations. We also trained without the big model size. We did it without pre-training, without you know, with continuous instead of discrete actions, with autoregressive actions, without history, without the transformer. And I think all of these design choices did seem to be required for robust performance. Oh, yeah, of course.

Yeah, I think all I mean, like, and again, you know, for paper writing, it's kind of like the best thing that we can empirically find. That's that's the method. And then we'll figure out why each of these are important. And so, yeah, I think what one surprising thing here, perhaps, was that autoregressive actions hurt, you might think that passing in more information is always better than passing in fewer, fewer information.

But in this case, maybe conditioning on your previous actions was kind of doing kind of like in context learning, it was doing online systems identification to figure out what teleoperator this data came from, and like how you can overfit to that specific set of action history. And so removing that was actually better.

One interesting tidbit there. Cool then. And maybe in the interest of time, I'll try to get through the other ones a bit more quicker. And then we can maybe just do a few, I'll just do the questions at the end, if that's possible, just so we have time to get through everything.

The next work here, moving a bit away from skill learning, then and actually on to the planning level, I think the first project took a lot of the design principles of other fields, and this offline robot learning paradigm and put it into the skill learning. Can we actually bring that now to other parts of the robotic system?

And the first work here is SACAN. If you remember here, back in this timeline, in 2022, we started thinking about, oh, yeah, how do we scale this multitask imitation learning, but at the same time, large language models and, you know, other types of foundation models are really picking up steam, whether it was Imogen or Dolly 2.

And we definitely wanted to figure out how we could use those as well. We had come up with this RTU 2.1 design that we're betting big on. But from here, we started to explore how all of the beta lesson 2.0, we could start utilizing foundation models within the context of our full stack system.

The problem of doing this naively is that language models are not completely a very natural fit for robotics. For example, if you're a robot in a kitchen, you ask a language model, I spilled my drink, what can you do? Language model will give you stuff that's not very relevant.

It's going to ask you to vacuum it, it's going to ask you to call a cleaner, or it's going to apologize. And these are not things that the robot can do in your kitchen with your spilled drink to help you. And so there are two parts of this then.

The one issue is that our robots are limited. They are very constrained with what they can do. They cannot do everything, but they can do certain things. And then the second problem is that the language models are also constrained. They don't know what the robot sees. They don't understand that they are in a robot body in a micro kitchen needing to do real stuff in the physical world.

And so we need to get the robots to speak language model language, and then the language model to speak robot language. To do this, we present SACAN in the same setting. Please put an apple on the table. We score the predictions of the language model on a constrained set of tasks that we know the robot has been trained to do.

And then we also take the affordance function from the robot. An affordance function is a estimation of, given some kind of state, what the robot is able to do, how confident it is that it can successfully accomplish that task in the given state. In our case, we use something like a value function from reinforcement learning, which kind of encompasses this quality.

Given these two values, these two scores, we have the confidence from a language model, and then the confidence from the robot. We can combine these, and then hopefully the combined prediction is both something that's going to be very semantically relevant for the high level instruction. Finding an apple is the first step, and please put an apple on the table.

But it's also something that the robot can do. There's no robot in the frame, but it knows that it's been trained to find an apple, so it can navigate around to find it. And so hopefully we can do this then in closed loop, and then keep on going and predicting a high level plan from the language model that's grounded with the affordance function of what the robot understands.

There's a video here of the SIGCHIAN doing different stuff, but happy to share it later offline. It's very cool, trust me. It's the greatest thing since sliced bread. And yeah, some numbers then. We tested this out on very long horizon instructions encompassing more than 10 separate navigation and manipulation skills in the micro kitchen that you see on the bottom right.

We evaluated hundreds of different evaluations on this, and we tested out a lot of different concepts, including things like rephrasing by using single primitives, by drawing instructions that just came from colleagues and friends. And then we found that while there were failures in both the language model planning side, where it would predict the wrong path for the current situation, as well as on the policy execution side, even when it gets a good plan, the robot will mess up sometimes.

Overall, it was still doing quite well. And now let's kind of take this back to the lesson. I think this is a very great example of how we can leverage internet scale foundation models as they get better. When we started the project, we started with a language model called Flan from Google.

Throughout our implementation, Palm came online, Pathways language model. And when that happened, we were able to just hot swap it in, and performance just kind of got better for free without us having to do anything. By just assuming that language was the API, the plan just has to be any string.

It can come from any source. It can come from a human. It can come from a language model. When we improve that language model, the system gets better overall. And here you see with the scaling sizes as the model LLM increased in size, our planning performance got even better.

And some cool tricks here to get it working. Well, how do we actually produce this plan? Well, just by prompting, as is the rich these days, with chain of thought and with better prompting of just giving examples of here are some great robot plans. Now give me a new plan starting with this high-level instruction.

We saw that the robot could do all things from understanding different languages to asking them to do very complex reasoning, like, hey, give me something caffeinated, or I don't do caffeine anymore. Get me something better. Or I could bring me a healthy snack versus bring me an unhealthy snack.

Seikan was able to reason through all of these. I think that was our kind of the first contact of robotics with language models on our team. And it was the first exploration into how these two worlds could overlap. There was definitely still improvements, though. And in our monologue, we tried to improve those further by bringing in vision language models.

The idea here is that we had very high plan rate success with Seikan. But unfortunately, it wasn't really able to recover from failures. What I mean by that is that the language model would not really get updates of what was going on in the world, so that if this was the plan it proposed, go to the table, pick up a Coke, bring it to you, but you messed up picking the Coke can.

You dropped it on the floor. It would still continue trying to bring it to you, put it aside, but all of that does not really matter anymore because you dropped the Coke can. And so in this work, in our monologue, we were really hoping to figure out how we could add closed loop dynamic feedback from the environment into this planning process.

Let's take that exact same example. Now, instead of just directly printing every instruction, maybe we add back some feedback from the scene, also conveyed using language as the universal API here. The scene can tell you what's actually in there. Maybe the robot asks a question now. In the robot, this is a language model asking the clarification question.

Maybe here, a human responds or another language model. Then you can predict the action or the next task to do once the language model has enough context. And maybe you even add in stuff like success detection and so on and so forth. How do we do this then? Well, the first thing that we implement is what we call passive scene description.

Just using either an off-the-shelf engineered heuristic, using object detection models, something like Vylde, you can describe the scene in text and just convey all that context to the language model. For active scene description, this is maybe similar to visual question answering if you're familiar with that field. The language model can actually propose active queries that it's curious about in the scene, maybe to make sure that it has enough context to move on.

And here, either a human can provide the answer, or in the future, a VQA model as they improve can provide that. And finally, for success detection, this is very important to allow the language model planner to know when to try to retry something. Here we take in the first and last image, fine tune a clip success detector, and use that to provide binary success/failure information back to our language model.

And for the results-wise, we can see a very similar SACAN long-horizon evaluation, but here what's interesting is that we're able to basically implement all these different automated feedback mechanisms on the robot, and so that it's able to reason and recover from things. Here you see it's going to try to go to the table, but the human's actually been saying, "Hey, I changed my mind." And then the human changes mind again, asking it to go back and forth.

And the robot's able to, maybe we're kind of torturing the language model at this point, but the language model's able to replan and make sure that the human intent is satisfied. We also tried, I'm not sure if this video shows it, but situations where we did adversarial inputs, where I walked around and just kind of knocking objects out of the robot's hands and forcing the success detector to tell it, "You messed up, try again." And we also tried this out on a couple of different domains, a simulated tabletop manipulation domain, as well as a real-world manipulation domain, and we found that this was much better than SACAN, or let's say just only using visual features themselves with something like Clipboard.

And I think here, it really speaks towards a trend that I've really come to appreciate. In 2018, a robotics professor once said that when they looked at all the different things preventing robot learning from scaling tremendously, it thought the bottleneck was high-level semantic planning, about reasoning, about common sense.

And I think in 2022 and 2023, language models can provide a one path of how this can kind of be offloaded, at least in the interim. And I think if language models are the API, then you can just bring in these vision language models as object detectors get better, as success detectors, as VQA, as language models get better, you can bring them all into the fold and they act as kind of a life vest.

If your robot currently does not have common sense reasoning, these other models can act as a scaffold and a life vest to bring you up to par with what they currently love. And maybe then in the future, you'll get beyond what the language models know, but in the short term, it does seem that we can leverage them to accelerate what we can do in the real world.

Moving on now from, we saw now how language models can do planning. We saw how vision language models can help planning. And now we're going to switch gears a bit and think about how vision language models can help other aspects of the bottlenecks that robot learning faces. One of these is that data collection is very expensive.

As we mentioned before, we did have this 130,000 demonstration data set, but it was collected over a year and a half at significant cost, both in resources and time and money and with many, many robots. And of course, these tasks too were also a bit limited, right? We use 700 very templated commands, instructions that we give to teleoperators, because we knew that this would scale, right?

If we collected enough data for each of these templated tasks, we could do that specific task. And here's the flow that someone was asking about earlier. We give this PICOCAN instruction, the operator controls the robot in the real world, finish the task, marks the episode as terminate, and then shade that out to this big orange data set.

And that big orange data set is what we trained on in all of the previous projects for the control policies. What we additionally considered was adding a bit of crowdsourced hindsight annotation. If you're familiar with it, with a hindsight experience replay and reinforcement learning with goal conditioning with, you know, maybe the robot did something that wasn't just this high level template instruction.

We could ask a human to describe more verbosely what the robot did. Maybe it picked up the COCAN that was on the right side of the table. Maybe it picked it up and then knocked it over. Maybe it moved it very slowly to the middle. There's a lot of semantic diversity encompassed in this demonstration that is not totally caught by this high level templated PICOCAN instruction.

So we labeled 3% of this big orange data set with these very verbose descriptions. And next, we kind of applied the pseudo-label strategy that's been seen in other fields, such as video pre-training with their inverse dynamics model. But instead, we apply that to the instructions, to the semantics of what's contained in your data set.

So step one, we pre-train a clip model on your small label data set of 3% of your main data. Then you go ahead and use that train BLM data to label all of the templated instruction demonstrations that you had before that 130,000 episode data sets. Now you have a re-labeled data set, which has a large diversity of interesting semantic instructions.

And then we plug in all of these data sets into RT1 and just train a language condition behavior cloning policy, similarly to how we would normally. But even though normally we just use data set B, the orange one, now we use all three data sets. And then finally, we evaluate on entirely new unseen instructions.

In the prior works, we were evaluating mainly on the 700 templated instructions. But in this work, we actually go beyond that. We can type in almost anything you want that you think might succeed. And you can phrase it how you can. You can add typos. You can even do it by referring to semantic concepts.

You can add spatial concepts. And we see how it does. The reason that this might work, maybe visually to represent this, is here are the t-SNE embeddings on the left and the right. It's the same embeddings. But on the left, they're colored by the original templated instruction that was used to collect that episode.

And on the right is what the vision language model thinks. If it's allowed to put a free form natural language caption and assign it to that episode, you see that on the left, you have these big clusters of pick cocaine is like, you know, hundreds or thousands of episodes, but we all just call them pick cocaine.

On the right, then we can then expand those concepts and say, actually, this episode is picking up the red cocaine. This episode is picking up the crumpled cocaine. This is picking up the cocaine that's next to the chip bag. And so you can get a lot more mileage out of the same underlying data set by just using language as the diversity mechanism through which you kind of expand the concepts that you're considering.

And for example, in the middle, you see, you know, open top drawer can become hold and pull out the top drawer. We have stuff like the center left for the middle, for the middle episode, for the bottom one, pick green rice chips from white bowl becomes lift up the green chip bag from the bowl and drop it at the bottom left corner of the table.

So you got a lot of these semantic, you know, spatial concepts that are now going to be in your target supervised labels. I have a question. Yeah. Great question. So I guess if I can rephrase a bit, the problem is that like, it's actually a very difficult and perhaps even untrackable problem of how you map all the linguistic concepts you see out in the wild down to like, maybe like embodied specific types of episodes.

And like, here, maybe I would say is that we are definitely introducing a lot of our priors and our biases onto like, maybe what we call as left, you mean left 10 centimeters of two centimeters, like, like, what do words mean? And these definitions, what do they mean to us, to the crowd compute raters that generated these captions?

What do they mean to the robot? What do they mean to the language models? Maybe these are all slightly different, but the hope is at least if they're roughly similar, we'll get like directionally correct improvements. So I would say the nuances of this specific hard lines of definitions and like actual, like, you know, semantic meaning of these words, I think that's maybe out of scope right now, but maybe something we'll dive into further at a higher level, though.

I think basically the bar is just so low. We have the 700 template instructions that are basically one hot IDs, and we just want to make those closer to natural language, even if by a little. And I think at least we're, we're, we're trying to get towards that with these vision language models that are captioning automatically.

Hope that answers your question. And we also compare it to a few baselines on the top left here. We look at what if we only train on this 3% of these fancy human rated labels? What if we only train on the original RT1 data sets? What if we train on both of these?

And what if we train on both of these plus all of the predictions given by our BLM? And what's interesting here is that, you know, relabeling seems to universally help. We evaluated only on novel instructions that was new for this project. It's the first time on a robotics project where we only tested on sentence, I could type whatever I thought I'll type it in.

And that became the test set. And we just had to make sure that it was never contained in the training coverage. And you see all these interesting examples on the right here of stuff like move the lonely object to the others. I have no idea how this works. Stuff like news, lifting the yellow rectangle, talking about colors, talking about move the right apple to the left.

Here, we actually had two apples in the scene. And actually in our training demonstration data, we never collected scenes with duplicate objects, just because, you know, we thought of this multi-modality problem. If you just say pick cocaine in this two cocaine, it's going to be very difficult to figure out which one to do.

But with language labeling, it seems like maybe we could do that now. So even though we never trained on scenes of two apples, now you can evaluate on them and just specify with language, which apple you want to go for. And it was working pretty reasonably. And finally, for the last example here, I thought it was kind of interesting.

A single cocaine, we try to do a novel behavior. Push towards the left was not a templated instruction. We only had move cocaine near Y, where Y is another object, move cocaine near apple, move cocaine near sponge. So pushing this motion of just pushing the cocaine into air essentially was not something that we ever encompassed, but maybe it was in one of the labels.

Maybe like if you've seen like move cocaine near apple and apples on the left, and you saw move cocaine near sponge and the sponge is on the left, you would general, the model can generalize and be like, oh, left means this side of the table, not a specific object.

So maybe that's what's happening, but it's very unclear. This is, as I said, you know, just, I type, I thought of something, I typed it and just saw what happened. And we definitely hope to explore this more quantitatively in the future. Bottom left, of course, is I think comparing against non-visual augmentation.

So maybe you can also get these interesting concepts just from language alone, right here. We had adding random noise or we do Madlib style, just swapping out words, or we even use a LLM GPT-3 in this case to propose rephrasing of existing instructions. But I think my takeaway there is that you really need visual grounding for the visual language model to say, actually, yeah, this caption is factually accurate at this given point in time.

And that it's, you know, something perhaps that would be interesting for a robot. That fine-tuning process provides both of those. Yeah, yeah, definitely. These are just some subsets of five of these evaluation instructions, but we had over 60 of them. We didn't do a full quantitative ablation, for example, as we did in RT1.

We had this like seen and unseen task set, and that was compositional. You would see, you know, move Coke near Apple, and you would see move Apple near sponge, but we'd hold out, move Coke near sponge, and we would test that out. But in this case, I think we can go much more beyond that.

Because our language is completely freeform, the compositional space of what you can kind of combine is just going to be much larger. So we did try a little bit to answer your question. We tried some combinatorial evaluations, but there's definitely a lot more thoroughness that we could do there, too.

How am I doing on time? Okay, 10 minutes. Maybe I'll try to wrap up pretty soon, then. The dial-up takeaway, then, is that two parts, right? Lesson two, leverage foundation models. Let's use them as data augmentation. And lesson three, let's make sure that our offline data set, you know, is robust enough where these different behaviors exist, and you can describe them in language.

If you don't have enough diverse behaviors, no matter how good your labeling is, you probably can't elicit all of the interesting concepts that you want to learn from. And maybe most exciting for me here was that actually some label noise is okay. Notoriously, in supervised learning and imitation learning, you need very clean labels that are always 100% true, right?

You don't want to be learning from, like, noisy data where some, like, you know, large percentage is just not accurate. But in our case, it seems that, like, some label noise was okay. The vision language model was not always predicting factually accurate descriptions of the scene. And I think this definitely hurt when it got too high, the noise, but at smaller levels, it definitely still seemed to be okay and robust enough to handle that.

So, that was a deep dive, then, on some individual works that use this big recipe of language, foundation models, offline data sets in different parts of the robot system. And this was the kind of pitch at the beginning, and I hope you at least see a little bit of how our team has tried to take these principles and apply them to accelerating robot learning in the real world.

As we see these different types of ingredients and lessons map onto different parts of the robot system altogether. For skill learning, right, that was RQ1 that we talked about. For planning, that was SACAN, and then adding the closed-loop feedback with vision language models, that was inner monologue. For low-level control, we didn't talk about this today, but an exciting work from our team is actually using language models to predict code that's executed on the robot directly, perhaps as low-level controllers.

Language models, you know, they read textbooks, they've read raw stocks, they've read, you know, UR5 documentation code, and they can write code for these robots, and we can execute that. For data augmentation, we saw Dial with vision language models. And also, I didn't talk about this here, but for object-centric representations, for things like feature activation maps for specific objects, we can use those as task representation for mapping a scene.

And in NLMAP, they did that for object-centric navigation around the micro kitchen that we looked at. And I think, hopefully, in the next, you know, coming weeks and months, we have a few more rows and entries to add here as well, but I think this kind of mindset is a very exciting research direction of how you can apply these big high-level concepts about foundation models and offline data sets, when you look at what exists in the robot systems of today, and you find many gaps and opportunities still available where we can do everything from exploratory pilots on how this might look like, all the way to more extensive evaluations and really building out robust systems.

I think both of these have value. So, I'll conclude with just saying that it was very fun exploring all of these complementary directions, but there are still some major questions of how we can take these concepts even further, and how these trends and ideas might even evolve moving forward as foundation models get better, as more data set becomes available online, as more data becomes homogenized and tokenized and interoperable.

And I think a lot of the concepts from other fields, like linguistics and vision, and from, you know, all of the big scaling kind of level questions that are being pioneered in language-based foundation models, hopefully, those kind of ideas can trickle down to robotics. Maybe even robotics can provide something back by providing embodied action causal data sets that maybe might improve the quality of reasoning of some of these large language models that are not embodied.

With that, though, I guess I'd like to, you know, thank everyone for your time and for Dave and Sia for inviting me, and open to any questions about the papers or just at a high level as well. Thanks so much. Yeah great question. So the question, I guess, is like, what about tasks that require more semantic reasoning, like, you know, operating at a certain speed or with maybe like, I don't know, numerical reasoning within the question, the prompt itself.

I would say, so for a lot of the more common sense reasoning, like, you know, throw away three co-cans, you know, after another, I think, you know, the language model is very good at that right now. So for the secant planner, it will predict, you know, throw away the co-can three separate times.

For the low level skill policy learning, though, I think that's more of a, that's more high variance, I would say. And definitely for right now, we don't really condition on speed or how you do it exactly. But that's definitely maybe something I could do if you could relabel with, like, pick up the co-can slowly versus pick up the co-can quickly.

Maybe that is something a vision language model could recognize. The question was, at what scale do we see like combinatorial generalization start to occur, maybe between like, you've seen colors of one block, and then you want to evaluate on a new color? And I think that's a great question.

And unfortunately, my answer is going to be very vague. And it depends. It depends on how you define your tasks. It depends on the scale of your data set. And it depends on like, the concepts that you're trying to generalize across. I think there have been numerous attempts to kind of basically formalize what it means to generalize within, you know, learning and within robotics, even within like the specific settings we consider.

And I don't think there are any clear trends be like, of where you can say, oh, yeah, this is the number I need to hit where, you know, I can generalize across x, y, z dimensions. Like, you could evaluate all those, but I don't think it will help you predict new trends, at least right now.

I think we're probably, you know, this is just me talking, I would say we're one order of magnitude off before we can start to make very broadly generalizing statements about generalization capabilities. I think, you know, add one or two more zeros to our data set size, and we can start to talk about that in terms of task object skills.

Yeah. Yeah, very astute observation. So the question, the question was that in SACAN, the value functions that predict these scalars on the right here for the affordances are only storing a certain limited number of tasks. So is that the bottleneck? And I would say yes, 100%. Scaling the number of tasks that your system is able to do that you can then give to the planner as its buffet of options to choose, that is the bottleneck, right?

No matter how good your planner is, if you can only do like three tasks, there's only certain like combinations of those three tasks that it can do to, you know, map on to a high level instruction. So as you add more tasks, as the low level skill capabilities of your robot increases, you're kind of like adding precision to like the coverage of the high level instructions that your robot can try to do.

So that's one of the main bottlenecks I see today. Great question. So have we tried RQ1 with RLHF or with RL? I think the short answer is I think we have some stuff in the works that is doing that. But right now, for all of our projects, currently, we're just using this implementation learning loss.

Again, I think I view this multitasking limitation bet that we're making is kind of an existence proof. It works, it's not cheap, but it kind of does work and it does scale. And that at least is a good starting point. And our main, you know, hope over the next months and years is can we improve beyond that?

Can we add back in offline improvement? You know, can we add in RL back to the equation somehow? I'm an RL person at heart, so I really hope so. Sorry, could you repeat that? Yeah, good question. So regarding task balance and whether text-only data is sufficient for helping motor control learning, I think my hope is that when, you know, when we experience emergence in both the robotics space and we've already seen emergence in the language space, at some point, maybe these reasoning concepts will start to transfer between the two.

I would point them to one interesting paper, which is, I think, can Wikipedia help reinforce that learning from Shane and some other folks? They pre-train, you know, a large policy network on like, you know, auto-aggressive token prediction on Wikipedia, just text only, and they use that to initialize, like, control for Atari games with RL, and this actually helped.

So, you know, maybe this is philosophical, but maybe there's something about decision-making reasoning that transfers between text and action data, so. Great question. I definitely agree. You know, passing in six images is not going to be enough when you're executing tasks for minutes at a time. Like, clean my whole house, and then you can only pass in the last, like, you know, two seconds.

Like, come on. So, I think that's definitely going to be a limitation as our tasks set more complex and long horizon, and I think here, it's another open question, too, is context length. We have high-dimensional images, even with token learning for reducing the number of patches that we pass through, it's still, you know, very high-dimensional, and we quickly hit the context length cap.

Can we do, how do we, you know, improve beyond this? Maybe it's like retrieval transformers or some other kind of mechanism. Great question. I think we are hoping to explore that in the future, but with this, like, context length limitation, we are already near the context length capacity with just these six images alone, much less, you know, passing in whole trajectories of zero-shot behavior, few-shot behavior we wish to see.

So, 2BD, I think. Cool. Thank you, guys. Thank you.