Hi everyone, I'm Annika. This is Asta. We both work at Advidia and we were part of the team that developed the GRU 10.1 robotics foundation model. So today we're going to give you a sense of what that is and how you go about building a robotics foundation model. But before we get into it, I feel like a lot of people start talks here with a hot take.
So the hot take that I'm bringing to an AI conference is that we're not necessarily running out of jobs. So this was a report done by McKinsey, part of the Global Institute, showing that in the world's 30 most advanced economies, there's actually too many jobs for the number of people that could fill them.
And really the two things you should look at in this whole graph is the 4.2x that's been the rate at which we're getting more jobs and people could fill it over the last decade. And this line that I'm highlighting in red, that's where we're in trouble, where there's just more jobs than able-bodied people to fill those jobs.
And obviously there's a real conversation around AI and jobs, so it helps to look at what industries are largely affected. I'm gonna highlight a couple in red. So leisure, hospitality, health care, construction, transportation, manufacturing. I guess you can figure out what they have in common. None of them can be solved by ChatGPT alone.
They require operating instruments and devices in the physical world, and they require physical AI. So that's really the big challenge that I see over the coming years is how do we take this huge amount of intelligence that we're seeing in language models and make it operable in the in the physical world.
The other question around humanoids is why do we build them to look like humans? It's not just because we want them to look like us. The world was made for humans. It's very hard to have generalist robots operate in our world and be generally useful without copying our physical form.
There's a lot of specialist robots that do incredible things. I don't know if you got to try the espresso from the barista robot downstairs makes a good espresso, but that robot couldn't even cook rice. So if we want a robot that can do multiple tasks for you, it's just a lot easier to try and imagine that that robot can operate in in our human world.
So how do we do this? There's three big buckets, three big stages. First one is data collecting, or generating, or multiplying data, which we'll talk quite a lot about. And now that you have this synthetic and real data, but largely synthetic data, you train a model. We'll also talk about what that architecture and training paradigm looks like.
And then finally we deploy on the robot, or at the edge. This is what we call the physical AI lifecycle. So generate the data, consume the data, and then finally deploy and have this robot operable in the physical world. NVIDIA also likes to call this the three computer problem, because they have very different compute characteristics.
So at the simulation stage, you're looking for a computer that's powerful at simulating, something like an OVX omniverse machine. There's a lot of really interesting work happening on the simulation side, but it has a very different type of workload than when we're training and we're using a DGX to just consume this enormous amounts of data and learn from that.
And then finally when we're deploying at the edge, it needs to be a model that's small enough and efficient enough to run on an edge edge device like an AGX. And really this is this is Project Root. So Project Root is NVIDIA's strategy for bringing humanoid and other forms of robotics into the world.
And it's everything from the compute infrastructure to the software, to the research that's needed. It's not simply just one foundation model. But that is what we'll be focusing on in this talk, because that's what we worked on. So the Groot N1 Foundation model was announced at GTC in March.
It is open source. It is highly customizable. And a very big part of it that is cross embodiment. So basically you can take this base model, there's specific embodiments that we have fine-tuned for, but the whole premise is that you can take this base model, it's a two billion parameter model, which in the world of LLMs is tiny, but still pretty sizable for a robot, and then go and modify it for your embodiments, your use cases.
So let's start with the first huge daunting task in the world of robotics data. When the Groot team actually started thinking about data, they put together this idea of the data pyramid, which is very elegant, but it was born out of desperation and necessity. The data you want does not exist in quantities.
There is no internet scale data set to scrape or download or put together, because robots haven't made it YouTube yet. So really at the top of the pyramid, we have the real world data, which is robots doing things, real robots doing real tasks and solving them. And how it's collected is humans teleoperate a robot most of the time.
So wearing like an Apple Vision Pro and wearing gloves, there's all kinds of ways to teleoperate the robot. But you have a real robot successfully completing a task, and then you have that ground truth data. So you can imagine this is very small in quantity, very expensive. And we put 24 hours per robot per day, because that's how many hours a human has.
But the reality is that humans and robots get tired. So it's not even 24 hours. So really, this is a very, very limited data set. And then at the bottom of the pyramid, we have the internet. So we have huge amounts of video data, and it's typically human solving tasks.
So you can imagine someone collecting a cooking video tutorial and putting that out there. With this unstructured data, it's not necessarily relevant to robots, but there is some value in that. So we didn't want to completely discard it. It forms part of this cohesive data strategy. And then in the middle, synthetic data.
And this is a topic that could fill this whole entire talk. And I've cut down so many slides on just this section, because in theory, this is infinite, right? You could just let the GPU keep generating more data. But in practice, creating high quality simulation environments is very labor intensive, and it requires serious skill.
And then on top of that, the other technique, which I will share a little bit about is taking the human trajectories that we do collect, so human teleoperation data, and trying to multiply it through essentially video generation models. So through World Foundation models that we fine tune to do this task.
But even in that case, there's a lot of active research in how we take the little bits of high quality data that we have and multiply it as well as how we effectively combine a simulation data with this real world data. So this is DreamGen. This was something that was announced at Computex very recently.
All in all, this data piece is a huge part of what the project root is about. So there's many, many solutions here in terms of the tele-op and the data strategy. But for now, the next piece is how do we bring all this data into an architecture? So I'm going to hand over to Asda to explain that about.
Thank you, Anika. Do you guys hear me? All right, awesome. So before we dive into the architecture, I'm going to show to you what an example input looks like and what an example output looks like. So what you see here is the image observation, the robot state and the language prompt.
That's the input. And then what's the output? The output is a robot action trajectory. So the prompt was to pick up the industrial object and place it in the yellow bin. And that's what the robot does. It picks up and places it in the yellow bin very neatly. But this is what it appears to us as humans.
But is the robot or the humanoid seeing the same? Not really. The humanoid sees this. It sees a bunch of vectors, floating point vectors, which control the different joints. So you're seeing the output as a trajectory, which is like motion of the robot hand, but that's not what the robot is seeing.
It uses these vectors to actually generate a continuous action. And to set context on what a robot state and action is, So you can imagine the state is the robot snapshot at an instant instance of time. So including the physique of the robot and the environment, that's the state.
And then the action is what the robot decides to do next based on the state. So moving on and diving a bit deeper into the architecture. The Groot N1 system introduced a very interesting concept. And this concept is inspired from Daniel Kahneman's book, Thinking Fast and Slow. Show of hands, how many of you have read the book?
Amazing. That helps to explain. So it's inspired by the the same concept, but it's applied to a robotics context. So we have two systems, system one, system two. System two, you can imagine, is the brain of the robot or the brain of the model. So that's the part which is actually trying to break down the complex tasks.
So make it simpler such that the system one can execute on it. So you can think of system two as the planner, which executes slowly to break down the complex task. And then system one is the first one. It operates almost at 120 hertz and it basically executes on the task that system two puts out for it.
And then now we're going to delve another level deeper into the architecture. And it's okay if all of this is complicated to you because it's not very straightforward. So we have the input as the robot state and the noised action. You must be wondering why we've called it noised action.
Noised action is a natural state because these sensors don't capture the action perfectly. So we have noised action. And then they're passed to a state encoder and an action encoder which generates some tokens. And you may be familiar with tokens. We've talked about LLMs, agents, a lot. So the same concept but just different kinds of tokens.
So state tokens, action tokens. And then it's passed through a diffusion transformer block. And the diffusion transformer block is essentially multiple layers of cross-attention and self-attention. And bringing in the other piece, which is the vision input and the text input. So you have the vision encoder which takes the image input, generates some tokens, passes it to the VLM to bring it to like a standardized encoding format.
And then the text tokenizer, which takes the text input, again, does the same passes through the VLM. And then all of this, all of the output tokens from the VLM, in this case, in case of root N1, it was the eagle to VLM, is passed into the cross-attention layer of the diffusion transformer block.
And then you get some output tokens. These are the output tokens. But these output tokens are still not ready to be consumed by the physical robot. So you need to make it consumable by the physical robot. And that's where you have this key piece called the action decoder. So it may seem like there's lots of encoders, lots of decoders, but you can say that the action decoder is the one which gives the model capability to be a generalist.
So you're giving it an action decoder which is specific to the embodiment that you're going to use. Whether it's a humanoid hand or a robot arm, an industrial robot arm, that's where that action decoder comes into place. It's specific to the embodiment you're trying to use. And then it's going to translate it specific to your embodiment and output, an action vector which can be translated into continuous robot motion or embodiment motion.
Just going to give you a second to digest all of this. So you can see that the action decoder is very, very important because otherwise you would only be able to train a model for one specific embodiment. But this model can leverage foundation knowledge from all different embodiments and then bring it to one one particular embodiment.
The concept is similar to the concept of a foundation model essentially. Moving on to the next slide. There are two main ways of robot learning. And the reason I chose to keep this slide is because it came up a lot in the conversations I was having the past couple of days.
There are two ways of training robots. One is imitation learning and the other is through reinforcement learning. Imitation learning in simple English terms, imitation means to copy someone like learn by copying. That's exactly what's happening here. So you have a human expert and the robot is trying to copy the human expert.
And you're trying to minimize the loss between the expert and the human. So you have a gold standard. You're trying to match up to the gold standard. And then in case of reinforcement learning it's more of a trial and error format. So what you're doing is you just maximize the reward.
So you don't have a golden state. You're trying to just reach wherever you can the best you can. You can think of it similar to having siblings. When siblings, parents try to compare between the two and they're like you need to be like your elder sibling. But then there's no sibling and in that case you can just be as good as you want.
So that's reinforcement learning for you. And like all things good and bad in this world both of them come with pros and cons. With the imitation learning you're severely bottlenecked by the expert data which is quite expensive. But in case of reinforcement learning you don't have that bottleneck. But it's the key challenge is the sim to real.
So there's a huge gap between going from sim to real and it's an active area of research. A lot of research labs, universities are going behind it. So that was the two ways of training robots. And GrootN1 used both of these in some ways. Here's an example of the train model.
What can it do? So on the left you see the model being able to do a few pick and place tasks in the kitchen. On the right top you can see with enough training the model can be taught how to be romantic as well. You don't see all the fallen champagne glasses and fallen flowers which went behind capturing this perfect snap.
And then the bottom right is two robot friends trying to get to an industrial task, like a pick and place task again. But these are not the only tasks that these humanoids or robots can be doing. They can be extended to any task, any environment. And that's why we have a foundation model.
A generalist foundation model which can be expanded to any downstream task. So this is going to be my conclusion. There are three core principles that we spoke about today. And each of these is very hefty by itself. But primarily the data pyramid. Annika spoke about this. In case of LLMs or text data or text models, you have the whole internet which you can be scraping to generate data.
But there's no such internet scale data for actions. So that is one of the key challenges that you need to address either via simulation or by imitation learning, generating expert data, tele-operation, all sorts of things. The next thing is the dual system architecture. Previously what used to happen was each of these components was trained independently.
And that resulted in some kind of disagreement between the two systems. The Groot N1 introduces this coherent architecture where both the system one and system two are being co-trained. And that kind of helps to optimize the whole stack instead of individually trying to train the pieces. And then the third piece and the final piece is the generalist model.
So in case of the generalist model, you are able to leverage foundation knowledge from the model and extend it to different embodiments, different tasks. You can think of it like how you have, in case of large language models, you have a base foundation Lama 2.7TB model or there's Lama 4 now or Lama 3.
I don't know which is the latest, but you can extend it to any, you can fine tune it to any task or like domain adapted. Similarly, you have the Groot N1 model which can be adapted to any embodiment and any downstream task. Thank you so much for attending our talk today.
We're really happy you were here. Please let us know if you have questions. We'll be outside hanging out. Thank you so much. We appreciate it.