back to indexWhat Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha

00:00:00.000 |
Hi everyone, I'm Annika. This is Asta. We both work at 00:00:19.420 |
Advidia and we were part of the team that developed the GRU 10.1 00:00:22.680 |
robotics foundation model. So today we're going to give you a sense of what that 00:00:27.360 |
is and how you go about building a robotics foundation model. But before we 00:00:31.920 |
get into it, I feel like a lot of people start talks here with a hot take. So the 00:00:37.340 |
hot take that I'm bringing to an AI conference is that we're not necessarily 00:00:40.860 |
running out of jobs. So this was a report done by McKinsey, part of the Global 00:00:46.260 |
Institute, showing that in the world's 30 most advanced economies, there's actually 00:00:52.240 |
too many jobs for the number of people that could fill them. And really the two 00:00:57.100 |
things you should look at in this whole graph is the 4.2x that's been the rate at 00:01:01.720 |
which we're getting more jobs and people could fill it over the last decade. And 00:01:05.020 |
this line that I'm highlighting in red, that's where we're in trouble, where 00:01:08.980 |
there's just more jobs than able-bodied people to fill those jobs. And obviously 00:01:13.840 |
there's a real conversation around AI and jobs, so it helps to look at what 00:01:18.540 |
industries are largely affected. I'm gonna highlight a couple in red. So leisure, 00:01:24.760 |
hospitality, health care, construction, transportation, manufacturing. I guess you 00:01:31.840 |
can figure out what they have in common. None of them can be solved by ChatGPT alone. They 00:01:37.840 |
require operating instruments and devices in the physical world, and they require 00:01:43.960 |
physical AI. So that's really the big challenge that I see over the coming years is how do we take 00:01:49.200 |
this huge amount of intelligence that we're seeing in language models and make it 00:01:54.560 |
operable in the in the physical world. The other question around humanoids is why do we build them to 00:02:00.540 |
look like humans? It's not just because we want them to look like us. The world was 00:02:05.520 |
made for humans. It's very hard to have generalist robots operate in our world and be 00:02:11.220 |
generally useful without copying our physical form. There's a lot of 00:02:16.740 |
specialist robots that do incredible things. I don't know if you got to try the 00:02:19.980 |
espresso from the barista robot downstairs makes a good espresso, but that robot 00:02:25.020 |
couldn't even cook rice. So if we want a robot that can do multiple tasks for you, it's 00:02:31.020 |
just a lot easier to try and imagine that that robot can operate in in our human 00:02:36.080 |
world. So how do we do this? There's three big buckets, three big stages. First one is 00:02:42.120 |
data collecting, or generating, or multiplying data, which we'll talk quite a 00:02:46.260 |
lot about. And now that you have this synthetic and real data, but largely synthetic 00:02:51.900 |
data, you train a model. We'll also talk about what that architecture and training 00:02:57.540 |
paradigm looks like. And then finally we deploy on the robot, or at the edge. This 00:03:04.260 |
is what we call the physical AI lifecycle. So generate the data, consume the data, and 00:03:10.300 |
then finally deploy and have this robot operable in the physical world. NVIDIA also 00:03:15.920 |
likes to call this the three computer problem, because they have very different 00:03:20.100 |
compute characteristics. So at the simulation stage, you're looking for a 00:03:24.320 |
computer that's powerful at simulating, something like an OVX omniverse machine. 00:03:28.400 |
There's a lot of really interesting work happening on the simulation side, but it 00:03:32.780 |
has a very different type of workload than when we're training and we're using a DGX to 00:03:38.060 |
just consume this enormous amounts of data and learn from that. And then finally when 00:03:43.340 |
we're deploying at the edge, it needs to be a model that's small enough and 00:03:46.820 |
efficient enough to run on an edge edge device like an AGX. And really this is this 00:03:53.960 |
is Project Root. So Project Root is NVIDIA's strategy for bringing humanoid and other forms 00:04:00.360 |
of robotics into the world. And it's everything from the compute infrastructure to the software, 00:04:05.900 |
to the research that's needed. It's not simply just one foundation model. But that is what we'll be focusing on in this talk, 00:04:13.540 |
because that's what we worked on. So the Groot N1 Foundation model was announced at GTC in March. 00:04:19.620 |
It is open source. It is highly customizable. And a very big part of it that is cross embodiment. 00:04:26.700 |
So basically you can take this base model, there's specific embodiments that we have fine-tuned for, 00:04:31.620 |
but the whole premise is that you can take this base model, it's a two billion parameter model, 00:04:36.340 |
which in the world of LLMs is tiny, but still pretty sizable for a robot, and then go and modify it 00:04:42.820 |
for your embodiments, your use cases. So let's start with the first huge daunting task in the world of 00:04:50.180 |
robotics data. When the Groot team actually started thinking about data, they put together this idea of 00:04:57.620 |
the data pyramid, which is very elegant, but it was born out of desperation and necessity. The data you want 00:05:04.180 |
does not exist in quantities. There is no internet scale data set to scrape or download or put together, 00:05:11.540 |
because robots haven't made it YouTube yet. So really at the top of the pyramid, we have the real 00:05:17.540 |
world data, which is robots doing things, real robots doing real tasks and solving them. And how it's 00:05:23.700 |
collected is humans teleoperate a robot most of the time. So wearing like an Apple Vision Pro and wearing 00:05:31.140 |
gloves, there's all kinds of ways to teleoperate the robot. But you have a real robot successfully 00:05:35.620 |
completing a task, and then you have that ground truth data. So you can imagine this is very small 00:05:41.620 |
in quantity, very expensive. And we put 24 hours per robot per day, because that's how many hours a human 00:05:48.660 |
has. But the reality is that humans and robots get tired. So it's not even 24 hours. So really, this is 00:05:55.620 |
a very, very limited data set. And then at the bottom of the pyramid, we have the internet. So we 00:06:00.900 |
have huge amounts of video data, and it's typically human solving tasks. So you can imagine someone 00:06:07.620 |
collecting a cooking video tutorial and putting that out there. With this unstructured data, it's not 00:06:13.620 |
necessarily relevant to robots, but there is some value in that. So we didn't want to completely discard 00:06:20.420 |
it. It forms part of this cohesive data strategy. And then in the middle, synthetic data. And this is a topic 00:06:28.900 |
that could fill this whole entire talk. And I've cut down so many slides on just this section, because in theory, 00:06:35.940 |
this is infinite, right? You could just let the GPU keep generating more data. But in practice, creating high 00:06:42.980 |
quality simulation environments is very labor intensive, and it requires serious skill. And then on top of that, the other 00:06:51.140 |
technique, which I will share a little bit about is taking the human trajectories that we do collect, 00:07:00.100 |
so human teleoperation data, and trying to multiply it through essentially video generation models. So 00:07:07.140 |
through World Foundation models that we fine tune to do this task. But even in that case, there's a lot 00:07:12.180 |
of active research in how we take the little bits of high quality data that we have and multiply it as well 00:07:18.660 |
as how we effectively combine a simulation data with this real world data. So this is DreamGen. This was 00:07:25.780 |
something that was announced at Computex very recently. All in all, this data piece is a huge part of what the 00:07:33.220 |
project root is about. So there's many, many solutions here in terms of the tele-op and the data strategy. But for now, 00:07:41.220 |
the next piece is how do we bring all this data into an architecture? So I'm going to hand over to Asda to explain that 00:07:47.300 |
about. Thank you, Anika. Do you guys hear me? All right, awesome. So before we dive into the architecture, I'm going to show to you what an 00:07:59.220 |
example input looks like and what an example output looks like. So what you see here is the image observation, the robot 00:08:06.180 |
state and the language prompt. That's the input. And then what's the output? The output is a robot action trajectory. 00:08:13.380 |
So the prompt was to pick up the industrial object and place it in the yellow bin. And that's what the robot does. 00:08:19.940 |
It picks up and places it in the yellow bin very neatly. But this is what it appears to us as humans. 00:08:26.740 |
But is the robot or the humanoid seeing the same? Not really. The humanoid sees this. It sees a bunch of 00:08:35.860 |
vectors, floating point vectors, which control the different joints. So you're seeing the output as a 00:08:41.300 |
trajectory, which is like motion of the robot hand, but that's not what the robot is seeing. It uses these 00:08:48.020 |
vectors to actually generate a continuous action. And to set context on what a robot state and action is, 00:08:55.380 |
So you can imagine the state is the robot snapshot at an instant instance of time. So including the 00:09:02.660 |
physique of the robot and the environment, that's the state. And then the action is what the robot decides 00:09:12.260 |
So moving on and diving a bit deeper into the architecture. The Groot N1 system introduced a 00:09:21.060 |
very interesting concept. And this concept is inspired from Daniel Kahneman's book, Thinking Fast and Slow. 00:09:26.980 |
Show of hands, how many of you have read the book? Amazing. That helps to explain. So it's inspired by the 00:09:34.100 |
the same concept, but it's applied to a robotics context. So we have two systems, system one, system two. 00:09:42.180 |
System two, you can imagine, is the brain of the robot or the brain of the model. So that's the part 00:09:48.820 |
which is actually trying to break down the complex tasks. So make it simpler such that the system one can 00:09:56.420 |
execute on it. So you can think of system two as the planner, which executes slowly to break down the 00:10:02.260 |
complex task. And then system one is the first one. It operates almost at 120 hertz and it basically 00:10:09.060 |
executes on the task that system two puts out for it. And then now we're going to delve another level 00:10:16.180 |
deeper into the architecture. And it's okay if all of this is complicated to you because it's not very 00:10:21.620 |
straightforward. So we have the input as the robot state and the noised action. You must be wondering 00:10:28.580 |
why we've called it noised action. Noised action is a natural state because these sensors don't capture 00:10:34.260 |
the action perfectly. So we have noised action. And then they're passed to a state encoder and an action 00:10:40.980 |
encoder which generates some tokens. And you may be familiar with tokens. We've talked about LLMs, agents, 00:10:49.060 |
a lot. So the same concept but just different kinds of tokens. So state tokens, action tokens. And then it's passed 00:10:56.020 |
through a diffusion transformer block. And the diffusion transformer block is essentially multiple layers of 00:11:03.220 |
cross-attention and self-attention. And bringing in the other piece, which is the vision input and the text 00:11:10.740 |
input. So you have the vision encoder which takes the image input, generates some tokens, passes it to the VLM to 00:11:17.780 |
bring it to like a standardized encoding format. And then the text tokenizer, which takes the text input, 00:11:23.780 |
again, does the same passes through the VLM. And then all of this, all of the output tokens from the VLM, 00:11:29.860 |
in this case, in case of root N1, it was the eagle to VLM, is passed into the cross-attention layer of the 00:11:38.740 |
diffusion transformer block. And then you get some output tokens. These are the output tokens. But these 00:11:46.660 |
output tokens are still not ready to be consumed by the physical robot. So you need to make it consumable 00:11:53.620 |
by the physical robot. And that's where you have this key piece called the action decoder. So it may seem 00:12:00.660 |
like there's lots of encoders, lots of decoders, but you can say that the action decoder is the one 00:12:06.180 |
which gives the model capability to be a generalist. So you're giving it an action decoder which is 00:12:11.860 |
specific to the embodiment that you're going to use. Whether it's a humanoid hand or a robot arm, 00:12:17.460 |
an industrial robot arm, that's where that action decoder comes into place. It's specific to the 00:12:22.740 |
embodiment you're trying to use. And then it's going to translate it specific to your embodiment and output, 00:12:28.820 |
an action vector which can be translated into continuous robot motion or embodiment motion. 00:12:34.580 |
Just going to give you a second to digest all of this. So you can see that the action decoder is very, 00:12:41.460 |
very important because otherwise you would only be able to train a model for one specific embodiment. 00:12:47.540 |
But this model can leverage foundation knowledge from all different embodiments and then bring it to one 00:12:53.780 |
one particular embodiment. The concept is similar to the concept of a foundation model essentially. 00:13:04.340 |
There are two main ways of robot learning. And the reason I chose to keep this slide is because it 00:13:12.580 |
came up a lot in the conversations I was having the past couple of days. There are two ways of training robots. 00:13:19.460 |
One is imitation learning and the other is through reinforcement learning. 00:13:24.500 |
Imitation learning in simple English terms, imitation means to copy someone like learn by copying. That's 00:13:33.300 |
exactly what's happening here. So you have a human expert and the robot is trying to copy the human 00:13:39.300 |
expert. And you're trying to minimize the loss between the expert and the human. So you have a gold standard. You're 00:13:45.940 |
trying to match up to the gold standard. And then in case of reinforcement learning it's more of a trial 00:13:50.740 |
and error format. So what you're doing is you just maximize the reward. So you don't have a golden 00:13:57.860 |
state. You're trying to just reach wherever you can the best you can. You can think of it similar to 00:14:04.020 |
having siblings. When siblings, parents try to compare between the two and they're like you need to be like 00:14:08.980 |
your elder sibling. But then there's no sibling and in that case you can just be as good as you want. 00:14:13.940 |
So that's reinforcement learning for you. And like all things good and bad in this world both of them 00:14:20.740 |
come with pros and cons. With the imitation learning you're severely bottlenecked by the expert data which is 00:14:27.060 |
quite expensive. But in case of reinforcement learning you don't have that bottleneck. But it's the key 00:14:33.540 |
challenge is the sim to real. So there's a huge gap between going from sim to real and it's an active 00:14:39.220 |
area of research. A lot of research labs, universities are going behind it. So that was the two ways of 00:14:46.820 |
training robots. And GrootN1 used both of these in some ways. Here's an example of the train model. 00:14:55.540 |
What can it do? So on the left you see the model being able to do a few pick and place tasks in the 00:15:01.060 |
kitchen. On the right top you can see with enough training the model can be taught how to be romantic 00:15:07.140 |
as well. You don't see all the fallen champagne glasses and fallen flowers which went behind capturing 00:15:14.180 |
this perfect snap. And then the bottom right is two robot friends trying to get to an industrial task, 00:15:21.940 |
like a pick and place task again. But these are not the only tasks that these humanoids or robots 00:15:28.420 |
can be doing. They can be extended to any task, any environment. And that's why we have a foundation 00:15:35.780 |
model. A generalist foundation model which can be expanded to any downstream task. 00:15:42.500 |
So this is going to be my conclusion. There are three core principles that we spoke about today. And 00:15:50.500 |
each of these is very hefty by itself. But primarily the data pyramid. Annika spoke about this. In case of 00:15:58.980 |
LLMs or text data or text models, you have the whole internet which you can be scraping to generate data. But 00:16:06.900 |
there's no such internet scale data for actions. So that is one of the key challenges that you need to 00:16:12.340 |
address either via simulation or by imitation learning, generating expert data, tele-operation, all sorts of 00:16:18.980 |
things. The next thing is the dual system architecture. Previously what used to happen was each of these 00:16:26.260 |
components was trained independently. And that resulted in some kind of disagreement between the two systems. 00:16:33.300 |
The Groot N1 introduces this coherent architecture where both the system one and system two are being 00:16:41.060 |
co-trained. And that kind of helps to optimize the whole stack instead of individually trying to train the pieces. 00:16:50.020 |
And then the third piece and the final piece is the generalist model. So in case of the generalist model, 00:16:57.140 |
you are able to leverage foundation knowledge from the model and extend it to different embodiments, 00:17:04.500 |
different tasks. You can think of it like how you have, in case of large language models, you have a base 00:17:10.660 |
foundation Lama 2.7TB model or there's Lama 4 now or Lama 3. I don't know which is the latest, but you can 00:17:18.340 |
extend it to any, you can fine tune it to any task or like domain adapted. Similarly, you have the Groot N1 00:17:25.540 |
model which can be adapted to any embodiment and any downstream task. Thank you so much for attending our 00:17:31.620 |
talk today. We're really happy you were here. Please let us know if you have questions. We'll be outside hanging out.