What Is a Humanoid Foundation Model? An Introduction to GR00T N1

00:00:00.000 | Hi everyone, I'm Annika. This is Asta. We both work at

00:00:19.420 | Advidia and we were part of the team that developed the GRU 10.1

00:00:22.680 | robotics foundation model. So today we're going to give you a sense of what that

00:00:27.360 | is and how you go about building a robotics foundation model. But before we

00:00:31.920 | get into it, I feel like a lot of people start talks here with a hot take. So the

00:00:37.340 | hot take that I'm bringing to an AI conference is that we're not necessarily

00:00:40.860 | running out of jobs. So this was a report done by McKinsey, part of the Global

00:00:46.260 | Institute, showing that in the world's 30 most advanced economies, there's actually

00:00:52.240 | too many jobs for the number of people that could fill them. And really the two

00:00:57.100 | things you should look at in this whole graph is the 4.2x that's been the rate at

00:01:01.720 | which we're getting more jobs and people could fill it over the last decade. And

00:01:05.020 | this line that I'm highlighting in red, that's where we're in trouble, where

00:01:08.980 | there's just more jobs than able-bodied people to fill those jobs. And obviously

00:01:13.840 | there's a real conversation around AI and jobs, so it helps to look at what

00:01:18.540 | industries are largely affected. I'm gonna highlight a couple in red. So leisure,

00:01:24.760 | hospitality, health care, construction, transportation, manufacturing. I guess you

00:01:31.840 | can figure out what they have in common. None of them can be solved by ChatGPT alone. They

00:01:37.840 | require operating instruments and devices in the physical world, and they require

00:01:43.960 | physical AI. So that's really the big challenge that I see over the coming years is how do we take

00:01:49.200 | this huge amount of intelligence that we're seeing in language models and make it

00:01:54.560 | operable in the in the physical world. The other question around humanoids is why do we build them to

00:02:00.540 | look like humans? It's not just because we want them to look like us. The world was

00:02:05.520 | made for humans. It's very hard to have generalist robots operate in our world and be

00:02:11.220 | generally useful without copying our physical form. There's a lot of

00:02:16.740 | specialist robots that do incredible things. I don't know if you got to try the

00:02:19.980 | espresso from the barista robot downstairs makes a good espresso, but that robot

00:02:25.020 | couldn't even cook rice. So if we want a robot that can do multiple tasks for you, it's

00:02:31.020 | just a lot easier to try and imagine that that robot can operate in in our human

00:02:36.080 | world. So how do we do this? There's three big buckets, three big stages. First one is

00:02:42.120 | data collecting, or generating, or multiplying data, which we'll talk quite a

00:02:46.260 | lot about. And now that you have this synthetic and real data, but largely synthetic

00:02:51.900 | data, you train a model. We'll also talk about what that architecture and training

00:02:57.540 | paradigm looks like. And then finally we deploy on the robot, or at the edge. This

00:03:04.260 | is what we call the physical AI lifecycle. So generate the data, consume the data, and

00:03:10.300 | then finally deploy and have this robot operable in the physical world. NVIDIA also

00:03:15.920 | likes to call this the three computer problem, because they have very different

00:03:20.100 | compute characteristics. So at the simulation stage, you're looking for a

00:03:24.320 | computer that's powerful at simulating, something like an OVX omniverse machine.

00:03:28.400 | There's a lot of really interesting work happening on the simulation side, but it

00:03:32.780 | has a very different type of workload than when we're training and we're using a DGX to

00:03:38.060 | just consume this enormous amounts of data and learn from that. And then finally when

00:03:43.340 | we're deploying at the edge, it needs to be a model that's small enough and

00:03:46.820 | efficient enough to run on an edge edge device like an AGX. And really this is this

00:03:53.960 | is Project Root. So Project Root is NVIDIA's strategy for bringing humanoid and other forms

00:04:00.360 | of robotics into the world. And it's everything from the compute infrastructure to the software,

00:04:05.900 | to the research that's needed. It's not simply just one foundation model. But that is what we'll be focusing on in this talk,

00:04:13.540 | because that's what we worked on. So the Groot N1 Foundation model was announced at GTC in March.

00:04:19.620 | It is open source. It is highly customizable. And a very big part of it that is cross embodiment.

00:04:26.700 | So basically you can take this base model, there's specific embodiments that we have fine-tuned for,

00:04:31.620 | but the whole premise is that you can take this base model, it's a two billion parameter model,

00:04:36.340 | which in the world of LLMs is tiny, but still pretty sizable for a robot, and then go and modify it

00:04:42.820 | for your embodiments, your use cases. So let's start with the first huge daunting task in the world of

00:04:50.180 | robotics data. When the Groot team actually started thinking about data, they put together this idea of

00:04:57.620 | the data pyramid, which is very elegant, but it was born out of desperation and necessity. The data you want

00:05:04.180 | does not exist in quantities. There is no internet scale data set to scrape or download or put together,

00:05:11.540 | because robots haven't made it YouTube yet. So really at the top of the pyramid, we have the real

00:05:17.540 | world data, which is robots doing things, real robots doing real tasks and solving them. And how it's

00:05:23.700 | collected is humans teleoperate a robot most of the time. So wearing like an Apple Vision Pro and wearing

00:05:31.140 | gloves, there's all kinds of ways to teleoperate the robot. But you have a real robot successfully

00:05:35.620 | completing a task, and then you have that ground truth data. So you can imagine this is very small

00:05:41.620 | in quantity, very expensive. And we put 24 hours per robot per day, because that's how many hours a human

00:05:48.660 | has. But the reality is that humans and robots get tired. So it's not even 24 hours. So really, this is

00:05:55.620 | a very, very limited data set. And then at the bottom of the pyramid, we have the internet. So we

00:06:00.900 | have huge amounts of video data, and it's typically human solving tasks. So you can imagine someone

00:06:07.620 | collecting a cooking video tutorial and putting that out there. With this unstructured data, it's not

00:06:13.620 | necessarily relevant to robots, but there is some value in that. So we didn't want to completely discard

00:06:20.420 | it. It forms part of this cohesive data strategy. And then in the middle, synthetic data. And this is a topic

00:06:28.900 | that could fill this whole entire talk. And I've cut down so many slides on just this section, because in theory,

00:06:35.940 | this is infinite, right? You could just let the GPU keep generating more data. But in practice, creating high

00:06:42.980 | quality simulation environments is very labor intensive, and it requires serious skill. And then on top of that, the other

00:06:51.140 | technique, which I will share a little bit about is taking the human trajectories that we do collect,

00:07:00.100 | so human teleoperation data, and trying to multiply it through essentially video generation models. So

00:07:07.140 | through World Foundation models that we fine tune to do this task. But even in that case, there's a lot

00:07:12.180 | of active research in how we take the little bits of high quality data that we have and multiply it as well

00:07:18.660 | as how we effectively combine a simulation data with this real world data. So this is DreamGen. This was

00:07:25.780 | something that was announced at Computex very recently. All in all, this data piece is a huge part of what the

00:07:33.220 | project root is about. So there's many, many solutions here in terms of the tele-op and the data strategy. But for now,

00:07:41.220 | the next piece is how do we bring all this data into an architecture? So I'm going to hand over to Asda to explain that

00:07:47.300 | about. Thank you, Anika. Do you guys hear me? All right, awesome. So before we dive into the architecture, I'm going to show to you what an

00:07:59.220 | example input looks like and what an example output looks like. So what you see here is the image observation, the robot

00:08:06.180 | state and the language prompt. That's the input. And then what's the output? The output is a robot action trajectory.

00:08:13.380 | So the prompt was to pick up the industrial object and place it in the yellow bin. And that's what the robot does.

00:08:19.940 | It picks up and places it in the yellow bin very neatly. But this is what it appears to us as humans.

00:08:26.740 | But is the robot or the humanoid seeing the same? Not really. The humanoid sees this. It sees a bunch of

00:08:35.860 | vectors, floating point vectors, which control the different joints. So you're seeing the output as a

00:08:41.300 | trajectory, which is like motion of the robot hand, but that's not what the robot is seeing. It uses these

00:08:48.020 | vectors to actually generate a continuous action. And to set context on what a robot state and action is,

00:08:55.380 | So you can imagine the state is the robot snapshot at an instant instance of time. So including the

00:09:02.660 | physique of the robot and the environment, that's the state. And then the action is what the robot decides

00:09:08.100 | to do next based on the state.

00:09:12.260 | So moving on and diving a bit deeper into the architecture. The Groot N1 system introduced a

00:09:21.060 | very interesting concept. And this concept is inspired from Daniel Kahneman's book, Thinking Fast and Slow.

00:09:26.980 | Show of hands, how many of you have read the book? Amazing. That helps to explain. So it's inspired by the

00:09:34.100 | the same concept, but it's applied to a robotics context. So we have two systems, system one, system two.

00:09:42.180 | System two, you can imagine, is the brain of the robot or the brain of the model. So that's the part

00:09:48.820 | which is actually trying to break down the complex tasks. So make it simpler such that the system one can

00:09:56.420 | execute on it. So you can think of system two as the planner, which executes slowly to break down the

00:10:02.260 | complex task. And then system one is the first one. It operates almost at 120 hertz and it basically

00:10:09.060 | executes on the task that system two puts out for it. And then now we're going to delve another level

00:10:16.180 | deeper into the architecture. And it's okay if all of this is complicated to you because it's not very

00:10:21.620 | straightforward. So we have the input as the robot state and the noised action. You must be wondering

00:10:28.580 | why we've called it noised action. Noised action is a natural state because these sensors don't capture

00:10:34.260 | the action perfectly. So we have noised action. And then they're passed to a state encoder and an action

00:10:40.980 | encoder which generates some tokens. And you may be familiar with tokens. We've talked about LLMs, agents,

00:10:49.060 | a lot. So the same concept but just different kinds of tokens. So state tokens, action tokens. And then it's passed

00:10:56.020 | through a diffusion transformer block. And the diffusion transformer block is essentially multiple layers of

00:11:03.220 | cross-attention and self-attention. And bringing in the other piece, which is the vision input and the text

00:11:10.740 | input. So you have the vision encoder which takes the image input, generates some tokens, passes it to the VLM to

00:11:17.780 | bring it to like a standardized encoding format. And then the text tokenizer, which takes the text input,

00:11:23.780 | again, does the same passes through the VLM. And then all of this, all of the output tokens from the VLM,

00:11:29.860 | in this case, in case of root N1, it was the eagle to VLM, is passed into the cross-attention layer of the

00:11:38.740 | diffusion transformer block. And then you get some output tokens. These are the output tokens. But these

00:11:46.660 | output tokens are still not ready to be consumed by the physical robot. So you need to make it consumable

00:11:53.620 | by the physical robot. And that's where you have this key piece called the action decoder. So it may seem

00:12:00.660 | like there's lots of encoders, lots of decoders, but you can say that the action decoder is the one

00:12:06.180 | which gives the model capability to be a generalist. So you're giving it an action decoder which is

00:12:11.860 | specific to the embodiment that you're going to use. Whether it's a humanoid hand or a robot arm,

00:12:17.460 | an industrial robot arm, that's where that action decoder comes into place. It's specific to the

00:12:22.740 | embodiment you're trying to use. And then it's going to translate it specific to your embodiment and output,

00:12:28.820 | an action vector which can be translated into continuous robot motion or embodiment motion.

00:12:34.580 | Just going to give you a second to digest all of this. So you can see that the action decoder is very,

00:12:41.460 | very important because otherwise you would only be able to train a model for one specific embodiment.

00:12:47.540 | But this model can leverage foundation knowledge from all different embodiments and then bring it to one

00:12:53.780 | one particular embodiment. The concept is similar to the concept of a foundation model essentially.

00:13:01.060 | Moving on to the next slide.

00:13:04.340 | There are two main ways of robot learning. And the reason I chose to keep this slide is because it

00:13:12.580 | came up a lot in the conversations I was having the past couple of days. There are two ways of training robots.

00:13:19.460 | One is imitation learning and the other is through reinforcement learning.

00:13:24.500 | Imitation learning in simple English terms, imitation means to copy someone like learn by copying. That's

00:13:33.300 | exactly what's happening here. So you have a human expert and the robot is trying to copy the human

00:13:39.300 | expert. And you're trying to minimize the loss between the expert and the human. So you have a gold standard. You're

00:13:45.940 | trying to match up to the gold standard. And then in case of reinforcement learning it's more of a trial

00:13:50.740 | and error format. So what you're doing is you just maximize the reward. So you don't have a golden

00:13:57.860 | state. You're trying to just reach wherever you can the best you can. You can think of it similar to

00:14:04.020 | having siblings. When siblings, parents try to compare between the two and they're like you need to be like

00:14:08.980 | your elder sibling. But then there's no sibling and in that case you can just be as good as you want.

00:14:13.940 | So that's reinforcement learning for you. And like all things good and bad in this world both of them

00:14:20.740 | come with pros and cons. With the imitation learning you're severely bottlenecked by the expert data which is

00:14:27.060 | quite expensive. But in case of reinforcement learning you don't have that bottleneck. But it's the key

00:14:33.540 | challenge is the sim to real. So there's a huge gap between going from sim to real and it's an active

00:14:39.220 | area of research. A lot of research labs, universities are going behind it. So that was the two ways of

00:14:46.820 | training robots. And GrootN1 used both of these in some ways. Here's an example of the train model.

00:14:55.540 | What can it do? So on the left you see the model being able to do a few pick and place tasks in the

00:15:01.060 | kitchen. On the right top you can see with enough training the model can be taught how to be romantic

00:15:07.140 | as well. You don't see all the fallen champagne glasses and fallen flowers which went behind capturing

00:15:14.180 | this perfect snap. And then the bottom right is two robot friends trying to get to an industrial task,

00:15:21.940 | like a pick and place task again. But these are not the only tasks that these humanoids or robots

00:15:28.420 | can be doing. They can be extended to any task, any environment. And that's why we have a foundation

00:15:35.780 | model. A generalist foundation model which can be expanded to any downstream task.

00:15:42.500 | So this is going to be my conclusion. There are three core principles that we spoke about today. And

00:15:50.500 | each of these is very hefty by itself. But primarily the data pyramid. Annika spoke about this. In case of

00:15:58.980 | LLMs or text data or text models, you have the whole internet which you can be scraping to generate data. But

00:16:06.900 | there's no such internet scale data for actions. So that is one of the key challenges that you need to

00:16:12.340 | address either via simulation or by imitation learning, generating expert data, tele-operation, all sorts of

00:16:18.980 | things. The next thing is the dual system architecture. Previously what used to happen was each of these

00:16:26.260 | components was trained independently. And that resulted in some kind of disagreement between the two systems.

00:16:33.300 | The Groot N1 introduces this coherent architecture where both the system one and system two are being

00:16:41.060 | co-trained. And that kind of helps to optimize the whole stack instead of individually trying to train the pieces.

00:16:50.020 | And then the third piece and the final piece is the generalist model. So in case of the generalist model,

00:16:57.140 | you are able to leverage foundation knowledge from the model and extend it to different embodiments,

00:17:04.500 | different tasks. You can think of it like how you have, in case of large language models, you have a base

00:17:10.660 | foundation Lama 2.7TB model or there's Lama 4 now or Lama 3. I don't know which is the latest, but you can

00:17:18.340 | extend it to any, you can fine tune it to any task or like domain adapted. Similarly, you have the Groot N1

00:17:25.540 | model which can be adapted to any embodiment and any downstream task. Thank you so much for attending our

00:17:31.620 | talk today. We're really happy you were here. Please let us know if you have questions. We'll be outside hanging out.

00:17:38.420 | Thank you so much. We appreciate it.

What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha