back to index

What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone, I'm Annika. This is Asta. We both work at
00:00:19.420 | Advidia and we were part of the team that developed the GRU 10.1
00:00:22.680 | robotics foundation model. So today we're going to give you a sense of what that
00:00:27.360 | is and how you go about building a robotics foundation model. But before we
00:00:31.920 | get into it, I feel like a lot of people start talks here with a hot take. So the
00:00:37.340 | hot take that I'm bringing to an AI conference is that we're not necessarily
00:00:40.860 | running out of jobs. So this was a report done by McKinsey, part of the Global
00:00:46.260 | Institute, showing that in the world's 30 most advanced economies, there's actually
00:00:52.240 | too many jobs for the number of people that could fill them. And really the two
00:00:57.100 | things you should look at in this whole graph is the 4.2x that's been the rate at
00:01:01.720 | which we're getting more jobs and people could fill it over the last decade. And
00:01:05.020 | this line that I'm highlighting in red, that's where we're in trouble, where
00:01:08.980 | there's just more jobs than able-bodied people to fill those jobs. And obviously
00:01:13.840 | there's a real conversation around AI and jobs, so it helps to look at what
00:01:18.540 | industries are largely affected. I'm gonna highlight a couple in red. So leisure,
00:01:24.760 | hospitality, health care, construction, transportation, manufacturing. I guess you
00:01:31.840 | can figure out what they have in common. None of them can be solved by ChatGPT alone. They
00:01:37.840 | require operating instruments and devices in the physical world, and they require
00:01:43.960 | physical AI. So that's really the big challenge that I see over the coming years is how do we take
00:01:49.200 | this huge amount of intelligence that we're seeing in language models and make it
00:01:54.560 | operable in the in the physical world. The other question around humanoids is why do we build them to
00:02:00.540 | look like humans? It's not just because we want them to look like us. The world was
00:02:05.520 | made for humans. It's very hard to have generalist robots operate in our world and be
00:02:11.220 | generally useful without copying our physical form. There's a lot of
00:02:16.740 | specialist robots that do incredible things. I don't know if you got to try the
00:02:19.980 | espresso from the barista robot downstairs makes a good espresso, but that robot
00:02:25.020 | couldn't even cook rice. So if we want a robot that can do multiple tasks for you, it's
00:02:31.020 | just a lot easier to try and imagine that that robot can operate in in our human
00:02:36.080 | world. So how do we do this? There's three big buckets, three big stages. First one is
00:02:42.120 | data collecting, or generating, or multiplying data, which we'll talk quite a
00:02:46.260 | lot about. And now that you have this synthetic and real data, but largely synthetic
00:02:51.900 | data, you train a model. We'll also talk about what that architecture and training
00:02:57.540 | paradigm looks like. And then finally we deploy on the robot, or at the edge. This
00:03:04.260 | is what we call the physical AI lifecycle. So generate the data, consume the data, and
00:03:10.300 | then finally deploy and have this robot operable in the physical world. NVIDIA also
00:03:15.920 | likes to call this the three computer problem, because they have very different
00:03:20.100 | compute characteristics. So at the simulation stage, you're looking for a
00:03:24.320 | computer that's powerful at simulating, something like an OVX omniverse machine.
00:03:28.400 | There's a lot of really interesting work happening on the simulation side, but it
00:03:32.780 | has a very different type of workload than when we're training and we're using a DGX to
00:03:38.060 | just consume this enormous amounts of data and learn from that. And then finally when
00:03:43.340 | we're deploying at the edge, it needs to be a model that's small enough and
00:03:46.820 | efficient enough to run on an edge edge device like an AGX. And really this is this
00:03:53.960 | is Project Root. So Project Root is NVIDIA's strategy for bringing humanoid and other forms
00:04:00.360 | of robotics into the world. And it's everything from the compute infrastructure to the software,
00:04:05.900 | to the research that's needed. It's not simply just one foundation model. But that is what we'll be focusing on in this talk,
00:04:13.540 | because that's what we worked on. So the Groot N1 Foundation model was announced at GTC in March.
00:04:19.620 | It is open source. It is highly customizable. And a very big part of it that is cross embodiment.
00:04:26.700 | So basically you can take this base model, there's specific embodiments that we have fine-tuned for,
00:04:31.620 | but the whole premise is that you can take this base model, it's a two billion parameter model,
00:04:36.340 | which in the world of LLMs is tiny, but still pretty sizable for a robot, and then go and modify it
00:04:42.820 | for your embodiments, your use cases. So let's start with the first huge daunting task in the world of
00:04:50.180 | robotics data. When the Groot team actually started thinking about data, they put together this idea of
00:04:57.620 | the data pyramid, which is very elegant, but it was born out of desperation and necessity. The data you want
00:05:04.180 | does not exist in quantities. There is no internet scale data set to scrape or download or put together,
00:05:11.540 | because robots haven't made it YouTube yet. So really at the top of the pyramid, we have the real
00:05:17.540 | world data, which is robots doing things, real robots doing real tasks and solving them. And how it's
00:05:23.700 | collected is humans teleoperate a robot most of the time. So wearing like an Apple Vision Pro and wearing
00:05:31.140 | gloves, there's all kinds of ways to teleoperate the robot. But you have a real robot successfully
00:05:35.620 | completing a task, and then you have that ground truth data. So you can imagine this is very small
00:05:41.620 | in quantity, very expensive. And we put 24 hours per robot per day, because that's how many hours a human
00:05:48.660 | has. But the reality is that humans and robots get tired. So it's not even 24 hours. So really, this is
00:05:55.620 | a very, very limited data set. And then at the bottom of the pyramid, we have the internet. So we
00:06:00.900 | have huge amounts of video data, and it's typically human solving tasks. So you can imagine someone
00:06:07.620 | collecting a cooking video tutorial and putting that out there. With this unstructured data, it's not
00:06:13.620 | necessarily relevant to robots, but there is some value in that. So we didn't want to completely discard
00:06:20.420 | it. It forms part of this cohesive data strategy. And then in the middle, synthetic data. And this is a topic
00:06:28.900 | that could fill this whole entire talk. And I've cut down so many slides on just this section, because in theory,
00:06:35.940 | this is infinite, right? You could just let the GPU keep generating more data. But in practice, creating high
00:06:42.980 | quality simulation environments is very labor intensive, and it requires serious skill. And then on top of that, the other
00:06:51.140 | technique, which I will share a little bit about is taking the human trajectories that we do collect,
00:07:00.100 | so human teleoperation data, and trying to multiply it through essentially video generation models. So
00:07:07.140 | through World Foundation models that we fine tune to do this task. But even in that case, there's a lot
00:07:12.180 | of active research in how we take the little bits of high quality data that we have and multiply it as well
00:07:18.660 | as how we effectively combine a simulation data with this real world data. So this is DreamGen. This was
00:07:25.780 | something that was announced at Computex very recently. All in all, this data piece is a huge part of what the
00:07:33.220 | project root is about. So there's many, many solutions here in terms of the tele-op and the data strategy. But for now,
00:07:41.220 | the next piece is how do we bring all this data into an architecture? So I'm going to hand over to Asda to explain that
00:07:47.300 | about. Thank you, Anika. Do you guys hear me? All right, awesome. So before we dive into the architecture, I'm going to show to you what an
00:07:59.220 | example input looks like and what an example output looks like. So what you see here is the image observation, the robot
00:08:06.180 | state and the language prompt. That's the input. And then what's the output? The output is a robot action trajectory.
00:08:13.380 | So the prompt was to pick up the industrial object and place it in the yellow bin. And that's what the robot does.
00:08:19.940 | It picks up and places it in the yellow bin very neatly. But this is what it appears to us as humans.
00:08:26.740 | But is the robot or the humanoid seeing the same? Not really. The humanoid sees this. It sees a bunch of
00:08:35.860 | vectors, floating point vectors, which control the different joints. So you're seeing the output as a
00:08:41.300 | trajectory, which is like motion of the robot hand, but that's not what the robot is seeing. It uses these
00:08:48.020 | vectors to actually generate a continuous action. And to set context on what a robot state and action is,
00:08:55.380 | So you can imagine the state is the robot snapshot at an instant instance of time. So including the
00:09:02.660 | physique of the robot and the environment, that's the state. And then the action is what the robot decides
00:09:08.100 | to do next based on the state.
00:09:12.260 | So moving on and diving a bit deeper into the architecture. The Groot N1 system introduced a
00:09:21.060 | very interesting concept. And this concept is inspired from Daniel Kahneman's book, Thinking Fast and Slow.
00:09:26.980 | Show of hands, how many of you have read the book? Amazing. That helps to explain. So it's inspired by the
00:09:34.100 | the same concept, but it's applied to a robotics context. So we have two systems, system one, system two.
00:09:42.180 | System two, you can imagine, is the brain of the robot or the brain of the model. So that's the part
00:09:48.820 | which is actually trying to break down the complex tasks. So make it simpler such that the system one can
00:09:56.420 | execute on it. So you can think of system two as the planner, which executes slowly to break down the
00:10:02.260 | complex task. And then system one is the first one. It operates almost at 120 hertz and it basically
00:10:09.060 | executes on the task that system two puts out for it. And then now we're going to delve another level
00:10:16.180 | deeper into the architecture. And it's okay if all of this is complicated to you because it's not very
00:10:21.620 | straightforward. So we have the input as the robot state and the noised action. You must be wondering
00:10:28.580 | why we've called it noised action. Noised action is a natural state because these sensors don't capture
00:10:34.260 | the action perfectly. So we have noised action. And then they're passed to a state encoder and an action
00:10:40.980 | encoder which generates some tokens. And you may be familiar with tokens. We've talked about LLMs, agents,
00:10:49.060 | a lot. So the same concept but just different kinds of tokens. So state tokens, action tokens. And then it's passed
00:10:56.020 | through a diffusion transformer block. And the diffusion transformer block is essentially multiple layers of
00:11:03.220 | cross-attention and self-attention. And bringing in the other piece, which is the vision input and the text
00:11:10.740 | input. So you have the vision encoder which takes the image input, generates some tokens, passes it to the VLM to
00:11:17.780 | bring it to like a standardized encoding format. And then the text tokenizer, which takes the text input,
00:11:23.780 | again, does the same passes through the VLM. And then all of this, all of the output tokens from the VLM,
00:11:29.860 | in this case, in case of root N1, it was the eagle to VLM, is passed into the cross-attention layer of the
00:11:38.740 | diffusion transformer block. And then you get some output tokens. These are the output tokens. But these
00:11:46.660 | output tokens are still not ready to be consumed by the physical robot. So you need to make it consumable
00:11:53.620 | by the physical robot. And that's where you have this key piece called the action decoder. So it may seem
00:12:00.660 | like there's lots of encoders, lots of decoders, but you can say that the action decoder is the one
00:12:06.180 | which gives the model capability to be a generalist. So you're giving it an action decoder which is
00:12:11.860 | specific to the embodiment that you're going to use. Whether it's a humanoid hand or a robot arm,
00:12:17.460 | an industrial robot arm, that's where that action decoder comes into place. It's specific to the
00:12:22.740 | embodiment you're trying to use. And then it's going to translate it specific to your embodiment and output,
00:12:28.820 | an action vector which can be translated into continuous robot motion or embodiment motion.
00:12:34.580 | Just going to give you a second to digest all of this. So you can see that the action decoder is very,
00:12:41.460 | very important because otherwise you would only be able to train a model for one specific embodiment.
00:12:47.540 | But this model can leverage foundation knowledge from all different embodiments and then bring it to one
00:12:53.780 | one particular embodiment. The concept is similar to the concept of a foundation model essentially.
00:13:01.060 | Moving on to the next slide.
00:13:04.340 | There are two main ways of robot learning. And the reason I chose to keep this slide is because it
00:13:12.580 | came up a lot in the conversations I was having the past couple of days. There are two ways of training robots.
00:13:19.460 | One is imitation learning and the other is through reinforcement learning.
00:13:24.500 | Imitation learning in simple English terms, imitation means to copy someone like learn by copying. That's
00:13:33.300 | exactly what's happening here. So you have a human expert and the robot is trying to copy the human
00:13:39.300 | expert. And you're trying to minimize the loss between the expert and the human. So you have a gold standard. You're
00:13:45.940 | trying to match up to the gold standard. And then in case of reinforcement learning it's more of a trial
00:13:50.740 | and error format. So what you're doing is you just maximize the reward. So you don't have a golden
00:13:57.860 | state. You're trying to just reach wherever you can the best you can. You can think of it similar to
00:14:04.020 | having siblings. When siblings, parents try to compare between the two and they're like you need to be like
00:14:08.980 | your elder sibling. But then there's no sibling and in that case you can just be as good as you want.
00:14:13.940 | So that's reinforcement learning for you. And like all things good and bad in this world both of them
00:14:20.740 | come with pros and cons. With the imitation learning you're severely bottlenecked by the expert data which is
00:14:27.060 | quite expensive. But in case of reinforcement learning you don't have that bottleneck. But it's the key
00:14:33.540 | challenge is the sim to real. So there's a huge gap between going from sim to real and it's an active
00:14:39.220 | area of research. A lot of research labs, universities are going behind it. So that was the two ways of
00:14:46.820 | training robots. And GrootN1 used both of these in some ways. Here's an example of the train model.
00:14:55.540 | What can it do? So on the left you see the model being able to do a few pick and place tasks in the
00:15:01.060 | kitchen. On the right top you can see with enough training the model can be taught how to be romantic
00:15:07.140 | as well. You don't see all the fallen champagne glasses and fallen flowers which went behind capturing
00:15:14.180 | this perfect snap. And then the bottom right is two robot friends trying to get to an industrial task,
00:15:21.940 | like a pick and place task again. But these are not the only tasks that these humanoids or robots
00:15:28.420 | can be doing. They can be extended to any task, any environment. And that's why we have a foundation
00:15:35.780 | model. A generalist foundation model which can be expanded to any downstream task.
00:15:42.500 | So this is going to be my conclusion. There are three core principles that we spoke about today. And
00:15:50.500 | each of these is very hefty by itself. But primarily the data pyramid. Annika spoke about this. In case of
00:15:58.980 | LLMs or text data or text models, you have the whole internet which you can be scraping to generate data. But
00:16:06.900 | there's no such internet scale data for actions. So that is one of the key challenges that you need to
00:16:12.340 | address either via simulation or by imitation learning, generating expert data, tele-operation, all sorts of
00:16:18.980 | things. The next thing is the dual system architecture. Previously what used to happen was each of these
00:16:26.260 | components was trained independently. And that resulted in some kind of disagreement between the two systems.
00:16:33.300 | The Groot N1 introduces this coherent architecture where both the system one and system two are being
00:16:41.060 | co-trained. And that kind of helps to optimize the whole stack instead of individually trying to train the pieces.
00:16:50.020 | And then the third piece and the final piece is the generalist model. So in case of the generalist model,
00:16:57.140 | you are able to leverage foundation knowledge from the model and extend it to different embodiments,
00:17:04.500 | different tasks. You can think of it like how you have, in case of large language models, you have a base
00:17:10.660 | foundation Lama 2.7TB model or there's Lama 4 now or Lama 3. I don't know which is the latest, but you can
00:17:18.340 | extend it to any, you can fine tune it to any task or like domain adapted. Similarly, you have the Groot N1
00:17:25.540 | model which can be adapted to any embodiment and any downstream task. Thank you so much for attending our
00:17:31.620 | talk today. We're really happy you were here. Please let us know if you have questions. We'll be outside hanging out.
00:17:38.420 | Thank you so much. We appreciate it.