back to indexStanford CS25: V3 I Generalist Agents in Open-Ended Worlds
00:00:00.000 |
So today we're honored to have Jim Ban from NVIDIA, who will be talking about generalist 00:00:14.200 |
agents in open-ended worlds, and he's a senior AI research scientist at NVIDIA, where his 00:00:21.040 |
mission is to build generally capable AI agents with applications to gaming, robotics, and 00:00:27.920 |
His research spans foundation models, multi-modal AI, reinforcement learning, and open-ended 00:00:35.060 |
Jim obtained his PhD degree in computer science from here, Stanford, advised by Professor 00:00:42.320 |
And previously, he did research internships at OpenAI, Google AI, as well as Mila Quebec 00:01:02.400 |
So I want to start with a story of two kittens. 00:01:08.040 |
It's a story that gave me a lot of inspiration over my career, so I want to share this one 00:01:14.720 |
Back in 1963, there were two scientists from MIT, Held and Hein. 00:01:19.480 |
They did this ingenious experiment where they put two newborn kittens in this device, and 00:01:25.700 |
the kittens have not seen the visual world yet. 00:01:28.120 |
So it's kind of like a merry-go-round, where the two kittens are linked by a rigid mechanical 00:01:33.080 |
bar, so their movements are exactly mirrored. 00:01:37.000 |
And there's an active kitten on the right-hand side, and that's the only one able to move 00:01:41.000 |
freely and then transmit the motion over this link to the passive kitten, which is confined 00:01:47.280 |
to the basket and cannot really control its own movements. 00:01:51.760 |
And then after a couple of days, Held and Hein kind of take the kittens out of this 00:01:56.240 |
merry-go-round and then did visual testing on them. 00:01:59.440 |
And they found that only the active kitten was able to develop a healthy visual motor 00:02:03.560 |
loop, like responding correctly to approaching objects or visual cliffs, but a passive kitten 00:02:13.180 |
So I find this experiment fascinating because it shows the importance of having this embodied 00:02:19.940 |
active experience to really ground a system of intelligence. 00:02:26.360 |
And let's put this experiment in today's AI context, right? 00:02:30.400 |
We actually have a very powerful passive kitten, and that is ChagGBT. 00:02:35.800 |
It passively observes and rehearses the text on the internet, and it doesn't have any embodiment. 00:02:41.860 |
And because of this, its knowledge is kind of abstract and ungrounded. 00:02:45.660 |
And that partially contributes to the fact that ChagGBT hallucinates things that are 00:02:50.540 |
just incompatible with our common sense and our physical experience. 00:02:55.460 |
And I believe the future belongs to active kittens, which translates to journalist agents. 00:03:01.800 |
They are the decision-makers in a constant feedback loop, and they're embodied in this 00:03:08.100 |
They're also not mutually exclusive with the passive kitten. 00:03:12.240 |
And in fact, I see the active embodiment part as a layer on top of the passive pre-training 00:03:26.660 |
You know, back in 2016, I remember it was like spring of 2016. 00:03:31.740 |
I was sitting in an undergraduate class at Columbia University, but I wasn't paying attention 00:03:36.660 |
I was watching a board game tournament on my laptop. 00:03:40.660 |
And this screenshot was the moment when Arthur Goh versus Lisa Dahl, and Arthur Goh won three 00:03:47.640 |
matches out of five, and became the first ever to beat a human champion at a game of 00:03:52.580 |
You know, I remember the adrenaline that day, right? 00:03:56.020 |
Oh my God, we're finally getting to HCI, and everyone's like so excited. 00:04:00.980 |
And I think that was the moment when AI agents entered the mainstream. 00:04:05.540 |
And you know, like when the excitement fades, I felt that even though Arthur Goh was so 00:04:12.340 |
mighty and so great, it could only do one thing and one thing alone, right? 00:04:18.620 |
And afterwards, you know, in 2019, there were more impressive achievements like OpenAI 5, 00:04:24.820 |
beating the human champions at a game of Dota, and Arthur Stark from DeepMind beat StarCraft. 00:04:30.900 |
But all of these, with Arthur Goh, they all have a single kind of theme, and that is to 00:04:38.420 |
There is this one objective that the agent needs to do. 00:04:42.540 |
And the models trained on Dota or Goh cannot generalize to any other tasks. 00:04:48.720 |
It cannot even play other games like Super Mario or Minecraft. 00:04:52.700 |
And the world is fixed and have very little room for like open-ended creativity and exploration. 00:04:59.840 |
So I argue that a journalist agent should have the following essential properties. 00:05:04.620 |
First, it should be able to pursue very complex, semantically rich and open world objectives. 00:05:11.100 |
Basically you explain what you want in natural language, and the agent should perform the 00:05:17.140 |
And second, the agent should have a large amount of pre-trained knowledge instead of 00:05:22.060 |
knowing only a few concepts that's extremely specific to the task. 00:05:29.900 |
A journalist agent, as the name implies, needs to do more than just a couple of things. 00:05:35.660 |
It should be, in the best case, infinitely multitask. 00:05:46.420 |
Correspondingly, we need three main ingredients. 00:05:51.980 |
The environment needs to be open-ended enough because the agent's capability is upper bounded 00:06:01.220 |
And I'd argue that Earth is actually a perfect example because it's so open-ended, this world 00:06:05.700 |
we live in, that it allows an algorithm called natural evolution to produce all the diverse 00:06:14.400 |
So can we have a simulator that is essentially a lo-fi Earth, but we can still run it on 00:06:23.340 |
And second, we need to provide the agent with massive pre-training data because exploration 00:06:27.740 |
in an open-ended world from scratch is just intractable. 00:06:31.780 |
And the data will serve at least two purposes. 00:06:34.260 |
One as a reference manual on how to do things. 00:06:37.500 |
And second, as a guidance on what are the interesting things worth pursuing. 00:06:42.660 |
And GPT is only, at least up to GPT-4, it only learns from pure text on the web. 00:06:48.780 |
But can we provide the agent with much richer data, such as video walkthrough or multimedia, 00:07:00.580 |
And finally, once we have the environment and the database, we are ready to train foundation 00:07:08.300 |
And it should be flexible enough to pursue the open-ended tasks without any task-specific 00:07:13.340 |
assumptions, and also scalable enough to compress all of the multi-modal data that I just described. 00:07:20.940 |
And here language, I argue, will play at least two key roles. 00:07:24.860 |
One is as a simple and intuitive interface to communicate a task, to communicate the 00:07:32.300 |
And second, as a bridge to ground all of the multi-modal concepts and signals. 00:07:38.040 |
And that train of thought landed us in Minecraft, the best-selling video game of all time. 00:07:45.480 |
And for those who are unfamiliar, Minecraft is a procedurally generated 3D voxel world. 00:07:51.640 |
And in the game, you can basically do whatever your heart desires. 00:07:55.460 |
And what's so special about the game is that unlike AlphaGo, StarCraft, or Dota, Minecraft 00:08:01.680 |
defines no particular objective to maximize, no particular opponent to beat, and doesn't 00:08:09.960 |
And that makes it very well suited as a truly open-ended AI playground. 00:08:13.760 |
And here, we see people doing extremely impressive things in Minecraft, like this is a YouTube 00:08:19.800 |
video where a gamer built the entire Hogwarts castle block by block, by hand, in the game. 00:08:28.200 |
And here's another example of someone just digging a big hole in the ground and then 00:08:32.120 |
making this beautiful underground temple with a river nearby. 00:08:40.120 |
And one more, this is someone building a functioning CPU circuit inside the game, because there 00:08:46.180 |
are something called redstone in Minecraft that you can build circuits out of it, like 00:09:00.600 |
And here, I want to highlight a number that is 140 million active players. 00:09:06.380 |
And just to put this number in perspective, this is more than twice the population of 00:09:13.240 |
And that is the amount of people playing Minecraft on a daily basis. 00:09:16.840 |
And it just so happens that gamers are generally happier than PhDs, so they love to stream 00:09:26.640 |
And that produces a huge amount of data every day online. 00:09:30.620 |
And there's this treasure trove of learning materials that we can tap into for training 00:09:36.700 |
You know, remember that data is the key for foundation models. 00:09:41.640 |
So we introduce MindDojo, a new open framework to help the community develop generally capable 00:09:48.520 |
agents using Minecraft as a kind of primordial soup. 00:09:55.400 |
MindDojo features three major parts, an open-ended environment, an international knowledge base, 00:10:01.300 |
and then a journalist agent developed with a simulator and massive data. 00:10:09.060 |
Here's a sample gallery of the interesting things that you can do with MindDojo's API. 00:10:15.220 |
We feature a massive benchmarking suite of more than 3,000 tasks. 00:10:19.380 |
And this is by far the largest open source agent benchmark to our knowledge. 00:10:25.540 |
And we implement a very versatile API that unlocks the full potential of the game. 00:10:29.700 |
Like for example, MindDojo supports multi-modal observation and a full action space, like 00:10:42.260 |
Like you can tweak the terrains, the weather, plot placement, monster spawning, and just 00:10:52.640 |
And given the simulator, we introduce around 1,500 programmatic tasks, which are tasks 00:10:58.880 |
that have ground-true success conditions defined in Python code. 00:11:02.660 |
And you can also explicitly write down like sparse or best reward functions using this 00:11:08.060 |
And some examples are like harvesting different resources, unlocking the tech tree, or fighting 00:11:16.060 |
And all these tasks come with language prompts that are templated. 00:11:20.700 |
Next, we also introduce 1,500 creative tasks that are freeform and open-ended. 00:11:26.580 |
And that is in contrast to the programmatic tasks I just mentioned. 00:11:30.660 |
So for example, let's say we want the agent to build a house. 00:11:39.140 |
And just like image generation, you don't know if it generates a cat correctly or not. 00:11:45.240 |
So it's very difficult to use simple Python programs to give these kinds of tasks reward 00:11:51.020 |
And the best way is to use foundation models trained on internet skill knowledge so that 00:11:56.660 |
the model itself understands abstract concepts like the concept of a house. 00:12:04.460 |
And finally, there's one task that holds a very special status called playsuit, which 00:12:08.380 |
is to beat the final boss of Minecraft, the Ender Dragon. 00:12:12.580 |
So Minecraft doesn't force you to do this task. 00:12:14.620 |
As we said, it doesn't have a fixed storyline, but it's still considered a really big milestone 00:12:23.100 |
I want to highlight it is an extremely difficult task that requires very complex preparation, 00:12:31.020 |
And for an average human, it will take many hours or even days to solve, easily over like 00:12:40.540 |
And that would be the longest benchmarking task for policy learning ever created here. 00:12:45.660 |
So I admit, I am personally a below average human. 00:12:49.300 |
I was never able to beat Ender Dragon and my friends laugh at me and I'm like, okay, 00:12:59.300 |
That was one of the motivations for this project. 00:13:03.380 |
Now let's move on to the second ingredient, the internet skill knowledge base part of 00:13:09.420 |
We offer three datasets here, the YouTube, Wiki, and Reddit, and combined they are the 00:13:14.580 |
largest open-ended agent behavior database ever compiled to our knowledge. 00:13:20.900 |
The first is YouTube, and we already said Minecraft is one of the most streamed games 00:13:26.500 |
on YouTube, and the gamers love to narrate what they are doing. 00:13:30.740 |
So we collected more than 700,000 videos with 2 billion words in the corresponding transcripts. 00:13:37.820 |
And these transcripts will help the agent learn about human strategies and creativities 00:13:46.780 |
And second, the Minecraft player base is so crazy that they have compiled a huge Minecraft 00:13:54.860 |
specific Wikipedia that basically explains everything you ever need to know in every 00:14:03.500 |
And we scraped 7,000 Wikipedias with interleaving multi-modal data, like images, tables, and 00:14:12.620 |
Like this is a gallery of all of the monsters and their corresponding behaviors, like spawn 00:14:20.460 |
And also like the thousands of crafting recipes are all present on the Wiki, and we scraped 00:14:26.500 |
And more like complex diagrams and tables and embedded figures. 00:14:29.740 |
Now we have something like GPT-4V, it may be able to understand many of these diagrams. 00:14:36.980 |
And finally, the Minecraft sub-Reddit is one of the most active forums across the entire 00:14:42.820 |
Reddit, and players showcase their creations and also ask questions for help. 00:14:47.860 |
So we scraped more than 300,000 posts from Minecraft Reddit, and here are some examples 00:14:53.520 |
of how people use the Reddit as a kind of stack overflow for Minecraft. 00:14:59.540 |
And we can see that some of the top voted answers are actually quite good. 00:15:03.020 |
Like someone is asking, "Oh, why doesn't my wheat farm grow?" 00:15:06.540 |
And the answer says, "You need to light up the room with more torches, you don't have 00:15:10.540 |
Now, given the massive task suite and internet data, we have the essential components to 00:15:21.580 |
So in the first Minecraft Dojo paper, we introduced a foundation model called Minecraft Clip. 00:15:26.220 |
And the idea is very simple, I can explain in three slides. 00:15:30.620 |
Basically for our YouTube database, we have time-aligned videos and transcripts. 00:15:35.900 |
And these are actually the real tutorial videos from our data set. 00:15:39.900 |
You see on the third clip, as I raise my axe in front of this pig, there's only one thing 00:15:45.900 |
that you know is going to happen, is actually someone said this, a big YouTuber of Minecraft. 00:15:53.180 |
And then, given this data, we train Minecraft in the same spirit as OpenAI Clip. 00:15:59.040 |
So for those who are unfamiliar, OpenAI Clip is a contrastive model that learns the association 00:16:06.940 |
And here, it's a very similar idea, but this time it is a video text contrastive model. 00:16:12.700 |
And we associate the text with a video snippet that runs about eight to 16 seconds each. 00:16:22.780 |
And intuitively, Minecraft learns the association between the video and the transcript that 00:16:30.980 |
And Minecraft outputs a score between 0 and 1, where 1 means a perfect correlation between 00:16:35.940 |
the text and the video, and 0 means the text is irrelevant to the activity. 00:16:41.420 |
So you see, this is effectively a language-prompted foundation reward model that knows the nuances 00:16:48.500 |
of things like forests, animal behaviors, and architectures in Minecraft. 00:16:57.580 |
Here's an example of our agent interacting with the simulator. 00:17:01.500 |
And here, the task is "shear sheep to obtain wool." 00:17:05.620 |
And as the agent explores in the simulator, it generates a video snippet as a moving window, 00:17:12.620 |
which can be encoded and fed into Minecraft, along with an encoding of the text prompt 00:17:22.260 |
The higher the association is, the more the agent's behavior in this video aligns with 00:17:27.540 |
the language, which is a task you want it to do. 00:17:30.980 |
And that becomes a reward function to any reinforcement learning algorithm. 00:17:38.480 |
Because it's essentially RL from human feedback, or RLHF in Minecraft. 00:17:46.300 |
And RLHF was the cornerstone algorithm that made ChagGBT possible, and I believe it will 00:17:51.500 |
play a critical role in journalists' agents as well. 00:17:55.940 |
I'll quickly gloss over some quantitative results. 00:17:58.700 |
I promise there won't be, like, many tables of numbers here. 00:18:02.820 |
For these eight tasks, we show the percentage success rate over 200 test episodes. 00:18:08.180 |
And here, in the green circle, is two variants of our Minecraft method. 00:18:15.920 |
So I'll highlight one baseline, which is that we construct a dense reward function manually 00:18:21.700 |
for each task using the MindDojo API, it's a Python API. 00:18:26.340 |
And you can consider this column as a kind of oracle, the upper bound of the performance, 00:18:31.180 |
because we put a lot of human efforts into designing these reward functions just for 00:18:37.400 |
And we can see that Mineclip is able to match the quality of many of these, not all of them, 00:18:42.940 |
but many of these manually generated rewards. 00:18:45.980 |
It is important to highlight that Mineclip is open vocabulary. 00:18:49.660 |
So we use a single model for all of these tasks instead of one model for each. 00:18:53.660 |
And we simply prompt the reward model with different tasks. 00:19:03.420 |
One major feature of the foundation model is strong generalization out of the box. 00:19:07.220 |
So can our agent generalize to dramatic changes in the visual appearance? 00:19:12.620 |
So we did this experiment where during training, we only train our agents on a default terrain 00:19:21.820 |
But we tested zero shot in a diverse range of terrains, weathers, and day/night cycles. 00:19:27.000 |
And you can customize everything in MindDojo. 00:19:29.500 |
And in our paper, we have numbers showing that Mineclip significantly beats an off-the-shelf 00:19:34.060 |
visual encoder when facing these kind of distribution shifts out of the box. 00:19:38.980 |
And this is no surprise, right, because Mineclip was trained on hundreds of thousands of clips 00:19:43.900 |
from Minecraft videos on YouTube, which have a very good coverage of all the scenarios. 00:19:51.660 |
And I think that is just a testament to the big advantage of using international data, 00:20:01.780 |
And here are some demos of our learned agent behaviors on various tasks. 00:20:06.140 |
So you may notice that these tasks are relatively short, around like 100 to 500 time steps. 00:20:12.440 |
And that is because Mineclip is not able to plan over very long time horizons. 00:20:18.700 |
It is an inherent limitation in the training pipeline, because we could only use 8 to 16 00:20:28.700 |
But our hope is to build an agent that can explore and make new discoveries autonomously, 00:20:36.380 |
And in 2022, this goal seems quite out of reach for us. 00:20:44.820 |
And that is GP4, a language model that is so good at coding and long-horizon planning, 00:20:54.220 |
We built Voyager, the first large-language-model-powered lifelong learning agent. 00:21:00.540 |
And when we set Voyager loose in Minecraft, we see that it just keeps going. 00:21:04.980 |
And by the way, all these video snippets are from a single episode of Voyager. 00:21:09.980 |
It's not from different episodes, it's a single one. 00:21:13.860 |
And we see that Voyager is just able to keep exploring the terrains, mine all kinds of 00:21:18.820 |
materials, fight monsters, craft hundreds of recipes, and unlock an ever-expanding tree 00:21:30.180 |
If we want to use the full power of GP4, a central question is how to stringify things, 00:21:35.700 |
converting this 3D world into a textual representation. 00:21:42.700 |
And thankfully, again, the crazy Minecraft community already built one for us, and it's 00:21:49.740 |
It's called Mineflayer, a high-level JavaScript API that's actively maintained to work with 00:21:57.300 |
And the beauty of Mineflayer is it has access to the game states surrounding the agent, 00:22:02.660 |
like the nearby blocks, animals, and enemies. 00:22:05.940 |
So we effectively have a ground-truth perception module as textual input. 00:22:10.340 |
At the same time, Mineflayer also supports action APIs that we can compose skills. 00:22:18.500 |
And now that we can convert everything to text, we are ready to construct an agent on 00:22:24.580 |
So on a high level, there are three components. 00:22:26.780 |
One is a coding module that writes JavaScript code to control the game bot, and it's the 00:22:33.180 |
main module that generates the executable actions. 00:22:36.340 |
And second, we have a code base to store the correctly written code and look it up in the 00:22:41.740 |
future if the agent needs to recall the skill. 00:22:46.980 |
And whenever facing similar situations in the future, the agent knows what to do. 00:22:51.240 |
And third, we have a curriculum that proposes what to do next, given the agent's current 00:23:00.320 |
And when you wire these components up together, you get a loop that drives the agent indefinitely 00:23:06.800 |
and achieve something like lifelong learning. 00:23:13.520 |
We prompt GT4 with documentations and examples on how to use a subset of the Mineflayer API. 00:23:20.400 |
And GT4 writes code to take actions given the current assigned task. 00:23:25.080 |
And because JavaScript runs a code interpreter, GT4 is able to define functions on the fly 00:23:32.400 |
But the code that GT4 writes isn't always correct, just like human engineers, you can't 00:23:38.440 |
So we develop an iterative prompting mechanism to refine the program. 00:23:45.760 |
The environment feedback, like, you know, what are the new materials you've got after 00:23:49.400 |
taking an action, or, you know, some enemies nearby. 00:23:53.240 |
And the execution error from the JavaScript interpreter, if you wrote some buggy code, 00:23:57.680 |
like undefined variable, for example, if it hallucinates something. 00:24:02.120 |
And another GT4 that provides critique through self reflection from the agent state and the 00:24:10.020 |
And that also helps refine the program effectively. 00:24:13.680 |
So I want to show some quick example of how the critic provides feedback on the task completion 00:24:20.000 |
So let's say in the first example, the task is to craft a spyglass. 00:24:24.200 |
And GT4 looks at the agent's inventory, and decides that it has enough copper, but not 00:24:32.720 |
And the second task is to kill three sheep to collect food. 00:24:36.020 |
And each sheep drops one unit of wool, but there are only two units in inventory. 00:24:40.560 |
So GT4 reasons and says that, okay, you have one more sheep to go, and likewise. 00:24:47.180 |
Now moving on to the second part, once Voyager implements a skill correctly, we save it to 00:24:55.720 |
And you can think of the skill library as a code repository written entirely by a language 00:25:04.740 |
And the agent can record new skills, and also retrieve skills from the library facing similar 00:25:11.960 |
So it doesn't have to go through this whole program refinement that we just saw, which 00:25:15.720 |
is quite inefficient, but you do it once you save it to disk. 00:25:20.360 |
And in this way, Voyager kind of bootstraps its own capabilities recursively as it explores 00:25:29.880 |
And let's dive a little bit deeper into how the skill library is implemented. 00:25:36.480 |
First we use GPT 3.5 to summarize the program into plain English. 00:25:41.000 |
And summarization is very easy, and GPT 4 is expensive. 00:25:47.600 |
And then we embed this summary as the key, and we save the program, which is a bunch 00:25:55.000 |
And we find out doing this makes retrieval better, because the summary is more semantic, 00:26:00.080 |
and the code is a bit more discreet, and you insert it. 00:26:06.940 |
And now for the retrieval process, when Voyager is faced with a new task, let's say craft 00:26:11.800 |
iron pickaxe, we again use GPT 3.5 to generate a hint on how to solve the task. 00:26:18.120 |
And that is something like a natural language paragraph. 00:26:20.780 |
And then we embed that and use that as the query into the vector database, and we retrieve 00:26:30.920 |
So you can think of it as a kind of in-context replay buffer in the reinforcement learning 00:26:37.920 |
And now moving on to the third part, we have another GPT 4 that proposes what task to do, 00:26:47.960 |
And here we give GPT 4 a very high-level unsupervised objective, that is to obtain as many unique 00:26:54.480 |
items as possible, that is our high-level directive. 00:26:58.080 |
And then GPT 4 takes this directive and implements a curriculum of progressively harder challenges 00:27:07.200 |
So it's kind of like curiosity exploration, where it is not novel research in a prior 00:27:13.400 |
literature, but implemented purely in context. 00:27:16.360 |
If you're listening to Zoom, the next example is fun. 00:27:22.240 |
Let's go through this example together, just to show you how Voyager works, the whole complicated 00:27:30.800 |
So the agent finds itself hungry, and only has one out of 20 hunger bar. 00:27:35.720 |
So GPT 4 knows that it needs to find food ASAP. 00:27:39.480 |
And then it senses there are four entities nearby, a cat, a villager, a pig, and some 00:27:46.600 |
And now GPT 4 starts a self-reflection, like, do I kill the cat and villager to get some 00:27:55.600 |
I can use the seeds to grow a farm, but that's going to take a very long time until I can 00:28:01.560 |
So sorry, piggy, you are the one being chosen. 00:28:05.360 |
So GPT 4 looks at the inventory, which is the agent state. 00:28:12.120 |
So Voyager recalls a skill from the library, that is to craft an iron sword, and then use 00:28:18.360 |
that skill to start learning a new skill, and that is hunt pig. 00:28:23.880 |
And once the hunt pig routine is successful, GPT 4 saves it to the skill library. 00:28:33.640 |
And putting all these together, we have this iterative prompting mechanism, the skill library, 00:28:40.680 |
And all of these combined is Voyager's no-gradient architecture, where we don't train any new 00:28:46.680 |
models or fine-tune any parameters, and allows Voyager to self-bootstrap on top of GPT 4, 00:28:54.360 |
even though we are treating the underlying language model as a black box. 00:28:59.200 |
It looks like my example work, and they started to listen. 00:29:09.000 |
So yeah, these are the tasks that Voyager picked up along the way, and we didn't pre-program 00:29:16.680 |
The agent is kind of forever curious, and also forever pursuing new adventures just 00:29:24.160 |
So to quickly show some quantitative results, here we have a learning curve, where the x-axis 00:29:31.440 |
is the number of prompting iterations, and the y-axis is the number of unique items that 00:29:36.360 |
Voyager discovered as it's exploring an environment. 00:29:40.760 |
And these two curves are baselines, React and Reflexion. 00:29:47.840 |
And this is AutoGPT, which is like a popular software repo. 00:29:50.720 |
Basically, you can think of it as combining React and a task planner that decomposes an 00:30:00.280 |
We're able to obtain three times more novel items than the prior methods, and also unlock 00:30:09.320 |
And if you take away the skill library, you see that Voyager really suffers. 00:30:13.640 |
The performance takes a hit, because every time it needs to kind of repeat and relearn 00:30:18.480 |
every skill from scratch, and it starts to make a lot more mistakes, and that really 00:30:25.920 |
Here, these two are the bird-eye views of the Minecraft map, and these circles are what 00:30:33.160 |
the prior methods are able to explore, given the same prompting iteration budget. 00:30:39.360 |
And we see that they tend to get stuck in local areas and kind of fail to explore more. 00:30:45.120 |
But Voyager is able to navigate distances at least two times as much as the prior works. 00:30:52.760 |
So it's able to visit a lot more places, because to satisfy this high-level directive of obtaining 00:30:59.000 |
as many unique items as possible, you've got to travel, right? 00:31:02.480 |
If you stay at one place, you will quickly exhaust interesting things to do. 00:31:06.760 |
And Voyager travels a lot, so that's how we came up with the name. 00:31:11.960 |
So finally, one limitation is that Voyager does not currently support visual perception, 00:31:18.080 |
because the GPT-4 that we used back then was text-only. 00:31:21.720 |
But there's nothing stopping Voyager from adopting, like, multi-modal language models 00:31:27.360 |
So here we have a little proof-of-concept demo, where we ask a human to basically function 00:31:34.040 |
And the human will tell Voyager that, as you're building these houses, what are the things 00:31:39.960 |
Like, you placed a door incorrectly, like, the roof is also not done correctly. 00:31:45.200 |
So the human is acting as a critic module of the Voyager stack. 00:31:49.560 |
And we see that with some of that help, Voyager is able to build a farmhouse and a nether 00:31:56.320 |
So it's not a hard time understanding, you know, 3D spatial coordinates just by itself 00:32:03.680 |
Now, after doing Voyager, we're considering, like, where else can we apply this idea, right, 00:32:11.440 |
of coding in an embodied environment, observe the feedback, and iteratively refine the program. 00:32:18.920 |
So we came to realize that physics simulations themselves are also just Python code. 00:32:24.740 |
So why not apply some of the principles from Voyager and do something in another domain? 00:32:31.560 |
What if you apply Voyager in the space of this physics simulator API? 00:32:35.420 |
And this is Eureka, which my team announced just, like, three days ago, fresh out of the 00:32:41.840 |
It is an open-ended agent that designs reward functions for robot dexterity at superhuman 00:32:49.700 |
And it turns out that GPT-4 plus reinforcement learning can spin a pen much better than I 00:32:56.280 |
I gave up on this task a long time ago from childhood. 00:33:03.640 |
So Eureka's idea is very simple and intuitive. 00:33:07.040 |
GPT-4 generates a bunch of possible reward function candidates implemented in Python. 00:33:13.040 |
And then you just do a full reinforcement learning training loop for each candidate 00:33:18.640 |
in a GPU-accelerated simulator, and you get a performance metric, and you take the best 00:33:24.600 |
candidates and feedback to GPT-4, and it samples the next proposals of candidates and keeps 00:33:31.020 |
improving the whole population of the reward functions. 00:33:35.600 |
It's kind of like an in-context evolutionary search. 00:33:40.360 |
So here's the initial reward generation, where Eureka takes as context the environment code 00:33:46.100 |
of NVIDIA's iSOC sim and a task description, and samples the initial reward function implementation. 00:33:54.540 |
So we found that the simulator code itself is actually a very good reference manual, 00:33:58.840 |
because it tells Eureka what are the variables you can use, like the hand positions, like 00:34:04.140 |
here, the fingertip position, the fingertip safe, the rotation, angular velocity, et cetera. 00:34:09.780 |
So you know all of these variables from the simulator code, and you know how they interact 00:34:16.520 |
So that serves as a very good in-context instruction. 00:34:21.640 |
So Eureka doesn't need to reference any human-written reward functions. 00:34:26.760 |
And then once you have the generated reward, you plug it into any reinforcement learning 00:34:33.860 |
So this step is typically very costly and very slow, because reinforcement learning 00:34:40.220 |
And we were only able to scale up Eureka because of NVIDIA's iSOC sim, which runs 1,000 simulated 00:34:50.360 |
So basically, you can think of it as speeding up reality by 1,000x. 00:34:57.360 |
And then after training, you will get the performance metrics back on each reward component. 00:35:02.640 |
And as we saw from Voyager, GPT-4 is very good at self-reflection. 00:35:10.080 |
And there's a software trial reminding you to activate a license. 00:35:17.920 |
Yeah, so Voyager reflects on it and then proposes mutations on the code. 00:35:26.800 |
So here, the mutations, we found, can be very diverse, ranging from something as simple 00:35:31.400 |
as just changing a hyperparameter in the reward function weighting to all the way to adding 00:35:36.600 |
completely novel components to the reward function. 00:35:41.280 |
And in our experiments, Eureka turns out to be a superhuman reward engineer, actually 00:35:47.720 |
outperforming some of the functions implemented by the expert human engineers on NVIDIA's 00:35:57.100 |
So here are some more demos of how Eureka is able to write very complex rewards that 00:36:05.940 |
And we can actually train the robot hand to rotate pens, not just in one direction, but 00:36:11.320 |
in different directions along different 3D axes. 00:36:15.480 |
I think one major contribution of Eureka, different from Voyager, is to bridge the gap 00:36:20.640 |
between high-level reasoning and low-level model controls. 00:36:25.060 |
So Eureka introduces a new paradigm that I'm calling hybrid gradient architecture. 00:36:30.500 |
So recall Voyager is a no-gradient architecture. 00:36:32.800 |
We don't touch anything, and we don't train anything. 00:36:35.960 |
But Eureka is a hybrid gradient, where a black box inference-only language model instructs 00:36:47.760 |
The outer loop is gradient-free, and it's driven by GP4, kind of selecting the reward 00:36:56.400 |
You train a full reinforcement learning episode from it to achieve extreme dexterity using 00:37:02.580 |
a specialized, like training, by training a special neural network controller. 00:37:06.640 |
And you must have both loops to succeed, to deliver this kind of dexterity. 00:37:11.820 |
And I think it will be a very useful paradigm for training robot agents in the future. 00:37:19.100 |
So you know, these days when I go on Twitter or X, I see AI conquering new lands like every 00:37:27.720 |
You know, chat, image generation, and music, they're all very well within reach. 00:37:33.260 |
But MindDojo, Voyager, and Eureka, these are just scratching the surface of open-ended 00:37:40.720 |
And looking forward, I want to share two key research directions that I personally find 00:37:45.760 |
extremely promising, and I'm also working on it myself. 00:37:50.300 |
The first is a continuation of MindClip, basically how to develop methods that learn from internet-skilled 00:37:57.680 |
And the second is multimodal foundation models. 00:38:00.720 |
Not that GPT-4VE is coming, but it is just the beginning of an era. 00:38:05.680 |
And I think it's important to have all of the modalities in a single foundation model. 00:38:17.400 |
Like so many data on YouTube, way too many for our limited GPUs to process. 00:38:24.600 |
They're extremely useful to train models that not only have dynamic perception and intuitive 00:38:30.400 |
physics, but also capture the complexity of human creativity and human behaviors. 00:38:36.520 |
It's all good, except that when you are using video to pre-training body agents, there is 00:38:44.600 |
You also don't get action labels, and you don't get any of the groundings because you 00:38:51.160 |
So I think here's a demonstration of why learning from video is hard, even for natural intelligence. 00:38:57.280 |
So a little cat is seeing boxers shaking their head, and it thinks maybe shaking head is 00:39:20.280 |
You have no idea why Tyson is doing this, right? 00:39:22.720 |
Like, the cat has no idea, and then it associates this with just the wrong kind of policy. 00:39:30.880 |
But for sure, it doesn't help the fighting, but it definitely boosts the cat's confidence. 00:39:41.280 |
Now, I want to point out a few kind of latest research in how to leverage so much video 00:39:51.480 |
The first is the simplest, just learn kind of a visual feature extractor from the videos. 00:39:57.080 |
So this is R3M from Chelsea's group at Stanford. 00:40:01.360 |
And this model is still an image-level representation, just that it uses a video-level loss function 00:40:06.200 |
to train, more specifically, time-contrasted learning. 00:40:10.280 |
And after that, you can use this as an image backbone for any agent, but you still need 00:40:15.560 |
to kind of fine-tune using domain-specific data for the agent. 00:40:21.040 |
The second path is to learn reward functions from video, and MindClip is one model under 00:40:28.920 |
It uses a contrastive objective between the transcript and video. 00:40:33.160 |
And here, this work, VIP, is another way to learn a similarity-based reward for goal-conditioned 00:40:41.920 |
So this work, VIP is led by, who's also the first author of Eureka, and Eureka is his 00:40:52.680 |
Can we directly do imitation learning from video, but better than the cat that we just 00:41:00.040 |
So we just said, you know, the videos don't have the actions, right? 00:41:04.640 |
We need to find some ways to pseudo-label the actions. 00:41:07.860 |
And this is video training of VPT from OpenAI last year, to solve long-range tasks in Minecraft. 00:41:18.400 |
Basically, you use a keyboard and a mouse action space, so you can align this action 00:41:26.320 |
And OpenAI hires a bunch of Minecraft players and actually collects data in-house. 00:41:31.920 |
So they record the episodes done by those gamers. 00:41:35.240 |
And now you have a dataset of video and action pairs, right? 00:41:40.160 |
And you train something called an inverse dynamics model, which is to take the observation 00:41:45.440 |
and then predict the actions that caused the observation to change. 00:41:51.240 |
And that becomes a labeler that you can apply to in-the-wild YouTube videos that don't have 00:41:59.400 |
So you apply IDM to like 70K hours of in-the-wild YouTube videos, and you will get these pseudo-labeled 00:42:05.120 |
pseudo-actions that are not always correct, but also way better than random. 00:42:10.320 |
And then you're trying imitation learning on top of this augmented dataset. 00:42:14.520 |
And in this way, OpenAI is able to greatly expand the data because the original data 00:42:21.160 |
collected from the humans are high quality, but they're extremely expensive, while in-the-wild 00:42:25.920 |
YouTube videos are very cheap, but you don't have the actions. 00:42:29.020 |
So they kind of solved and got the best of both worlds. 00:42:32.920 |
But still, you know, it's really expensive to hire these humans. 00:42:40.480 |
I'm a firm believer that multimodal models will be the future. 00:42:44.560 |
And I see text as a very lousy kind of 1D projection of our physical world. 00:42:49.320 |
So it's essential to include the other sensory modalities to provide a full embodied experience. 00:42:55.480 |
And in the context of embodied agents, I think the input will be a mixture of text, images, 00:43:01.040 |
videos, and even audio in the future, and the output will be actions. 00:43:06.780 |
So here's a very early example of a multimodal language model for robot learning. 00:43:15.400 |
We can ask the robot to bring us a cup of tea from the kitchen. 00:43:19.360 |
But if we want to be more specific, I want this particular cup. 00:43:26.620 |
And we also provide a video demo of how we want to mop the floor and ask the robot to 00:43:36.480 |
And when the robot sees an unfamiliar object, like a sweeper, we can explain it by providing 00:43:40.480 |
an image and showing this is a sweeper, now, you know, go ahead and do something with the 00:43:46.400 |
And finally, to ensure safety, we can say, take a picture of that room and just do not 00:43:51.720 |
To achieve this, back last year, we proposed a model called VIMA, which stands for Visual 00:43:59.360 |
And in this work, we introduced a concept called multimodal prompting, where the prompt 00:44:08.400 |
And this provides a very expressive API that just unifies a bunch of different robot tasks 00:44:14.060 |
that otherwise would require a very different pipeline or specialized models to solve in 00:44:20.960 |
And VIMA simply tokenizes everything, converting image and text into sequences of tokens and 00:44:28.340 |
training a transformer on top to output the robot arm actions autoregressively, one step 00:44:37.600 |
So just to look at some of the examples here, like this prompt, rearrange objects to match 00:44:43.840 |
It is a classical task called visual goal reaching that has a big body of prior works 00:44:47.480 |
on it, and that's how our robot does it, given this prompt. 00:44:54.440 |
And we can also give it novel concepts in context, like this is a blicker, this is a 00:45:03.760 |
So it's not in the training data, but VIMA is able to generalize your shot and follow 00:45:11.560 |
So the bot understands what we want and then follow this trajectory. 00:45:15.600 |
And finally, we can give it more complex prompt, like these are the safety constraints, sweep 00:45:20.960 |
the box into this, but without exceeding that line. 00:45:24.000 |
And we do this using the interleaving image and text tokens. 00:45:30.720 |
And recently, Google Brain Robotics followed up after VIMA with RT1 and RT2, Robot Transformer 00:45:39.200 |
And RT2 is using a similar recipe as I described, where they first kind of pre-train on internet 00:45:45.520 |
scale data and then fine tune with some human collected demonstrations on the Google robots. 00:45:51.760 |
And RoboCAD from DeepMind is another interesting work. 00:45:54.880 |
They train a single unified policy that works not just on a single robot, but actually across 00:46:02.120 |
different embodiments, different robot forms, and even generalize to a new hardware. 00:46:07.540 |
So I think this is like a higher form of multimodal agent with a physical form factor. 00:46:12.560 |
The morphology of the agent itself is another modality. 00:46:17.620 |
So that concludes our looking forward section. 00:46:22.220 |
And lastly, I want to kind of put all the links together of the works I described. 00:46:29.140 |
We have open source everything, well, for all the projects where big fans are open source, 00:46:34.880 |
we open source as much as we can, including like the model code, checkpoints, simulator 00:46:55.140 |
If you just want an excuse to play Minecraft at work, then MindDojo is perfect for you 00:47:00.080 |
because you are collecting human demonstration to train generalization. 00:47:04.220 |
And if there's one thing that you take away from this talk, it should be this slide. 00:47:10.180 |
And lastly, I just want to remind all of us, despite all the progress I've shown, what 00:47:14.780 |
we can do is still very far from human ingenuity as embodied agents. 00:47:21.380 |
These are the videos from our dataset of people doing like decorating a winter wonderland 00:47:27.100 |
or building the functioning CPU circuit within Minecraft. 00:47:35.780 |
If human can do these mind-blowing tasks, then why not our AI, right?