Stanford CS25: V3 I Generalist Agents in Open-Ended Worlds

00:00:00.000 | So today we're honored to have Jim Ban from NVIDIA, who will be talking about generalist

00:00:14.200 | agents in open-ended worlds, and he's a senior AI research scientist at NVIDIA, where his

00:00:21.040 | mission is to build generally capable AI agents with applications to gaming, robotics, and

00:00:26.920 | software automation.

00:00:27.920 | His research spans foundation models, multi-modal AI, reinforcement learning, and open-ended

00:00:34.060 | learning.

00:00:35.060 | Jim obtained his PhD degree in computer science from here, Stanford, advised by Professor

00:00:41.320 | Pei-Pei Li.

00:00:42.320 | And previously, he did research internships at OpenAI, Google AI, as well as Mila Quebec

00:00:49.080 | AI Institute.

00:00:50.080 | So yeah, give it up for Jim.

00:00:54.520 | Yeah, thanks for having me.

00:01:02.400 | So I want to start with a story of two kittens.

00:01:08.040 | It's a story that gave me a lot of inspiration over my career, so I want to share this one

00:01:13.720 | first.

00:01:14.720 | Back in 1963, there were two scientists from MIT, Held and Hein.

00:01:19.480 | They did this ingenious experiment where they put two newborn kittens in this device, and

00:01:25.700 | the kittens have not seen the visual world yet.

00:01:28.120 | So it's kind of like a merry-go-round, where the two kittens are linked by a rigid mechanical

00:01:33.080 | bar, so their movements are exactly mirrored.

00:01:37.000 | And there's an active kitten on the right-hand side, and that's the only one able to move

00:01:41.000 | freely and then transmit the motion over this link to the passive kitten, which is confined

00:01:47.280 | to the basket and cannot really control its own movements.

00:01:51.760 | And then after a couple of days, Held and Hein kind of take the kittens out of this

00:01:56.240 | merry-go-round and then did visual testing on them.

00:01:59.440 | And they found that only the active kitten was able to develop a healthy visual motor

00:02:03.560 | loop, like responding correctly to approaching objects or visual cliffs, but a passive kitten

00:02:10.200 | did not have a healthy visual system.

00:02:13.180 | So I find this experiment fascinating because it shows the importance of having this embodied

00:02:19.940 | active experience to really ground a system of intelligence.

00:02:26.360 | And let's put this experiment in today's AI context, right?

00:02:30.400 | We actually have a very powerful passive kitten, and that is ChagGBT.

00:02:35.800 | It passively observes and rehearses the text on the internet, and it doesn't have any embodiment.

00:02:41.860 | And because of this, its knowledge is kind of abstract and ungrounded.

00:02:45.660 | And that partially contributes to the fact that ChagGBT hallucinates things that are

00:02:50.540 | just incompatible with our common sense and our physical experience.

00:02:55.460 | And I believe the future belongs to active kittens, which translates to journalist agents.

00:03:01.800 | They are the decision-makers in a constant feedback loop, and they're embodied in this

00:03:06.580 | fully immersive world.

00:03:08.100 | They're also not mutually exclusive with the passive kitten.

00:03:12.240 | And in fact, I see the active embodiment part as a layer on top of the passive pre-training

00:03:18.300 | from lots and lots of internet data.

00:03:22.520 | So are we there yet?

00:03:24.980 | Have we achieved journalist agent?

00:03:26.660 | You know, back in 2016, I remember it was like spring of 2016.

00:03:31.740 | I was sitting in an undergraduate class at Columbia University, but I wasn't paying attention

00:03:35.420 | to the lecture.

00:03:36.660 | I was watching a board game tournament on my laptop.

00:03:40.660 | And this screenshot was the moment when Arthur Goh versus Lisa Dahl, and Arthur Goh won three

00:03:47.640 | matches out of five, and became the first ever to beat a human champion at a game of

00:03:51.580 | Goh.

00:03:52.580 | You know, I remember the adrenaline that day, right?

00:03:54.620 | I've seen history unfold.

00:03:56.020 | Oh my God, we're finally getting to HCI, and everyone's like so excited.

00:04:00.980 | And I think that was the moment when AI agents entered the mainstream.

00:04:05.540 | And you know, like when the excitement fades, I felt that even though Arthur Goh was so

00:04:12.340 | mighty and so great, it could only do one thing and one thing alone, right?

00:04:18.620 | And afterwards, you know, in 2019, there were more impressive achievements like OpenAI 5,

00:04:24.820 | beating the human champions at a game of Dota, and Arthur Stark from DeepMind beat StarCraft.

00:04:30.900 | But all of these, with Arthur Goh, they all have a single kind of theme, and that is to

00:04:36.940 | beat the opponent.

00:04:38.420 | There is this one objective that the agent needs to do.

00:04:42.540 | And the models trained on Dota or Goh cannot generalize to any other tasks.

00:04:48.720 | It cannot even play other games like Super Mario or Minecraft.

00:04:52.700 | And the world is fixed and have very little room for like open-ended creativity and exploration.

00:04:59.840 | So I argue that a journalist agent should have the following essential properties.

00:05:04.620 | First, it should be able to pursue very complex, semantically rich and open world objectives.

00:05:11.100 | Basically you explain what you want in natural language, and the agent should perform the

00:05:14.740 | actions for you in a dynamic world.

00:05:17.140 | And second, the agent should have a large amount of pre-trained knowledge instead of

00:05:22.060 | knowing only a few concepts that's extremely specific to the task.

00:05:27.580 | And third, massively multitask.

00:05:29.900 | A journalist agent, as the name implies, needs to do more than just a couple of things.

00:05:35.660 | It should be, in the best case, infinitely multitask.

00:05:40.060 | As expressive as human language can dictate.

00:05:44.460 | So what does it take?

00:05:46.420 | Correspondingly, we need three main ingredients.

00:05:49.980 | First is the environment.

00:05:51.980 | The environment needs to be open-ended enough because the agent's capability is upper bounded

00:05:58.300 | by the environment complexity.

00:06:01.220 | And I'd argue that Earth is actually a perfect example because it's so open-ended, this world

00:06:05.700 | we live in, that it allows an algorithm called natural evolution to produce all the diverse

00:06:10.940 | forms and behaviors of life on this planet.

00:06:14.400 | So can we have a simulator that is essentially a lo-fi Earth, but we can still run it on

00:06:19.740 | the lab clusters?

00:06:23.340 | And second, we need to provide the agent with massive pre-training data because exploration

00:06:27.740 | in an open-ended world from scratch is just intractable.

00:06:31.780 | And the data will serve at least two purposes.

00:06:34.260 | One as a reference manual on how to do things.

00:06:37.500 | And second, as a guidance on what are the interesting things worth pursuing.

00:06:42.660 | And GPT is only, at least up to GPT-4, it only learns from pure text on the web.

00:06:48.780 | But can we provide the agent with much richer data, such as video walkthrough or multimedia,

00:06:55.540 | wiki documents and other media forms?

00:07:00.580 | And finally, once we have the environment and the database, we are ready to train foundation

00:07:05.980 | models for the agents.

00:07:08.300 | And it should be flexible enough to pursue the open-ended tasks without any task-specific

00:07:13.340 | assumptions, and also scalable enough to compress all of the multi-modal data that I just described.

00:07:20.940 | And here language, I argue, will play at least two key roles.

00:07:24.860 | One is as a simple and intuitive interface to communicate a task, to communicate the

00:07:30.020 | human intentions to the agent.

00:07:32.300 | And second, as a bridge to ground all of the multi-modal concepts and signals.

00:07:38.040 | And that train of thought landed us in Minecraft, the best-selling video game of all time.

00:07:45.480 | And for those who are unfamiliar, Minecraft is a procedurally generated 3D voxel world.

00:07:51.640 | And in the game, you can basically do whatever your heart desires.

00:07:55.460 | And what's so special about the game is that unlike AlphaGo, StarCraft, or Dota, Minecraft

00:08:01.680 | defines no particular objective to maximize, no particular opponent to beat, and doesn't

00:08:07.880 | even have a fixed storyline.

00:08:09.960 | And that makes it very well suited as a truly open-ended AI playground.

00:08:13.760 | And here, we see people doing extremely impressive things in Minecraft, like this is a YouTube

00:08:19.800 | video where a gamer built the entire Hogwarts castle block by block, by hand, in the game.

00:08:28.200 | And here's another example of someone just digging a big hole in the ground and then

00:08:32.120 | making this beautiful underground temple with a river nearby.

00:08:36.700 | It's all crafted by hand.

00:08:40.120 | And one more, this is someone building a functioning CPU circuit inside the game, because there

00:08:46.180 | are something called redstone in Minecraft that you can build circuits out of it, like

00:08:51.920 | logical gates.

00:08:52.920 | And actually, the game is Turing-complete.

00:08:55.360 | You can simulate a computer inside a game.

00:08:58.400 | Just think about how crazy that is.

00:09:00.600 | And here, I want to highlight a number that is 140 million active players.

00:09:06.380 | And just to put this number in perspective, this is more than twice the population of

00:09:11.320 | the UK.

00:09:13.240 | And that is the amount of people playing Minecraft on a daily basis.

00:09:16.840 | And it just so happens that gamers are generally happier than PhDs, so they love to stream

00:09:24.680 | and share what they're doing.

00:09:26.640 | And that produces a huge amount of data every day online.

00:09:30.620 | And there's this treasure trove of learning materials that we can tap into for training

00:09:35.700 | journalist agents.

00:09:36.700 | You know, remember that data is the key for foundation models.

00:09:41.640 | So we introduce MindDojo, a new open framework to help the community develop generally capable

00:09:48.520 | agents using Minecraft as a kind of primordial soup.

00:09:55.400 | MindDojo features three major parts, an open-ended environment, an international knowledge base,

00:10:01.300 | and then a journalist agent developed with a simulator and massive data.

00:10:06.500 | So let's zoom in the first one.

00:10:09.060 | Here's a sample gallery of the interesting things that you can do with MindDojo's API.

00:10:15.220 | We feature a massive benchmarking suite of more than 3,000 tasks.

00:10:19.380 | And this is by far the largest open source agent benchmark to our knowledge.

00:10:25.540 | And we implement a very versatile API that unlocks the full potential of the game.

00:10:29.700 | Like for example, MindDojo supports multi-modal observation and a full action space, like

00:10:36.500 | moving or attack or inventory management.

00:10:39.780 | And that can be customized at every detail.

00:10:42.260 | Like you can tweak the terrains, the weather, plot placement, monster spawning, and just

00:10:49.060 | anything you want to customize in the game.

00:10:52.640 | And given the simulator, we introduce around 1,500 programmatic tasks, which are tasks

00:10:58.880 | that have ground-true success conditions defined in Python code.

00:11:02.660 | And you can also explicitly write down like sparse or best reward functions using this

00:11:07.060 | API.

00:11:08.060 | And some examples are like harvesting different resources, unlocking the tech tree, or fighting

00:11:13.860 | various monsters and getting reward.

00:11:16.060 | And all these tasks come with language prompts that are templated.

00:11:20.700 | Next, we also introduce 1,500 creative tasks that are freeform and open-ended.

00:11:26.580 | And that is in contrast to the programmatic tasks I just mentioned.

00:11:30.660 | So for example, let's say we want the agent to build a house.

00:11:35.300 | But what makes a house a house, right?

00:11:37.920 | It is ill-defined.

00:11:39.140 | And just like image generation, you don't know if it generates a cat correctly or not.

00:11:45.240 | So it's very difficult to use simple Python programs to give these kinds of tasks reward

00:11:49.360 | functions.

00:11:51.020 | And the best way is to use foundation models trained on internet skill knowledge so that

00:11:56.660 | the model itself understands abstract concepts like the concept of a house.

00:12:04.460 | And finally, there's one task that holds a very special status called playsuit, which

00:12:08.380 | is to beat the final boss of Minecraft, the Ender Dragon.

00:12:12.580 | So Minecraft doesn't force you to do this task.

00:12:14.620 | As we said, it doesn't have a fixed storyline, but it's still considered a really big milestone

00:12:19.580 | for any kind of beginner human players.

00:12:23.100 | I want to highlight it is an extremely difficult task that requires very complex preparation,

00:12:28.380 | exploration, and also martial skills.

00:12:31.020 | And for an average human, it will take many hours or even days to solve, easily over like

00:12:37.540 | 1 million action steps in a single episode.

00:12:40.540 | And that would be the longest benchmarking task for policy learning ever created here.

00:12:45.660 | So I admit, I am personally a below average human.

00:12:49.300 | I was never able to beat Ender Dragon and my friends laugh at me and I'm like, okay,

00:12:54.780 | one day my AI will avenge my poor skills.

00:12:59.300 | That was one of the motivations for this project.

00:13:03.380 | Now let's move on to the second ingredient, the internet skill knowledge base part of

00:13:07.660 | MindDojo.

00:13:09.420 | We offer three datasets here, the YouTube, Wiki, and Reddit, and combined they are the

00:13:14.580 | largest open-ended agent behavior database ever compiled to our knowledge.

00:13:20.900 | The first is YouTube, and we already said Minecraft is one of the most streamed games

00:13:26.500 | on YouTube, and the gamers love to narrate what they are doing.

00:13:30.740 | So we collected more than 700,000 videos with 2 billion words in the corresponding transcripts.

00:13:37.820 | And these transcripts will help the agent learn about human strategies and creativities

00:13:43.140 | without us manually labeling things.

00:13:46.780 | And second, the Minecraft player base is so crazy that they have compiled a huge Minecraft

00:13:54.860 | specific Wikipedia that basically explains everything you ever need to know in every

00:14:01.060 | version of the game.

00:14:02.460 | It's crazy.

00:14:03.500 | And we scraped 7,000 Wikipedias with interleaving multi-modal data, like images, tables, and

00:14:09.620 | diagrams, and here are some screenshots.

00:14:12.620 | Like this is a gallery of all of the monsters and their corresponding behaviors, like spawn

00:14:18.140 | and attack patterns.

00:14:20.460 | And also like the thousands of crafting recipes are all present on the Wiki, and we scraped

00:14:25.100 | all of them.

00:14:26.500 | And more like complex diagrams and tables and embedded figures.

00:14:29.740 | Now we have something like GPT-4V, it may be able to understand many of these diagrams.

00:14:36.980 | And finally, the Minecraft sub-Reddit is one of the most active forums across the entire

00:14:42.820 | Reddit, and players showcase their creations and also ask questions for help.

00:14:47.860 | So we scraped more than 300,000 posts from Minecraft Reddit, and here are some examples

00:14:53.520 | of how people use the Reddit as a kind of stack overflow for Minecraft.

00:14:59.540 | And we can see that some of the top voted answers are actually quite good.

00:15:03.020 | Like someone is asking, "Oh, why doesn't my wheat farm grow?"

00:15:06.540 | And the answer says, "You need to light up the room with more torches, you don't have

00:15:09.540 | enough lighting."

00:15:10.540 | Now, given the massive task suite and internet data, we have the essential components to

00:15:17.940 | build journalist agents.

00:15:21.580 | So in the first Minecraft Dojo paper, we introduced a foundation model called Minecraft Clip.

00:15:26.220 | And the idea is very simple, I can explain in three slides.

00:15:30.620 | Basically for our YouTube database, we have time-aligned videos and transcripts.

00:15:35.900 | And these are actually the real tutorial videos from our data set.

00:15:39.900 | You see on the third clip, as I raise my axe in front of this pig, there's only one thing

00:15:45.900 | that you know is going to happen, is actually someone said this, a big YouTuber of Minecraft.

00:15:53.180 | And then, given this data, we train Minecraft in the same spirit as OpenAI Clip.

00:15:59.040 | So for those who are unfamiliar, OpenAI Clip is a contrastive model that learns the association

00:16:04.580 | between an image and its caption.

00:16:06.940 | And here, it's a very similar idea, but this time it is a video text contrastive model.

00:16:12.700 | And we associate the text with a video snippet that runs about eight to 16 seconds each.

00:16:22.780 | And intuitively, Minecraft learns the association between the video and the transcript that

00:16:27.660 | describes the activity in the video.

00:16:30.980 | And Minecraft outputs a score between 0 and 1, where 1 means a perfect correlation between

00:16:35.940 | the text and the video, and 0 means the text is irrelevant to the activity.

00:16:41.420 | So you see, this is effectively a language-prompted foundation reward model that knows the nuances

00:16:48.500 | of things like forests, animal behaviors, and architectures in Minecraft.

00:16:55.060 | So how do we use Minecraft in action?

00:16:57.580 | Here's an example of our agent interacting with the simulator.

00:17:01.500 | And here, the task is "shear sheep to obtain wool."

00:17:05.620 | And as the agent explores in the simulator, it generates a video snippet as a moving window,

00:17:12.620 | which can be encoded and fed into Minecraft, along with an encoding of the text prompt

00:17:17.580 | here.

00:17:19.500 | And Minecraft computes the association.

00:17:22.260 | The higher the association is, the more the agent's behavior in this video aligns with

00:17:27.540 | the language, which is a task you want it to do.

00:17:30.980 | And that becomes a reward function to any reinforcement learning algorithm.

00:17:35.700 | So this looks very familiar, right?

00:17:38.480 | Because it's essentially RL from human feedback, or RLHF in Minecraft.

00:17:46.300 | And RLHF was the cornerstone algorithm that made ChagGBT possible, and I believe it will

00:17:51.500 | play a critical role in journalists' agents as well.

00:17:55.940 | I'll quickly gloss over some quantitative results.

00:17:58.700 | I promise there won't be, like, many tables of numbers here.

00:18:02.820 | For these eight tasks, we show the percentage success rate over 200 test episodes.

00:18:08.180 | And here, in the green circle, is two variants of our Minecraft method.

00:18:12.580 | And in the orange circles are the baselines.

00:18:15.920 | So I'll highlight one baseline, which is that we construct a dense reward function manually

00:18:21.700 | for each task using the MindDojo API, it's a Python API.

00:18:26.340 | And you can consider this column as a kind of oracle, the upper bound of the performance,

00:18:31.180 | because we put a lot of human efforts into designing these reward functions just for

00:18:35.020 | the tasks.

00:18:37.400 | And we can see that Mineclip is able to match the quality of many of these, not all of them,

00:18:42.940 | but many of these manually generated rewards.

00:18:45.980 | It is important to highlight that Mineclip is open vocabulary.

00:18:49.660 | So we use a single model for all of these tasks instead of one model for each.

00:18:53.660 | And we simply prompt the reward model with different tasks.

00:18:58.500 | And that's the only variation.

00:19:03.420 | One major feature of the foundation model is strong generalization out of the box.

00:19:07.220 | So can our agent generalize to dramatic changes in the visual appearance?

00:19:12.620 | So we did this experiment where during training, we only train our agents on a default terrain

00:19:19.780 | at noon on a sunny day.

00:19:21.820 | But we tested zero shot in a diverse range of terrains, weathers, and day/night cycles.

00:19:27.000 | And you can customize everything in MindDojo.

00:19:29.500 | And in our paper, we have numbers showing that Mineclip significantly beats an off-the-shelf

00:19:34.060 | visual encoder when facing these kind of distribution shifts out of the box.

00:19:38.980 | And this is no surprise, right, because Mineclip was trained on hundreds of thousands of clips

00:19:43.900 | from Minecraft videos on YouTube, which have a very good coverage of all the scenarios.

00:19:51.660 | And I think that is just a testament to the big advantage of using international data,

00:19:57.980 | because you get robustness out of the box.

00:20:01.780 | And here are some demos of our learned agent behaviors on various tasks.

00:20:06.140 | So you may notice that these tasks are relatively short, around like 100 to 500 time steps.

00:20:12.440 | And that is because Mineclip is not able to plan over very long time horizons.

00:20:18.700 | It is an inherent limitation in the training pipeline, because we could only use 8 to 16

00:20:24.220 | seconds of the video.

00:20:25.560 | So it's constrained to short actions.

00:20:28.700 | But our hope is to build an agent that can explore and make new discoveries autonomously,

00:20:33.900 | just all by itself, and it keeps going.

00:20:36.380 | And in 2022, this goal seems quite out of reach for us.

00:20:40.100 | Mine Dojo was June 2022.

00:20:43.380 | And this year, something happened.

00:20:44.820 | And that is GP4, a language model that is so good at coding and long-horizon planning,

00:20:51.780 | so we just cannot sit still, right?

00:20:54.220 | We built Voyager, the first large-language-model-powered lifelong learning agent.

00:21:00.540 | And when we set Voyager loose in Minecraft, we see that it just keeps going.

00:21:04.980 | And by the way, all these video snippets are from a single episode of Voyager.

00:21:09.980 | It's not from different episodes, it's a single one.

00:21:13.860 | And we see that Voyager is just able to keep exploring the terrains, mine all kinds of

00:21:18.820 | materials, fight monsters, craft hundreds of recipes, and unlock an ever-expanding tree

00:21:24.620 | of diverse skills.

00:21:27.060 | So how do we do this?

00:21:30.180 | If we want to use the full power of GP4, a central question is how to stringify things,

00:21:35.700 | converting this 3D world into a textual representation.

00:21:40.380 | We need a magic box here.

00:21:42.700 | And thankfully, again, the crazy Minecraft community already built one for us, and it's

00:21:47.500 | been around for many years.

00:21:49.740 | It's called Mineflayer, a high-level JavaScript API that's actively maintained to work with

00:21:54.840 | any Minecraft version.

00:21:57.300 | And the beauty of Mineflayer is it has access to the game states surrounding the agent,

00:22:02.660 | like the nearby blocks, animals, and enemies.

00:22:05.940 | So we effectively have a ground-truth perception module as textual input.

00:22:10.340 | At the same time, Mineflayer also supports action APIs that we can compose skills.

00:22:18.500 | And now that we can convert everything to text, we are ready to construct an agent on

00:22:22.620 | top of GP4.

00:22:24.580 | So on a high level, there are three components.

00:22:26.780 | One is a coding module that writes JavaScript code to control the game bot, and it's the

00:22:33.180 | main module that generates the executable actions.

00:22:36.340 | And second, we have a code base to store the correctly written code and look it up in the

00:22:41.740 | future if the agent needs to recall the skill.

00:22:44.820 | And in this way, we don't duplicate efforts.

00:22:46.980 | And whenever facing similar situations in the future, the agent knows what to do.

00:22:51.240 | And third, we have a curriculum that proposes what to do next, given the agent's current

00:22:57.320 | capabilities and also situation.

00:23:00.320 | And when you wire these components up together, you get a loop that drives the agent indefinitely

00:23:06.800 | and achieve something like lifelong learning.

00:23:10.060 | So let's zoom in the center module.

00:23:13.520 | We prompt GT4 with documentations and examples on how to use a subset of the Mineflayer API.

00:23:20.400 | And GT4 writes code to take actions given the current assigned task.

00:23:25.080 | And because JavaScript runs a code interpreter, GT4 is able to define functions on the fly

00:23:29.840 | and run it interactively.

00:23:32.400 | But the code that GT4 writes isn't always correct, just like human engineers, you can't

00:23:35.880 | get everything correct on the first try.

00:23:38.440 | So we develop an iterative prompting mechanism to refine the program.

00:23:43.320 | And there are three types of feedback here.

00:23:45.760 | The environment feedback, like, you know, what are the new materials you've got after

00:23:49.400 | taking an action, or, you know, some enemies nearby.

00:23:53.240 | And the execution error from the JavaScript interpreter, if you wrote some buggy code,

00:23:57.680 | like undefined variable, for example, if it hallucinates something.

00:24:02.120 | And another GT4 that provides critique through self reflection from the agent state and the

00:24:08.520 | world state.

00:24:10.020 | And that also helps refine the program effectively.

00:24:13.680 | So I want to show some quick example of how the critic provides feedback on the task completion

00:24:18.760 | progress.

00:24:20.000 | So let's say in the first example, the task is to craft a spyglass.

00:24:24.200 | And GT4 looks at the agent's inventory, and decides that it has enough copper, but not

00:24:29.120 | enough Amherst as material.

00:24:32.720 | And the second task is to kill three sheep to collect food.

00:24:36.020 | And each sheep drops one unit of wool, but there are only two units in inventory.

00:24:40.560 | So GT4 reasons and says that, okay, you have one more sheep to go, and likewise.

00:24:47.180 | Now moving on to the second part, once Voyager implements a skill correctly, we save it to

00:24:53.520 | our persistent storage.

00:24:55.720 | And you can think of the skill library as a code repository written entirely by a language

00:25:00.880 | model through interaction with a 3D world.

00:25:04.740 | And the agent can record new skills, and also retrieve skills from the library facing similar

00:25:09.840 | situations in the future.

00:25:11.960 | So it doesn't have to go through this whole program refinement that we just saw, which

00:25:15.720 | is quite inefficient, but you do it once you save it to disk.

00:25:20.360 | And in this way, Voyager kind of bootstraps its own capabilities recursively as it explores

00:25:26.380 | and experiments in the game.

00:25:29.880 | And let's dive a little bit deeper into how the skill library is implemented.

00:25:33.840 | So this is how we insert a new skill.

00:25:36.480 | First we use GPT 3.5 to summarize the program into plain English.

00:25:41.000 | And summarization is very easy, and GPT 4 is expensive.

00:25:43.960 | So we just go for a cheaper tier.

00:25:47.600 | And then we embed this summary as the key, and we save the program, which is a bunch

00:25:52.880 | of code, as the value.

00:25:55.000 | And we find out doing this makes retrieval better, because the summary is more semantic,

00:26:00.080 | and the code is a bit more discreet, and you insert it.

00:26:06.940 | And now for the retrieval process, when Voyager is faced with a new task, let's say craft

00:26:11.800 | iron pickaxe, we again use GPT 3.5 to generate a hint on how to solve the task.

00:26:18.120 | And that is something like a natural language paragraph.

00:26:20.780 | And then we embed that and use that as the query into the vector database, and we retrieve

00:26:27.200 | the skill from the library.

00:26:30.920 | So you can think of it as a kind of in-context replay buffer in the reinforcement learning

00:26:35.480 | literature.

00:26:37.920 | And now moving on to the third part, we have another GPT 4 that proposes what task to do,

00:26:44.240 | given its own capabilities at the moment.

00:26:47.960 | And here we give GPT 4 a very high-level unsupervised objective, that is to obtain as many unique

00:26:54.480 | items as possible, that is our high-level directive.

00:26:58.080 | And then GPT 4 takes this directive and implements a curriculum of progressively harder challenges

00:27:04.600 | and more novel challenges to solve.

00:27:07.200 | So it's kind of like curiosity exploration, where it is not novel research in a prior

00:27:13.400 | literature, but implemented purely in context.

00:27:16.360 | If you're listening to Zoom, the next example is fun.

00:27:22.240 | Let's go through this example together, just to show you how Voyager works, the whole complicated

00:27:28.640 | data flow that I just showed.

00:27:30.800 | So the agent finds itself hungry, and only has one out of 20 hunger bar.

00:27:35.720 | So GPT 4 knows that it needs to find food ASAP.

00:27:39.480 | And then it senses there are four entities nearby, a cat, a villager, a pig, and some

00:27:45.160 | wheat seeds.

00:27:46.600 | And now GPT 4 starts a self-reflection, like, do I kill the cat and villager to get some

00:27:51.160 | meat?

00:27:52.280 | That sounds horrible.

00:27:53.280 | How about the wheat seeds?

00:27:55.600 | I can use the seeds to grow a farm, but that's going to take a very long time until I can

00:28:00.080 | generate some food.

00:28:01.560 | So sorry, piggy, you are the one being chosen.

00:28:05.360 | So GPT 4 looks at the inventory, which is the agent state.

00:28:09.600 | There's a piece of iron in inventory.

00:28:12.120 | So Voyager recalls a skill from the library, that is to craft an iron sword, and then use

00:28:18.360 | that skill to start learning a new skill, and that is hunt pig.

00:28:23.880 | And once the hunt pig routine is successful, GPT 4 saves it to the skill library.

00:28:30.560 | That's roughly how it works.

00:28:33.640 | And putting all these together, we have this iterative prompting mechanism, the skill library,

00:28:38.560 | and an automatic curriculum.

00:28:40.680 | And all of these combined is Voyager's no-gradient architecture, where we don't train any new

00:28:46.680 | models or fine-tune any parameters, and allows Voyager to self-bootstrap on top of GPT 4,

00:28:54.360 | even though we are treating the underlying language model as a black box.

00:28:59.200 | It looks like my example work, and they started to listen.

00:29:09.000 | So yeah, these are the tasks that Voyager picked up along the way, and we didn't pre-program

00:29:13.840 | any of these.

00:29:14.840 | We just saw Voyager's idea.

00:29:16.680 | The agent is kind of forever curious, and also forever pursuing new adventures just

00:29:21.200 | by itself.

00:29:24.160 | So to quickly show some quantitative results, here we have a learning curve, where the x-axis

00:29:31.440 | is the number of prompting iterations, and the y-axis is the number of unique items that

00:29:36.360 | Voyager discovered as it's exploring an environment.

00:29:40.760 | And these two curves are baselines, React and Reflexion.

00:29:47.840 | And this is AutoGPT, which is like a popular software repo.

00:29:50.720 | Basically, you can think of it as combining React and a task planner that decomposes an

00:29:55.880 | objective into sub-goals.

00:29:58.520 | And this is Voyager.

00:30:00.280 | We're able to obtain three times more novel items than the prior methods, and also unlock

00:30:05.600 | the entire tech tree significantly faster.

00:30:09.320 | And if you take away the skill library, you see that Voyager really suffers.

00:30:13.640 | The performance takes a hit, because every time it needs to kind of repeat and relearn

00:30:18.480 | every skill from scratch, and it starts to make a lot more mistakes, and that really

00:30:23.240 | degrades the exploration.

00:30:25.920 | Here, these two are the bird-eye views of the Minecraft map, and these circles are what

00:30:33.160 | the prior methods are able to explore, given the same prompting iteration budget.

00:30:39.360 | And we see that they tend to get stuck in local areas and kind of fail to explore more.

00:30:45.120 | But Voyager is able to navigate distances at least two times as much as the prior works.

00:30:52.760 | So it's able to visit a lot more places, because to satisfy this high-level directive of obtaining

00:30:59.000 | as many unique items as possible, you've got to travel, right?

00:31:02.480 | If you stay at one place, you will quickly exhaust interesting things to do.

00:31:06.760 | And Voyager travels a lot, so that's how we came up with the name.

00:31:11.960 | So finally, one limitation is that Voyager does not currently support visual perception,

00:31:18.080 | because the GPT-4 that we used back then was text-only.

00:31:21.720 | But there's nothing stopping Voyager from adopting, like, multi-modal language models

00:31:26.320 | in the future.

00:31:27.360 | So here we have a little proof-of-concept demo, where we ask a human to basically function

00:31:32.160 | as the image captioner.

00:31:34.040 | And the human will tell Voyager that, as you're building these houses, what are the things

00:31:38.960 | that are missing?

00:31:39.960 | Like, you placed a door incorrectly, like, the roof is also not done correctly.

00:31:45.200 | So the human is acting as a critic module of the Voyager stack.

00:31:49.560 | And we see that with some of that help, Voyager is able to build a farmhouse and a nether

00:31:55.320 | portal.

00:31:56.320 | So it's not a hard time understanding, you know, 3D spatial coordinates just by itself

00:32:00.360 | in a textual domain.

00:32:03.680 | Now, after doing Voyager, we're considering, like, where else can we apply this idea, right,

00:32:11.440 | of coding in an embodied environment, observe the feedback, and iteratively refine the program.

00:32:18.920 | So we came to realize that physics simulations themselves are also just Python code.

00:32:24.740 | So why not apply some of the principles from Voyager and do something in another domain?

00:32:31.560 | What if you apply Voyager in the space of this physics simulator API?

00:32:35.420 | And this is Eureka, which my team announced just, like, three days ago, fresh out of the

00:32:40.840 | oven.

00:32:41.840 | It is an open-ended agent that designs reward functions for robot dexterity at superhuman

00:32:48.480 | level.

00:32:49.700 | And it turns out that GPT-4 plus reinforcement learning can spin a pen much better than I

00:32:55.280 | do.

00:32:56.280 | I gave up on this task a long time ago from childhood.

00:32:59.640 | It's so hard for me.

00:33:03.640 | So Eureka's idea is very simple and intuitive.

00:33:07.040 | GPT-4 generates a bunch of possible reward function candidates implemented in Python.

00:33:13.040 | And then you just do a full reinforcement learning training loop for each candidate

00:33:18.640 | in a GPU-accelerated simulator, and you get a performance metric, and you take the best

00:33:24.600 | candidates and feedback to GPT-4, and it samples the next proposals of candidates and keeps

00:33:31.020 | improving the whole population of the reward functions.

00:33:34.600 | That's the whole idea.

00:33:35.600 | It's kind of like an in-context evolutionary search.

00:33:40.360 | So here's the initial reward generation, where Eureka takes as context the environment code

00:33:46.100 | of NVIDIA's iSOC sim and a task description, and samples the initial reward function implementation.

00:33:54.540 | So we found that the simulator code itself is actually a very good reference manual,

00:33:58.840 | because it tells Eureka what are the variables you can use, like the hand positions, like

00:34:04.140 | here, the fingertip position, the fingertip safe, the rotation, angular velocity, et cetera.

00:34:09.780 | So you know all of these variables from the simulator code, and you know how they interact

00:34:15.260 | with each other.

00:34:16.520 | So that serves as a very good in-context instruction.

00:34:21.640 | So Eureka doesn't need to reference any human-written reward functions.

00:34:26.760 | And then once you have the generated reward, you plug it into any reinforcement learning

00:34:30.920 | algorithm and just train it to completion.

00:34:33.860 | So this step is typically very costly and very slow, because reinforcement learning

00:34:38.920 | itself is slow.

00:34:40.220 | And we were only able to scale up Eureka because of NVIDIA's iSOC sim, which runs 1,000 simulated

00:34:47.740 | environment copies on a single GPU.

00:34:50.360 | So basically, you can think of it as speeding up reality by 1,000x.

00:34:57.360 | And then after training, you will get the performance metrics back on each reward component.

00:35:02.640 | And as we saw from Voyager, GPT-4 is very good at self-reflection.

00:35:07.280 | So we leveraged that capability.

00:35:10.080 | And there's a software trial reminding you to activate a license.

00:35:17.920 | Yeah, so Voyager reflects on it and then proposes mutations on the code.

00:35:26.800 | So here, the mutations, we found, can be very diverse, ranging from something as simple

00:35:31.400 | as just changing a hyperparameter in the reward function weighting to all the way to adding

00:35:36.600 | completely novel components to the reward function.

00:35:41.280 | And in our experiments, Eureka turns out to be a superhuman reward engineer, actually

00:35:47.720 | outperforming some of the functions implemented by the expert human engineers on NVIDIA's

00:35:53.400 | iSOC sim team.

00:35:57.100 | So here are some more demos of how Eureka is able to write very complex rewards that

00:36:02.480 | lead to these extremely dexterous behaviors.

00:36:05.940 | And we can actually train the robot hand to rotate pens, not just in one direction, but

00:36:11.320 | in different directions along different 3D axes.

00:36:15.480 | I think one major contribution of Eureka, different from Voyager, is to bridge the gap

00:36:20.640 | between high-level reasoning and low-level model controls.

00:36:25.060 | So Eureka introduces a new paradigm that I'm calling hybrid gradient architecture.

00:36:30.500 | So recall Voyager is a no-gradient architecture.

00:36:32.800 | We don't touch anything, and we don't train anything.

00:36:35.960 | But Eureka is a hybrid gradient, where a black box inference-only language model instructs

00:36:41.880 | a white box learnable neural network.

00:36:45.280 | So you can think of it as two loops, right?

00:36:47.760 | The outer loop is gradient-free, and it's driven by GP4, kind of selecting the reward

00:36:53.760 | functions.

00:36:54.760 | And the inner loop is gradient-based.

00:36:56.400 | You train a full reinforcement learning episode from it to achieve extreme dexterity using

00:37:02.580 | a specialized, like training, by training a special neural network controller.

00:37:06.640 | And you must have both loops to succeed, to deliver this kind of dexterity.

00:37:11.820 | And I think it will be a very useful paradigm for training robot agents in the future.

00:37:19.100 | So you know, these days when I go on Twitter or X, I see AI conquering new lands like every

00:37:26.720 | week.

00:37:27.720 | You know, chat, image generation, and music, they're all very well within reach.

00:37:33.260 | But MindDojo, Voyager, and Eureka, these are just scratching the surface of open-ended

00:37:38.220 | journalist agents.

00:37:40.720 | And looking forward, I want to share two key research directions that I personally find

00:37:45.760 | extremely promising, and I'm also working on it myself.

00:37:50.300 | The first is a continuation of MindClip, basically how to develop methods that learn from internet-skilled

00:37:56.520 | videos.

00:37:57.680 | And the second is multimodal foundation models.

00:38:00.720 | Not that GPT-4VE is coming, but it is just the beginning of an era.

00:38:05.680 | And I think it's important to have all of the modalities in a single foundation model.

00:38:12.780 | So first, about videos.

00:38:15.080 | We all know that videos are abundant, right?

00:38:17.400 | Like so many data on YouTube, way too many for our limited GPUs to process.

00:38:24.600 | They're extremely useful to train models that not only have dynamic perception and intuitive

00:38:30.400 | physics, but also capture the complexity of human creativity and human behaviors.

00:38:36.520 | It's all good, except that when you are using video to pre-training body agents, there is

00:38:42.960 | huge distribution shift.

00:38:44.600 | You also don't get action labels, and you don't get any of the groundings because you

00:38:48.280 | are a passive observer.

00:38:51.160 | So I think here's a demonstration of why learning from video is hard, even for natural intelligence.

00:38:57.280 | So a little cat is seeing boxers shaking their head, and it thinks maybe shaking head is

00:39:02.840 | the best way to do fighting.

00:39:08.720 | This is why learning from video is hard.

00:39:13.720 | You have no idea, like, why...

00:39:18.280 | This is too good.

00:39:19.280 | Let's play this again.

00:39:20.280 | You have no idea why Tyson is doing this, right?

00:39:22.720 | Like, the cat has no idea, and then it associates this with just the wrong kind of policy.

00:39:30.880 | But for sure, it doesn't help the fighting, but it definitely boosts the cat's confidence.

00:39:39.280 | That's why learning from video is hard.

00:39:41.280 | Now, I want to point out a few kind of latest research in how to leverage so much video

00:39:47.000 | for journalists' agents.

00:39:49.320 | There are a couple of approaches.

00:39:51.480 | The first is the simplest, just learn kind of a visual feature extractor from the videos.

00:39:57.080 | So this is R3M from Chelsea's group at Stanford.

00:40:01.360 | And this model is still an image-level representation, just that it uses a video-level loss function

00:40:06.200 | to train, more specifically, time-contrasted learning.

00:40:10.280 | And after that, you can use this as an image backbone for any agent, but you still need

00:40:15.560 | to kind of fine-tune using domain-specific data for the agent.

00:40:21.040 | The second path is to learn reward functions from video, and MindClip is one model under

00:40:27.920 | this category.

00:40:28.920 | It uses a contrastive objective between the transcript and video.

00:40:33.160 | And here, this work, VIP, is another way to learn a similarity-based reward for goal-conditioned

00:40:39.600 | tasks in the image space.

00:40:41.920 | So this work, VIP is led by, who's also the first author of Eureka, and Eureka is his

00:40:47.160 | internship project with me.

00:40:50.560 | And the third idea is very interesting.

00:40:52.680 | Can we directly do imitation learning from video, but better than the cat that we just

00:40:59.000 | saw?

00:41:00.040 | So we just said, you know, the videos don't have the actions, right?

00:41:04.640 | We need to find some ways to pseudo-label the actions.

00:41:07.860 | And this is video training of VPT from OpenAI last year, to solve long-range tasks in Minecraft.

00:41:15.000 | And here, the pipeline works like this.

00:41:18.400 | Basically, you use a keyboard and a mouse action space, so you can align this action

00:41:23.880 | space with the human actions.

00:41:26.320 | And OpenAI hires a bunch of Minecraft players and actually collects data in-house.

00:41:31.920 | So they record the episodes done by those gamers.

00:41:35.240 | And now you have a dataset of video and action pairs, right?

00:41:40.160 | And you train something called an inverse dynamics model, which is to take the observation

00:41:45.440 | and then predict the actions that caused the observation to change.

00:41:49.280 | So that's the inverse dynamics model.

00:41:51.240 | And that becomes a labeler that you can apply to in-the-wild YouTube videos that don't have

00:41:58.120 | the actions.

00:41:59.400 | So you apply IDM to like 70K hours of in-the-wild YouTube videos, and you will get these pseudo-labeled

00:42:05.120 | pseudo-actions that are not always correct, but also way better than random.

00:42:10.320 | And then you're trying imitation learning on top of this augmented dataset.

00:42:14.520 | And in this way, OpenAI is able to greatly expand the data because the original data

00:42:21.160 | collected from the humans are high quality, but they're extremely expensive, while in-the-wild

00:42:25.920 | YouTube videos are very cheap, but you don't have the actions.

00:42:29.020 | So they kind of solved and got the best of both worlds.

00:42:32.920 | But still, you know, it's really expensive to hire these humans.

00:42:37.800 | Now what's beyond the videos, right?

00:42:40.480 | I'm a firm believer that multimodal models will be the future.

00:42:44.560 | And I see text as a very lousy kind of 1D projection of our physical world.

00:42:49.320 | So it's essential to include the other sensory modalities to provide a full embodied experience.

00:42:55.480 | And in the context of embodied agents, I think the input will be a mixture of text, images,

00:43:01.040 | videos, and even audio in the future, and the output will be actions.

00:43:06.780 | So here's a very early example of a multimodal language model for robot learning.

00:43:12.960 | So let's imagine a household robot.

00:43:15.400 | We can ask the robot to bring us a cup of tea from the kitchen.

00:43:19.360 | But if we want to be more specific, I want this particular cup.

00:43:22.880 | That is my favorite cup.

00:43:24.040 | So show me this image.

00:43:26.620 | And we also provide a video demo of how we want to mop the floor and ask the robot to

00:43:32.640 | imitate the similar motion in context.

00:43:36.480 | And when the robot sees an unfamiliar object, like a sweeper, we can explain it by providing

00:43:40.480 | an image and showing this is a sweeper, now, you know, go ahead and do something with the

00:43:45.400 | tool.

00:43:46.400 | And finally, to ensure safety, we can say, take a picture of that room and just do not

00:43:49.440 | enter that room.

00:43:51.720 | To achieve this, back last year, we proposed a model called VIMA, which stands for Visual

00:43:58.360 | Motor Attention.

00:43:59.360 | And in this work, we introduced a concept called multimodal prompting, where the prompt

00:44:04.580 | can be a mixture of text, image, and videos.

00:44:08.400 | And this provides a very expressive API that just unifies a bunch of different robot tasks

00:44:14.060 | that otherwise would require a very different pipeline or specialized models to solve in

00:44:18.920 | prior literature.

00:44:20.960 | And VIMA simply tokenizes everything, converting image and text into sequences of tokens and

00:44:28.340 | training a transformer on top to output the robot arm actions autoregressively, one step

00:44:34.440 | at a time during inference time.

00:44:37.600 | So just to look at some of the examples here, like this prompt, rearrange objects to match

00:44:42.840 | the scene.

00:44:43.840 | It is a classical task called visual goal reaching that has a big body of prior works

00:44:47.480 | on it, and that's how our robot does it, given this prompt.

00:44:54.440 | And we can also give it novel concepts in context, like this is a blicker, this is a

00:44:59.040 | work, now put a work into a blicker.

00:45:01.720 | And both words are nonsensical.

00:45:03.760 | So it's not in the training data, but VIMA is able to generalize your shot and follow

00:45:08.320 | the motion to manipulate this object.

00:45:11.560 | So the bot understands what we want and then follow this trajectory.

00:45:15.600 | And finally, we can give it more complex prompt, like these are the safety constraints, sweep

00:45:20.960 | the box into this, but without exceeding that line.

00:45:24.000 | And we do this using the interleaving image and text tokens.

00:45:30.720 | And recently, Google Brain Robotics followed up after VIMA with RT1 and RT2, Robot Transformer

00:45:37.640 | 1 and 2.

00:45:39.200 | And RT2 is using a similar recipe as I described, where they first kind of pre-train on internet

00:45:45.520 | scale data and then fine tune with some human collected demonstrations on the Google robots.

00:45:51.760 | And RoboCAD from DeepMind is another interesting work.

00:45:54.880 | They train a single unified policy that works not just on a single robot, but actually across

00:46:02.120 | different embodiments, different robot forms, and even generalize to a new hardware.

00:46:07.540 | So I think this is like a higher form of multimodal agent with a physical form factor.

00:46:12.560 | The morphology of the agent itself is another modality.

00:46:17.620 | So that concludes our looking forward section.

00:46:22.220 | And lastly, I want to kind of put all the links together of the works I described.

00:46:27.340 | So this is minddojo.org.

00:46:29.140 | We have open source everything, well, for all the projects where big fans are open source,

00:46:34.880 | we open source as much as we can, including like the model code, checkpoints, simulator

00:46:41.080 | code, and training data.

00:46:44.820 | And this is voyager.minddojo.org.

00:46:47.860 | This is Eureka.

00:46:49.980 | And this is Vyma.

00:46:53.540 | And one more thing, right?

00:46:55.140 | If you just want an excuse to play Minecraft at work, then MindDojo is perfect for you

00:47:00.080 | because you are collecting human demonstration to train generalization.

00:47:04.220 | And if there's one thing that you take away from this talk, it should be this slide.

00:47:10.180 | And lastly, I just want to remind all of us, despite all the progress I've shown, what

00:47:14.780 | we can do is still very far from human ingenuity as embodied agents.

00:47:21.380 | These are the videos from our dataset of people doing like decorating a winter wonderland

00:47:27.100 | or building the functioning CPU circuit within Minecraft.

00:47:30.780 | And we're very far from that as AI research.

00:47:33.940 | So here's a call to the community.

00:47:35.780 | If human can do these mind-blowing tasks, then why not our AI, right?

00:47:40.740 | Let's find out together.

00:47:41.540 | [END]

00:47:43.540 | [BLANK_AUDIO]