Back to Index

Stanford CS25: V3 I Generalist Agents in Open-Ended Worlds


Transcript

So today we're honored to have Jim Ban from NVIDIA, who will be talking about generalist agents in open-ended worlds, and he's a senior AI research scientist at NVIDIA, where his mission is to build generally capable AI agents with applications to gaming, robotics, and software automation. His research spans foundation models, multi-modal AI, reinforcement learning, and open-ended learning.

Jim obtained his PhD degree in computer science from here, Stanford, advised by Professor Pei-Pei Li. And previously, he did research internships at OpenAI, Google AI, as well as Mila Quebec AI Institute. So yeah, give it up for Jim. Yeah, thanks for having me. So I want to start with a story of two kittens.

It's a story that gave me a lot of inspiration over my career, so I want to share this one first. Back in 1963, there were two scientists from MIT, Held and Hein. They did this ingenious experiment where they put two newborn kittens in this device, and the kittens have not seen the visual world yet.

So it's kind of like a merry-go-round, where the two kittens are linked by a rigid mechanical bar, so their movements are exactly mirrored. And there's an active kitten on the right-hand side, and that's the only one able to move freely and then transmit the motion over this link to the passive kitten, which is confined to the basket and cannot really control its own movements.

And then after a couple of days, Held and Hein kind of take the kittens out of this merry-go-round and then did visual testing on them. And they found that only the active kitten was able to develop a healthy visual motor loop, like responding correctly to approaching objects or visual cliffs, but a passive kitten did not have a healthy visual system.

So I find this experiment fascinating because it shows the importance of having this embodied active experience to really ground a system of intelligence. And let's put this experiment in today's AI context, right? We actually have a very powerful passive kitten, and that is ChagGBT. It passively observes and rehearses the text on the internet, and it doesn't have any embodiment.

And because of this, its knowledge is kind of abstract and ungrounded. And that partially contributes to the fact that ChagGBT hallucinates things that are just incompatible with our common sense and our physical experience. And I believe the future belongs to active kittens, which translates to journalist agents. They are the decision-makers in a constant feedback loop, and they're embodied in this fully immersive world.

They're also not mutually exclusive with the passive kitten. And in fact, I see the active embodiment part as a layer on top of the passive pre-training from lots and lots of internet data. So are we there yet? Have we achieved journalist agent? You know, back in 2016, I remember it was like spring of 2016.

I was sitting in an undergraduate class at Columbia University, but I wasn't paying attention to the lecture. I was watching a board game tournament on my laptop. And this screenshot was the moment when Arthur Goh versus Lisa Dahl, and Arthur Goh won three matches out of five, and became the first ever to beat a human champion at a game of Goh.

You know, I remember the adrenaline that day, right? I've seen history unfold. Oh my God, we're finally getting to HCI, and everyone's like so excited. And I think that was the moment when AI agents entered the mainstream. And you know, like when the excitement fades, I felt that even though Arthur Goh was so mighty and so great, it could only do one thing and one thing alone, right?

And afterwards, you know, in 2019, there were more impressive achievements like OpenAI 5, beating the human champions at a game of Dota, and Arthur Stark from DeepMind beat StarCraft. But all of these, with Arthur Goh, they all have a single kind of theme, and that is to beat the opponent.

There is this one objective that the agent needs to do. And the models trained on Dota or Goh cannot generalize to any other tasks. It cannot even play other games like Super Mario or Minecraft. And the world is fixed and have very little room for like open-ended creativity and exploration.

So I argue that a journalist agent should have the following essential properties. First, it should be able to pursue very complex, semantically rich and open world objectives. Basically you explain what you want in natural language, and the agent should perform the actions for you in a dynamic world. And second, the agent should have a large amount of pre-trained knowledge instead of knowing only a few concepts that's extremely specific to the task.

And third, massively multitask. A journalist agent, as the name implies, needs to do more than just a couple of things. It should be, in the best case, infinitely multitask. As expressive as human language can dictate. So what does it take? Correspondingly, we need three main ingredients. First is the environment.

The environment needs to be open-ended enough because the agent's capability is upper bounded by the environment complexity. And I'd argue that Earth is actually a perfect example because it's so open-ended, this world we live in, that it allows an algorithm called natural evolution to produce all the diverse forms and behaviors of life on this planet.

So can we have a simulator that is essentially a lo-fi Earth, but we can still run it on the lab clusters? And second, we need to provide the agent with massive pre-training data because exploration in an open-ended world from scratch is just intractable. And the data will serve at least two purposes.

One as a reference manual on how to do things. And second, as a guidance on what are the interesting things worth pursuing. And GPT is only, at least up to GPT-4, it only learns from pure text on the web. But can we provide the agent with much richer data, such as video walkthrough or multimedia, wiki documents and other media forms?

And finally, once we have the environment and the database, we are ready to train foundation models for the agents. And it should be flexible enough to pursue the open-ended tasks without any task-specific assumptions, and also scalable enough to compress all of the multi-modal data that I just described. And here language, I argue, will play at least two key roles.

One is as a simple and intuitive interface to communicate a task, to communicate the human intentions to the agent. And second, as a bridge to ground all of the multi-modal concepts and signals. And that train of thought landed us in Minecraft, the best-selling video game of all time. And for those who are unfamiliar, Minecraft is a procedurally generated 3D voxel world.

And in the game, you can basically do whatever your heart desires. And what's so special about the game is that unlike AlphaGo, StarCraft, or Dota, Minecraft defines no particular objective to maximize, no particular opponent to beat, and doesn't even have a fixed storyline. And that makes it very well suited as a truly open-ended AI playground.

And here, we see people doing extremely impressive things in Minecraft, like this is a YouTube video where a gamer built the entire Hogwarts castle block by block, by hand, in the game. And here's another example of someone just digging a big hole in the ground and then making this beautiful underground temple with a river nearby.

It's all crafted by hand. And one more, this is someone building a functioning CPU circuit inside the game, because there are something called redstone in Minecraft that you can build circuits out of it, like logical gates. And actually, the game is Turing-complete. You can simulate a computer inside a game.

Just think about how crazy that is. And here, I want to highlight a number that is 140 million active players. And just to put this number in perspective, this is more than twice the population of the UK. And that is the amount of people playing Minecraft on a daily basis.

And it just so happens that gamers are generally happier than PhDs, so they love to stream and share what they're doing. And that produces a huge amount of data every day online. And there's this treasure trove of learning materials that we can tap into for training journalist agents. You know, remember that data is the key for foundation models.

So we introduce MindDojo, a new open framework to help the community develop generally capable agents using Minecraft as a kind of primordial soup. MindDojo features three major parts, an open-ended environment, an international knowledge base, and then a journalist agent developed with a simulator and massive data. So let's zoom in the first one.

Here's a sample gallery of the interesting things that you can do with MindDojo's API. We feature a massive benchmarking suite of more than 3,000 tasks. And this is by far the largest open source agent benchmark to our knowledge. And we implement a very versatile API that unlocks the full potential of the game.

Like for example, MindDojo supports multi-modal observation and a full action space, like moving or attack or inventory management. And that can be customized at every detail. Like you can tweak the terrains, the weather, plot placement, monster spawning, and just anything you want to customize in the game. And given the simulator, we introduce around 1,500 programmatic tasks, which are tasks that have ground-true success conditions defined in Python code.

And you can also explicitly write down like sparse or best reward functions using this API. And some examples are like harvesting different resources, unlocking the tech tree, or fighting various monsters and getting reward. And all these tasks come with language prompts that are templated. Next, we also introduce 1,500 creative tasks that are freeform and open-ended.

And that is in contrast to the programmatic tasks I just mentioned. So for example, let's say we want the agent to build a house. But what makes a house a house, right? It is ill-defined. And just like image generation, you don't know if it generates a cat correctly or not.

So it's very difficult to use simple Python programs to give these kinds of tasks reward functions. And the best way is to use foundation models trained on internet skill knowledge so that the model itself understands abstract concepts like the concept of a house. And finally, there's one task that holds a very special status called playsuit, which is to beat the final boss of Minecraft, the Ender Dragon.

So Minecraft doesn't force you to do this task. As we said, it doesn't have a fixed storyline, but it's still considered a really big milestone for any kind of beginner human players. I want to highlight it is an extremely difficult task that requires very complex preparation, exploration, and also martial skills.

And for an average human, it will take many hours or even days to solve, easily over like 1 million action steps in a single episode. And that would be the longest benchmarking task for policy learning ever created here. So I admit, I am personally a below average human. I was never able to beat Ender Dragon and my friends laugh at me and I'm like, okay, one day my AI will avenge my poor skills.

That was one of the motivations for this project. Now let's move on to the second ingredient, the internet skill knowledge base part of MindDojo. We offer three datasets here, the YouTube, Wiki, and Reddit, and combined they are the largest open-ended agent behavior database ever compiled to our knowledge. The first is YouTube, and we already said Minecraft is one of the most streamed games on YouTube, and the gamers love to narrate what they are doing.

So we collected more than 700,000 videos with 2 billion words in the corresponding transcripts. And these transcripts will help the agent learn about human strategies and creativities without us manually labeling things. And second, the Minecraft player base is so crazy that they have compiled a huge Minecraft specific Wikipedia that basically explains everything you ever need to know in every version of the game.

It's crazy. And we scraped 7,000 Wikipedias with interleaving multi-modal data, like images, tables, and diagrams, and here are some screenshots. Like this is a gallery of all of the monsters and their corresponding behaviors, like spawn and attack patterns. And also like the thousands of crafting recipes are all present on the Wiki, and we scraped all of them.

And more like complex diagrams and tables and embedded figures. Now we have something like GPT-4V, it may be able to understand many of these diagrams. And finally, the Minecraft sub-Reddit is one of the most active forums across the entire Reddit, and players showcase their creations and also ask questions for help.

So we scraped more than 300,000 posts from Minecraft Reddit, and here are some examples of how people use the Reddit as a kind of stack overflow for Minecraft. And we can see that some of the top voted answers are actually quite good. Like someone is asking, "Oh, why doesn't my wheat farm grow?" And the answer says, "You need to light up the room with more torches, you don't have enough lighting." Now, given the massive task suite and internet data, we have the essential components to build journalist agents.

So in the first Minecraft Dojo paper, we introduced a foundation model called Minecraft Clip. And the idea is very simple, I can explain in three slides. Basically for our YouTube database, we have time-aligned videos and transcripts. And these are actually the real tutorial videos from our data set. You see on the third clip, as I raise my axe in front of this pig, there's only one thing that you know is going to happen, is actually someone said this, a big YouTuber of Minecraft.

And then, given this data, we train Minecraft in the same spirit as OpenAI Clip. So for those who are unfamiliar, OpenAI Clip is a contrastive model that learns the association between an image and its caption. And here, it's a very similar idea, but this time it is a video text contrastive model.

And we associate the text with a video snippet that runs about eight to 16 seconds each. And intuitively, Minecraft learns the association between the video and the transcript that describes the activity in the video. And Minecraft outputs a score between 0 and 1, where 1 means a perfect correlation between the text and the video, and 0 means the text is irrelevant to the activity.

So you see, this is effectively a language-prompted foundation reward model that knows the nuances of things like forests, animal behaviors, and architectures in Minecraft. So how do we use Minecraft in action? Here's an example of our agent interacting with the simulator. And here, the task is "shear sheep to obtain wool." And as the agent explores in the simulator, it generates a video snippet as a moving window, which can be encoded and fed into Minecraft, along with an encoding of the text prompt here.

And Minecraft computes the association. The higher the association is, the more the agent's behavior in this video aligns with the language, which is a task you want it to do. And that becomes a reward function to any reinforcement learning algorithm. So this looks very familiar, right? Because it's essentially RL from human feedback, or RLHF in Minecraft.

And RLHF was the cornerstone algorithm that made ChagGBT possible, and I believe it will play a critical role in journalists' agents as well. I'll quickly gloss over some quantitative results. I promise there won't be, like, many tables of numbers here. For these eight tasks, we show the percentage success rate over 200 test episodes.

And here, in the green circle, is two variants of our Minecraft method. And in the orange circles are the baselines. So I'll highlight one baseline, which is that we construct a dense reward function manually for each task using the MindDojo API, it's a Python API. And you can consider this column as a kind of oracle, the upper bound of the performance, because we put a lot of human efforts into designing these reward functions just for the tasks.

And we can see that Mineclip is able to match the quality of many of these, not all of them, but many of these manually generated rewards. It is important to highlight that Mineclip is open vocabulary. So we use a single model for all of these tasks instead of one model for each.

And we simply prompt the reward model with different tasks. And that's the only variation. One major feature of the foundation model is strong generalization out of the box. So can our agent generalize to dramatic changes in the visual appearance? So we did this experiment where during training, we only train our agents on a default terrain at noon on a sunny day.

But we tested zero shot in a diverse range of terrains, weathers, and day/night cycles. And you can customize everything in MindDojo. And in our paper, we have numbers showing that Mineclip significantly beats an off-the-shelf visual encoder when facing these kind of distribution shifts out of the box. And this is no surprise, right, because Mineclip was trained on hundreds of thousands of clips from Minecraft videos on YouTube, which have a very good coverage of all the scenarios.

And I think that is just a testament to the big advantage of using international data, because you get robustness out of the box. And here are some demos of our learned agent behaviors on various tasks. So you may notice that these tasks are relatively short, around like 100 to 500 time steps.

And that is because Mineclip is not able to plan over very long time horizons. It is an inherent limitation in the training pipeline, because we could only use 8 to 16 seconds of the video. So it's constrained to short actions. But our hope is to build an agent that can explore and make new discoveries autonomously, just all by itself, and it keeps going.

And in 2022, this goal seems quite out of reach for us. Mine Dojo was June 2022. And this year, something happened. And that is GP4, a language model that is so good at coding and long-horizon planning, so we just cannot sit still, right? We built Voyager, the first large-language-model-powered lifelong learning agent.

And when we set Voyager loose in Minecraft, we see that it just keeps going. And by the way, all these video snippets are from a single episode of Voyager. It's not from different episodes, it's a single one. And we see that Voyager is just able to keep exploring the terrains, mine all kinds of materials, fight monsters, craft hundreds of recipes, and unlock an ever-expanding tree of diverse skills.

So how do we do this? If we want to use the full power of GP4, a central question is how to stringify things, converting this 3D world into a textual representation. We need a magic box here. And thankfully, again, the crazy Minecraft community already built one for us, and it's been around for many years.

It's called Mineflayer, a high-level JavaScript API that's actively maintained to work with any Minecraft version. And the beauty of Mineflayer is it has access to the game states surrounding the agent, like the nearby blocks, animals, and enemies. So we effectively have a ground-truth perception module as textual input. At the same time, Mineflayer also supports action APIs that we can compose skills.

And now that we can convert everything to text, we are ready to construct an agent on top of GP4. So on a high level, there are three components. One is a coding module that writes JavaScript code to control the game bot, and it's the main module that generates the executable actions.

And second, we have a code base to store the correctly written code and look it up in the future if the agent needs to recall the skill. And in this way, we don't duplicate efforts. And whenever facing similar situations in the future, the agent knows what to do. And third, we have a curriculum that proposes what to do next, given the agent's current capabilities and also situation.

And when you wire these components up together, you get a loop that drives the agent indefinitely and achieve something like lifelong learning. So let's zoom in the center module. We prompt GT4 with documentations and examples on how to use a subset of the Mineflayer API. And GT4 writes code to take actions given the current assigned task.

And because JavaScript runs a code interpreter, GT4 is able to define functions on the fly and run it interactively. But the code that GT4 writes isn't always correct, just like human engineers, you can't get everything correct on the first try. So we develop an iterative prompting mechanism to refine the program.

And there are three types of feedback here. The environment feedback, like, you know, what are the new materials you've got after taking an action, or, you know, some enemies nearby. And the execution error from the JavaScript interpreter, if you wrote some buggy code, like undefined variable, for example, if it hallucinates something.

And another GT4 that provides critique through self reflection from the agent state and the world state. And that also helps refine the program effectively. So I want to show some quick example of how the critic provides feedback on the task completion progress. So let's say in the first example, the task is to craft a spyglass.

And GT4 looks at the agent's inventory, and decides that it has enough copper, but not enough Amherst as material. And the second task is to kill three sheep to collect food. And each sheep drops one unit of wool, but there are only two units in inventory. So GT4 reasons and says that, okay, you have one more sheep to go, and likewise.

Now moving on to the second part, once Voyager implements a skill correctly, we save it to our persistent storage. And you can think of the skill library as a code repository written entirely by a language model through interaction with a 3D world. And the agent can record new skills, and also retrieve skills from the library facing similar situations in the future.

So it doesn't have to go through this whole program refinement that we just saw, which is quite inefficient, but you do it once you save it to disk. And in this way, Voyager kind of bootstraps its own capabilities recursively as it explores and experiments in the game. And let's dive a little bit deeper into how the skill library is implemented.

So this is how we insert a new skill. First we use GPT 3.5 to summarize the program into plain English. And summarization is very easy, and GPT 4 is expensive. So we just go for a cheaper tier. And then we embed this summary as the key, and we save the program, which is a bunch of code, as the value.

And we find out doing this makes retrieval better, because the summary is more semantic, and the code is a bit more discreet, and you insert it. And now for the retrieval process, when Voyager is faced with a new task, let's say craft iron pickaxe, we again use GPT 3.5 to generate a hint on how to solve the task.

And that is something like a natural language paragraph. And then we embed that and use that as the query into the vector database, and we retrieve the skill from the library. So you can think of it as a kind of in-context replay buffer in the reinforcement learning literature. And now moving on to the third part, we have another GPT 4 that proposes what task to do, given its own capabilities at the moment.

And here we give GPT 4 a very high-level unsupervised objective, that is to obtain as many unique items as possible, that is our high-level directive. And then GPT 4 takes this directive and implements a curriculum of progressively harder challenges and more novel challenges to solve. So it's kind of like curiosity exploration, where it is not novel research in a prior literature, but implemented purely in context.

If you're listening to Zoom, the next example is fun. Let's go through this example together, just to show you how Voyager works, the whole complicated data flow that I just showed. So the agent finds itself hungry, and only has one out of 20 hunger bar. So GPT 4 knows that it needs to find food ASAP.

And then it senses there are four entities nearby, a cat, a villager, a pig, and some wheat seeds. And now GPT 4 starts a self-reflection, like, do I kill the cat and villager to get some meat? That sounds horrible. How about the wheat seeds? I can use the seeds to grow a farm, but that's going to take a very long time until I can generate some food.

So sorry, piggy, you are the one being chosen. So GPT 4 looks at the inventory, which is the agent state. There's a piece of iron in inventory. So Voyager recalls a skill from the library, that is to craft an iron sword, and then use that skill to start learning a new skill, and that is hunt pig.

And once the hunt pig routine is successful, GPT 4 saves it to the skill library. That's roughly how it works. And putting all these together, we have this iterative prompting mechanism, the skill library, and an automatic curriculum. And all of these combined is Voyager's no-gradient architecture, where we don't train any new models or fine-tune any parameters, and allows Voyager to self-bootstrap on top of GPT 4, even though we are treating the underlying language model as a black box.

It looks like my example work, and they started to listen. So yeah, these are the tasks that Voyager picked up along the way, and we didn't pre-program any of these. We just saw Voyager's idea. The agent is kind of forever curious, and also forever pursuing new adventures just by itself.

So to quickly show some quantitative results, here we have a learning curve, where the x-axis is the number of prompting iterations, and the y-axis is the number of unique items that Voyager discovered as it's exploring an environment. And these two curves are baselines, React and Reflexion. And this is AutoGPT, which is like a popular software repo.

Basically, you can think of it as combining React and a task planner that decomposes an objective into sub-goals. And this is Voyager. We're able to obtain three times more novel items than the prior methods, and also unlock the entire tech tree significantly faster. And if you take away the skill library, you see that Voyager really suffers.

The performance takes a hit, because every time it needs to kind of repeat and relearn every skill from scratch, and it starts to make a lot more mistakes, and that really degrades the exploration. Here, these two are the bird-eye views of the Minecraft map, and these circles are what the prior methods are able to explore, given the same prompting iteration budget.

And we see that they tend to get stuck in local areas and kind of fail to explore more. But Voyager is able to navigate distances at least two times as much as the prior works. So it's able to visit a lot more places, because to satisfy this high-level directive of obtaining as many unique items as possible, you've got to travel, right?

If you stay at one place, you will quickly exhaust interesting things to do. And Voyager travels a lot, so that's how we came up with the name. So finally, one limitation is that Voyager does not currently support visual perception, because the GPT-4 that we used back then was text-only.

But there's nothing stopping Voyager from adopting, like, multi-modal language models in the future. So here we have a little proof-of-concept demo, where we ask a human to basically function as the image captioner. And the human will tell Voyager that, as you're building these houses, what are the things that are missing?

Like, you placed a door incorrectly, like, the roof is also not done correctly. So the human is acting as a critic module of the Voyager stack. And we see that with some of that help, Voyager is able to build a farmhouse and a nether portal. So it's not a hard time understanding, you know, 3D spatial coordinates just by itself in a textual domain.

Now, after doing Voyager, we're considering, like, where else can we apply this idea, right, of coding in an embodied environment, observe the feedback, and iteratively refine the program. So we came to realize that physics simulations themselves are also just Python code. So why not apply some of the principles from Voyager and do something in another domain?

What if you apply Voyager in the space of this physics simulator API? And this is Eureka, which my team announced just, like, three days ago, fresh out of the oven. It is an open-ended agent that designs reward functions for robot dexterity at superhuman level. And it turns out that GPT-4 plus reinforcement learning can spin a pen much better than I do.

I gave up on this task a long time ago from childhood. It's so hard for me. So Eureka's idea is very simple and intuitive. GPT-4 generates a bunch of possible reward function candidates implemented in Python. And then you just do a full reinforcement learning training loop for each candidate in a GPU-accelerated simulator, and you get a performance metric, and you take the best candidates and feedback to GPT-4, and it samples the next proposals of candidates and keeps improving the whole population of the reward functions.

That's the whole idea. It's kind of like an in-context evolutionary search. So here's the initial reward generation, where Eureka takes as context the environment code of NVIDIA's iSOC sim and a task description, and samples the initial reward function implementation. So we found that the simulator code itself is actually a very good reference manual, because it tells Eureka what are the variables you can use, like the hand positions, like here, the fingertip position, the fingertip safe, the rotation, angular velocity, et cetera.

So you know all of these variables from the simulator code, and you know how they interact with each other. So that serves as a very good in-context instruction. So Eureka doesn't need to reference any human-written reward functions. And then once you have the generated reward, you plug it into any reinforcement learning algorithm and just train it to completion.

So this step is typically very costly and very slow, because reinforcement learning itself is slow. And we were only able to scale up Eureka because of NVIDIA's iSOC sim, which runs 1,000 simulated environment copies on a single GPU. So basically, you can think of it as speeding up reality by 1,000x.

And then after training, you will get the performance metrics back on each reward component. And as we saw from Voyager, GPT-4 is very good at self-reflection. So we leveraged that capability. And there's a software trial reminding you to activate a license. Yeah, so Voyager reflects on it and then proposes mutations on the code.

So here, the mutations, we found, can be very diverse, ranging from something as simple as just changing a hyperparameter in the reward function weighting to all the way to adding completely novel components to the reward function. And in our experiments, Eureka turns out to be a superhuman reward engineer, actually outperforming some of the functions implemented by the expert human engineers on NVIDIA's iSOC sim team.

So here are some more demos of how Eureka is able to write very complex rewards that lead to these extremely dexterous behaviors. And we can actually train the robot hand to rotate pens, not just in one direction, but in different directions along different 3D axes. I think one major contribution of Eureka, different from Voyager, is to bridge the gap between high-level reasoning and low-level model controls.

So Eureka introduces a new paradigm that I'm calling hybrid gradient architecture. So recall Voyager is a no-gradient architecture. We don't touch anything, and we don't train anything. But Eureka is a hybrid gradient, where a black box inference-only language model instructs a white box learnable neural network. So you can think of it as two loops, right?

The outer loop is gradient-free, and it's driven by GP4, kind of selecting the reward functions. And the inner loop is gradient-based. You train a full reinforcement learning episode from it to achieve extreme dexterity using a specialized, like training, by training a special neural network controller. And you must have both loops to succeed, to deliver this kind of dexterity.

And I think it will be a very useful paradigm for training robot agents in the future. So you know, these days when I go on Twitter or X, I see AI conquering new lands like every week. You know, chat, image generation, and music, they're all very well within reach.

But MindDojo, Voyager, and Eureka, these are just scratching the surface of open-ended journalist agents. And looking forward, I want to share two key research directions that I personally find extremely promising, and I'm also working on it myself. The first is a continuation of MindClip, basically how to develop methods that learn from internet-skilled videos.

And the second is multimodal foundation models. Not that GPT-4VE is coming, but it is just the beginning of an era. And I think it's important to have all of the modalities in a single foundation model. So first, about videos. We all know that videos are abundant, right? Like so many data on YouTube, way too many for our limited GPUs to process.

They're extremely useful to train models that not only have dynamic perception and intuitive physics, but also capture the complexity of human creativity and human behaviors. It's all good, except that when you are using video to pre-training body agents, there is huge distribution shift. You also don't get action labels, and you don't get any of the groundings because you are a passive observer.

So I think here's a demonstration of why learning from video is hard, even for natural intelligence. So a little cat is seeing boxers shaking their head, and it thinks maybe shaking head is the best way to do fighting. This is why learning from video is hard. You have no idea, like, why...

This is too good. Let's play this again. You have no idea why Tyson is doing this, right? Like, the cat has no idea, and then it associates this with just the wrong kind of policy. But for sure, it doesn't help the fighting, but it definitely boosts the cat's confidence.

That's why learning from video is hard. Now, I want to point out a few kind of latest research in how to leverage so much video for journalists' agents. There are a couple of approaches. The first is the simplest, just learn kind of a visual feature extractor from the videos.

So this is R3M from Chelsea's group at Stanford. And this model is still an image-level representation, just that it uses a video-level loss function to train, more specifically, time-contrasted learning. And after that, you can use this as an image backbone for any agent, but you still need to kind of fine-tune using domain-specific data for the agent.

The second path is to learn reward functions from video, and MindClip is one model under this category. It uses a contrastive objective between the transcript and video. And here, this work, VIP, is another way to learn a similarity-based reward for goal-conditioned tasks in the image space. So this work, VIP is led by, who's also the first author of Eureka, and Eureka is his internship project with me.

And the third idea is very interesting. Can we directly do imitation learning from video, but better than the cat that we just saw? So we just said, you know, the videos don't have the actions, right? We need to find some ways to pseudo-label the actions. And this is video training of VPT from OpenAI last year, to solve long-range tasks in Minecraft.

And here, the pipeline works like this. Basically, you use a keyboard and a mouse action space, so you can align this action space with the human actions. And OpenAI hires a bunch of Minecraft players and actually collects data in-house. So they record the episodes done by those gamers. And now you have a dataset of video and action pairs, right?

And you train something called an inverse dynamics model, which is to take the observation and then predict the actions that caused the observation to change. So that's the inverse dynamics model. And that becomes a labeler that you can apply to in-the-wild YouTube videos that don't have the actions. So you apply IDM to like 70K hours of in-the-wild YouTube videos, and you will get these pseudo-labeled pseudo-actions that are not always correct, but also way better than random.

And then you're trying imitation learning on top of this augmented dataset. And in this way, OpenAI is able to greatly expand the data because the original data collected from the humans are high quality, but they're extremely expensive, while in-the-wild YouTube videos are very cheap, but you don't have the actions.

So they kind of solved and got the best of both worlds. But still, you know, it's really expensive to hire these humans. Now what's beyond the videos, right? I'm a firm believer that multimodal models will be the future. And I see text as a very lousy kind of 1D projection of our physical world.

So it's essential to include the other sensory modalities to provide a full embodied experience. And in the context of embodied agents, I think the input will be a mixture of text, images, videos, and even audio in the future, and the output will be actions. So here's a very early example of a multimodal language model for robot learning.

So let's imagine a household robot. We can ask the robot to bring us a cup of tea from the kitchen. But if we want to be more specific, I want this particular cup. That is my favorite cup. So show me this image. And we also provide a video demo of how we want to mop the floor and ask the robot to imitate the similar motion in context.

And when the robot sees an unfamiliar object, like a sweeper, we can explain it by providing an image and showing this is a sweeper, now, you know, go ahead and do something with the tool. And finally, to ensure safety, we can say, take a picture of that room and just do not enter that room.

To achieve this, back last year, we proposed a model called VIMA, which stands for Visual Motor Attention. And in this work, we introduced a concept called multimodal prompting, where the prompt can be a mixture of text, image, and videos. And this provides a very expressive API that just unifies a bunch of different robot tasks that otherwise would require a very different pipeline or specialized models to solve in prior literature.

And VIMA simply tokenizes everything, converting image and text into sequences of tokens and training a transformer on top to output the robot arm actions autoregressively, one step at a time during inference time. So just to look at some of the examples here, like this prompt, rearrange objects to match the scene.

It is a classical task called visual goal reaching that has a big body of prior works on it, and that's how our robot does it, given this prompt. And we can also give it novel concepts in context, like this is a blicker, this is a work, now put a work into a blicker.

And both words are nonsensical. So it's not in the training data, but VIMA is able to generalize your shot and follow the motion to manipulate this object. So the bot understands what we want and then follow this trajectory. And finally, we can give it more complex prompt, like these are the safety constraints, sweep the box into this, but without exceeding that line.

And we do this using the interleaving image and text tokens. And recently, Google Brain Robotics followed up after VIMA with RT1 and RT2, Robot Transformer 1 and 2. And RT2 is using a similar recipe as I described, where they first kind of pre-train on internet scale data and then fine tune with some human collected demonstrations on the Google robots.

And RoboCAD from DeepMind is another interesting work. They train a single unified policy that works not just on a single robot, but actually across different embodiments, different robot forms, and even generalize to a new hardware. So I think this is like a higher form of multimodal agent with a physical form factor.

The morphology of the agent itself is another modality. So that concludes our looking forward section. And lastly, I want to kind of put all the links together of the works I described. So this is minddojo.org. We have open source everything, well, for all the projects where big fans are open source, we open source as much as we can, including like the model code, checkpoints, simulator code, and training data.

And this is voyager.minddojo.org. This is Eureka. And this is Vyma. And one more thing, right? If you just want an excuse to play Minecraft at work, then MindDojo is perfect for you because you are collecting human demonstration to train generalization. And if there's one thing that you take away from this talk, it should be this slide.

And lastly, I just want to remind all of us, despite all the progress I've shown, what we can do is still very far from human ingenuity as embodied agents. These are the videos from our dataset of people doing like decorating a winter wonderland or building the functioning CPU circuit within Minecraft.

And we're very far from that as AI research. So here's a call to the community. If human can do these mind-blowing tasks, then why not our AI, right? Let's find out together.