If GPT-4 can train a robodog better than we can to balance on a rolling yoga ball that's being kicked or deflated, what's next? Are we sure that changing a lightbulb or fixing a plumbing leak is much more physically complex? And if it's a 2022 era language model, GPT-4, that is doing the teaching, what does that say about the learning rates of robots taught by even 2024 era AI?
This Dr. Eureka paper was released less than 48 hours ago, but I'll give you all the highlights and interview clips with two key figures behind the Eureka concept, Jim Fan and Guangzhe Wang. But first, what is the overall concept? What are they basically doing? They want to train a robot, in this case a quadruped robodog, in simulation and see if they can transfer that to the real world.
That's the sim to real part from simulation to reality. And they want to use a language model, in this case GPT-4, to guide that process. And why would it be helpful to use a language model? Well, if you have to go in as a human and tweak all the different parameters, which we'll see in a moment, that takes ages.
As the paper says, that renders the process slow and human labor intensive. But this paper isn't just about saving time. The language model derived reward functions perform better than the human ones. In short, language models like GPT-4 are better teachers for robots. So why do I think this is so much more significant than yoga balls?
Language models like ChatGPT are brilliant at generating hypotheses, generating ideas, as the paper says. But as we all know, their great Achilles heel is hallucinations or confabulations, making stuff up, making mistakes. But if those ideas, even tens of thousands of them, can be tested in simulation, we can find just the good ones.
Thankfully, language models are infinitely patient. And so what we end up with are better implementations, in this case for robot training, than humans can produce. And crucially, as the paper points out, this works for novel or new robot tasks, ones not seen in the training data of the language model.
And this approach isn't just effective for new tasks, but for novel situations within existing tasks. We'll see how all of this is done in a moment, but here is the GPT-4 trained robo-dog reacting to the yoga ball being deflated. It overcomes that situation, despite not seeing that in training.
Before we get back to the paper, though, let's reiterate something in the words of Dr. Jim Fan. Once derived in simulation, this policy is transferred zero shot to the real world. Or translated, it's not relying on human demonstrations. The robo-dog doesn't have to see a human or another robo-dog balancing on a yoga ball.
No fine tuning, it just works. Not every single time, admittedly, but we'll get to the blooper reel later on. I will need to give you a minute of background on the Eureka paper, which came out in October of last year, before we can get to Dr. Eureka. And let me try to summarize the paper in less than 60 seconds.
A reward function is a way of specifying, in code, how to measure success in a task. And language models are great at coming up with them and modifying them based on environmental feedback. So the NVIDIA team proposed a task, in this case spinning a pen in a robotic simulated hand.
Then GPT-4 would propose a way of measuring success, a reward function. Of course, because it's infinitely patient, it could generate hundreds of these. These would be tested in simulation in parallel, thanks to NVIDIA hardware. GPT-4 was then prompted to reflect on the results. Then, based on those reflections, it would iterate the reward functions.
Spoiler alert, it got really good at spinning the pen, at least in simulation. Fast forward a month and we now have Dr. Eureka. And no, GPT-4 didn't go off and get a PhD. We're still using vanilla GPT-4. The DR is domain randomization, which I'll get to in a moment.
Now, some of you may immediately put your hands up and say, what was wrong with Eureka? Couldn't that have just worked for real world deployment? In a nutshell, the basic issue is that the world is kind of weird and nitty gritty. There's a lot of physics you need to contend with and aspects of the domain or environment you can't quite predict.
How much power will be in the robot's motors and how much friction will the legs have on the ball? And then some of you might say, that's not a problem, just test every single scenario. But the problem with that is that in the real world, people have a limited compute budget.
It's not practical in 2024 to test every single possible scenario. We need to give the variables a realistic range, but not with human intuition, with LLM intuition. So let me now try to run through the Dr. Eureka process, which I think is genius. As with Eureka, we propose the task.
What we add to Eureka is a safety instruction. Basically, GPT-4 be realistic. Our motors can only do this much. Other things they say include this policy is going to be deployed in the real world. Be careful. So then we get the GPT-4 policy or set of instructions. For example, controlling the legs of the robodog.
Now this is where it gets a little bit complicated, so you might have to focus. Taking that policy, what they then do is isolate each variable in the environment, in this case, gravity. But then they amp it right up until the policy breaks. They bring it right down until the policy breaks.
That gives us a viable range where the policy works. That's the reward aware part. And why limit ourselves to that range? Well, if you set hyper unrealistic settings for gravity, then we won't learn anything. The set of instructions will fail every single time in that setting. So there's no signal back to the system of what works.
Keep things in a realistic range and we get a more reliable signal. Unfortunately, though, that's not enough. And that's where we need domain randomization. And to explain that, I have to give you a vivid demonstration. At the previous stage, we were limited to ranges for these different variables that could at least sometimes work.
Variables for the bounciness of the ball, restitution and friction and gravity, as I mentioned. There you can see the motor strength range. But there's no real common sense here about what would happen with a yoga ball. That's why they called it an uninformative context. What GPT-4 generated domain randomizations do is give a much more realistic range based on common sense.
Notice how with each of the ranges, we get an explanation from GPT-4 about why it's picking that range. For bounciness, it says we're not focused on bouncing. It's still relevant for minor impacts, though. Notice just between 0 and 0.5. For friction, it's thinking about tiles, grass, dirt, etc. For motor strength, it's actually half of the full range.
And it says this is a moderate range, allowing for variability in motor performance. By limiting the ranges we're going to test, we get much more effective learning. This is where GPT-4 starts to outstrip humans in teaching robots. In case you didn't know, by the way, GPT-4 finished training in August of 2022.
How good GPT-5 is at training robots, only time will tell. Now, some of you in bewilderment will be saying, but Philip, why do we even need a range? Why can't we just guess a value for each of these things? Well, the real world again is messy. By testing your instructions in these realistic scenarios, it becomes much more robust in the real world.
As we'll see, the original Eureka flops without this step. Before we carry on, some of you will be shaking your head and saying, I'm sure humans could do better than this. Can't humans come up with better reward functions and more realistic ranges? Well, here's Guanjue Wang describing how humans get stuck in local optima.
It has a very much prior knowledge and therefore it can just propose different kinds of mutations and variations of the reward function based on the environment context. For humans, you need to manually tune the reward functions and it's very easy for humans to get stuck to a local optima.
For GPT-4, it can generate tens of reward functions at the same time and based on the performance of each reward function, it can continuously improve it. Humans simply don't have the patience of larger language models. Or to bring in some real numbers, Dr. Eureka trained robodogs outperform those trained with human designed reward functions and domain randomization parameters by 34% in forward velocity and 20% in distance traveled across various real world evaluation terrains, the grass pavement, you name it.
By the way, they also did other tasks like cube rotations and there again, Dr. Eureka's best policy performs nearly 300% more of them within a fixed time period. More rotations for your money if you will. Remember, before this, we had to rely on domain experts to manually perturb different parameters such as friction.
And another problem, as I mentioned earlier, is that then the human would have to observe how those set of instructions or policies did, test it in the real world effectively, and then try new reward functions. All of this delay is probably why we don't have robot servants already. To clarify, this is the first work to investigate whether large language models like GPT-4 themselves can be used to guide this simulation to reality transfer.
Now what about that safety instruction I mentioned earlier? Why is that crucial? Well, this is where it gets a little bit funny. Basically, without that safety instruction, GPT-4 starts to behave in a degenerate fashion. Things got pretty wild with GPT-4, but I'll give you the censored version. Basically, it would cheat by over-exerting the robot motors or learning unnatural behavior.
Essentially, it would propose things that conquer the simulation, but which wouldn't work in reality. For example, it would try thrusting its hip against the ground and dragging itself with three of its legs. Now I'm sure that none of you would try such degenerate behavior, but GPT-4 did. Put that into the real world though, and of course that behavior doesn't transfer.
With that policy, the robo-dog directly face plants at the starting line. More formally though, we got reward functions like this. And unlike human-designed reward functions, which would involve adding each component, this was multiplicative. The reward was the product of the terms above. And why is that really smart? Well, if any of these tend towards zero, the product will tend towards zero.
If you violate the degree of freedom of the joints of the robot, the entire reward function will generate zero. Remember, if you multiply anything by zero, it's zero. Whereas with the human-designed policy, you would add these things up and still get some reward. Here are some of the examples of the kind of prompts they fed GPT-4 to emphasize realism and safety.
The policy, they said, will be trained in simulation and deployed in the real world. So the policy, they reminded GPT-4, should be as steady and stable as possible. Keep the torso high up and the orientation should be perpendicular to gravity. Later, they say, please also penalize jittery or fast actions that may burn out the motors.
These kinds of safety-oriented prompts were crucial. Here you can see GPT-4 reflecting on a reward function that had failed and coming up with improvements. It was like, ah, I need an exponential reward component for the height reward so that the reward gradient is smoother. Then it updates the reward function.
And here's another way that Dr. Eureka outperforms human training. When humans are trying to teach a robot a skill, they often come up with a curriculum, a set of things to learn in a particular order. So first they might teach a robot to move at half a meter per second, then one meter, then two meters per second.
These curricula have to be carefully designed. Well, with this approach, we don't need a reward curriculum. It's almost like the model throws out the human textbook and teaches itself. Oh, and why a yoga ball, by the way? Well, apparently they were inspired by the circus. Doesn't make you wonder what they're going to try next, but let's see.
And what about limitations? Well, if you remember from earlier, they didn't incorporate any real world feedback, but of course they admit that with dynamic adjustment of domain randomization parameters based on policy performance or real world feedback, they could of course further improve the simulation to reality transferability. I actually had a discussion with Jim Fan about all of this back on my Patreon in January, and one of the things we discussed was another way to improve this approach, incorporating vision.
If GPT-4 could see where the robot is going wrong and not just read about it, it could do far better. And how about one more way to improve this approach? Co-evolution. Apologies for the slight audio deformity here. I honestly am struggling to see what the limit will be, and I'm wondering what you think about the limit to the Eureka approach as we are getting more and more powerful models.
I think that is a great question. You know, just by sheer coincidence, people are talking about two-star and there's this renewed interest in LLM complying with classical approaches like search, right? Instead of just generating, you generate and then you get some feedback and you generate more, you would do a little bit of search, and then you expand that search and that kind of comes back to improve the model and also improve just the intelligence of the whole system.
And actually, Eureka is doing exactly that. It uses GPT-4 to write reward functions, and the reward function instructs a robot hand to do tasks, and you get feedback. You know how good that robot is performing. And you can use that as a ground truth signal to improve even more, which we did in the paper.
And one limitation is that we are not able to fine-tune GPT, but it's possible that some of the open-source models will catch up in the future. And actually, we are also actively exploring how to use some open-source models in the loop for Eureka. Well, that means we will be able to not just improve in context, but also improve the intelligence on the underlying language.
So basically the LLM and the Eureka and the robots, they can co-evolve and co-improve. And then, you know, that means basically the sky's the limit. Or, you know, compute budget is the limit. In case you were wondering, all of this is open-source and the links will be in the description.
But what about the bigger implications? I predict that within a year, we will see a humanoid robot perform a complex physical dexterous task, one that is performed commonly in industry. That could be the wake-up call for many that the blue-collar world isn't completely immune to AI. Of course, there's a long way to go between where we are and the mass manufacturing of the robots needed to affect jobs at a big scale.
So of course, plumbers are safe for now. In high-stakes settings like self-driving, we're clearly not quite ready for widespread deployment. Although Waymo is doing pretty well. But for repetitive tasks, things might change faster than you think. And if you believe that the dexterity of human fingers is what will differentiate us, then Sanctuary AI will soon be on your case.
And with AI doing the training in parallel across thousands of simulations, things could change fast. Just an amazing paper and super enjoyable to read. And yes, I read many of the papers linked in the appendices. I kind of went deep for this one. So thank you as ever for watching to the end.
And if you do want to support the channel, check out my amazing Patreon. We have incredible networking on the Discord, plus I do podcasts and interviews and more. But regardless of all of that, I hope you have a wonderful day.