back to indexAI Conquers Gravity: Robo-dog, Trained by GPT-4, Stays Balanced on Rolling, Deflating Yoga Ball
00:00:00.000 |
If GPT-4 can train a robodog better than we can to balance on a rolling yoga ball that's 00:00:08.360 |
being kicked or deflated, what's next? Are we sure that changing a lightbulb or fixing 00:00:14.800 |
a plumbing leak is much more physically complex? And if it's a 2022 era language model, GPT-4, 00:00:23.640 |
that is doing the teaching, what does that say about the learning rates of robots taught 00:00:28.280 |
by even 2024 era AI? This Dr. Eureka paper was released less than 48 hours ago, but I'll 00:00:36.720 |
give you all the highlights and interview clips with two key figures behind the Eureka 00:00:42.140 |
concept, Jim Fan and Guangzhe Wang. But first, what is the overall concept? What are they 00:00:48.240 |
basically doing? They want to train a robot, in this case a quadruped robodog, in simulation 00:00:54.920 |
and see if they can transfer that to the real world. That's the sim to real part from simulation 00:01:00.880 |
to reality. And they want to use a language model, in this case GPT-4, to guide that process. 00:01:07.360 |
And why would it be helpful to use a language model? Well, if you have to go in as a human 00:01:12.760 |
and tweak all the different parameters, which we'll see in a moment, that takes ages. As 00:01:18.160 |
the paper says, that renders the process slow and human labor intensive. But this paper 00:01:23.840 |
isn't just about saving time. The language model derived reward functions perform better 00:01:29.440 |
than the human ones. In short, language models like GPT-4 are better teachers for robots. 00:01:35.740 |
So why do I think this is so much more significant than yoga balls? Language models like ChatGPT 00:01:42.360 |
are brilliant at generating hypotheses, generating ideas, as the paper says. But as we all know, 00:01:48.960 |
their great Achilles heel is hallucinations or confabulations, making stuff up, making 00:01:54.280 |
mistakes. But if those ideas, even tens of thousands of them, can be tested in simulation, 00:01:59.700 |
we can find just the good ones. Thankfully, language models are infinitely patient. And 00:02:04.800 |
so what we end up with are better implementations, in this case for robot training, than humans 00:02:10.400 |
can produce. And crucially, as the paper points out, this works for novel or new robot tasks, 00:02:16.960 |
ones not seen in the training data of the language model. And this approach isn't 00:02:21.480 |
just effective for new tasks, but for novel situations within existing tasks. We'll 00:02:27.320 |
see how all of this is done in a moment, but here is the GPT-4 trained robo-dog reacting 00:02:32.120 |
to the yoga ball being deflated. It overcomes that situation, despite not seeing that in 00:02:38.400 |
training. Before we get back to the paper, though, let's reiterate something in the 00:02:41.440 |
words of Dr. Jim Fan. Once derived in simulation, this policy is transferred zero shot to the 00:02:48.240 |
real world. Or translated, it's not relying on human demonstrations. The robo-dog doesn't 00:02:53.520 |
have to see a human or another robo-dog balancing on a yoga ball. No fine tuning, it just works. 00:03:00.320 |
Not every single time, admittedly, but we'll get to the blooper reel later on. I will need 00:03:06.260 |
to give you a minute of background on the Eureka paper, which came out in October of 00:03:11.640 |
last year, before we can get to Dr. Eureka. And let me try to summarize the paper in less 00:03:17.360 |
than 60 seconds. A reward function is a way of specifying, in code, how to measure success 00:03:23.880 |
in a task. And language models are great at coming up with them and modifying them based 00:03:29.720 |
on environmental feedback. So the NVIDIA team proposed a task, in this case spinning a pen 00:03:35.200 |
in a robotic simulated hand. Then GPT-4 would propose a way of measuring success, a reward 00:03:41.480 |
function. Of course, because it's infinitely patient, it could generate hundreds of these. 00:03:46.220 |
These would be tested in simulation in parallel, thanks to NVIDIA hardware. GPT-4 was then 00:03:52.360 |
prompted to reflect on the results. Then, based on those reflections, it would iterate 00:03:57.880 |
the reward functions. Spoiler alert, it got really good at spinning the pen, at least 00:04:03.680 |
in simulation. Fast forward a month and we now have Dr. Eureka. And no, GPT-4 didn't 00:04:09.520 |
go off and get a PhD. We're still using vanilla GPT-4. The DR is domain randomization, which 00:04:15.960 |
I'll get to in a moment. Now, some of you may immediately put your hands up and say, 00:04:19.720 |
what was wrong with Eureka? Couldn't that have just worked for real world deployment? 00:04:23.960 |
In a nutshell, the basic issue is that the world is kind of weird and nitty gritty. There's 00:04:30.040 |
a lot of physics you need to contend with and aspects of the domain or environment you 00:04:35.880 |
can't quite predict. How much power will be in the robot's motors and how much friction 00:04:40.480 |
will the legs have on the ball? And then some of you might say, that's not a problem, just 00:04:44.440 |
test every single scenario. But the problem with that is that in the real world, people 00:04:50.320 |
have a limited compute budget. It's not practical in 2024 to test every single possible scenario. 00:04:58.120 |
We need to give the variables a realistic range, but not with human intuition, with 00:05:03.680 |
LLM intuition. So let me now try to run through the Dr. Eureka process, which I think is genius. 00:05:10.800 |
As with Eureka, we propose the task. What we add to Eureka is a safety instruction. 00:05:17.320 |
Basically, GPT-4 be realistic. Our motors can only do this much. Other things they say 00:05:22.600 |
include this policy is going to be deployed in the real world. Be careful. So then we 00:05:27.260 |
get the GPT-4 policy or set of instructions. For example, controlling the legs of the robodog. 00:05:34.040 |
Now this is where it gets a little bit complicated, so you might have to focus. Taking that policy, 00:05:39.600 |
what they then do is isolate each variable in the environment, in this case, gravity. 00:05:45.120 |
But then they amp it right up until the policy breaks. They bring it right down until the 00:05:49.720 |
policy breaks. That gives us a viable range where the policy works. That's the reward 00:05:55.400 |
aware part. And why limit ourselves to that range? Well, if you set hyper unrealistic 00:06:00.760 |
settings for gravity, then we won't learn anything. The set of instructions will fail 00:06:06.560 |
every single time in that setting. So there's no signal back to the system of what works. 00:06:11.680 |
Keep things in a realistic range and we get a more reliable signal. Unfortunately, though, 00:06:16.560 |
that's not enough. And that's where we need domain randomization. And to explain that, 00:06:21.520 |
I have to give you a vivid demonstration. At the previous stage, we were limited to 00:06:26.520 |
ranges for these different variables that could at least sometimes work. Variables for 00:06:31.840 |
the bounciness of the ball, restitution and friction and gravity, as I mentioned. There 00:06:37.000 |
you can see the motor strength range. But there's no real common sense here about 00:06:41.560 |
what would happen with a yoga ball. That's why they called it an uninformative context. 00:06:47.160 |
What GPT-4 generated domain randomizations do is give a much more realistic range based 00:06:53.800 |
on common sense. Notice how with each of the ranges, we get an explanation from GPT-4 about 00:06:59.200 |
why it's picking that range. For bounciness, it says we're not focused on bouncing. It's 00:07:03.800 |
still relevant for minor impacts, though. Notice just between 0 and 0.5. For friction, 00:07:09.360 |
it's thinking about tiles, grass, dirt, etc. For motor strength, it's actually half of 00:07:14.680 |
the full range. And it says this is a moderate range, allowing for variability in motor performance. 00:07:20.640 |
By limiting the ranges we're going to test, we get much more effective learning. This 00:07:25.160 |
is where GPT-4 starts to outstrip humans in teaching robots. In case you didn't know, 00:07:30.480 |
by the way, GPT-4 finished training in August of 2022. How good GPT-5 is at training robots, 00:07:37.520 |
only time will tell. Now, some of you in bewilderment will be saying, but Philip, why do we even 00:07:42.640 |
need a range? Why can't we just guess a value for each of these things? Well, the real world 00:07:47.280 |
again is messy. By testing your instructions in these realistic scenarios, it becomes much 00:07:52.840 |
more robust in the real world. As we'll see, the original Eureka flops without this step. 00:07:58.100 |
Before we carry on, some of you will be shaking your head and saying, I'm sure humans could 00:08:01.820 |
do better than this. Can't humans come up with better reward functions and more realistic 00:08:06.880 |
ranges? Well, here's Guanjue Wang describing how humans get stuck in local optima. 00:08:12.040 |
It has a very much prior knowledge and therefore it can just propose different kinds of mutations 00:08:17.480 |
and variations of the reward function based on the environment context. For humans, you 00:08:22.600 |
need to manually tune the reward functions and it's very easy for humans to get stuck 00:08:27.400 |
to a local optima. For GPT-4, it can generate tens of reward functions at the same time 00:08:32.800 |
and based on the performance of each reward function, it can continuously improve it. 00:08:38.280 |
Humans simply don't have the patience of larger language models. Or to bring in some real 00:08:43.000 |
numbers, Dr. Eureka trained robodogs outperform those trained with human designed reward functions 00:08:49.600 |
and domain randomization parameters by 34% in forward velocity and 20% in distance traveled 00:08:56.640 |
across various real world evaluation terrains, the grass pavement, you name it. By the way, 00:09:02.820 |
they also did other tasks like cube rotations and there again, Dr. Eureka's best policy 00:09:08.120 |
performs nearly 300% more of them within a fixed time period. More rotations for your 00:09:14.320 |
money if you will. Remember, before this, we had to rely on domain 00:09:18.120 |
experts to manually perturb different parameters such as friction. And another problem, as 00:09:23.700 |
I mentioned earlier, is that then the human would have to observe how those set of instructions 00:09:28.440 |
or policies did, test it in the real world effectively, and then try new reward functions. 00:09:33.080 |
All of this delay is probably why we don't have robot servants already. To clarify, this 00:09:38.140 |
is the first work to investigate whether large language models like GPT-4 themselves can 00:09:44.440 |
be used to guide this simulation to reality transfer. 00:09:48.320 |
Now what about that safety instruction I mentioned earlier? Why is that crucial? Well, this is 00:09:52.620 |
where it gets a little bit funny. Basically, without that safety instruction, GPT-4 starts 00:09:57.720 |
to behave in a degenerate fashion. Things got pretty wild with GPT-4, but I'll give 00:10:03.600 |
you the censored version. Basically, it would cheat by over-exerting the robot motors or 00:10:10.240 |
learning unnatural behavior. Essentially, it would propose things that conquer the simulation, 00:10:15.720 |
but which wouldn't work in reality. For example, it would try thrusting its hip against the 00:10:20.440 |
ground and dragging itself with three of its legs. Now I'm sure that none of you would 00:10:25.480 |
try such degenerate behavior, but GPT-4 did. Put that into the real world though, and of 00:10:30.800 |
course that behavior doesn't transfer. With that policy, the robo-dog directly face plants 00:10:36.640 |
at the starting line. More formally though, we got reward functions 00:10:40.140 |
like this. And unlike human-designed reward functions, which would involve adding each 00:10:44.800 |
component, this was multiplicative. The reward was the product of the terms above. And why 00:10:50.300 |
is that really smart? Well, if any of these tend towards zero, the product will tend towards 00:10:56.520 |
zero. If you violate the degree of freedom of the joints of the robot, the entire reward 00:11:03.640 |
function will generate zero. Remember, if you multiply anything by zero, it's zero. 00:11:08.240 |
Whereas with the human-designed policy, you would add these things up and still get some 00:11:12.920 |
reward. Here are some of the examples of the kind 00:11:15.040 |
of prompts they fed GPT-4 to emphasize realism and safety. The policy, they said, will be 00:11:20.780 |
trained in simulation and deployed in the real world. So the policy, they reminded GPT-4, 00:11:25.100 |
should be as steady and stable as possible. Keep the torso high up and the orientation 00:11:30.980 |
should be perpendicular to gravity. Later, they say, please also penalize jittery or 00:11:36.340 |
fast actions that may burn out the motors. These kinds of safety-oriented prompts were 00:11:41.620 |
crucial. Here you can see GPT-4 reflecting on a reward function that had failed and coming 00:11:47.360 |
up with improvements. It was like, ah, I need an exponential reward component for the height 00:11:52.660 |
reward so that the reward gradient is smoother. Then it updates the reward function. 00:11:58.380 |
And here's another way that Dr. Eureka outperforms human training. When humans are trying to 00:12:03.640 |
teach a robot a skill, they often come up with a curriculum, a set of things to learn 00:12:09.620 |
in a particular order. So first they might teach a robot to move at half a meter per 00:12:14.940 |
second, then one meter, then two meters per second. These curricula have to be carefully 00:12:20.020 |
designed. Well, with this approach, we don't need a reward curriculum. It's almost like 00:12:24.700 |
the model throws out the human textbook and teaches itself. Oh, and why a yoga ball, by 00:12:30.660 |
the way? Well, apparently they were inspired by the circus. Doesn't make you wonder what 00:12:36.060 |
they're going to try next, but let's see. And what about limitations? Well, if you remember 00:12:40.220 |
from earlier, they didn't incorporate any real world feedback, but of course they admit 00:12:45.060 |
that with dynamic adjustment of domain randomization parameters based on policy performance or 00:12:51.140 |
real world feedback, they could of course further improve the simulation to reality 00:12:55.820 |
transferability. I actually had a discussion with Jim Fan about all of this back on my 00:13:01.180 |
Patreon in January, and one of the things we discussed was another way to improve this 00:13:06.460 |
approach, incorporating vision. If GPT-4 could see where the robot is going wrong and not 00:13:11.900 |
just read about it, it could do far better. And how about one more way to improve this 00:13:17.580 |
approach? Co-evolution. Apologies for the slight audio deformity here. I honestly am 00:13:23.580 |
struggling to see what the limit will be, and I'm wondering what you think about the 00:13:28.860 |
limit to the Eureka approach as we are getting more and more powerful models. 00:13:35.300 |
I think that is a great question. You know, just by sheer coincidence, people are talking 00:13:39.780 |
about two-star and there's this renewed interest in LLM complying with classical approaches 00:13:46.060 |
like search, right? Instead of just generating, you generate and then you get some feedback 00:13:51.420 |
and you generate more, you would do a little bit of search, and then you expand that search 00:13:55.700 |
and that kind of comes back to improve the model and also improve just the intelligence 00:14:00.460 |
of the whole system. And actually, Eureka is doing exactly that. It uses GPT-4 to write 00:14:07.220 |
reward functions, and the reward function instructs a robot hand to do tasks, and you 00:14:12.020 |
get feedback. You know how good that robot is performing. And you can use that as a ground 00:14:16.540 |
truth signal to improve even more, which we did in the paper. And one limitation is that 00:14:22.460 |
we are not able to fine-tune GPT, but it's possible that some of the open-source models 00:14:27.820 |
will catch up in the future. And actually, we are also actively exploring how to use 00:14:32.140 |
some open-source models in the loop for Eureka. Well, that means we will be able to not just 00:14:37.060 |
improve in context, but also improve the intelligence on the underlying language. So basically the 00:14:43.100 |
LLM and the Eureka and the robots, they can co-evolve and co-improve. And then, you know, 00:14:50.140 |
that means basically the sky's the limit. Or, you know, compute budget is the limit. 00:14:54.860 |
In case you were wondering, all of this is open-source and the links will be in the description. 00:14:59.900 |
But what about the bigger implications? I predict that within a year, we will see a 00:15:05.220 |
humanoid robot perform a complex physical dexterous task, one that is performed commonly 00:15:11.140 |
in industry. That could be the wake-up call for many that the blue-collar world isn't 00:15:16.940 |
completely immune to AI. Of course, there's a long way to go between where we are and 00:15:21.940 |
the mass manufacturing of the robots needed to affect jobs at a big scale. So of course, 00:15:28.180 |
plumbers are safe for now. In high-stakes settings like self-driving, we're clearly 00:15:32.900 |
not quite ready for widespread deployment. Although Waymo is doing pretty well. But for 00:15:38.140 |
repetitive tasks, things might change faster than you think. And if you believe that the 00:15:43.500 |
dexterity of human fingers is what will differentiate us, then Sanctuary AI will soon be on your 00:15:51.220 |
case. And with AI doing the training in parallel across thousands of simulations, things could 00:15:57.480 |
change fast. Just an amazing paper and super enjoyable to read. And yes, I read many of 00:16:03.380 |
the papers linked in the appendices. I kind of went deep for this one. So thank you as 00:16:08.700 |
ever for watching to the end. And if you do want to support the channel, check out my 00:16:13.580 |
amazing Patreon. We have incredible networking on the Discord, plus I do podcasts and interviews 00:16:19.740 |
and more. But regardless of all of that, I hope you have a wonderful day.