AI Conquers Gravity: Robo-dog, Trained by GPT-4, Stays Balanced on Rolling, Deflating Yoga Ball

00:00:00.000 | If GPT-4 can train a robodog better than we can to balance on a rolling yoga ball that's

00:00:08.360 | being kicked or deflated, what's next? Are we sure that changing a lightbulb or fixing

00:00:14.800 | a plumbing leak is much more physically complex? And if it's a 2022 era language model, GPT-4,

00:00:23.640 | that is doing the teaching, what does that say about the learning rates of robots taught

00:00:28.280 | by even 2024 era AI? This Dr. Eureka paper was released less than 48 hours ago, but I'll

00:00:36.720 | give you all the highlights and interview clips with two key figures behind the Eureka

00:00:42.140 | concept, Jim Fan and Guangzhe Wang. But first, what is the overall concept? What are they

00:00:48.240 | basically doing? They want to train a robot, in this case a quadruped robodog, in simulation

00:00:54.920 | and see if they can transfer that to the real world. That's the sim to real part from simulation

00:01:00.880 | to reality. And they want to use a language model, in this case GPT-4, to guide that process.

00:01:07.360 | And why would it be helpful to use a language model? Well, if you have to go in as a human

00:01:12.760 | and tweak all the different parameters, which we'll see in a moment, that takes ages. As

00:01:18.160 | the paper says, that renders the process slow and human labor intensive. But this paper

00:01:23.840 | isn't just about saving time. The language model derived reward functions perform better

00:01:29.440 | than the human ones. In short, language models like GPT-4 are better teachers for robots.

00:01:35.740 | So why do I think this is so much more significant than yoga balls? Language models like ChatGPT

00:01:42.360 | are brilliant at generating hypotheses, generating ideas, as the paper says. But as we all know,

00:01:48.960 | their great Achilles heel is hallucinations or confabulations, making stuff up, making

00:01:54.280 | mistakes. But if those ideas, even tens of thousands of them, can be tested in simulation,

00:01:59.700 | we can find just the good ones. Thankfully, language models are infinitely patient. And

00:02:04.800 | so what we end up with are better implementations, in this case for robot training, than humans

00:02:10.400 | can produce. And crucially, as the paper points out, this works for novel or new robot tasks,

00:02:16.960 | ones not seen in the training data of the language model. And this approach isn't

00:02:21.480 | just effective for new tasks, but for novel situations within existing tasks. We'll

00:02:27.320 | see how all of this is done in a moment, but here is the GPT-4 trained robo-dog reacting

00:02:32.120 | to the yoga ball being deflated. It overcomes that situation, despite not seeing that in

00:02:38.400 | training. Before we get back to the paper, though, let's reiterate something in the

00:02:41.440 | words of Dr. Jim Fan. Once derived in simulation, this policy is transferred zero shot to the

00:02:48.240 | real world. Or translated, it's not relying on human demonstrations. The robo-dog doesn't

00:02:53.520 | have to see a human or another robo-dog balancing on a yoga ball. No fine tuning, it just works.

00:03:00.320 | Not every single time, admittedly, but we'll get to the blooper reel later on. I will need

00:03:06.260 | to give you a minute of background on the Eureka paper, which came out in October of

00:03:11.640 | last year, before we can get to Dr. Eureka. And let me try to summarize the paper in less

00:03:17.360 | than 60 seconds. A reward function is a way of specifying, in code, how to measure success

00:03:23.880 | in a task. And language models are great at coming up with them and modifying them based

00:03:29.720 | on environmental feedback. So the NVIDIA team proposed a task, in this case spinning a pen

00:03:35.200 | in a robotic simulated hand. Then GPT-4 would propose a way of measuring success, a reward

00:03:41.480 | function. Of course, because it's infinitely patient, it could generate hundreds of these.

00:03:46.220 | These would be tested in simulation in parallel, thanks to NVIDIA hardware. GPT-4 was then

00:03:52.360 | prompted to reflect on the results. Then, based on those reflections, it would iterate

00:03:57.880 | the reward functions. Spoiler alert, it got really good at spinning the pen, at least

00:04:03.680 | in simulation. Fast forward a month and we now have Dr. Eureka. And no, GPT-4 didn't

00:04:09.520 | go off and get a PhD. We're still using vanilla GPT-4. The DR is domain randomization, which

00:04:15.960 | I'll get to in a moment. Now, some of you may immediately put your hands up and say,

00:04:19.720 | what was wrong with Eureka? Couldn't that have just worked for real world deployment?

00:04:23.960 | In a nutshell, the basic issue is that the world is kind of weird and nitty gritty. There's

00:04:30.040 | a lot of physics you need to contend with and aspects of the domain or environment you

00:04:35.880 | can't quite predict. How much power will be in the robot's motors and how much friction

00:04:40.480 | will the legs have on the ball? And then some of you might say, that's not a problem, just

00:04:44.440 | test every single scenario. But the problem with that is that in the real world, people

00:04:50.320 | have a limited compute budget. It's not practical in 2024 to test every single possible scenario.

00:04:58.120 | We need to give the variables a realistic range, but not with human intuition, with

00:05:03.680 | LLM intuition. So let me now try to run through the Dr. Eureka process, which I think is genius.

00:05:10.800 | As with Eureka, we propose the task. What we add to Eureka is a safety instruction.

00:05:17.320 | Basically, GPT-4 be realistic. Our motors can only do this much. Other things they say

00:05:22.600 | include this policy is going to be deployed in the real world. Be careful. So then we

00:05:27.260 | get the GPT-4 policy or set of instructions. For example, controlling the legs of the robodog.

00:05:34.040 | Now this is where it gets a little bit complicated, so you might have to focus. Taking that policy,

00:05:39.600 | what they then do is isolate each variable in the environment, in this case, gravity.

00:05:45.120 | But then they amp it right up until the policy breaks. They bring it right down until the

00:05:49.720 | policy breaks. That gives us a viable range where the policy works. That's the reward

00:05:55.400 | aware part. And why limit ourselves to that range? Well, if you set hyper unrealistic

00:06:00.760 | settings for gravity, then we won't learn anything. The set of instructions will fail

00:06:06.560 | every single time in that setting. So there's no signal back to the system of what works.

00:06:11.680 | Keep things in a realistic range and we get a more reliable signal. Unfortunately, though,

00:06:16.560 | that's not enough. And that's where we need domain randomization. And to explain that,

00:06:21.520 | I have to give you a vivid demonstration. At the previous stage, we were limited to

00:06:26.520 | ranges for these different variables that could at least sometimes work. Variables for

00:06:31.840 | the bounciness of the ball, restitution and friction and gravity, as I mentioned. There

00:06:37.000 | you can see the motor strength range. But there's no real common sense here about

00:06:41.560 | what would happen with a yoga ball. That's why they called it an uninformative context.

00:06:47.160 | What GPT-4 generated domain randomizations do is give a much more realistic range based

00:06:53.800 | on common sense. Notice how with each of the ranges, we get an explanation from GPT-4 about

00:06:59.200 | why it's picking that range. For bounciness, it says we're not focused on bouncing. It's

00:07:03.800 | still relevant for minor impacts, though. Notice just between 0 and 0.5. For friction,

00:07:09.360 | it's thinking about tiles, grass, dirt, etc. For motor strength, it's actually half of

00:07:14.680 | the full range. And it says this is a moderate range, allowing for variability in motor performance.

00:07:20.640 | By limiting the ranges we're going to test, we get much more effective learning. This

00:07:25.160 | is where GPT-4 starts to outstrip humans in teaching robots. In case you didn't know,

00:07:30.480 | by the way, GPT-4 finished training in August of 2022. How good GPT-5 is at training robots,

00:07:37.520 | only time will tell. Now, some of you in bewilderment will be saying, but Philip, why do we even

00:07:42.640 | need a range? Why can't we just guess a value for each of these things? Well, the real world

00:07:47.280 | again is messy. By testing your instructions in these realistic scenarios, it becomes much

00:07:52.840 | more robust in the real world. As we'll see, the original Eureka flops without this step.

00:07:58.100 | Before we carry on, some of you will be shaking your head and saying, I'm sure humans could

00:08:01.820 | do better than this. Can't humans come up with better reward functions and more realistic

00:08:06.880 | ranges? Well, here's Guanjue Wang describing how humans get stuck in local optima.

00:08:12.040 | It has a very much prior knowledge and therefore it can just propose different kinds of mutations

00:08:17.480 | and variations of the reward function based on the environment context. For humans, you

00:08:22.600 | need to manually tune the reward functions and it's very easy for humans to get stuck

00:08:27.400 | to a local optima. For GPT-4, it can generate tens of reward functions at the same time

00:08:32.800 | and based on the performance of each reward function, it can continuously improve it.

00:08:38.280 | Humans simply don't have the patience of larger language models. Or to bring in some real

00:08:43.000 | numbers, Dr. Eureka trained robodogs outperform those trained with human designed reward functions

00:08:49.600 | and domain randomization parameters by 34% in forward velocity and 20% in distance traveled

00:08:56.640 | across various real world evaluation terrains, the grass pavement, you name it. By the way,

00:09:02.820 | they also did other tasks like cube rotations and there again, Dr. Eureka's best policy

00:09:08.120 | performs nearly 300% more of them within a fixed time period. More rotations for your

00:09:14.320 | money if you will. Remember, before this, we had to rely on domain

00:09:18.120 | experts to manually perturb different parameters such as friction. And another problem, as

00:09:23.700 | I mentioned earlier, is that then the human would have to observe how those set of instructions

00:09:28.440 | or policies did, test it in the real world effectively, and then try new reward functions.

00:09:33.080 | All of this delay is probably why we don't have robot servants already. To clarify, this

00:09:38.140 | is the first work to investigate whether large language models like GPT-4 themselves can

00:09:44.440 | be used to guide this simulation to reality transfer.

00:09:48.320 | Now what about that safety instruction I mentioned earlier? Why is that crucial? Well, this is

00:09:52.620 | where it gets a little bit funny. Basically, without that safety instruction, GPT-4 starts

00:09:57.720 | to behave in a degenerate fashion. Things got pretty wild with GPT-4, but I'll give

00:10:03.600 | you the censored version. Basically, it would cheat by over-exerting the robot motors or

00:10:10.240 | learning unnatural behavior. Essentially, it would propose things that conquer the simulation,

00:10:15.720 | but which wouldn't work in reality. For example, it would try thrusting its hip against the

00:10:20.440 | ground and dragging itself with three of its legs. Now I'm sure that none of you would

00:10:25.480 | try such degenerate behavior, but GPT-4 did. Put that into the real world though, and of

00:10:30.800 | course that behavior doesn't transfer. With that policy, the robo-dog directly face plants

00:10:36.640 | at the starting line. More formally though, we got reward functions

00:10:40.140 | like this. And unlike human-designed reward functions, which would involve adding each

00:10:44.800 | component, this was multiplicative. The reward was the product of the terms above. And why

00:10:50.300 | is that really smart? Well, if any of these tend towards zero, the product will tend towards

00:10:56.520 | zero. If you violate the degree of freedom of the joints of the robot, the entire reward

00:11:03.640 | function will generate zero. Remember, if you multiply anything by zero, it's zero.

00:11:08.240 | Whereas with the human-designed policy, you would add these things up and still get some

00:11:12.920 | reward. Here are some of the examples of the kind

00:11:15.040 | of prompts they fed GPT-4 to emphasize realism and safety. The policy, they said, will be

00:11:20.780 | trained in simulation and deployed in the real world. So the policy, they reminded GPT-4,

00:11:25.100 | should be as steady and stable as possible. Keep the torso high up and the orientation

00:11:30.980 | should be perpendicular to gravity. Later, they say, please also penalize jittery or

00:11:36.340 | fast actions that may burn out the motors. These kinds of safety-oriented prompts were

00:11:41.620 | crucial. Here you can see GPT-4 reflecting on a reward function that had failed and coming

00:11:47.360 | up with improvements. It was like, ah, I need an exponential reward component for the height

00:11:52.660 | reward so that the reward gradient is smoother. Then it updates the reward function.

00:11:58.380 | And here's another way that Dr. Eureka outperforms human training. When humans are trying to

00:12:03.640 | teach a robot a skill, they often come up with a curriculum, a set of things to learn

00:12:09.620 | in a particular order. So first they might teach a robot to move at half a meter per

00:12:14.940 | second, then one meter, then two meters per second. These curricula have to be carefully

00:12:20.020 | designed. Well, with this approach, we don't need a reward curriculum. It's almost like

00:12:24.700 | the model throws out the human textbook and teaches itself. Oh, and why a yoga ball, by

00:12:30.660 | the way? Well, apparently they were inspired by the circus. Doesn't make you wonder what

00:12:36.060 | they're going to try next, but let's see. And what about limitations? Well, if you remember

00:12:40.220 | from earlier, they didn't incorporate any real world feedback, but of course they admit

00:12:45.060 | that with dynamic adjustment of domain randomization parameters based on policy performance or

00:12:51.140 | real world feedback, they could of course further improve the simulation to reality

00:12:55.820 | transferability. I actually had a discussion with Jim Fan about all of this back on my

00:13:01.180 | Patreon in January, and one of the things we discussed was another way to improve this

00:13:06.460 | approach, incorporating vision. If GPT-4 could see where the robot is going wrong and not

00:13:11.900 | just read about it, it could do far better. And how about one more way to improve this

00:13:17.580 | approach? Co-evolution. Apologies for the slight audio deformity here. I honestly am

00:13:23.580 | struggling to see what the limit will be, and I'm wondering what you think about the

00:13:28.860 | limit to the Eureka approach as we are getting more and more powerful models.

00:13:35.300 | I think that is a great question. You know, just by sheer coincidence, people are talking

00:13:39.780 | about two-star and there's this renewed interest in LLM complying with classical approaches

00:13:46.060 | like search, right? Instead of just generating, you generate and then you get some feedback

00:13:51.420 | and you generate more, you would do a little bit of search, and then you expand that search

00:13:55.700 | and that kind of comes back to improve the model and also improve just the intelligence

00:14:00.460 | of the whole system. And actually, Eureka is doing exactly that. It uses GPT-4 to write

00:14:07.220 | reward functions, and the reward function instructs a robot hand to do tasks, and you

00:14:12.020 | get feedback. You know how good that robot is performing. And you can use that as a ground

00:14:16.540 | truth signal to improve even more, which we did in the paper. And one limitation is that

00:14:22.460 | we are not able to fine-tune GPT, but it's possible that some of the open-source models

00:14:27.820 | will catch up in the future. And actually, we are also actively exploring how to use

00:14:32.140 | some open-source models in the loop for Eureka. Well, that means we will be able to not just

00:14:37.060 | improve in context, but also improve the intelligence on the underlying language. So basically the

00:14:43.100 | LLM and the Eureka and the robots, they can co-evolve and co-improve. And then, you know,

00:14:50.140 | that means basically the sky's the limit. Or, you know, compute budget is the limit.

00:14:54.860 | In case you were wondering, all of this is open-source and the links will be in the description.

00:14:59.900 | But what about the bigger implications? I predict that within a year, we will see a

00:15:05.220 | humanoid robot perform a complex physical dexterous task, one that is performed commonly

00:15:11.140 | in industry. That could be the wake-up call for many that the blue-collar world isn't

00:15:16.940 | completely immune to AI. Of course, there's a long way to go between where we are and

00:15:21.940 | the mass manufacturing of the robots needed to affect jobs at a big scale. So of course,

00:15:28.180 | plumbers are safe for now. In high-stakes settings like self-driving, we're clearly

00:15:32.900 | not quite ready for widespread deployment. Although Waymo is doing pretty well. But for

00:15:38.140 | repetitive tasks, things might change faster than you think. And if you believe that the

00:15:43.500 | dexterity of human fingers is what will differentiate us, then Sanctuary AI will soon be on your

00:15:51.220 | case. And with AI doing the training in parallel across thousands of simulations, things could

00:15:57.480 | change fast. Just an amazing paper and super enjoyable to read. And yes, I read many of

00:16:03.380 | the papers linked in the appendices. I kind of went deep for this one. So thank you as

00:16:08.700 | ever for watching to the end. And if you do want to support the channel, check out my

00:16:13.580 | amazing Patreon. We have incredible networking on the Discord, plus I do podcasts and interviews

00:16:19.740 | and more. But regardless of all of that, I hope you have a wonderful day.