back to indexPieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10
Chapters
0:0 Intro
0:41 Robot tennis
4:5 Robot parkour
4:53 The psychology of robots
7:5 Can robots have emotion
10:40 Intuition behind RL
15:5 Time skills
16:17 Limitations
20:30 Reusable results
24:50 Modularity
26:38 Mathematical Formalisation
28:27 Selfplay
33:51 Simulation
35:30 Safety
38:47 Human evolution
00:00:00.000 |
The following is a conversation with Peter Abbeel. 00:00:03.120 |
He's a professor at UC Berkeley and the director of the Berkeley Robotics Learning Lab. 00:00:07.840 |
He's one of the top researchers in the world working on how we make robots understand and interact with the world around them, 00:00:15.360 |
especially using imitation and deep reinforcement learning. 00:00:19.720 |
This conversation is part of the MIT course on artificial general intelligence and the Artificial Intelligence Podcast. 00:00:26.400 |
If you enjoy it, please subscribe on YouTube, iTunes, or your podcast provider of choice, 00:00:31.680 |
or simply connect with me on Twitter @LexFriedman, spelled F-R-I-D. 00:00:36.920 |
And now, here's my conversation with Peter Abbeel. 00:00:41.400 |
You've mentioned that if there was one person you could meet, it would be Roger Federer. 00:00:46.200 |
So let me ask, when do you think we'll have a robot that fully autonomously can beat Roger Federer at tennis, 00:00:57.520 |
Well, first, if you can make it happen for me to meet Roger, let me know. 00:01:02.560 |
In terms of getting a robot to beat him at tennis, it's kind of an interesting question, 00:01:08.920 |
because for a lot of the challenges we think about in AI, the software is really the missing piece. 00:01:16.760 |
But for something like this, the hardware is nowhere near either. 00:01:22.640 |
To really have a robot that can physically run around, the Boston Dynamics robots are starting to get there, 00:01:28.520 |
but still not really human-level ability to run around and then swing a racket. 00:01:38.320 |
I don't think it's a hardware problem only. I think it's a hardware and a software problem. 00:01:41.560 |
I think it's both. And I think they'll have independent progress. 00:01:45.640 |
So I'd say the hardware, maybe in 10, 15 years. 00:01:51.560 |
On clay, not grass. I mean, grass is probably harder. 00:01:55.000 |
Well, the clay, I'm not sure what's harder, grass or clay. 00:01:58.840 |
The clay involves sliding, which might be harder to master, actually. 00:02:05.880 |
But you're not limited to bipedal. I'm sure there's no... 00:02:09.880 |
Well, if we can build a machine, it's a whole different question, of course. 00:02:13.080 |
If you can say, "Okay, this robot can be on wheels, it can move around on wheels, 00:02:17.560 |
and can be designed differently," then I think that can be done sooner, 00:02:26.120 |
What do you think of swing a racket? So you've worked at basic manipulation. 00:02:31.160 |
How hard do you think is the task of swinging a racket, 00:02:34.120 |
with being able to hit a nice backhand or a forehand? 00:02:39.240 |
Let's say we just set up stationary, a nice robot arm, let's say, 00:02:43.960 |
you know, a standard industrial arm, and it can watch the ball come 00:02:48.360 |
and then swing the racket. It's a good question. 00:02:58.040 |
If we do it with reinforced milling, it would require a lot of trial and error. 00:03:01.400 |
It's not going to swing it right the first time around. 00:03:03.240 |
But yeah, I don't see why I couldn't swing it the right way. 00:03:09.320 |
I think it's learnable. I think if you set up a ball machine, 00:03:11.960 |
let's say, on one side, and then a robot with a tennis racket on the other side, 00:03:17.640 |
I think it's learnable. And maybe a little bit of pre-training and simulation. 00:03:23.000 |
Yeah, I think that's feasible. I think the swing the racket is feasible. 00:03:27.160 |
It'd be very interesting to see how much precision it can get. 00:03:31.320 |
Because, I mean, that's where, I mean, some of the human players can hit it 00:03:41.960 |
Whether RL can learn to put a spin on the ball. 00:03:45.560 |
Well, you got me interested. Maybe someday we'll set this up. 00:03:51.080 |
Your answer is basically, okay, for this problem, it sounds fascinating. 00:03:54.120 |
But for the general problem of a tennis player, we might be a little bit farther away. 00:03:57.960 |
What's the most impressive thing you've seen a robot do in the physical world? 00:04:04.120 |
So physically, for me, it's the Boston Dynamics videos. 00:04:10.920 |
Always just bring home and just super impressed. 00:04:14.200 |
Recently, the robot running up the stairs, doing the parkour type thing. 00:04:19.400 |
I mean, yes, we don't know what's underneath. 00:04:23.880 |
But even if it's hard-coded underneath, which it might or might not be, 00:04:28.360 |
just the physical abilities of doing that parkour, that's a very impressive. 00:04:32.600 |
So have you met Spot Mini or any of those robots in person? 00:04:36.680 |
Met Spot Mini last year in April at the Mars event that Jeff Bezos organizes. 00:04:42.840 |
They brought it out there and it was nicely following around Jeff. 00:04:47.640 |
When Jeff left the room, they had it follow him along, which was pretty impressive. 00:04:52.120 |
So I think there's some confidence to know that there's no learning going on in those robots. 00:04:58.920 |
So while knowing that, while knowing there's not, if there's any learning going on, it's very limited. 00:05:03.400 |
I met Spot Mini earlier this year and knowing everything that's going on, 00:05:08.680 |
having one-on-one interaction, so I get to spend some time alone. 00:05:12.360 |
And there's immediately a deep connection on the psychological level. 00:05:18.200 |
Even though you know the fundamentals, how it works, there's something magical. 00:05:23.240 |
So do you think about the psychology of interacting with robots in the physical world? 00:05:29.000 |
Even you just showed me the PR2, the robot, and there was a little bit something like a face, 00:05:38.360 |
There's something that immediately draws you to it. 00:05:40.520 |
Do you think about that aspect of the robotics problem? 00:05:48.200 |
We'll give him a name, Berkeley Robot, for the elimination of tedious tasks. 00:05:52.040 |
It's very hard to not think of the robot as a person. 00:05:56.440 |
And it seems like everybody calls them a he for whatever reason, 00:05:59.400 |
but that also makes it more a person than if it was a it. 00:06:01.880 |
And it seems pretty natural to think of it that way. 00:06:08.520 |
I've seen Pepper many times on videos, but then I was at an event organized by, 00:06:15.160 |
this was by Fidelity, and they had scripted Pepper to help moderate some sessions. 00:06:22.600 |
And they had scripted Pepper to have the personality of a child a little bit. 00:06:26.360 |
And it was very hard to not think of it as its own person in some sense, 00:06:31.720 |
because it was just kind of jumping, it would just jump into conversation, 00:06:35.720 |
Moderate would be saying, Pepper would just jump in, "Hold on, how about me? 00:06:41.240 |
And just like, "Okay, this is like a person." 00:06:45.400 |
And even then it was hard not to have that sense of somehow there is something there. 00:06:50.520 |
So as we have robots interact in this physical world, 00:06:54.280 |
is that a signal that could be used in reinforcement learning? 00:06:57.080 |
You've worked a little bit in this direction, 00:07:00.120 |
but do you think that psychology can be somehow pulled in? 00:07:02.920 |
Yes, that's a question I would say a lot of people ask. 00:07:08.920 |
And I think part of why they ask it is they're thinking about 00:07:16.520 |
Like after they see some results, they see a computer play Go, 00:07:19.640 |
they see a computer do this, that, they're like, "Okay, but can it really have emotion? 00:07:26.680 |
And then once you're around robots, you already start feeling it. 00:07:29.960 |
And I think that kind of maybe methodologically, the way that I think of it is, 00:07:34.280 |
if you run something like reinforcement learning, it's about optimizing some objective. 00:07:39.000 |
And there's no reason that the objective couldn't be tied into 00:07:47.480 |
how much does a person like interacting with this system? 00:07:50.600 |
And why could not the reinforcement learning system optimize for 00:07:56.040 |
And why wouldn't it then naturally become more and more attractive and more and more, 00:08:00.440 |
maybe like a person or like a pet, I don't know what it would exactly be, 00:08:04.520 |
but more and more have those features and acquire them automatically. 00:08:08.120 |
- As long as you can formalize an objective of what it means to like something, 00:08:19.320 |
Because you have to somehow collect that information within you, human. 00:08:22.280 |
But you're saying if you can formulate as an objective, it can be learned. 00:08:26.840 |
- There's no reason it couldn't emerge through learning. 00:08:29.320 |
And maybe one way to formulate as an objective, 00:08:31.400 |
you wouldn't have to necessarily score it explicitly. 00:08:33.720 |
So standard rewards are numbers, and numbers are hard to come by. 00:08:45.320 |
"Okay, what you did the last five minutes was much nicer 00:08:53.000 |
And in fact, there have been some results in that. 00:08:55.160 |
For example, Paul Christiano and collaborators at OpenAI had the hopper, 00:09:00.040 |
Mojoco hopper, a one-legged robot, learn to do backflips. 00:09:03.720 |
Purely from feedback, I like this better than that. 00:09:10.840 |
it figured out what it was the person was asking for, namely a backflip. 00:09:18.520 |
It was just getting a score from the comparison score 00:09:21.880 |
- The person having in mind, in their own mind, 00:09:27.320 |
but the robot didn't know what it was supposed to be doing. 00:09:35.880 |
what the person was actually after was a backflip. 00:09:38.600 |
And I'd imagine the same would be true for things 00:09:45.000 |
"Oh, this kind of thing apparently is appreciated more 00:09:53.880 |
Richard Sutton's reinforcement learning book, 00:10:03.160 |
as a powerful mechanism for machine learning, 00:10:14.840 |
So how do you think we can possibly learn anything 00:10:20.200 |
about the world when the reward for the actions 00:10:52.040 |
you do something maybe for like, I don't know, 00:10:55.000 |
you take a hundred actions and then you get a reward. 00:10:59.480 |
And I'm like, okay, three, not sure what that means. 00:11:04.280 |
And now you know that that sequence of a hundred actions 00:11:08.120 |
somehow was worse than the sequence of a hundred actions 00:11:15.000 |
Some might've been good and bad in either one. 00:11:16.760 |
And so that's why you need so many experiences. 00:11:24.360 |
what is consistently there when you get a higher reward? 00:11:27.640 |
And what's consistently there when you get a lower reward? 00:11:31.880 |
and sometimes the policy gradient update is to say, 00:11:36.840 |
to make the actions that were kind of present 00:12:01.800 |
the policy gradient that somehow you can learn exactly. 00:12:16.760 |
- Yeah, so I think there's a few ways to think about this. 00:12:21.720 |
The way I tend to think about it mostly originally, 00:12:25.640 |
when we started working on deep reinforcement learning 00:12:28.920 |
here at Berkeley, which was maybe 2011, 12, 13, 00:12:32.680 |
around that time, John Shulman was a PhD student, 00:12:38.120 |
And the way we thought about it at the time was, 00:12:56.920 |
linear feedback control is extremely successful. 00:12:59.160 |
It can solve many, many problems surprisingly well. 00:13:02.600 |
I remember, for example, when we did helicopter flight, 00:13:07.240 |
not a non-stationary, but a stationary flight regime 00:13:10.360 |
like hover, you can use linear feedback control 00:13:12.360 |
to stabilize a helicopter, a very complex dynamical system, 00:13:18.280 |
And so I think that's a big part of it is that 00:13:22.200 |
even though the system you control can be very, 00:13:24.120 |
very complex, often relatively simple control architectures 00:13:30.360 |
but then also just linear is not good enough. 00:13:32.440 |
And so one way you can think of these neural networks 00:13:36.920 |
which people were already trying to do more by hand 00:13:49.400 |
And so it's benefiting from this linear control aspect, 00:13:53.480 |
but it's somehow tiling it one dimension at a time. 00:13:56.680 |
Because if let's say you have a two layer network, 00:14:00.520 |
you make a transition from active to inactive 00:14:12.200 |
And so you have this kind of very gradual tiling of the space 00:14:16.680 |
between the linear controllers that tile the space. 00:14:19.400 |
And that was always my intuition as to why to expect 00:14:42.440 |
of when you start going up the number of dimensions 00:14:51.480 |
in terms of how often you get a clean reward signal? 00:15:08.680 |
compared to the things we've looked at so far 00:15:23.240 |
maybe some student decided to do a PhD here, right? 00:15:40.120 |
And that's a very high frequency control thing, 00:15:48.120 |
and you can maybe do it slightly differently, 00:15:49.640 |
but typically that's how you affect the world. 00:15:51.960 |
And the decision of doing a PhD is like so abstract 00:15:56.200 |
relative to what you're actually doing in the world. 00:16:08.840 |
at a level that is just not available at all yet. 00:16:11.640 |
- Where do you think we can pick up hierarchical reasoning? 00:16:30.760 |
but the problem is that they were not grounded 00:16:46.200 |
And so it didn't tie into real objects and so forth. 00:17:09.960 |
to some of these more traditional approaches. 00:17:13.960 |
you need to do some kind of end-to-end training, 00:18:12.920 |
to the gas station because I need to get gas for my car. 00:18:15.400 |
Well, that'll now take five minutes to get there." 00:18:20.200 |
from the high-level action I took much earlier. 00:18:23.720 |
That, we had a very hard time getting success with. 00:18:30.520 |
but we had a lot of trouble getting that to work. 00:18:40.840 |
but you can think about what does hierarchy give us? 00:18:46.840 |
What is better credit assignment is giving us? 00:18:53.960 |
And so, faster learning is ultimately maybe what we're after. 00:19:01.640 |
the RL squared paper on learning to reinforcement learn, 00:19:07.480 |
And that's exactly the meta-learning approach 00:19:10.920 |
where you say, "Okay, we don't know how to design hierarchy. 00:19:21.000 |
The maze navigation had consistent motion down hallways, 00:19:29.560 |
And then when there is an option to take a turn, 00:19:31.480 |
I can decide whether to take a turn or not and repeat. 00:19:47.000 |
that maybe you can meta-learn these hierarchical concepts. 00:19:51.000 |
I mean, it seems like through these meta-learning concepts, 00:20:12.760 |
So there's some signs that you can generalize a little bit, 00:20:19.400 |
or totally different breakthroughs are needed 00:20:35.960 |
- Well, there's just some very impressive results already. 00:20:43.000 |
even with the initial kind of big breakthrough in 2012 00:20:50.840 |
This does better on ImageNet, hence image recognition. 00:21:07.400 |
And that was often found to be the even bigger deal 00:21:16.680 |
you learn something for one scenario, and that was it. 00:21:29.000 |
And then recently, I feel like similar kind of, 00:21:43.000 |
some of the OpenAI results on language models 00:21:45.240 |
and some of the recent Google results on language models, 00:21:58.440 |
where somehow if you train a big enough model 00:22:11.000 |
in ways where it wasn't just doing reinforcement learning, 00:22:16.680 |
So I think there's a lot of interesting results already. 00:22:19.080 |
I think maybe where it's hard to wrap my head around 00:22:34.060 |
You draw this, by the way, just to frame things. 00:22:40.600 |
it's the difference between learning to master 00:22:50.120 |
of what learning to master and learning to generalize, 00:22:57.720 |
and I think it might've been one of your interviews, 00:23:14.840 |
let's say, the relative motion of our planets, 00:23:27.640 |
it would probably not predict what would happen, right? 00:23:31.960 |
And that's a different kind of generalization. 00:23:33.400 |
That's a generalization that relies on the ultimate, 00:23:41.400 |
whereas just pattern recognition could predict 00:23:43.560 |
our current solar system motion pretty well, no problem. 00:23:48.680 |
of a kind of generalization that is a little different 00:24:03.640 |
but that's what physics researchers do, right? 00:24:12.200 |
The master equation for the entire dynamics of the universe. 00:24:15.320 |
We haven't really pushed that direction as hard 00:24:21.880 |
but it seems a kind of generalization you get from that 00:24:24.360 |
that you don't get in our current methods so far. 00:24:27.240 |
- So I just talked to Vladimir Vapnik, for example, 00:24:42.280 |
Do you think that's a fruitless pursuit in near term, 00:24:50.600 |
- I think that's a really interesting pursuit 00:25:02.680 |
And so I wouldn't maybe think of it as the theory, 00:25:26.200 |
they might be able to reuse parts of their brain 00:25:29.320 |
And so what that suggests is some kind of modularity 00:25:35.000 |
and I think it is a pretty natural thing to strive for, 00:25:48.360 |
But if you think of things like the neocortex, 00:25:51.560 |
that seems fairly modular from what the findings so far. 00:26:02.280 |
I think that would be the kind of interesting 00:26:16.680 |
of what it means to do something intelligent? 00:26:18.840 |
So reinforcement learning embodies both groups, right? 00:26:21.960 |
To prove that something converges, prove the bounds. 00:26:33.240 |
How do you think of those two parts of your brain? 00:26:55.560 |
And experimentation takes a long time to get through. 00:27:01.160 |
kind of reinforcement learning your research process, 00:27:09.720 |
And hopefully once you do a bunch of experiments, 00:27:14.360 |
You can do some derivations that leapfrog some experiments. 00:27:20.680 |
has been such that we have not been able to find 00:27:29.320 |
A new experiment here, a new experiment there 00:27:31.080 |
that gives us new insights and gradually building up, 00:27:34.280 |
but not getting to something yet where we're just, 00:27:36.440 |
"Okay, here's an equation that now explains how," 00:27:40.440 |
have been two years of experimentation to get there, 00:27:42.440 |
but this tells us what the result's going to be. 00:28:13.640 |
and eventually play, eventually interact with humans 00:28:21.720 |
What's more promising, you think, as a research direction? 00:28:57.080 |
And so whenever you can turn something into self-play, 00:29:01.960 |
where you can naturally learn much more quickly 00:29:04.680 |
than in most other reinforced learning environments. 00:29:17.080 |
So far, self-play has been largely around games 00:29:22.600 |
But if we could do self-play for other things, 00:29:29.400 |
but maybe it tries to build a hut or something. 00:29:41.400 |
where somebody figures out a formalism to say, 00:29:43.720 |
"Okay, any RL problem by playing this and this idea, 00:29:55.880 |
And so either we need to provide detailed reward, 00:29:58.760 |
that doesn't just reward for achieving a goal, 00:30:13.000 |
And now the question is, how do you show the robot? 00:30:16.360 |
One way to show is to tally operate the robot, 00:30:18.440 |
and then the robot really experiences things. 00:30:20.840 |
And that's nice because that's really high signal 00:30:22.840 |
to noise ratio data, and we've done a lot of that. 00:30:26.760 |
in just 10 minutes, you can teach a robot a new basic skill, 00:30:29.640 |
like, "Okay, pick up the bottle, place it somewhere else." 00:30:32.120 |
That's a skill, no matter where the bottle starts, 00:30:34.120 |
maybe it always goes onto a target or something. 00:30:36.040 |
That's fairly easy to teach your robot with tally-up. 00:30:46.840 |
and doesn't experience it, but just watches it and says, 00:30:56.840 |
but I'm gonna use my hand, I do that mapping." 00:30:59.320 |
And so that's where I think one of the big breakthroughs 00:31:05.160 |
It's almost like learning a machine translation 00:31:08.040 |
for demonstrations, where you have a human demonstration, 00:31:19.800 |
And that, I think, opens up a lot of opportunities 00:31:26.360 |
Do you think this approach of third-person watching, 00:31:33.000 |
- So for autonomous driving, I would say it's, 00:31:41.400 |
And the reason I'm gonna say it's slightly easier 00:31:55.560 |
so I think the distinction between third-person 00:31:57.400 |
and first-person is not a very important distinction 00:32:01.720 |
They're very similar, because the distinction is really about 00:32:14.680 |
to a point, let's say, a couple meters in front of you. 00:32:17.320 |
And that's a problem that's very well understood. 00:32:26.680 |
For autonomous driving, I think there is still the question, 00:32:47.560 |
And of course, there are versions of imitation learning, 00:32:50.920 |
inverse reinforcement learning type imitation learning, 00:33:03.880 |
If it really doesn't have a notion of objectives 00:33:11.960 |
that you get from just behavioral cloning/supervised learning. 00:33:18.280 |
whether it's self-play or even imitation learning, 00:33:26.360 |
And you're doing a lot of stuff in the physical world 00:33:32.440 |
power of simulation being boundless eventually 00:33:49.080 |
- So I think we could even rephrase that question 00:34:31.960 |
is sufficiently representative of the real world 00:34:34.520 |
such that it would work if you train in there. 00:34:39.720 |
then there is something that's good in all of them. 00:34:43.400 |
The real world will just be another one of them 00:34:50.600 |
- Another sample from the distribution of simulators. 00:34:59.160 |
It's definitely a very advanced simulator if it is. 00:35:07.320 |
It's something you think about a little bit too. 00:35:09.320 |
Of course, you're really trying to build these systems, 00:35:18.120 |
as you build robots that are operating in the physical world? 00:35:24.920 |
in an engineering kind of way, in a systematic way? 00:35:32.200 |
you kind of have a few notions of safety to worry about. 00:35:41.480 |
Same for cars, which we can think of as robots too in some way. 00:35:48.200 |
So it could be not the kind of long-term AI safety concerns 00:35:51.640 |
that, okay, AI is smarter than us and now what do we do? 00:36:05.560 |
And I'm always wondering, like I always wonder, 00:36:07.400 |
let's say you look at, let's go back to driving 00:36:09.960 |
'cause a lot of people know driving well, of course. 00:36:12.200 |
What do we do to test somebody for driving, right? 00:36:19.400 |
I mean, you fill out some tests and then you drive. 00:36:27.640 |
that driving test is just you drive around the block, 00:36:34.600 |
and then you pull over again and you're pretty much done. 00:36:37.560 |
And you're like, okay, if a self-driving car did that, 00:36:45.080 |
And I'd be like, no, that's not enough for me to trust it. 00:36:49.800 |
that somebody being able to do that is representative 00:36:53.160 |
of them being able to do a lot of other things. 00:36:58.360 |
we've figured out representative tests of what it means 00:37:13.080 |
'cause they use the same neural net and so forth. 00:37:15.400 |
But still, I feel like we don't have this kind of unit tests 00:37:22.680 |
And I think there's something very interesting 00:37:28.120 |
you have a better self-driving car suite, you update it. 00:37:31.000 |
How do you know it's indeed more capable on everything 00:37:35.960 |
that you didn't have any bad things creep into it? 00:37:40.120 |
So I think that's a very interesting direction of research 00:37:46.520 |
'Cause we say, okay, you have a driving test, you passed, 00:37:53.400 |
or 10 million miles, something pretty phenomenal. 00:37:55.640 |
Compared to that short test that is being done. 00:38:01.560 |
- So let me ask, you've mentioned that Andrew Ang, 00:38:05.240 |
by example, showed you the value of kindness. 00:38:32.440 |
or if AI system had to operate in this real world, 00:38:35.080 |
do you think it's really easy to find policies 00:38:41.160 |
Or is it like a very hard optimization problem? 00:38:44.440 |
- I mean, there is kind of two optimizations happening 00:38:56.680 |
And we're kind of predisposed to like certain things. 00:39:00.520 |
And that's in some sense what makes our learning easier 00:39:32.120 |
but at the same time, also to be very territorial 00:40:26.040 |
this kind of ability to interact with humans, 00:40:32.120 |
Do you think it's possible to teach RL based robot 00:40:36.200 |
and to inspire that human to love the robot back? 00:40:48.760 |
Maybe I'll answer it with another question, right? 00:40:58.040 |
okay, I mean, how close does some people's happiness get 00:41:02.840 |
from interacting with just a really nice dog? 00:41:11.960 |
It makes you happy when you come home to your dog. 00:41:20.520 |
your partner took him on a trip or something, 00:41:22.920 |
you might not be nearly as happy when you get home, right? 00:41:27.560 |
it seems like the level of reasoning a dog has 00:41:32.040 |
but then it's still not yet at the level of human reasoning. 00:41:35.480 |
And so it seems like we don't even need to achieve 00:41:37.640 |
human level reasoning to get like very strong affection 00:41:45.480 |
couldn't we achieve the kind of level of affection 00:41:55.800 |
It's a question, is it a good thing for us or not? 00:42:08.920 |
Maybe he should say love is the objective function