Pieter Abbeel: Deep Reinforcement Learning

00:00:00.000 | The following is a conversation with Peter Abbeel.

00:00:03.120 | He's a professor at UC Berkeley and the director of the Berkeley Robotics Learning Lab.

00:00:07.840 | He's one of the top researchers in the world working on how we make robots understand and interact with the world around them,

00:00:15.360 | especially using imitation and deep reinforcement learning.

00:00:19.720 | This conversation is part of the MIT course on artificial general intelligence and the Artificial Intelligence Podcast.

00:00:26.400 | If you enjoy it, please subscribe on YouTube, iTunes, or your podcast provider of choice,

00:00:31.680 | or simply connect with me on Twitter @LexFriedman, spelled F-R-I-D.

00:00:36.920 | And now, here's my conversation with Peter Abbeel.

00:00:41.400 | You've mentioned that if there was one person you could meet, it would be Roger Federer.

00:00:46.200 | So let me ask, when do you think we'll have a robot that fully autonomously can beat Roger Federer at tennis,

00:00:54.320 | a Roger Federer-level player at tennis?

00:00:57.520 | Well, first, if you can make it happen for me to meet Roger, let me know.

00:01:02.560 | In terms of getting a robot to beat him at tennis, it's kind of an interesting question,

00:01:08.920 | because for a lot of the challenges we think about in AI, the software is really the missing piece.

00:01:16.760 | But for something like this, the hardware is nowhere near either.

00:01:22.640 | To really have a robot that can physically run around, the Boston Dynamics robots are starting to get there,

00:01:28.520 | but still not really human-level ability to run around and then swing a racket.

00:01:36.800 | So you think that's a hardware problem?

00:01:38.320 | I don't think it's a hardware problem only. I think it's a hardware and a software problem.

00:01:41.560 | I think it's both. And I think they'll have independent progress.

00:01:45.640 | So I'd say the hardware, maybe in 10, 15 years.

00:01:51.560 | On clay, not grass. I mean, grass is probably harder.

00:01:55.000 | Well, the clay, I'm not sure what's harder, grass or clay.

00:01:58.840 | The clay involves sliding, which might be harder to master, actually.

00:02:05.880 | But you're not limited to bipedal. I'm sure there's no...

00:02:09.880 | Well, if we can build a machine, it's a whole different question, of course.

00:02:13.080 | If you can say, "Okay, this robot can be on wheels, it can move around on wheels,

00:02:17.560 | and can be designed differently," then I think that can be done sooner,

00:02:22.680 | probably than a full humanoid type of setup.

00:02:26.120 | What do you think of swing a racket? So you've worked at basic manipulation.

00:02:31.160 | How hard do you think is the task of swinging a racket,

00:02:34.120 | with being able to hit a nice backhand or a forehand?

00:02:39.240 | Let's say we just set up stationary, a nice robot arm, let's say,

00:02:43.960 | you know, a standard industrial arm, and it can watch the ball come

00:02:48.360 | and then swing the racket. It's a good question.

00:02:51.400 | I'm not sure it would be super hard to do.

00:02:56.040 | I mean, I'm sure it would require a lot...

00:02:58.040 | If we do it with reinforced milling, it would require a lot of trial and error.

00:03:01.400 | It's not going to swing it right the first time around.

00:03:03.240 | But yeah, I don't see why I couldn't swing it the right way.

00:03:09.320 | I think it's learnable. I think if you set up a ball machine,

00:03:11.960 | let's say, on one side, and then a robot with a tennis racket on the other side,

00:03:17.640 | I think it's learnable. And maybe a little bit of pre-training and simulation.

00:03:23.000 | Yeah, I think that's feasible. I think the swing the racket is feasible.

00:03:27.160 | It'd be very interesting to see how much precision it can get.

00:03:31.320 | Because, I mean, that's where, I mean, some of the human players can hit it

00:03:36.760 | on the lines, which is very high precision.

00:03:39.160 | With spin. The spin is an interesting...

00:03:41.960 | Whether RL can learn to put a spin on the ball.

00:03:45.560 | Well, you got me interested. Maybe someday we'll set this up.

00:03:48.120 | Someday, sure.

00:03:48.620 | You got me intrigued.

00:03:51.080 | Your answer is basically, okay, for this problem, it sounds fascinating.

00:03:54.120 | But for the general problem of a tennis player, we might be a little bit farther away.

00:03:57.960 | What's the most impressive thing you've seen a robot do in the physical world?

00:04:04.120 | So physically, for me, it's the Boston Dynamics videos.

00:04:10.920 | Always just bring home and just super impressed.

00:04:14.200 | Recently, the robot running up the stairs, doing the parkour type thing.

00:04:19.400 | I mean, yes, we don't know what's underneath.

00:04:22.200 | They don't really write a lot of detail.

00:04:23.880 | But even if it's hard-coded underneath, which it might or might not be,

00:04:28.360 | just the physical abilities of doing that parkour, that's a very impressive.

00:04:32.600 | So have you met Spot Mini or any of those robots in person?

00:04:36.680 | Met Spot Mini last year in April at the Mars event that Jeff Bezos organizes.

00:04:42.840 | They brought it out there and it was nicely following around Jeff.

00:04:47.640 | When Jeff left the room, they had it follow him along, which was pretty impressive.

00:04:52.120 | So I think there's some confidence to know that there's no learning going on in those robots.

00:04:57.960 | The psychology of it.

00:04:58.920 | So while knowing that, while knowing there's not, if there's any learning going on, it's very limited.

00:05:03.400 | I met Spot Mini earlier this year and knowing everything that's going on,

00:05:08.680 | having one-on-one interaction, so I get to spend some time alone.

00:05:12.360 | And there's immediately a deep connection on the psychological level.

00:05:18.200 | Even though you know the fundamentals, how it works, there's something magical.

00:05:23.240 | So do you think about the psychology of interacting with robots in the physical world?

00:05:29.000 | Even you just showed me the PR2, the robot, and there was a little bit something like a face,

00:05:35.960 | had a little bit something like a face.

00:05:38.360 | There's something that immediately draws you to it.

00:05:40.520 | Do you think about that aspect of the robotics problem?

00:05:45.000 | Well, it's very hard with Brad here.

00:05:48.200 | We'll give him a name, Berkeley Robot, for the elimination of tedious tasks.

00:05:52.040 | It's very hard to not think of the robot as a person.

00:05:56.440 | And it seems like everybody calls them a he for whatever reason,

00:05:59.400 | but that also makes it more a person than if it was a it.

00:06:01.880 | And it seems pretty natural to think of it that way.

00:06:07.160 | This past weekend really struck me.

00:06:08.520 | I've seen Pepper many times on videos, but then I was at an event organized by,

00:06:15.160 | this was by Fidelity, and they had scripted Pepper to help moderate some sessions.

00:06:22.600 | And they had scripted Pepper to have the personality of a child a little bit.

00:06:26.360 | And it was very hard to not think of it as its own person in some sense,

00:06:31.720 | because it was just kind of jumping, it would just jump into conversation,

00:06:34.360 | making it very interactive.

00:06:35.720 | Moderate would be saying, Pepper would just jump in, "Hold on, how about me?

00:06:39.240 | Can I participate in this too?"

00:06:41.240 | And just like, "Okay, this is like a person."

00:06:43.560 | And that was 100% scripted.

00:06:45.400 | And even then it was hard not to have that sense of somehow there is something there.

00:06:50.520 | So as we have robots interact in this physical world,

00:06:54.280 | is that a signal that could be used in reinforcement learning?

00:06:57.080 | You've worked a little bit in this direction,

00:07:00.120 | but do you think that psychology can be somehow pulled in?

00:07:02.920 | Yes, that's a question I would say a lot of people ask.

00:07:08.920 | And I think part of why they ask it is they're thinking about

00:07:12.840 | how unique are we really still as people?

00:07:16.520 | Like after they see some results, they see a computer play Go,

00:07:19.640 | they see a computer do this, that, they're like, "Okay, but can it really have emotion?

00:07:23.640 | Can it really interact with us in that way?"

00:07:26.680 | And then once you're around robots, you already start feeling it.

00:07:29.960 | And I think that kind of maybe methodologically, the way that I think of it is,

00:07:34.280 | if you run something like reinforcement learning, it's about optimizing some objective.

00:07:39.000 | And there's no reason that the objective couldn't be tied into

00:07:47.480 | how much does a person like interacting with this system?

00:07:50.600 | And why could not the reinforcement learning system optimize for

00:07:54.280 | the robot being fun to be around?

00:07:56.040 | And why wouldn't it then naturally become more and more attractive and more and more,

00:08:00.440 | maybe like a person or like a pet, I don't know what it would exactly be,

00:08:04.520 | but more and more have those features and acquire them automatically.

00:08:08.120 | - As long as you can formalize an objective of what it means to like something,

00:08:13.160 | how you exhibit, what's the ground truth?

00:08:17.240 | How do you get the reward from human?

00:08:19.320 | Because you have to somehow collect that information within you, human.

00:08:22.280 | But you're saying if you can formulate as an objective, it can be learned.

00:08:26.840 | - There's no reason it couldn't emerge through learning.

00:08:29.320 | And maybe one way to formulate as an objective,

00:08:31.400 | you wouldn't have to necessarily score it explicitly.

00:08:33.720 | So standard rewards are numbers, and numbers are hard to come by.

00:08:37.720 | This is a 1.5 or 1.7 on some scale.

00:08:41.160 | It's very hard to do for a person.

00:08:42.840 | But much easier is for a person to say,

00:08:45.320 | "Okay, what you did the last five minutes was much nicer

00:08:49.000 | than we did the previous five minutes."

00:08:51.080 | And that now gives a comparison.

00:08:53.000 | And in fact, there have been some results in that.

00:08:55.160 | For example, Paul Christiano and collaborators at OpenAI had the hopper,

00:09:00.040 | Mojoco hopper, a one-legged robot, learn to do backflips.

00:09:03.720 | Purely from feedback, I like this better than that.

00:09:06.840 | That's kind of equally good.

00:09:08.600 | And after a bunch of interactions,

00:09:10.840 | it figured out what it was the person was asking for, namely a backflip.

00:09:14.200 | And so I think the same thing--

00:09:15.240 | - Oh, it wasn't trying to do a backflip.

00:09:18.520 | It was just getting a score from the comparison score

00:09:20.680 | from the person based on--

00:09:21.880 | - The person having in mind, in their own mind,

00:09:25.240 | I wanted to do a backflip,

00:09:27.320 | but the robot didn't know what it was supposed to be doing.

00:09:30.680 | It just knew that sometimes the person said,

00:09:32.680 | "This is better, this is worse."

00:09:34.440 | And then the robot figured out

00:09:35.880 | what the person was actually after was a backflip.

00:09:38.600 | And I'd imagine the same would be true for things

00:09:40.840 | like more interactive robots

00:09:43.000 | that the robot would figure out over time,

00:09:45.000 | "Oh, this kind of thing apparently is appreciated more

00:09:47.960 | than this other kind of thing."

00:09:49.160 | - So when I first picked up Sutton's,

00:09:53.880 | Richard Sutton's reinforcement learning book,

00:09:55.880 | before sort of this deep learning,

00:09:59.880 | before the re-emergence of neural networks

00:10:03.160 | as a powerful mechanism for machine learning,

00:10:04.920 | RL seemed to me like magic.

00:10:07.560 | It was beautiful.

00:10:10.200 | So that seemed like what intelligence is,

00:10:13.480 | RL reinforcement learning.

00:10:14.840 | So how do you think we can possibly learn anything

00:10:20.200 | about the world when the reward for the actions

00:10:22.920 | is delayed, is so sparse?

00:10:25.240 | Why do you think RL works?

00:10:29.240 | Why do you think you can learn anything

00:10:32.120 | under such sparse rewards,

00:10:34.440 | whether it's regular reinforcement learning

00:10:36.760 | or deep reinforcement learning?

00:10:38.360 | What's your intuition?

00:10:40.360 | - The counterpart of that is why is RL,

00:10:43.240 | why does it need so many samples,

00:10:47.000 | so many experiences to learn from?

00:10:48.520 | Because really what's happening is

00:10:50.520 | when you have a sparse reward,

00:10:52.040 | you do something maybe for like, I don't know,

00:10:55.000 | you take a hundred actions and then you get a reward.

00:10:57.160 | And maybe you get like a score of three.

00:10:59.480 | And I'm like, okay, three, not sure what that means.

00:11:02.760 | You go again and now you get two.

00:11:04.280 | And now you know that that sequence of a hundred actions

00:11:06.920 | that you did the second time around

00:11:08.120 | somehow was worse than the sequence of a hundred actions

00:11:10.440 | you did the first time around.

00:11:11.720 | But that's tough to now know

00:11:13.560 | which one of those were better or worse.

00:11:15.000 | Some might've been good and bad in either one.

00:11:16.760 | And so that's why you need so many experiences.

00:11:19.640 | But once you have enough experiences,

00:11:21.160 | effectively RL is teasing that apart.

00:11:23.240 | It's starting to say, okay,

00:11:24.360 | what is consistently there when you get a higher reward?

00:11:27.640 | And what's consistently there when you get a lower reward?

00:11:29.800 | And then kind of the magic of,

00:11:31.880 | and sometimes the policy gradient update is to say,

00:11:34.040 | now let's update the neural network

00:11:36.840 | to make the actions that were kind of present

00:11:39.000 | when things are good, more likely,

00:11:41.320 | and make the actions that are present

00:11:42.920 | when things are not as good, less likely.

00:11:44.680 | - So that is the counterpoint,

00:11:46.920 | but it seems like you would need to run it

00:11:49.480 | a lot more than you do.

00:11:50.840 | Even though right now people could say

00:11:52.600 | that RL is very inefficient,

00:11:54.280 | but it seems to be way more efficient

00:11:56.120 | than one would imagine on paper.

00:11:58.280 | That the simple updates to the policy,

00:12:01.800 | the policy gradient that somehow you can learn exactly.

00:12:05.160 | You just said, what are the common actions

00:12:07.560 | that seem to produce some good results?

00:12:09.640 | That that somehow can learn anything.

00:12:11.720 | It seems counterintuitive at least.

00:12:14.440 | Is there some intuition behind it?

00:12:16.760 | - Yeah, so I think there's a few ways to think about this.

00:12:21.720 | The way I tend to think about it mostly originally,

00:12:25.640 | when we started working on deep reinforcement learning

00:12:28.920 | here at Berkeley, which was maybe 2011, 12, 13,

00:12:32.680 | around that time, John Shulman was a PhD student,

00:12:36.040 | initially kind of driving it forward here.

00:12:38.120 | And the way we thought about it at the time was,

00:12:43.960 | if you think about rectified linear units

00:12:46.920 | or kind of rectifier type neural networks,

00:12:49.000 | what do you get?

00:12:50.840 | You get something that's piecewise

00:12:52.840 | linear feedback control.

00:12:54.200 | And if you look at the literature,

00:12:56.920 | linear feedback control is extremely successful.

00:12:59.160 | It can solve many, many problems surprisingly well.

00:13:02.600 | I remember, for example, when we did helicopter flight,

00:13:05.560 | if you're in a stationary flight regime,

00:13:07.240 | not a non-stationary, but a stationary flight regime

00:13:10.360 | like hover, you can use linear feedback control

00:13:12.360 | to stabilize a helicopter, a very complex dynamical system,

00:13:15.400 | but the controller is relatively simple.

00:13:18.280 | And so I think that's a big part of it is that

00:13:19.880 | if you do feedback control,

00:13:22.200 | even though the system you control can be very,

00:13:24.120 | very complex, often relatively simple control architectures

00:13:28.600 | can already do a lot,

00:13:30.360 | but then also just linear is not good enough.

00:13:32.440 | And so one way you can think of these neural networks

00:13:35.000 | is that in some sense, they tile the space,

00:13:36.920 | which people were already trying to do more by hand

00:13:39.320 | or with finite state machines.

00:13:40.920 | Say this linear controller here,

00:13:42.360 | this linear controller here.

00:13:43.640 | Neural network learns to tile the space

00:13:45.480 | and say linear controller here,

00:13:46.440 | another linear controller here,

00:13:48.120 | but it's more subtle than that.

00:13:49.400 | And so it's benefiting from this linear control aspect,

00:13:51.880 | it's benefiting from the tiling,

00:13:53.480 | but it's somehow tiling it one dimension at a time.

00:13:56.680 | Because if let's say you have a two layer network,

00:13:59.320 | if in that hidden layer,

00:14:00.520 | you make a transition from active to inactive

00:14:04.840 | or the other way around,

00:14:05.720 | that is essentially one axis,

00:14:08.280 | but not axis aligned,

00:14:09.400 | but one direction that you change.

00:14:12.200 | And so you have this kind of very gradual tiling of the space

00:14:15.160 | where you have a lot of sharing

00:14:16.680 | between the linear controllers that tile the space.

00:14:19.400 | And that was always my intuition as to why to expect

00:14:22.120 | that this might work pretty well.

00:14:24.680 | It's essentially leveraging the fact

00:14:25.960 | that linear feedback control is so good,

00:14:28.360 | but of course not enough.

00:14:29.720 | And this is a gradual tiling of the space

00:14:31.640 | with linear feedback controls

00:14:33.400 | that share a lot of expertise across them.

00:14:36.040 | - So that's really nice intuition.

00:14:38.840 | But do you think that scales

00:14:40.840 | to the more and more general problems

00:14:42.440 | of when you start going up the number of dimensions

00:14:47.000 | when you start going down

00:14:51.480 | in terms of how often you get a clean reward signal?

00:14:55.240 | Does that intuition carry forward

00:14:57.240 | to those crazier, weirder worlds

00:14:59.560 | that we think of as the real world?

00:15:01.160 | - So I think where things get really tricky

00:15:07.880 | in the real world

00:15:08.680 | compared to the things we've looked at so far

00:15:10.840 | with great success in reinforcement learning

00:15:13.080 | is the time scales,

00:15:17.160 | which takes us to an extreme.

00:15:18.920 | So when you think about the real world,

00:15:21.640 | I mean, I don't know,

00:15:23.240 | maybe some student decided to do a PhD here, right?

00:15:26.760 | Okay, that's a decision.

00:15:28.520 | That's a very high level decision.

00:15:29.960 | But if you think about their lives,

00:15:32.440 | I mean, any person's life,

00:15:33.960 | it's a sequence of muscle fiber contractions

00:15:37.240 | and relaxations,

00:15:38.200 | and that's how you interact with the world.

00:15:40.120 | And that's a very high frequency control thing,

00:15:42.520 | but it's ultimately what you do

00:15:44.440 | and how you affect the world.

00:15:45.560 | Until I guess we have brain readings,

00:15:48.120 | and you can maybe do it slightly differently,

00:15:49.640 | but typically that's how you affect the world.

00:15:51.960 | And the decision of doing a PhD is like so abstract

00:15:56.200 | relative to what you're actually doing in the world.

00:15:59.160 | And I think that's where credit assignment

00:16:01.000 | becomes just completely beyond

00:16:04.680 | what any current RL algorithm can do.

00:16:06.600 | And we need hierarchical reasoning

00:16:08.840 | at a level that is just not available at all yet.

00:16:11.640 | - Where do you think we can pick up hierarchical reasoning?

00:16:14.760 | By which mechanisms?

00:16:15.880 | - Yeah, so maybe let me highlight

00:16:18.520 | what I think the limitations are

00:16:20.600 | of what already was done 20, 30 years ago.

00:16:25.880 | In fact, you'll find reasoning systems

00:16:27.560 | that reason over relatively long horizons,

00:16:30.760 | but the problem is that they were not grounded

00:16:32.680 | in the real world.

00:16:33.480 | So people would have to hand design

00:16:37.000 | some kind of logical,

00:16:40.520 | dynamical descriptions of the world,

00:16:43.800 | and that didn't tie into perception.

00:16:46.200 | And so it didn't tie into real objects and so forth.

00:16:49.080 | And so that was a big gap.

00:16:51.000 | Now with deep learning,

00:16:52.200 | we start having the ability to

00:16:55.400 | really see with sensors, process that,

00:16:59.480 | and understand what's in the world.

00:17:01.320 | And so it's a good time to try

00:17:02.680 | to bring these things together.

00:17:03.880 | I see a few ways of getting there.

00:17:06.280 | One way to get there would be to say,

00:17:08.040 | deep learning can get bolted on somehow

00:17:09.960 | to some of these more traditional approaches.

00:17:12.200 | Now, bolted on would probably mean

00:17:13.960 | you need to do some kind of end-to-end training,

00:17:16.200 | where you say, "My deep learning processing

00:17:18.440 | somehow leads to a representation

00:17:20.680 | that in turn uses some kind of

00:17:23.720 | traditional underlying dynamical systems

00:17:27.320 | that can be used for planning."

00:17:29.720 | And that's, for example, the direction

00:17:31.320 | Aviv Tamar and Thanar Kuretach here

00:17:33.320 | have been pushing with causal infoGAN

00:17:35.000 | and of course other people too.

00:17:36.200 | That's one way.

00:17:38.120 | Can we somehow force it into the form factor

00:17:41.000 | that is amenable to reasoning?

00:17:43.560 | Another direction we've been thinking about

00:17:46.440 | for a long time

00:17:47.640 | and didn't make any progress on

00:17:50.120 | was more information-theoretic approaches.

00:17:53.560 | So the idea there was that

00:17:55.080 | what it means to take high-level action

00:17:57.960 | is to choose a latent variable now

00:18:02.440 | that tells you a lot about

00:18:03.560 | what's going to be the case in the future.

00:18:05.240 | Because that's what it means

00:18:06.280 | to take a high-level action.

00:18:08.600 | I say, "Okay, I decide I'm going to navigate

00:18:12.920 | to the gas station because I need to get gas for my car.

00:18:15.400 | Well, that'll now take five minutes to get there."

00:18:17.720 | But the fact that I get there,

00:18:19.160 | I could already tell that

00:18:20.200 | from the high-level action I took much earlier.

00:18:23.720 | That, we had a very hard time getting success with.

00:18:27.560 | Not saying it's a dead end necessarily,

00:18:30.520 | but we had a lot of trouble getting that to work.

00:18:32.280 | And then we started revisiting the notion

00:18:34.520 | of what are we really trying to achieve?

00:18:36.120 | What we're trying to achieve

00:18:38.920 | is not necessarily hierarchy per se,

00:18:40.840 | but you can think about what does hierarchy give us?

00:18:42.840 | What we hope it would give us

00:18:45.720 | is better credit assignment.

00:18:46.840 | What is better credit assignment is giving us?

00:18:51.720 | It gives us faster learning.

00:18:53.960 | And so, faster learning is ultimately maybe what we're after.

00:18:59.640 | And so, that's what we ended up with,

00:19:01.640 | the RL squared paper on learning to reinforcement learn,

00:19:04.760 | which at a time Rocky Duan led.

00:19:07.480 | And that's exactly the meta-learning approach

00:19:10.920 | where you say, "Okay, we don't know how to design hierarchy.

00:19:13.400 | We know what we want to get from it.

00:19:15.560 | Let's just enter and optimize

00:19:17.160 | for what we want to get from it

00:19:18.760 | and see if it might emerge."

00:19:19.960 | And we saw things emerge.

00:19:21.000 | The maze navigation had consistent motion down hallways,

00:19:24.760 | which is what you want.

00:19:26.920 | A hierarchical control should say,

00:19:28.120 | "I want to go down this hallway."

00:19:29.560 | And then when there is an option to take a turn,

00:19:31.480 | I can decide whether to take a turn or not and repeat.

00:19:33.640 | Even had the notion of,

00:19:35.240 | "Where have you been before or not?"

00:19:37.080 | Do not revisit places you've been before.

00:19:39.000 | It still didn't scale yet

00:19:41.880 | to the real world kind of scenarios

00:19:44.760 | I think you had in mind,

00:19:45.800 | but it was some sign of life

00:19:47.000 | that maybe you can meta-learn these hierarchical concepts.

00:19:51.000 | I mean, it seems like through these meta-learning concepts,

00:19:56.040 | we get at what I think is one of the hardest

00:19:59.640 | and most important problems of AI,

00:20:02.200 | which is transfer learning.

00:20:03.320 | So it's generalization.

00:20:05.080 | How far along this journey

00:20:08.280 | towards building general systems are we?

00:20:10.360 | Being able to do transfer learning well.

00:20:12.760 | So there's some signs that you can generalize a little bit,

00:20:16.680 | but do you think we're on the right path

00:20:19.400 | or totally different breakthroughs are needed

00:20:23.560 | to be able to transfer knowledge

00:20:26.680 | between different learned models?

00:20:28.680 | - Yeah, I'm pretty torn on this.

00:20:33.000 | I think there are some very impressive--

00:20:35.160 | - The present of the day.

00:20:35.960 | - Well, there's just some very impressive results already.

00:20:40.120 | Right?

00:20:40.280 | I mean, I would say when,

00:20:43.000 | even with the initial kind of big breakthrough in 2012

00:20:47.080 | with AlexNet, right?

00:20:48.520 | The initial thing is, okay, great.

00:20:50.840 | This does better on ImageNet, hence image recognition.

00:20:54.680 | But then immediately thereafter,

00:20:57.560 | there was of course the notion that,

00:20:59.560 | wow, what was learned on ImageNet,

00:21:03.080 | and you now want to solve a new task,

00:21:04.840 | you can fine tune AlexNet for new tasks.

00:21:07.400 | And that was often found to be the even bigger deal

00:21:11.880 | that you learn something that was reusable,

00:21:14.120 | which was not often the case before.

00:21:15.880 | Usually machine learning,

00:21:16.680 | you learn something for one scenario, and that was it.

00:21:18.920 | - Yeah, and that's really exciting.

00:21:20.120 | I mean, that's a huge application.

00:21:22.200 | That's probably the biggest success

00:21:23.480 | of transfer learning to date,

00:21:25.160 | in terms of scope and impact.

00:21:27.320 | - That was a huge breakthrough.

00:21:29.000 | And then recently, I feel like similar kind of,

00:21:33.960 | by scaling things up,

00:21:35.960 | it seems like this has been expanded upon.

00:21:37.880 | Like people training even bigger networks,

00:21:39.800 | they might transfer even better.

00:21:41.320 | If you looked at, for example,

00:21:43.000 | some of the OpenAI results on language models

00:21:45.240 | and some of the recent Google results on language models,

00:21:48.120 | they are learned for just prediction,

00:21:52.200 | and then they get reused for other tasks.

00:21:56.760 | And so I think there is something there

00:21:58.440 | where somehow if you train a big enough model

00:22:00.360 | on enough things,

00:22:01.960 | it seems to transfer some deep mind results

00:22:04.920 | that I thought were very impressive,

00:22:05.880 | the Unreal results,

00:22:07.080 | where it was learned to navigate mazes

00:22:11.000 | in ways where it wasn't just doing reinforcement learning,

00:22:13.640 | but it had other objectives,

00:22:14.920 | was optimizing for.

00:22:16.680 | So I think there's a lot of interesting results already.

00:22:19.080 | I think maybe where it's hard to wrap my head around

00:22:24.040 | is to which extent,

00:22:26.200 | or when do we call something generalization,

00:22:28.440 | or the levels of generalization

00:22:31.240 | involved in these different tasks.

00:22:33.560 | - Yeah.

00:22:34.060 | You draw this, by the way, just to frame things.

00:22:38.360 | I've heard you say somewhere,

00:22:40.600 | it's the difference between learning to master

00:22:42.600 | versus learning to generalize.

00:22:44.360 | It's a nice line to think about,

00:22:47.720 | and I guess you're saying that's a gray area

00:22:50.120 | of what learning to master and learning to generalize,

00:22:53.560 | where one starts and one ends.

00:22:54.600 | - I think I might've heard this.

00:22:55.880 | I might've heard it somewhere else,

00:22:57.720 | and I think it might've been one of your interviews,

00:23:00.360 | maybe the one with Yoshua Benjamin,

00:23:02.040 | I'm not 100% sure,

00:23:03.480 | but I liked the example,

00:23:05.080 | and I'm not sure who it was,

00:23:08.280 | but the example was essentially,

00:23:10.520 | if you use current deep learning techniques,

00:23:13.160 | what we're doing to predict,

00:23:14.840 | let's say, the relative motion of our planets,

00:23:20.440 | it would do pretty well,

00:23:21.560 | but then now if a massive new mass

00:23:26.280 | enters our solar system,

00:23:27.640 | it would probably not predict what would happen, right?

00:23:31.960 | And that's a different kind of generalization.

00:23:33.400 | That's a generalization that relies on the ultimate,

00:23:36.520 | simplest, simplest explanation

00:23:38.440 | that we have available today

00:23:40.040 | to explain the motion of planets,

00:23:41.400 | whereas just pattern recognition could predict

00:23:43.560 | our current solar system motion pretty well, no problem.

00:23:47.160 | And so I think that's an example

00:23:48.680 | of a kind of generalization that is a little different

00:23:52.200 | from what we've achieved so far.

00:23:53.960 | And it's not clear if just regularizing more

00:23:59.560 | and forcing it to come up with a simpler,

00:24:01.320 | simpler, simpler explanation,

00:24:02.440 | say, look, this is not simple,

00:24:03.640 | but that's what physics researchers do, right?

00:24:05.400 | They say, can I make this even simpler?

00:24:08.040 | How simple can I get this?

00:24:09.240 | What's the simplest equation

00:24:10.280 | that can explain everything, right?

00:24:12.200 | The master equation for the entire dynamics of the universe.

00:24:15.320 | We haven't really pushed that direction as hard

00:24:17.480 | in deep learning, I would say.

00:24:19.160 | Not sure if it should be pushed,

00:24:21.880 | but it seems a kind of generalization you get from that

00:24:24.360 | that you don't get in our current methods so far.

00:24:27.240 | - So I just talked to Vladimir Vapnik, for example,

00:24:29.960 | who's a statistician, statistical learning,

00:24:34.040 | and he kind of dreams of creating,

00:24:36.920 | the E equals MC squared for learning, right?

00:24:40.920 | The general theory of learning.

00:24:42.280 | Do you think that's a fruitless pursuit in near term,

00:24:47.640 | within the next several decades?

00:24:50.600 | - I think that's a really interesting pursuit

00:24:53.480 | and in the following sense,

00:24:55.560 | in that there is a lot of evidence

00:24:57.960 | that the brain is pretty modular.

00:25:02.680 | And so I wouldn't maybe think of it as the theory,

00:25:05.400 | maybe the underlying theory,

00:25:06.920 | but more kind of the principle

00:25:09.000 | where there have been findings

00:25:12.200 | where people who are blind

00:25:15.000 | will use the part of the brain

00:25:16.440 | usually used for vision for other functions.

00:25:20.200 | And even after some kind of,

00:25:24.520 | if people get rewired in some way,

00:25:26.200 | they might be able to reuse parts of their brain

00:25:28.520 | for other functions.

00:25:29.320 | And so what that suggests is some kind of modularity

00:25:35.000 | and I think it is a pretty natural thing to strive for,

00:25:39.080 | to see, can we find that modularity?

00:25:41.560 | Can we find this thing?

00:25:43.000 | Of course, it's not every part of the brain

00:25:44.840 | is not exactly the same.

00:25:45.800 | Not everything can be rewired arbitrarily.

00:25:48.360 | But if you think of things like the neocortex,

00:25:50.040 | which is a pretty big part of the brain,

00:25:51.560 | that seems fairly modular from what the findings so far.

00:25:55.720 | Can you design something equally modular?

00:25:59.080 | And if you can just grow it,

00:26:00.280 | it becomes more capable probably.

00:26:02.280 | I think that would be the kind of interesting

00:26:04.760 | underlying principle to shoot for

00:26:06.280 | that is not unrealistic.

00:26:08.360 | - Do you think you prefer math

00:26:12.600 | or empirical trial and error

00:26:15.080 | for the discovery of the essence

00:26:16.680 | of what it means to do something intelligent?

00:26:18.840 | So reinforcement learning embodies both groups, right?

00:26:21.960 | To prove that something converges, prove the bounds.

00:26:26.280 | And then at the same time,

00:26:27.720 | a lot of those successes are,

00:26:29.160 | well, let's try this and see if it works.

00:26:31.400 | So which do you gravitate towards?

00:26:33.240 | How do you think of those two parts of your brain?

00:26:35.720 | - So maybe I would prefer

00:26:42.360 | we could make the progress with mathematics.

00:26:45.480 | And the reason maybe I would prefer that

00:26:46.920 | is because often if you have something

00:26:49.000 | you can mathematically formalize,

00:26:51.880 | you can leapfrog a lot of experimentation.

00:26:55.560 | And experimentation takes a long time to get through.

00:26:58.120 | And a lot of trial and error,

00:27:01.160 | kind of reinforcement learning your research process,

00:27:03.160 | but you need to do a lot of trial and error

00:27:05.480 | before you get to a success.

00:27:06.520 | So if you can leapfrog that, to my mind,

00:27:08.360 | that's what the math is about.

00:27:09.720 | And hopefully once you do a bunch of experiments,

00:27:13.080 | you start seeing a pattern.

00:27:14.360 | You can do some derivations that leapfrog some experiments.

00:27:17.480 | But I agree with you.

00:27:18.840 | I mean, in practice, a lot of the progress

00:27:20.680 | has been such that we have not been able to find

00:27:23.080 | the math that allows it to leapfrog ahead.

00:27:25.000 | And we are kind of making gradual progress

00:27:27.960 | one step at a time.

00:27:29.320 | A new experiment here, a new experiment there

00:27:31.080 | that gives us new insights and gradually building up,

00:27:34.280 | but not getting to something yet where we're just,

00:27:36.440 | "Okay, here's an equation that now explains how,"

00:27:38.920 | you know, that would be,

00:27:40.440 | have been two years of experimentation to get there,

00:27:42.440 | but this tells us what the result's going to be.

00:27:44.760 | Unfortunately, not so much yet.

00:27:47.080 | - Not so much yet, but your hope is there.

00:27:49.320 | In trying to teach robots or systems

00:27:53.560 | to do everyday tasks or even in simulation,

00:27:58.840 | what do you think you're more excited about?

00:28:01.400 | Imitation learning or self-play?

00:28:04.680 | So letting robots learn from humans

00:28:07.400 | or letting robots plan their own,

00:28:11.240 | try to figure out in their own way

00:28:13.640 | and eventually play, eventually interact with humans

00:28:18.120 | or solve whatever problem is.

00:28:19.960 | What's the more exciting to you?

00:28:21.720 | What's more promising, you think, as a research direction?

00:28:24.280 | - So when we look at self-play,

00:28:31.400 | what's so beautiful about it is,

00:28:34.200 | goes back to kind of the challenges

00:28:36.280 | in reinforcement learning.

00:28:37.160 | So the challenge of reinforcement learning

00:28:38.360 | is getting signal.

00:28:39.160 | And if you don't never succeed,

00:28:41.880 | you don't get any signal.

00:28:43.160 | In self-play, you're on both sides.

00:28:46.600 | So one of you succeeds.

00:28:47.800 | And the beauty is also one of you fails.

00:28:49.800 | And so you see the contrast.

00:28:51.000 | You see the one version of me

00:28:52.680 | that did better than the other version.

00:28:53.880 | And so every time you play yourself,

00:28:55.880 | you get signal.

00:28:57.080 | And so whenever you can turn something into self-play,

00:28:59.880 | you're in a beautiful situation

00:29:01.960 | where you can naturally learn much more quickly

00:29:04.680 | than in most other reinforced learning environments.

00:29:07.800 | So I think if somehow we can turn more

00:29:11.720 | reinforcement learning problems

00:29:13.640 | into self-play formulations,

00:29:15.400 | that would go really, really far.

00:29:17.080 | So far, self-play has been largely around games

00:29:20.520 | where there is natural opponents.

00:29:22.600 | But if we could do self-play for other things,

00:29:24.520 | and let's say, I don't know,

00:29:25.400 | a robot learns to build a house.

00:29:26.760 | I mean, that's a pretty advanced thing

00:29:28.280 | to try to do for a robot,

00:29:29.400 | but maybe it tries to build a hut or something.

00:29:31.720 | If that can be done through self-play,

00:29:34.040 | it would learn a lot more quickly

00:29:35.240 | if somebody can figure that out.

00:29:36.360 | And I think that would be something

00:29:37.880 | where it goes closer

00:29:39.320 | to kind of the mathematical leapfrogging

00:29:41.400 | where somebody figures out a formalism to say,

00:29:43.720 | "Okay, any RL problem by playing this and this idea,

00:29:47.000 | "you can turn it into a self-play problem

00:29:48.520 | "where you get signal a lot more easily."

00:29:52.440 | Reality is, many problems

00:29:54.200 | we don't know how to turn into self-play.

00:29:55.880 | And so either we need to provide detailed reward,

00:29:58.760 | that doesn't just reward for achieving a goal,

00:30:00.840 | but rewards for making progress,

00:30:02.680 | and that becomes time-consuming.

00:30:04.440 | And once you're starting to do that,

00:30:05.720 | let's say you want a robot to do something,

00:30:07.000 | you need to give all this detailed reward,

00:30:09.000 | well, why not just give a demonstration?

00:30:10.600 | - Right.

00:30:11.160 | - Because why not just show the robot?

00:30:13.000 | And now the question is, how do you show the robot?

00:30:16.360 | One way to show is to tally operate the robot,

00:30:18.440 | and then the robot really experiences things.

00:30:20.840 | And that's nice because that's really high signal

00:30:22.840 | to noise ratio data, and we've done a lot of that.

00:30:24.840 | And you teach your robot skills,

00:30:26.760 | in just 10 minutes, you can teach a robot a new basic skill,

00:30:29.640 | like, "Okay, pick up the bottle, place it somewhere else."

00:30:32.120 | That's a skill, no matter where the bottle starts,

00:30:34.120 | maybe it always goes onto a target or something.

00:30:36.040 | That's fairly easy to teach your robot with tally-up.

00:30:39.080 | Now, what's even more interesting,

00:30:42.120 | if you can now teach your robot

00:30:43.160 | through third-person learning,

00:30:44.920 | where the robot watches you do something,

00:30:46.840 | and doesn't experience it, but just watches it and says,

00:30:49.960 | "Okay, well, if you're showing me that,

00:30:51.400 | that means I should be doing this,

00:30:53.640 | and I'm not gonna be using your hand,

00:30:55.160 | because I don't get to control your hand,

00:30:56.840 | but I'm gonna use my hand, I do that mapping."

00:30:59.320 | And so that's where I think one of the big breakthroughs

00:31:01.960 | has happened this year.

00:31:03.160 | This was led by Chelsea Finn here.

00:31:05.160 | It's almost like learning a machine translation

00:31:08.040 | for demonstrations, where you have a human demonstration,

00:31:11.080 | and the robot learns to translate it into

00:31:13.080 | what it means for the robot to do it.

00:31:15.640 | And that was a meta-learning formulation,

00:31:17.400 | learn from one to get the other.

00:31:19.800 | And that, I think, opens up a lot of opportunities

00:31:22.840 | to learn a lot more quickly.

00:31:24.280 | - So my focus is on autonomous vehicles.

00:31:26.360 | Do you think this approach of third-person watching,

00:31:28.920 | the autonomous driving is amenable

00:31:31.800 | to this kind of approach?

00:31:33.000 | - So for autonomous driving, I would say it's,

00:31:37.960 | third-person is slightly easier.

00:31:41.400 | And the reason I'm gonna say it's slightly easier

00:31:43.240 | to do with third-person is because

00:31:46.520 | the car dynamics are very well understood.

00:31:48.760 | So the-- - Easier than

00:31:51.720 | first-person, you mean?

00:31:53.800 | Or easier than-- - I think it's,

00:31:55.560 | so I think the distinction between third-person

00:31:57.400 | and first-person is not a very important distinction

00:32:00.120 | for autonomous driving.

00:32:01.720 | They're very similar, because the distinction is really about

00:32:06.040 | who turns the steering wheel,

00:32:07.720 | or maybe, let me put it differently.

00:32:11.720 | How to get from a point where you are now

00:32:14.680 | to a point, let's say, a couple meters in front of you.

00:32:17.320 | And that's a problem that's very well understood.

00:32:19.080 | And that's the only distinction

00:32:20.120 | between third and first-person there.

00:32:21.640 | Whereas with the robot manipulation,

00:32:23.080 | interaction forces are very complex,

00:32:25.240 | and it's still a very different thing.

00:32:26.680 | For autonomous driving, I think there is still the question,

00:32:31.320 | imitation versus RL.

00:32:33.880 | So imitation gives you a lot more signal.

00:32:36.600 | I think where imitation is lacking

00:32:38.760 | and needs some extra machinery is,

00:32:42.280 | it doesn't, in its normal format,

00:32:45.320 | doesn't think about goals or objectives.

00:32:47.560 | And of course, there are versions of imitation learning,

00:32:50.920 | inverse reinforcement learning type imitation learning,

00:32:52.760 | which also thinks about goals.

00:32:54.440 | I think then we're getting much closer.

00:32:56.840 | But I think it's very hard to think of a

00:32:58.520 | fully reactive car generalizing well.

00:33:03.880 | If it really doesn't have a notion of objectives

00:33:05.800 | to generalize well to the kind of generality

00:33:08.520 | that you would want.

00:33:09.400 | You'd want more than just that reactivity

00:33:11.960 | that you get from just behavioral cloning/supervised learning.

00:33:15.160 | - So a lot of the work,

00:33:18.280 | whether it's self-play or even imitation learning,

00:33:21.880 | would benefit significantly from simulation,

00:33:24.040 | from effective simulation.

00:33:26.360 | And you're doing a lot of stuff in the physical world

00:33:28.120 | and in simulation.

00:33:29.480 | Do you have hope for greater and greater

00:33:32.440 | power of simulation being boundless eventually

00:33:38.200 | to where most of what we need to operate

00:33:40.520 | in the physical world could be simulated

00:33:43.560 | to a degree that's directly transferable

00:33:46.280 | to the physical world?

00:33:47.400 | Or are we still very far away from that?

00:33:49.080 | - So I think we could even rephrase that question

00:33:57.560 | in some sense.

00:33:58.280 | - Please.

00:33:58.920 | - And so the power of simulation, right?

00:34:03.560 | As simulators get better and better,

00:34:06.440 | of course, becomes stronger

00:34:08.840 | and we can learn more in simulation.

00:34:11.080 | But there's also another version,

00:34:12.280 | which is where you say the simulator

00:34:13.560 | doesn't even have to be that precise.

00:34:15.240 | As long as it's somewhat representative

00:34:17.800 | and instead of trying to get one simulator

00:34:20.360 | that is sufficiently precise to learn in

00:34:23.000 | and transfer really well to the real world,

00:34:25.160 | I'm going to build many simulators.

00:34:26.600 | - Ensemble of simulators.

00:34:28.040 | - Ensemble of simulators.

00:34:29.240 | Not any single one of them

00:34:31.960 | is sufficiently representative of the real world

00:34:34.520 | such that it would work if you train in there.

00:34:37.800 | But if you train in all of them,

00:34:39.720 | then there is something that's good in all of them.

00:34:43.400 | The real world will just be another one of them

00:34:47.480 | that's not identical to any one of them,

00:34:49.480 | but just another one of them.

00:34:50.600 | - Another sample from the distribution of simulators.

00:34:52.920 | - Exactly.

00:34:53.240 | - We do live in a simulation,

00:34:54.680 | so this is just one other one.

00:34:57.400 | - I'm not sure about that, but yeah.

00:34:59.160 | It's definitely a very advanced simulator if it is.

00:35:03.320 | - Yeah, it's a pretty good one.

00:35:04.760 | I've talked to Russell.

00:35:07.320 | It's something you think about a little bit too.

00:35:09.320 | Of course, you're really trying to build these systems,

00:35:11.880 | but do you think about the future of AI?

00:35:13.640 | A lot of people have concern about safety.

00:35:15.560 | How do you think about AI safety

00:35:18.120 | as you build robots that are operating in the physical world?

00:35:20.920 | How do you approach this problem

00:35:24.920 | in an engineering kind of way, in a systematic way?

00:35:27.400 | - So when a robot is doing things,

00:35:32.200 | you kind of have a few notions of safety to worry about.

00:35:36.120 | One is that the robot is physically strong

00:35:39.240 | and of course could do a lot of damage.

00:35:41.480 | Same for cars, which we can think of as robots too in some way.

00:35:45.240 | And this could be completely unintentional.

00:35:48.200 | So it could be not the kind of long-term AI safety concerns

00:35:51.640 | that, okay, AI is smarter than us and now what do we do?

00:35:54.200 | But it could be just very practical.

00:35:55.720 | Okay, this robot, if it makes a mistake,

00:35:57.720 | what are the results going to be?

00:36:00.520 | Of course, simulation comes in a lot there

00:36:02.200 | to test in simulation.

00:36:04.280 | It's a difficult question.

00:36:05.560 | And I'm always wondering, like I always wonder,

00:36:07.400 | let's say you look at, let's go back to driving

00:36:09.960 | 'cause a lot of people know driving well, of course.

00:36:12.200 | What do we do to test somebody for driving, right?

00:36:16.600 | To get a driver's license?

00:36:18.440 | What do they really do?

00:36:19.400 | I mean, you fill out some tests and then you drive.

00:36:24.840 | And I mean, in suburban California,

00:36:27.640 | that driving test is just you drive around the block,

00:36:31.080 | pull over, you do a stop sign successfully

00:36:34.600 | and then you pull over again and you're pretty much done.

00:36:37.560 | And you're like, okay, if a self-driving car did that,

00:36:42.680 | would you trust it that it can drive?

00:36:45.080 | And I'd be like, no, that's not enough for me to trust it.

00:36:47.240 | But somehow for humans, we've figured out

00:36:49.800 | that somebody being able to do that is representative

00:36:53.160 | of them being able to do a lot of other things.

00:36:56.280 | And so I think somehow for humans,

00:36:58.360 | we've figured out representative tests of what it means

00:37:02.040 | if you can do this, what you can really do.

00:37:04.120 | Of course, testing humans,

00:37:05.720 | humans don't wanna be tested at all times.

00:37:07.400 | Self-driving cars or robots

00:37:08.520 | could be tested more often probably.

00:37:10.200 | You can have replicants that get tested

00:37:11.640 | and they're known to be identical

00:37:13.080 | 'cause they use the same neural net and so forth.

00:37:15.400 | But still, I feel like we don't have this kind of unit tests

00:37:19.480 | or proper tests for robots.

00:37:22.680 | And I think there's something very interesting

00:37:23.800 | to be thought about there,

00:37:25.080 | especially as you update things.

00:37:26.680 | Your software improves,

00:37:28.120 | you have a better self-driving car suite, you update it.

00:37:31.000 | How do you know it's indeed more capable on everything

00:37:34.600 | than what you had before

00:37:35.960 | that you didn't have any bad things creep into it?

00:37:40.120 | So I think that's a very interesting direction of research

00:37:42.120 | that there is no real solution yet,

00:37:44.920 | except that somehow for humans we do.

00:37:46.520 | 'Cause we say, okay, you have a driving test, you passed,

00:37:49.400 | you can go on the road now.

00:37:50.520 | And humans have accidents every like million

00:37:53.400 | or 10 million miles, something pretty phenomenal.

00:37:55.640 | Compared to that short test that is being done.

00:38:01.560 | - So let me ask, you've mentioned that Andrew Ang,

00:38:05.240 | by example, showed you the value of kindness.

00:38:07.560 | Do you think the space of policies,

00:38:14.440 | good policies for humans and for AI

00:38:16.680 | is populated by policies that,

00:38:20.120 | with kindness or ones that are the opposite?

00:38:25.720 | Exploitation, even evil.

00:38:28.040 | So if you just look at the sea of policies

00:38:30.120 | we operate under as human beings,

00:38:32.440 | or if AI system had to operate in this real world,

00:38:35.080 | do you think it's really easy to find policies

00:38:37.880 | that are full of kindness,

00:38:39.400 | like would naturally fall into them?

00:38:41.160 | Or is it like a very hard optimization problem?

00:38:44.440 | - I mean, there is kind of two optimizations happening

00:38:50.680 | for humans, right?

00:38:52.120 | So for humans, there's kind of

00:38:53.080 | the very long-term optimization,

00:38:54.600 | which evolution has done for us.

00:38:56.680 | And we're kind of predisposed to like certain things.

00:39:00.520 | And that's in some sense what makes our learning easier

00:39:02.600 | because I mean, we know things like pain

00:39:04.600 | and hunger and thirst.

00:39:08.200 | And the fact that we know about those

00:39:09.960 | is not something that we were taught.

00:39:11.640 | That's kind of innate.

00:39:12.520 | When we're hungry, we're unhappy.

00:39:13.800 | When we're thirsty, we're unhappy.

00:39:15.160 | When we have pain, we're unhappy.

00:39:18.280 | And ultimately evolution built that into us

00:39:21.560 | to think about those things.

00:39:22.520 | And so I think there is a notion that

00:39:24.520 | it seems somehow humans evolved in general

00:39:27.320 | to prefer to get along in some ways,

00:39:32.120 | but at the same time, also to be very territorial

00:39:36.280 | and kind of centric to their own tribe.

00:39:39.720 | It seems like that's the kind of space

00:39:43.480 | we converged onto.

00:39:44.600 | I mean, I'm not an expert in anthropology,

00:39:46.520 | but it seems like we're very kind of

00:39:47.960 | good within our own tribe,

00:39:50.200 | but need to be taught.

00:39:52.440 | To be nice to other tribes.

00:39:54.360 | - Well, if you look at Steven Pinker,

00:39:56.200 | he highlights this pretty nicely

00:39:57.800 | in "Better Angels of Our Nature"

00:40:02.200 | where he talks about violence

00:40:03.480 | decreasing over time consistently.

00:40:05.480 | So whatever tension, whatever teams we pick,

00:40:08.200 | it seems that the long arc of history

00:40:11.000 | goes towards us getting along more and more.

00:40:13.720 | - I hope so.

00:40:14.760 | (laughing)

00:40:16.200 | - So do you think that,

00:40:17.880 | do you think it's possible to teach RL

00:40:22.200 | based robots this kind of kindness,

00:40:26.040 | this kind of ability to interact with humans,

00:40:28.280 | this kind of policy,

00:40:29.560 | even to, let me ask a fun one.

00:40:32.120 | Do you think it's possible to teach RL based robot

00:40:35.000 | to love a human being

00:40:36.200 | and to inspire that human to love the robot back?

00:40:39.960 | So to like a RL based algorithm

00:40:43.720 | that leads to a happy marriage.

00:40:45.880 | - That's an interesting question.

00:40:48.760 | Maybe I'll answer it with another question, right?

00:40:52.520 | (laughing)

00:40:54.120 | 'Cause I mean, but I'll come back to it.

00:40:56.520 | So another question you can have is,

00:40:58.040 | okay, I mean, how close does some people's happiness get

00:41:02.840 | from interacting with just a really nice dog?

00:41:07.480 | Like, I mean, dogs, you come home,

00:41:09.720 | that's what dogs do.

00:41:10.520 | They greet you, they're excited.

00:41:11.960 | It makes you happy when you come home to your dog.

00:41:14.520 | You're just like, okay, this is exciting.

00:41:16.280 | They're always happy when I'm here.

00:41:18.120 | I mean, if they don't greet you,

00:41:19.640 | 'cause maybe whatever,

00:41:20.520 | your partner took him on a trip or something,

00:41:22.920 | you might not be nearly as happy when you get home, right?

00:41:25.960 | And so the kind of,

00:41:27.560 | it seems like the level of reasoning a dog has

00:41:31.000 | is pretty sophisticated,

00:41:32.040 | but then it's still not yet at the level of human reasoning.

00:41:35.480 | And so it seems like we don't even need to achieve

00:41:37.640 | human level reasoning to get like very strong affection

00:41:40.360 | with humans.

00:41:41.560 | And so my thinking is why not, right?

00:41:44.280 | Why couldn't with an AI,

00:41:45.480 | couldn't we achieve the kind of level of affection

00:41:48.920 | that humans feel among each other

00:41:51.960 | or with friendly animals and so forth?

00:41:55.800 | It's a question, is it a good thing for us or not?

00:41:59.560 | That's another thing, right?

00:42:01.240 | Because I mean,

00:42:02.040 | but I don't see why not.

00:42:05.720 | - Why not, yeah.

00:42:07.000 | So Elon Musk says love is the answer.

00:42:08.920 | Maybe he should say love is the objective function

00:42:12.520 | and then RL is the answer, right?

00:42:14.360 | (both laughing)

00:42:15.640 | - Well, maybe.

00:42:16.200 | (both laughing)

00:42:17.480 | - Oh, Peter, thank you so much.

00:42:18.760 | I don't wanna take up more of your time.

00:42:20.120 | Thank you so much for talking today.

00:42:21.400 | - Well, thanks for coming by.

00:42:23.320 | Great to have you visit.

00:42:24.680 | (upbeat music)

00:42:27.260 | (upbeat music)

00:42:29.840 | (upbeat music)

00:42:32.420 | (upbeat music)

00:42:35.000 | (upbeat music)

00:42:37.580 | (upbeat music)

00:42:40.160 | [BLANK_AUDIO]

Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10

Chapters