back to index

Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10


Chapters

0:0 Intro
0:41 Robot tennis
4:5 Robot parkour
4:53 The psychology of robots
7:5 Can robots have emotion
10:40 Intuition behind RL
15:5 Time skills
16:17 Limitations
20:30 Reusable results
24:50 Modularity
26:38 Mathematical Formalisation
28:27 Selfplay
33:51 Simulation
35:30 Safety
38:47 Human evolution

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Peter Abbeel.
00:00:03.120 | He's a professor at UC Berkeley and the director of the Berkeley Robotics Learning Lab.
00:00:07.840 | He's one of the top researchers in the world working on how we make robots understand and interact with the world around them,
00:00:15.360 | especially using imitation and deep reinforcement learning.
00:00:19.720 | This conversation is part of the MIT course on artificial general intelligence and the Artificial Intelligence Podcast.
00:00:26.400 | If you enjoy it, please subscribe on YouTube, iTunes, or your podcast provider of choice,
00:00:31.680 | or simply connect with me on Twitter @LexFriedman, spelled F-R-I-D.
00:00:36.920 | And now, here's my conversation with Peter Abbeel.
00:00:41.400 | You've mentioned that if there was one person you could meet, it would be Roger Federer.
00:00:46.200 | So let me ask, when do you think we'll have a robot that fully autonomously can beat Roger Federer at tennis,
00:00:54.320 | a Roger Federer-level player at tennis?
00:00:57.520 | Well, first, if you can make it happen for me to meet Roger, let me know.
00:01:02.560 | In terms of getting a robot to beat him at tennis, it's kind of an interesting question,
00:01:08.920 | because for a lot of the challenges we think about in AI, the software is really the missing piece.
00:01:16.760 | But for something like this, the hardware is nowhere near either.
00:01:22.640 | To really have a robot that can physically run around, the Boston Dynamics robots are starting to get there,
00:01:28.520 | but still not really human-level ability to run around and then swing a racket.
00:01:36.800 | So you think that's a hardware problem?
00:01:38.320 | I don't think it's a hardware problem only. I think it's a hardware and a software problem.
00:01:41.560 | I think it's both. And I think they'll have independent progress.
00:01:45.640 | So I'd say the hardware, maybe in 10, 15 years.
00:01:51.560 | On clay, not grass. I mean, grass is probably harder.
00:01:55.000 | Well, the clay, I'm not sure what's harder, grass or clay.
00:01:58.840 | The clay involves sliding, which might be harder to master, actually.
00:02:05.880 | But you're not limited to bipedal. I'm sure there's no...
00:02:09.880 | Well, if we can build a machine, it's a whole different question, of course.
00:02:13.080 | If you can say, "Okay, this robot can be on wheels, it can move around on wheels,
00:02:17.560 | and can be designed differently," then I think that can be done sooner,
00:02:22.680 | probably than a full humanoid type of setup.
00:02:26.120 | What do you think of swing a racket? So you've worked at basic manipulation.
00:02:31.160 | How hard do you think is the task of swinging a racket,
00:02:34.120 | with being able to hit a nice backhand or a forehand?
00:02:39.240 | Let's say we just set up stationary, a nice robot arm, let's say,
00:02:43.960 | you know, a standard industrial arm, and it can watch the ball come
00:02:48.360 | and then swing the racket. It's a good question.
00:02:51.400 | I'm not sure it would be super hard to do.
00:02:56.040 | I mean, I'm sure it would require a lot...
00:02:58.040 | If we do it with reinforced milling, it would require a lot of trial and error.
00:03:01.400 | It's not going to swing it right the first time around.
00:03:03.240 | But yeah, I don't see why I couldn't swing it the right way.
00:03:09.320 | I think it's learnable. I think if you set up a ball machine,
00:03:11.960 | let's say, on one side, and then a robot with a tennis racket on the other side,
00:03:17.640 | I think it's learnable. And maybe a little bit of pre-training and simulation.
00:03:23.000 | Yeah, I think that's feasible. I think the swing the racket is feasible.
00:03:27.160 | It'd be very interesting to see how much precision it can get.
00:03:31.320 | Because, I mean, that's where, I mean, some of the human players can hit it
00:03:36.760 | on the lines, which is very high precision.
00:03:39.160 | With spin. The spin is an interesting...
00:03:41.960 | Whether RL can learn to put a spin on the ball.
00:03:45.560 | Well, you got me interested. Maybe someday we'll set this up.
00:03:48.120 | Someday, sure.
00:03:48.620 | You got me intrigued.
00:03:51.080 | Your answer is basically, okay, for this problem, it sounds fascinating.
00:03:54.120 | But for the general problem of a tennis player, we might be a little bit farther away.
00:03:57.960 | What's the most impressive thing you've seen a robot do in the physical world?
00:04:04.120 | So physically, for me, it's the Boston Dynamics videos.
00:04:10.920 | Always just bring home and just super impressed.
00:04:14.200 | Recently, the robot running up the stairs, doing the parkour type thing.
00:04:19.400 | I mean, yes, we don't know what's underneath.
00:04:22.200 | They don't really write a lot of detail.
00:04:23.880 | But even if it's hard-coded underneath, which it might or might not be,
00:04:28.360 | just the physical abilities of doing that parkour, that's a very impressive.
00:04:32.600 | So have you met Spot Mini or any of those robots in person?
00:04:36.680 | Met Spot Mini last year in April at the Mars event that Jeff Bezos organizes.
00:04:42.840 | They brought it out there and it was nicely following around Jeff.
00:04:47.640 | When Jeff left the room, they had it follow him along, which was pretty impressive.
00:04:52.120 | So I think there's some confidence to know that there's no learning going on in those robots.
00:04:57.960 | The psychology of it.
00:04:58.920 | So while knowing that, while knowing there's not, if there's any learning going on, it's very limited.
00:05:03.400 | I met Spot Mini earlier this year and knowing everything that's going on,
00:05:08.680 | having one-on-one interaction, so I get to spend some time alone.
00:05:12.360 | And there's immediately a deep connection on the psychological level.
00:05:18.200 | Even though you know the fundamentals, how it works, there's something magical.
00:05:23.240 | So do you think about the psychology of interacting with robots in the physical world?
00:05:29.000 | Even you just showed me the PR2, the robot, and there was a little bit something like a face,
00:05:35.960 | had a little bit something like a face.
00:05:38.360 | There's something that immediately draws you to it.
00:05:40.520 | Do you think about that aspect of the robotics problem?
00:05:45.000 | Well, it's very hard with Brad here.
00:05:48.200 | We'll give him a name, Berkeley Robot, for the elimination of tedious tasks.
00:05:52.040 | It's very hard to not think of the robot as a person.
00:05:56.440 | And it seems like everybody calls them a he for whatever reason,
00:05:59.400 | but that also makes it more a person than if it was a it.
00:06:01.880 | And it seems pretty natural to think of it that way.
00:06:07.160 | This past weekend really struck me.
00:06:08.520 | I've seen Pepper many times on videos, but then I was at an event organized by,
00:06:15.160 | this was by Fidelity, and they had scripted Pepper to help moderate some sessions.
00:06:22.600 | And they had scripted Pepper to have the personality of a child a little bit.
00:06:26.360 | And it was very hard to not think of it as its own person in some sense,
00:06:31.720 | because it was just kind of jumping, it would just jump into conversation,
00:06:34.360 | making it very interactive.
00:06:35.720 | Moderate would be saying, Pepper would just jump in, "Hold on, how about me?
00:06:39.240 | Can I participate in this too?"
00:06:41.240 | And just like, "Okay, this is like a person."
00:06:43.560 | And that was 100% scripted.
00:06:45.400 | And even then it was hard not to have that sense of somehow there is something there.
00:06:50.520 | So as we have robots interact in this physical world,
00:06:54.280 | is that a signal that could be used in reinforcement learning?
00:06:57.080 | You've worked a little bit in this direction,
00:07:00.120 | but do you think that psychology can be somehow pulled in?
00:07:02.920 | Yes, that's a question I would say a lot of people ask.
00:07:08.920 | And I think part of why they ask it is they're thinking about
00:07:12.840 | how unique are we really still as people?
00:07:16.520 | Like after they see some results, they see a computer play Go,
00:07:19.640 | they see a computer do this, that, they're like, "Okay, but can it really have emotion?
00:07:23.640 | Can it really interact with us in that way?"
00:07:26.680 | And then once you're around robots, you already start feeling it.
00:07:29.960 | And I think that kind of maybe methodologically, the way that I think of it is,
00:07:34.280 | if you run something like reinforcement learning, it's about optimizing some objective.
00:07:39.000 | And there's no reason that the objective couldn't be tied into
00:07:47.480 | how much does a person like interacting with this system?
00:07:50.600 | And why could not the reinforcement learning system optimize for
00:07:54.280 | the robot being fun to be around?
00:07:56.040 | And why wouldn't it then naturally become more and more attractive and more and more,
00:08:00.440 | maybe like a person or like a pet, I don't know what it would exactly be,
00:08:04.520 | but more and more have those features and acquire them automatically.
00:08:08.120 | - As long as you can formalize an objective of what it means to like something,
00:08:13.160 | how you exhibit, what's the ground truth?
00:08:17.240 | How do you get the reward from human?
00:08:19.320 | Because you have to somehow collect that information within you, human.
00:08:22.280 | But you're saying if you can formulate as an objective, it can be learned.
00:08:26.840 | - There's no reason it couldn't emerge through learning.
00:08:29.320 | And maybe one way to formulate as an objective,
00:08:31.400 | you wouldn't have to necessarily score it explicitly.
00:08:33.720 | So standard rewards are numbers, and numbers are hard to come by.
00:08:37.720 | This is a 1.5 or 1.7 on some scale.
00:08:41.160 | It's very hard to do for a person.
00:08:42.840 | But much easier is for a person to say,
00:08:45.320 | "Okay, what you did the last five minutes was much nicer
00:08:49.000 | than we did the previous five minutes."
00:08:51.080 | And that now gives a comparison.
00:08:53.000 | And in fact, there have been some results in that.
00:08:55.160 | For example, Paul Christiano and collaborators at OpenAI had the hopper,
00:09:00.040 | Mojoco hopper, a one-legged robot, learn to do backflips.
00:09:03.720 | Purely from feedback, I like this better than that.
00:09:06.840 | That's kind of equally good.
00:09:08.600 | And after a bunch of interactions,
00:09:10.840 | it figured out what it was the person was asking for, namely a backflip.
00:09:14.200 | And so I think the same thing--
00:09:15.240 | - Oh, it wasn't trying to do a backflip.
00:09:18.520 | It was just getting a score from the comparison score
00:09:20.680 | from the person based on--
00:09:21.880 | - The person having in mind, in their own mind,
00:09:25.240 | I wanted to do a backflip,
00:09:27.320 | but the robot didn't know what it was supposed to be doing.
00:09:30.680 | It just knew that sometimes the person said,
00:09:32.680 | "This is better, this is worse."
00:09:34.440 | And then the robot figured out
00:09:35.880 | what the person was actually after was a backflip.
00:09:38.600 | And I'd imagine the same would be true for things
00:09:40.840 | like more interactive robots
00:09:43.000 | that the robot would figure out over time,
00:09:45.000 | "Oh, this kind of thing apparently is appreciated more
00:09:47.960 | than this other kind of thing."
00:09:49.160 | - So when I first picked up Sutton's,
00:09:53.880 | Richard Sutton's reinforcement learning book,
00:09:55.880 | before sort of this deep learning,
00:09:59.880 | before the re-emergence of neural networks
00:10:03.160 | as a powerful mechanism for machine learning,
00:10:04.920 | RL seemed to me like magic.
00:10:07.560 | It was beautiful.
00:10:10.200 | So that seemed like what intelligence is,
00:10:13.480 | RL reinforcement learning.
00:10:14.840 | So how do you think we can possibly learn anything
00:10:20.200 | about the world when the reward for the actions
00:10:22.920 | is delayed, is so sparse?
00:10:25.240 | Why do you think RL works?
00:10:29.240 | Why do you think you can learn anything
00:10:32.120 | under such sparse rewards,
00:10:34.440 | whether it's regular reinforcement learning
00:10:36.760 | or deep reinforcement learning?
00:10:38.360 | What's your intuition?
00:10:40.360 | - The counterpart of that is why is RL,
00:10:43.240 | why does it need so many samples,
00:10:47.000 | so many experiences to learn from?
00:10:48.520 | Because really what's happening is
00:10:50.520 | when you have a sparse reward,
00:10:52.040 | you do something maybe for like, I don't know,
00:10:55.000 | you take a hundred actions and then you get a reward.
00:10:57.160 | And maybe you get like a score of three.
00:10:59.480 | And I'm like, okay, three, not sure what that means.
00:11:02.760 | You go again and now you get two.
00:11:04.280 | And now you know that that sequence of a hundred actions
00:11:06.920 | that you did the second time around
00:11:08.120 | somehow was worse than the sequence of a hundred actions
00:11:10.440 | you did the first time around.
00:11:11.720 | But that's tough to now know
00:11:13.560 | which one of those were better or worse.
00:11:15.000 | Some might've been good and bad in either one.
00:11:16.760 | And so that's why you need so many experiences.
00:11:19.640 | But once you have enough experiences,
00:11:21.160 | effectively RL is teasing that apart.
00:11:23.240 | It's starting to say, okay,
00:11:24.360 | what is consistently there when you get a higher reward?
00:11:27.640 | And what's consistently there when you get a lower reward?
00:11:29.800 | And then kind of the magic of,
00:11:31.880 | and sometimes the policy gradient update is to say,
00:11:34.040 | now let's update the neural network
00:11:36.840 | to make the actions that were kind of present
00:11:39.000 | when things are good, more likely,
00:11:41.320 | and make the actions that are present
00:11:42.920 | when things are not as good, less likely.
00:11:44.680 | - So that is the counterpoint,
00:11:46.920 | but it seems like you would need to run it
00:11:49.480 | a lot more than you do.
00:11:50.840 | Even though right now people could say
00:11:52.600 | that RL is very inefficient,
00:11:54.280 | but it seems to be way more efficient
00:11:56.120 | than one would imagine on paper.
00:11:58.280 | That the simple updates to the policy,
00:12:01.800 | the policy gradient that somehow you can learn exactly.
00:12:05.160 | You just said, what are the common actions
00:12:07.560 | that seem to produce some good results?
00:12:09.640 | That that somehow can learn anything.
00:12:11.720 | It seems counterintuitive at least.
00:12:14.440 | Is there some intuition behind it?
00:12:16.760 | - Yeah, so I think there's a few ways to think about this.
00:12:21.720 | The way I tend to think about it mostly originally,
00:12:25.640 | when we started working on deep reinforcement learning
00:12:28.920 | here at Berkeley, which was maybe 2011, 12, 13,
00:12:32.680 | around that time, John Shulman was a PhD student,
00:12:36.040 | initially kind of driving it forward here.
00:12:38.120 | And the way we thought about it at the time was,
00:12:43.960 | if you think about rectified linear units
00:12:46.920 | or kind of rectifier type neural networks,
00:12:49.000 | what do you get?
00:12:50.840 | You get something that's piecewise
00:12:52.840 | linear feedback control.
00:12:54.200 | And if you look at the literature,
00:12:56.920 | linear feedback control is extremely successful.
00:12:59.160 | It can solve many, many problems surprisingly well.
00:13:02.600 | I remember, for example, when we did helicopter flight,
00:13:05.560 | if you're in a stationary flight regime,
00:13:07.240 | not a non-stationary, but a stationary flight regime
00:13:10.360 | like hover, you can use linear feedback control
00:13:12.360 | to stabilize a helicopter, a very complex dynamical system,
00:13:15.400 | but the controller is relatively simple.
00:13:18.280 | And so I think that's a big part of it is that
00:13:19.880 | if you do feedback control,
00:13:22.200 | even though the system you control can be very,
00:13:24.120 | very complex, often relatively simple control architectures
00:13:28.600 | can already do a lot,
00:13:30.360 | but then also just linear is not good enough.
00:13:32.440 | And so one way you can think of these neural networks
00:13:35.000 | is that in some sense, they tile the space,
00:13:36.920 | which people were already trying to do more by hand
00:13:39.320 | or with finite state machines.
00:13:40.920 | Say this linear controller here,
00:13:42.360 | this linear controller here.
00:13:43.640 | Neural network learns to tile the space
00:13:45.480 | and say linear controller here,
00:13:46.440 | another linear controller here,
00:13:48.120 | but it's more subtle than that.
00:13:49.400 | And so it's benefiting from this linear control aspect,
00:13:51.880 | it's benefiting from the tiling,
00:13:53.480 | but it's somehow tiling it one dimension at a time.
00:13:56.680 | Because if let's say you have a two layer network,
00:13:59.320 | if in that hidden layer,
00:14:00.520 | you make a transition from active to inactive
00:14:04.840 | or the other way around,
00:14:05.720 | that is essentially one axis,
00:14:08.280 | but not axis aligned,
00:14:09.400 | but one direction that you change.
00:14:12.200 | And so you have this kind of very gradual tiling of the space
00:14:15.160 | where you have a lot of sharing
00:14:16.680 | between the linear controllers that tile the space.
00:14:19.400 | And that was always my intuition as to why to expect
00:14:22.120 | that this might work pretty well.
00:14:24.680 | It's essentially leveraging the fact
00:14:25.960 | that linear feedback control is so good,
00:14:28.360 | but of course not enough.
00:14:29.720 | And this is a gradual tiling of the space
00:14:31.640 | with linear feedback controls
00:14:33.400 | that share a lot of expertise across them.
00:14:36.040 | - So that's really nice intuition.
00:14:38.840 | But do you think that scales
00:14:40.840 | to the more and more general problems
00:14:42.440 | of when you start going up the number of dimensions
00:14:47.000 | when you start going down
00:14:51.480 | in terms of how often you get a clean reward signal?
00:14:55.240 | Does that intuition carry forward
00:14:57.240 | to those crazier, weirder worlds
00:14:59.560 | that we think of as the real world?
00:15:01.160 | - So I think where things get really tricky
00:15:07.880 | in the real world
00:15:08.680 | compared to the things we've looked at so far
00:15:10.840 | with great success in reinforcement learning
00:15:13.080 | is the time scales,
00:15:17.160 | which takes us to an extreme.
00:15:18.920 | So when you think about the real world,
00:15:21.640 | I mean, I don't know,
00:15:23.240 | maybe some student decided to do a PhD here, right?
00:15:26.760 | Okay, that's a decision.
00:15:28.520 | That's a very high level decision.
00:15:29.960 | But if you think about their lives,
00:15:32.440 | I mean, any person's life,
00:15:33.960 | it's a sequence of muscle fiber contractions
00:15:37.240 | and relaxations,
00:15:38.200 | and that's how you interact with the world.
00:15:40.120 | And that's a very high frequency control thing,
00:15:42.520 | but it's ultimately what you do
00:15:44.440 | and how you affect the world.
00:15:45.560 | Until I guess we have brain readings,
00:15:48.120 | and you can maybe do it slightly differently,
00:15:49.640 | but typically that's how you affect the world.
00:15:51.960 | And the decision of doing a PhD is like so abstract
00:15:56.200 | relative to what you're actually doing in the world.
00:15:59.160 | And I think that's where credit assignment
00:16:01.000 | becomes just completely beyond
00:16:04.680 | what any current RL algorithm can do.
00:16:06.600 | And we need hierarchical reasoning
00:16:08.840 | at a level that is just not available at all yet.
00:16:11.640 | - Where do you think we can pick up hierarchical reasoning?
00:16:14.760 | By which mechanisms?
00:16:15.880 | - Yeah, so maybe let me highlight
00:16:18.520 | what I think the limitations are
00:16:20.600 | of what already was done 20, 30 years ago.
00:16:25.880 | In fact, you'll find reasoning systems
00:16:27.560 | that reason over relatively long horizons,
00:16:30.760 | but the problem is that they were not grounded
00:16:32.680 | in the real world.
00:16:33.480 | So people would have to hand design
00:16:37.000 | some kind of logical,
00:16:40.520 | dynamical descriptions of the world,
00:16:43.800 | and that didn't tie into perception.
00:16:46.200 | And so it didn't tie into real objects and so forth.
00:16:49.080 | And so that was a big gap.
00:16:51.000 | Now with deep learning,
00:16:52.200 | we start having the ability to
00:16:55.400 | really see with sensors, process that,
00:16:59.480 | and understand what's in the world.
00:17:01.320 | And so it's a good time to try
00:17:02.680 | to bring these things together.
00:17:03.880 | I see a few ways of getting there.
00:17:06.280 | One way to get there would be to say,
00:17:08.040 | deep learning can get bolted on somehow
00:17:09.960 | to some of these more traditional approaches.
00:17:12.200 | Now, bolted on would probably mean
00:17:13.960 | you need to do some kind of end-to-end training,
00:17:16.200 | where you say, "My deep learning processing
00:17:18.440 | somehow leads to a representation
00:17:20.680 | that in turn uses some kind of
00:17:23.720 | traditional underlying dynamical systems
00:17:27.320 | that can be used for planning."
00:17:29.720 | And that's, for example, the direction
00:17:31.320 | Aviv Tamar and Thanar Kuretach here
00:17:33.320 | have been pushing with causal infoGAN
00:17:35.000 | and of course other people too.
00:17:36.200 | That's one way.
00:17:38.120 | Can we somehow force it into the form factor
00:17:41.000 | that is amenable to reasoning?
00:17:43.560 | Another direction we've been thinking about
00:17:46.440 | for a long time
00:17:47.640 | and didn't make any progress on
00:17:50.120 | was more information-theoretic approaches.
00:17:53.560 | So the idea there was that
00:17:55.080 | what it means to take high-level action
00:17:57.960 | is to choose a latent variable now
00:18:02.440 | that tells you a lot about
00:18:03.560 | what's going to be the case in the future.
00:18:05.240 | Because that's what it means
00:18:06.280 | to take a high-level action.
00:18:08.600 | I say, "Okay, I decide I'm going to navigate
00:18:12.920 | to the gas station because I need to get gas for my car.
00:18:15.400 | Well, that'll now take five minutes to get there."
00:18:17.720 | But the fact that I get there,
00:18:19.160 | I could already tell that
00:18:20.200 | from the high-level action I took much earlier.
00:18:23.720 | That, we had a very hard time getting success with.
00:18:27.560 | Not saying it's a dead end necessarily,
00:18:30.520 | but we had a lot of trouble getting that to work.
00:18:32.280 | And then we started revisiting the notion
00:18:34.520 | of what are we really trying to achieve?
00:18:36.120 | What we're trying to achieve
00:18:38.920 | is not necessarily hierarchy per se,
00:18:40.840 | but you can think about what does hierarchy give us?
00:18:42.840 | What we hope it would give us
00:18:45.720 | is better credit assignment.
00:18:46.840 | What is better credit assignment is giving us?
00:18:51.720 | It gives us faster learning.
00:18:53.960 | And so, faster learning is ultimately maybe what we're after.
00:18:59.640 | And so, that's what we ended up with,
00:19:01.640 | the RL squared paper on learning to reinforcement learn,
00:19:04.760 | which at a time Rocky Duan led.
00:19:07.480 | And that's exactly the meta-learning approach
00:19:10.920 | where you say, "Okay, we don't know how to design hierarchy.
00:19:13.400 | We know what we want to get from it.
00:19:15.560 | Let's just enter and optimize
00:19:17.160 | for what we want to get from it
00:19:18.760 | and see if it might emerge."
00:19:19.960 | And we saw things emerge.
00:19:21.000 | The maze navigation had consistent motion down hallways,
00:19:24.760 | which is what you want.
00:19:26.920 | A hierarchical control should say,
00:19:28.120 | "I want to go down this hallway."
00:19:29.560 | And then when there is an option to take a turn,
00:19:31.480 | I can decide whether to take a turn or not and repeat.
00:19:33.640 | Even had the notion of,
00:19:35.240 | "Where have you been before or not?"
00:19:37.080 | Do not revisit places you've been before.
00:19:39.000 | It still didn't scale yet
00:19:41.880 | to the real world kind of scenarios
00:19:44.760 | I think you had in mind,
00:19:45.800 | but it was some sign of life
00:19:47.000 | that maybe you can meta-learn these hierarchical concepts.
00:19:51.000 | I mean, it seems like through these meta-learning concepts,
00:19:56.040 | we get at what I think is one of the hardest
00:19:59.640 | and most important problems of AI,
00:20:02.200 | which is transfer learning.
00:20:03.320 | So it's generalization.
00:20:05.080 | How far along this journey
00:20:08.280 | towards building general systems are we?
00:20:10.360 | Being able to do transfer learning well.
00:20:12.760 | So there's some signs that you can generalize a little bit,
00:20:16.680 | but do you think we're on the right path
00:20:19.400 | or totally different breakthroughs are needed
00:20:23.560 | to be able to transfer knowledge
00:20:26.680 | between different learned models?
00:20:28.680 | - Yeah, I'm pretty torn on this.
00:20:33.000 | I think there are some very impressive--
00:20:35.160 | - The present of the day.
00:20:35.960 | - Well, there's just some very impressive results already.
00:20:40.120 | Right?
00:20:40.280 | I mean, I would say when,
00:20:43.000 | even with the initial kind of big breakthrough in 2012
00:20:47.080 | with AlexNet, right?
00:20:48.520 | The initial thing is, okay, great.
00:20:50.840 | This does better on ImageNet, hence image recognition.
00:20:54.680 | But then immediately thereafter,
00:20:57.560 | there was of course the notion that,
00:20:59.560 | wow, what was learned on ImageNet,
00:21:03.080 | and you now want to solve a new task,
00:21:04.840 | you can fine tune AlexNet for new tasks.
00:21:07.400 | And that was often found to be the even bigger deal
00:21:11.880 | that you learn something that was reusable,
00:21:14.120 | which was not often the case before.
00:21:15.880 | Usually machine learning,
00:21:16.680 | you learn something for one scenario, and that was it.
00:21:18.920 | - Yeah, and that's really exciting.
00:21:20.120 | I mean, that's a huge application.
00:21:22.200 | That's probably the biggest success
00:21:23.480 | of transfer learning to date,
00:21:25.160 | in terms of scope and impact.
00:21:27.320 | - That was a huge breakthrough.
00:21:29.000 | And then recently, I feel like similar kind of,
00:21:33.960 | by scaling things up,
00:21:35.960 | it seems like this has been expanded upon.
00:21:37.880 | Like people training even bigger networks,
00:21:39.800 | they might transfer even better.
00:21:41.320 | If you looked at, for example,
00:21:43.000 | some of the OpenAI results on language models
00:21:45.240 | and some of the recent Google results on language models,
00:21:48.120 | they are learned for just prediction,
00:21:52.200 | and then they get reused for other tasks.
00:21:56.760 | And so I think there is something there
00:21:58.440 | where somehow if you train a big enough model
00:22:00.360 | on enough things,
00:22:01.960 | it seems to transfer some deep mind results
00:22:04.920 | that I thought were very impressive,
00:22:05.880 | the Unreal results,
00:22:07.080 | where it was learned to navigate mazes
00:22:11.000 | in ways where it wasn't just doing reinforcement learning,
00:22:13.640 | but it had other objectives,
00:22:14.920 | was optimizing for.
00:22:16.680 | So I think there's a lot of interesting results already.
00:22:19.080 | I think maybe where it's hard to wrap my head around
00:22:24.040 | is to which extent,
00:22:26.200 | or when do we call something generalization,
00:22:28.440 | or the levels of generalization
00:22:31.240 | involved in these different tasks.
00:22:33.560 | - Yeah.
00:22:34.060 | You draw this, by the way, just to frame things.
00:22:38.360 | I've heard you say somewhere,
00:22:40.600 | it's the difference between learning to master
00:22:42.600 | versus learning to generalize.
00:22:44.360 | It's a nice line to think about,
00:22:47.720 | and I guess you're saying that's a gray area
00:22:50.120 | of what learning to master and learning to generalize,
00:22:53.560 | where one starts and one ends.
00:22:54.600 | - I think I might've heard this.
00:22:55.880 | I might've heard it somewhere else,
00:22:57.720 | and I think it might've been one of your interviews,
00:23:00.360 | maybe the one with Yoshua Benjamin,
00:23:02.040 | I'm not 100% sure,
00:23:03.480 | but I liked the example,
00:23:05.080 | and I'm not sure who it was,
00:23:08.280 | but the example was essentially,
00:23:10.520 | if you use current deep learning techniques,
00:23:13.160 | what we're doing to predict,
00:23:14.840 | let's say, the relative motion of our planets,
00:23:20.440 | it would do pretty well,
00:23:21.560 | but then now if a massive new mass
00:23:26.280 | enters our solar system,
00:23:27.640 | it would probably not predict what would happen, right?
00:23:31.960 | And that's a different kind of generalization.
00:23:33.400 | That's a generalization that relies on the ultimate,
00:23:36.520 | simplest, simplest explanation
00:23:38.440 | that we have available today
00:23:40.040 | to explain the motion of planets,
00:23:41.400 | whereas just pattern recognition could predict
00:23:43.560 | our current solar system motion pretty well, no problem.
00:23:47.160 | And so I think that's an example
00:23:48.680 | of a kind of generalization that is a little different
00:23:52.200 | from what we've achieved so far.
00:23:53.960 | And it's not clear if just regularizing more
00:23:59.560 | and forcing it to come up with a simpler,
00:24:01.320 | simpler, simpler explanation,
00:24:02.440 | say, look, this is not simple,
00:24:03.640 | but that's what physics researchers do, right?
00:24:05.400 | They say, can I make this even simpler?
00:24:08.040 | How simple can I get this?
00:24:09.240 | What's the simplest equation
00:24:10.280 | that can explain everything, right?
00:24:12.200 | The master equation for the entire dynamics of the universe.
00:24:15.320 | We haven't really pushed that direction as hard
00:24:17.480 | in deep learning, I would say.
00:24:19.160 | Not sure if it should be pushed,
00:24:21.880 | but it seems a kind of generalization you get from that
00:24:24.360 | that you don't get in our current methods so far.
00:24:27.240 | - So I just talked to Vladimir Vapnik, for example,
00:24:29.960 | who's a statistician, statistical learning,
00:24:34.040 | and he kind of dreams of creating,
00:24:36.920 | the E equals MC squared for learning, right?
00:24:40.920 | The general theory of learning.
00:24:42.280 | Do you think that's a fruitless pursuit in near term,
00:24:47.640 | within the next several decades?
00:24:50.600 | - I think that's a really interesting pursuit
00:24:53.480 | and in the following sense,
00:24:55.560 | in that there is a lot of evidence
00:24:57.960 | that the brain is pretty modular.
00:25:02.680 | And so I wouldn't maybe think of it as the theory,
00:25:05.400 | maybe the underlying theory,
00:25:06.920 | but more kind of the principle
00:25:09.000 | where there have been findings
00:25:12.200 | where people who are blind
00:25:15.000 | will use the part of the brain
00:25:16.440 | usually used for vision for other functions.
00:25:20.200 | And even after some kind of,
00:25:24.520 | if people get rewired in some way,
00:25:26.200 | they might be able to reuse parts of their brain
00:25:28.520 | for other functions.
00:25:29.320 | And so what that suggests is some kind of modularity
00:25:35.000 | and I think it is a pretty natural thing to strive for,
00:25:39.080 | to see, can we find that modularity?
00:25:41.560 | Can we find this thing?
00:25:43.000 | Of course, it's not every part of the brain
00:25:44.840 | is not exactly the same.
00:25:45.800 | Not everything can be rewired arbitrarily.
00:25:48.360 | But if you think of things like the neocortex,
00:25:50.040 | which is a pretty big part of the brain,
00:25:51.560 | that seems fairly modular from what the findings so far.
00:25:55.720 | Can you design something equally modular?
00:25:59.080 | And if you can just grow it,
00:26:00.280 | it becomes more capable probably.
00:26:02.280 | I think that would be the kind of interesting
00:26:04.760 | underlying principle to shoot for
00:26:06.280 | that is not unrealistic.
00:26:08.360 | - Do you think you prefer math
00:26:12.600 | or empirical trial and error
00:26:15.080 | for the discovery of the essence
00:26:16.680 | of what it means to do something intelligent?
00:26:18.840 | So reinforcement learning embodies both groups, right?
00:26:21.960 | To prove that something converges, prove the bounds.
00:26:26.280 | And then at the same time,
00:26:27.720 | a lot of those successes are,
00:26:29.160 | well, let's try this and see if it works.
00:26:31.400 | So which do you gravitate towards?
00:26:33.240 | How do you think of those two parts of your brain?
00:26:35.720 | - So maybe I would prefer
00:26:42.360 | we could make the progress with mathematics.
00:26:45.480 | And the reason maybe I would prefer that
00:26:46.920 | is because often if you have something
00:26:49.000 | you can mathematically formalize,
00:26:51.880 | you can leapfrog a lot of experimentation.
00:26:55.560 | And experimentation takes a long time to get through.
00:26:58.120 | And a lot of trial and error,
00:27:01.160 | kind of reinforcement learning your research process,
00:27:03.160 | but you need to do a lot of trial and error
00:27:05.480 | before you get to a success.
00:27:06.520 | So if you can leapfrog that, to my mind,
00:27:08.360 | that's what the math is about.
00:27:09.720 | And hopefully once you do a bunch of experiments,
00:27:13.080 | you start seeing a pattern.
00:27:14.360 | You can do some derivations that leapfrog some experiments.
00:27:17.480 | But I agree with you.
00:27:18.840 | I mean, in practice, a lot of the progress
00:27:20.680 | has been such that we have not been able to find
00:27:23.080 | the math that allows it to leapfrog ahead.
00:27:25.000 | And we are kind of making gradual progress
00:27:27.960 | one step at a time.
00:27:29.320 | A new experiment here, a new experiment there
00:27:31.080 | that gives us new insights and gradually building up,
00:27:34.280 | but not getting to something yet where we're just,
00:27:36.440 | "Okay, here's an equation that now explains how,"
00:27:38.920 | you know, that would be,
00:27:40.440 | have been two years of experimentation to get there,
00:27:42.440 | but this tells us what the result's going to be.
00:27:44.760 | Unfortunately, not so much yet.
00:27:47.080 | - Not so much yet, but your hope is there.
00:27:49.320 | In trying to teach robots or systems
00:27:53.560 | to do everyday tasks or even in simulation,
00:27:58.840 | what do you think you're more excited about?
00:28:01.400 | Imitation learning or self-play?
00:28:04.680 | So letting robots learn from humans
00:28:07.400 | or letting robots plan their own,
00:28:11.240 | try to figure out in their own way
00:28:13.640 | and eventually play, eventually interact with humans
00:28:18.120 | or solve whatever problem is.
00:28:19.960 | What's the more exciting to you?
00:28:21.720 | What's more promising, you think, as a research direction?
00:28:24.280 | - So when we look at self-play,
00:28:31.400 | what's so beautiful about it is,
00:28:34.200 | goes back to kind of the challenges
00:28:36.280 | in reinforcement learning.
00:28:37.160 | So the challenge of reinforcement learning
00:28:38.360 | is getting signal.
00:28:39.160 | And if you don't never succeed,
00:28:41.880 | you don't get any signal.
00:28:43.160 | In self-play, you're on both sides.
00:28:46.600 | So one of you succeeds.
00:28:47.800 | And the beauty is also one of you fails.
00:28:49.800 | And so you see the contrast.
00:28:51.000 | You see the one version of me
00:28:52.680 | that did better than the other version.
00:28:53.880 | And so every time you play yourself,
00:28:55.880 | you get signal.
00:28:57.080 | And so whenever you can turn something into self-play,
00:28:59.880 | you're in a beautiful situation
00:29:01.960 | where you can naturally learn much more quickly
00:29:04.680 | than in most other reinforced learning environments.
00:29:07.800 | So I think if somehow we can turn more
00:29:11.720 | reinforcement learning problems
00:29:13.640 | into self-play formulations,
00:29:15.400 | that would go really, really far.
00:29:17.080 | So far, self-play has been largely around games
00:29:20.520 | where there is natural opponents.
00:29:22.600 | But if we could do self-play for other things,
00:29:24.520 | and let's say, I don't know,
00:29:25.400 | a robot learns to build a house.
00:29:26.760 | I mean, that's a pretty advanced thing
00:29:28.280 | to try to do for a robot,
00:29:29.400 | but maybe it tries to build a hut or something.
00:29:31.720 | If that can be done through self-play,
00:29:34.040 | it would learn a lot more quickly
00:29:35.240 | if somebody can figure that out.
00:29:36.360 | And I think that would be something
00:29:37.880 | where it goes closer
00:29:39.320 | to kind of the mathematical leapfrogging
00:29:41.400 | where somebody figures out a formalism to say,
00:29:43.720 | "Okay, any RL problem by playing this and this idea,
00:29:47.000 | "you can turn it into a self-play problem
00:29:48.520 | "where you get signal a lot more easily."
00:29:52.440 | Reality is, many problems
00:29:54.200 | we don't know how to turn into self-play.
00:29:55.880 | And so either we need to provide detailed reward,
00:29:58.760 | that doesn't just reward for achieving a goal,
00:30:00.840 | but rewards for making progress,
00:30:02.680 | and that becomes time-consuming.
00:30:04.440 | And once you're starting to do that,
00:30:05.720 | let's say you want a robot to do something,
00:30:07.000 | you need to give all this detailed reward,
00:30:09.000 | well, why not just give a demonstration?
00:30:10.600 | - Right.
00:30:11.160 | - Because why not just show the robot?
00:30:13.000 | And now the question is, how do you show the robot?
00:30:16.360 | One way to show is to tally operate the robot,
00:30:18.440 | and then the robot really experiences things.
00:30:20.840 | And that's nice because that's really high signal
00:30:22.840 | to noise ratio data, and we've done a lot of that.
00:30:24.840 | And you teach your robot skills,
00:30:26.760 | in just 10 minutes, you can teach a robot a new basic skill,
00:30:29.640 | like, "Okay, pick up the bottle, place it somewhere else."
00:30:32.120 | That's a skill, no matter where the bottle starts,
00:30:34.120 | maybe it always goes onto a target or something.
00:30:36.040 | That's fairly easy to teach your robot with tally-up.
00:30:39.080 | Now, what's even more interesting,
00:30:42.120 | if you can now teach your robot
00:30:43.160 | through third-person learning,
00:30:44.920 | where the robot watches you do something,
00:30:46.840 | and doesn't experience it, but just watches it and says,
00:30:49.960 | "Okay, well, if you're showing me that,
00:30:51.400 | that means I should be doing this,
00:30:53.640 | and I'm not gonna be using your hand,
00:30:55.160 | because I don't get to control your hand,
00:30:56.840 | but I'm gonna use my hand, I do that mapping."
00:30:59.320 | And so that's where I think one of the big breakthroughs
00:31:01.960 | has happened this year.
00:31:03.160 | This was led by Chelsea Finn here.
00:31:05.160 | It's almost like learning a machine translation
00:31:08.040 | for demonstrations, where you have a human demonstration,
00:31:11.080 | and the robot learns to translate it into
00:31:13.080 | what it means for the robot to do it.
00:31:15.640 | And that was a meta-learning formulation,
00:31:17.400 | learn from one to get the other.
00:31:19.800 | And that, I think, opens up a lot of opportunities
00:31:22.840 | to learn a lot more quickly.
00:31:24.280 | - So my focus is on autonomous vehicles.
00:31:26.360 | Do you think this approach of third-person watching,
00:31:28.920 | the autonomous driving is amenable
00:31:31.800 | to this kind of approach?
00:31:33.000 | - So for autonomous driving, I would say it's,
00:31:37.960 | third-person is slightly easier.
00:31:41.400 | And the reason I'm gonna say it's slightly easier
00:31:43.240 | to do with third-person is because
00:31:46.520 | the car dynamics are very well understood.
00:31:48.760 | So the-- - Easier than
00:31:51.720 | first-person, you mean?
00:31:53.800 | Or easier than-- - I think it's,
00:31:55.560 | so I think the distinction between third-person
00:31:57.400 | and first-person is not a very important distinction
00:32:00.120 | for autonomous driving.
00:32:01.720 | They're very similar, because the distinction is really about
00:32:06.040 | who turns the steering wheel,
00:32:07.720 | or maybe, let me put it differently.
00:32:11.720 | How to get from a point where you are now
00:32:14.680 | to a point, let's say, a couple meters in front of you.
00:32:17.320 | And that's a problem that's very well understood.
00:32:19.080 | And that's the only distinction
00:32:20.120 | between third and first-person there.
00:32:21.640 | Whereas with the robot manipulation,
00:32:23.080 | interaction forces are very complex,
00:32:25.240 | and it's still a very different thing.
00:32:26.680 | For autonomous driving, I think there is still the question,
00:32:31.320 | imitation versus RL.
00:32:33.880 | So imitation gives you a lot more signal.
00:32:36.600 | I think where imitation is lacking
00:32:38.760 | and needs some extra machinery is,
00:32:42.280 | it doesn't, in its normal format,
00:32:45.320 | doesn't think about goals or objectives.
00:32:47.560 | And of course, there are versions of imitation learning,
00:32:50.920 | inverse reinforcement learning type imitation learning,
00:32:52.760 | which also thinks about goals.
00:32:54.440 | I think then we're getting much closer.
00:32:56.840 | But I think it's very hard to think of a
00:32:58.520 | fully reactive car generalizing well.
00:33:03.880 | If it really doesn't have a notion of objectives
00:33:05.800 | to generalize well to the kind of generality
00:33:08.520 | that you would want.
00:33:09.400 | You'd want more than just that reactivity
00:33:11.960 | that you get from just behavioral cloning/supervised learning.
00:33:15.160 | - So a lot of the work,
00:33:18.280 | whether it's self-play or even imitation learning,
00:33:21.880 | would benefit significantly from simulation,
00:33:24.040 | from effective simulation.
00:33:26.360 | And you're doing a lot of stuff in the physical world
00:33:28.120 | and in simulation.
00:33:29.480 | Do you have hope for greater and greater
00:33:32.440 | power of simulation being boundless eventually
00:33:38.200 | to where most of what we need to operate
00:33:40.520 | in the physical world could be simulated
00:33:43.560 | to a degree that's directly transferable
00:33:46.280 | to the physical world?
00:33:47.400 | Or are we still very far away from that?
00:33:49.080 | - So I think we could even rephrase that question
00:33:57.560 | in some sense.
00:33:58.280 | - Please.
00:33:58.920 | - And so the power of simulation, right?
00:34:03.560 | As simulators get better and better,
00:34:06.440 | of course, becomes stronger
00:34:08.840 | and we can learn more in simulation.
00:34:11.080 | But there's also another version,
00:34:12.280 | which is where you say the simulator
00:34:13.560 | doesn't even have to be that precise.
00:34:15.240 | As long as it's somewhat representative
00:34:17.800 | and instead of trying to get one simulator
00:34:20.360 | that is sufficiently precise to learn in
00:34:23.000 | and transfer really well to the real world,
00:34:25.160 | I'm going to build many simulators.
00:34:26.600 | - Ensemble of simulators.
00:34:28.040 | - Ensemble of simulators.
00:34:29.240 | Not any single one of them
00:34:31.960 | is sufficiently representative of the real world
00:34:34.520 | such that it would work if you train in there.
00:34:37.800 | But if you train in all of them,
00:34:39.720 | then there is something that's good in all of them.
00:34:43.400 | The real world will just be another one of them
00:34:47.480 | that's not identical to any one of them,
00:34:49.480 | but just another one of them.
00:34:50.600 | - Another sample from the distribution of simulators.
00:34:52.920 | - Exactly.
00:34:53.240 | - We do live in a simulation,
00:34:54.680 | so this is just one other one.
00:34:57.400 | - I'm not sure about that, but yeah.
00:34:59.160 | It's definitely a very advanced simulator if it is.
00:35:03.320 | - Yeah, it's a pretty good one.
00:35:04.760 | I've talked to Russell.
00:35:07.320 | It's something you think about a little bit too.
00:35:09.320 | Of course, you're really trying to build these systems,
00:35:11.880 | but do you think about the future of AI?
00:35:13.640 | A lot of people have concern about safety.
00:35:15.560 | How do you think about AI safety
00:35:18.120 | as you build robots that are operating in the physical world?
00:35:20.920 | How do you approach this problem
00:35:24.920 | in an engineering kind of way, in a systematic way?
00:35:27.400 | - So when a robot is doing things,
00:35:32.200 | you kind of have a few notions of safety to worry about.
00:35:36.120 | One is that the robot is physically strong
00:35:39.240 | and of course could do a lot of damage.
00:35:41.480 | Same for cars, which we can think of as robots too in some way.
00:35:45.240 | And this could be completely unintentional.
00:35:48.200 | So it could be not the kind of long-term AI safety concerns
00:35:51.640 | that, okay, AI is smarter than us and now what do we do?
00:35:54.200 | But it could be just very practical.
00:35:55.720 | Okay, this robot, if it makes a mistake,
00:35:57.720 | what are the results going to be?
00:36:00.520 | Of course, simulation comes in a lot there
00:36:02.200 | to test in simulation.
00:36:04.280 | It's a difficult question.
00:36:05.560 | And I'm always wondering, like I always wonder,
00:36:07.400 | let's say you look at, let's go back to driving
00:36:09.960 | 'cause a lot of people know driving well, of course.
00:36:12.200 | What do we do to test somebody for driving, right?
00:36:16.600 | To get a driver's license?
00:36:18.440 | What do they really do?
00:36:19.400 | I mean, you fill out some tests and then you drive.
00:36:24.840 | And I mean, in suburban California,
00:36:27.640 | that driving test is just you drive around the block,
00:36:31.080 | pull over, you do a stop sign successfully
00:36:34.600 | and then you pull over again and you're pretty much done.
00:36:37.560 | And you're like, okay, if a self-driving car did that,
00:36:42.680 | would you trust it that it can drive?
00:36:45.080 | And I'd be like, no, that's not enough for me to trust it.
00:36:47.240 | But somehow for humans, we've figured out
00:36:49.800 | that somebody being able to do that is representative
00:36:53.160 | of them being able to do a lot of other things.
00:36:56.280 | And so I think somehow for humans,
00:36:58.360 | we've figured out representative tests of what it means
00:37:02.040 | if you can do this, what you can really do.
00:37:04.120 | Of course, testing humans,
00:37:05.720 | humans don't wanna be tested at all times.
00:37:07.400 | Self-driving cars or robots
00:37:08.520 | could be tested more often probably.
00:37:10.200 | You can have replicants that get tested
00:37:11.640 | and they're known to be identical
00:37:13.080 | 'cause they use the same neural net and so forth.
00:37:15.400 | But still, I feel like we don't have this kind of unit tests
00:37:19.480 | or proper tests for robots.
00:37:22.680 | And I think there's something very interesting
00:37:23.800 | to be thought about there,
00:37:25.080 | especially as you update things.
00:37:26.680 | Your software improves,
00:37:28.120 | you have a better self-driving car suite, you update it.
00:37:31.000 | How do you know it's indeed more capable on everything
00:37:34.600 | than what you had before
00:37:35.960 | that you didn't have any bad things creep into it?
00:37:40.120 | So I think that's a very interesting direction of research
00:37:42.120 | that there is no real solution yet,
00:37:44.920 | except that somehow for humans we do.
00:37:46.520 | 'Cause we say, okay, you have a driving test, you passed,
00:37:49.400 | you can go on the road now.
00:37:50.520 | And humans have accidents every like million
00:37:53.400 | or 10 million miles, something pretty phenomenal.
00:37:55.640 | Compared to that short test that is being done.
00:38:01.560 | - So let me ask, you've mentioned that Andrew Ang,
00:38:05.240 | by example, showed you the value of kindness.
00:38:07.560 | Do you think the space of policies,
00:38:14.440 | good policies for humans and for AI
00:38:16.680 | is populated by policies that,
00:38:20.120 | with kindness or ones that are the opposite?
00:38:25.720 | Exploitation, even evil.
00:38:28.040 | So if you just look at the sea of policies
00:38:30.120 | we operate under as human beings,
00:38:32.440 | or if AI system had to operate in this real world,
00:38:35.080 | do you think it's really easy to find policies
00:38:37.880 | that are full of kindness,
00:38:39.400 | like would naturally fall into them?
00:38:41.160 | Or is it like a very hard optimization problem?
00:38:44.440 | - I mean, there is kind of two optimizations happening
00:38:50.680 | for humans, right?
00:38:52.120 | So for humans, there's kind of
00:38:53.080 | the very long-term optimization,
00:38:54.600 | which evolution has done for us.
00:38:56.680 | And we're kind of predisposed to like certain things.
00:39:00.520 | And that's in some sense what makes our learning easier
00:39:02.600 | because I mean, we know things like pain
00:39:04.600 | and hunger and thirst.
00:39:08.200 | And the fact that we know about those
00:39:09.960 | is not something that we were taught.
00:39:11.640 | That's kind of innate.
00:39:12.520 | When we're hungry, we're unhappy.
00:39:13.800 | When we're thirsty, we're unhappy.
00:39:15.160 | When we have pain, we're unhappy.
00:39:18.280 | And ultimately evolution built that into us
00:39:21.560 | to think about those things.
00:39:22.520 | And so I think there is a notion that
00:39:24.520 | it seems somehow humans evolved in general
00:39:27.320 | to prefer to get along in some ways,
00:39:32.120 | but at the same time, also to be very territorial
00:39:36.280 | and kind of centric to their own tribe.
00:39:39.720 | It seems like that's the kind of space
00:39:43.480 | we converged onto.
00:39:44.600 | I mean, I'm not an expert in anthropology,
00:39:46.520 | but it seems like we're very kind of
00:39:47.960 | good within our own tribe,
00:39:50.200 | but need to be taught.
00:39:52.440 | To be nice to other tribes.
00:39:54.360 | - Well, if you look at Steven Pinker,
00:39:56.200 | he highlights this pretty nicely
00:39:57.800 | in "Better Angels of Our Nature"
00:40:02.200 | where he talks about violence
00:40:03.480 | decreasing over time consistently.
00:40:05.480 | So whatever tension, whatever teams we pick,
00:40:08.200 | it seems that the long arc of history
00:40:11.000 | goes towards us getting along more and more.
00:40:13.720 | - I hope so.
00:40:14.760 | (laughing)
00:40:16.200 | - So do you think that,
00:40:17.880 | do you think it's possible to teach RL
00:40:22.200 | based robots this kind of kindness,
00:40:26.040 | this kind of ability to interact with humans,
00:40:28.280 | this kind of policy,
00:40:29.560 | even to, let me ask a fun one.
00:40:32.120 | Do you think it's possible to teach RL based robot
00:40:35.000 | to love a human being
00:40:36.200 | and to inspire that human to love the robot back?
00:40:39.960 | So to like a RL based algorithm
00:40:43.720 | that leads to a happy marriage.
00:40:45.880 | - That's an interesting question.
00:40:48.760 | Maybe I'll answer it with another question, right?
00:40:52.520 | (laughing)
00:40:54.120 | 'Cause I mean, but I'll come back to it.
00:40:56.520 | So another question you can have is,
00:40:58.040 | okay, I mean, how close does some people's happiness get
00:41:02.840 | from interacting with just a really nice dog?
00:41:07.480 | Like, I mean, dogs, you come home,
00:41:09.720 | that's what dogs do.
00:41:10.520 | They greet you, they're excited.
00:41:11.960 | It makes you happy when you come home to your dog.
00:41:14.520 | You're just like, okay, this is exciting.
00:41:16.280 | They're always happy when I'm here.
00:41:18.120 | I mean, if they don't greet you,
00:41:19.640 | 'cause maybe whatever,
00:41:20.520 | your partner took him on a trip or something,
00:41:22.920 | you might not be nearly as happy when you get home, right?
00:41:25.960 | And so the kind of,
00:41:27.560 | it seems like the level of reasoning a dog has
00:41:31.000 | is pretty sophisticated,
00:41:32.040 | but then it's still not yet at the level of human reasoning.
00:41:35.480 | And so it seems like we don't even need to achieve
00:41:37.640 | human level reasoning to get like very strong affection
00:41:40.360 | with humans.
00:41:41.560 | And so my thinking is why not, right?
00:41:44.280 | Why couldn't with an AI,
00:41:45.480 | couldn't we achieve the kind of level of affection
00:41:48.920 | that humans feel among each other
00:41:51.960 | or with friendly animals and so forth?
00:41:55.800 | It's a question, is it a good thing for us or not?
00:41:59.560 | That's another thing, right?
00:42:01.240 | Because I mean,
00:42:02.040 | but I don't see why not.
00:42:05.720 | - Why not, yeah.
00:42:07.000 | So Elon Musk says love is the answer.
00:42:08.920 | Maybe he should say love is the objective function
00:42:12.520 | and then RL is the answer, right?
00:42:14.360 | (both laughing)
00:42:15.640 | - Well, maybe.
00:42:16.200 | (both laughing)
00:42:17.480 | - Oh, Peter, thank you so much.
00:42:18.760 | I don't wanna take up more of your time.
00:42:20.120 | Thank you so much for talking today.
00:42:21.400 | - Well, thanks for coming by.
00:42:23.320 | Great to have you visit.
00:42:24.680 | (upbeat music)
00:42:27.260 | (upbeat music)
00:42:29.840 | (upbeat music)
00:42:32.420 | (upbeat music)
00:42:35.000 | (upbeat music)
00:42:37.580 | (upbeat music)
00:42:40.160 | [BLANK_AUDIO]