Sergey Levine: Robotics and Machine Learning

00:00:00.000 | The following is a conversation with Sergei Levine,

00:00:03.300 | a professor at Berkeley and a world-class researcher

00:00:06.340 | in deep learning, reinforcement learning,

00:00:08.400 | robotics and computer vision,

00:00:10.360 | including the development of algorithms

00:00:12.540 | for end-to-end training of neural network policies

00:00:15.160 | that combine perception and control,

00:00:17.540 | scalable algorithms for inverse reinforcement learning

00:00:20.360 | and in general, deep RL algorithms.

00:00:23.940 | Quick summary of the ads.

00:00:25.280 | Two sponsors, Cash App and ExpressVPN.

00:00:28.580 | Please consider supporting the podcast

00:00:30.280 | by downloading Cash App and using code LexPodcast

00:00:33.800 | and signing up at expressvpn.com/lexpod.

00:00:38.720 | Click the links, buy the stuff.

00:00:41.040 | It's the best way to support this podcast

00:00:42.800 | and in general, the journey I'm on.

00:00:45.300 | If you enjoy this thing, subscribe on YouTube,

00:00:48.720 | review it with Five Stars on Apple Podcast,

00:00:50.840 | follow on Spotify, support it on Patreon

00:00:53.440 | or connect with me on Twitter @LexFriedman.

00:00:57.640 | As usual, I'll do a few minutes of ads now

00:00:59.900 | and never any ads in the middle

00:01:01.120 | that can break the flow of the conversation.

00:01:03.900 | This show is presented by Cash App,

00:01:06.060 | the number one finance app in the App Store.

00:01:08.340 | When you get it, use code LexPodcast.

00:01:11.580 | Cash App lets you send money to friends,

00:01:13.700 | buy Bitcoin and invest in the stock market

00:01:15.860 | with as little as $1.

00:01:18.300 | Since Cash App does fractional share trading,

00:01:20.660 | let me mention that the order execution algorithm

00:01:23.540 | that works behind the scenes to create the abstraction

00:01:26.820 | of the fractional orders is an algorithmic marvel.

00:01:30.080 | So big props to the Cash App engineers

00:01:32.320 | for taking a step up to the next layer of abstraction

00:01:34.560 | over the stock market,

00:01:35.840 | making trading more accessible for new investors

00:01:38.600 | and diversification much easier.

00:01:41.640 | So again, if you get Cash App from the App Store

00:01:43.920 | or Google Play and use the code LexPodcast,

00:01:48.120 | you get $10 and Cash App will also donate $10 to FIRST,

00:01:52.720 | an organization that is helping to advance robotics

00:01:55.320 | and STEM education for young people around the world.

00:01:58.660 | This show is also sponsored by ExpressVPN.

00:02:04.000 | Get it at ExpressVPN.com/lexpod

00:02:08.320 | to support this podcast and to get an extra three months free

00:02:12.580 | on a one year package.

00:02:14.520 | I've been using ExpressVPN for many years.

00:02:17.400 | I love it.

00:02:18.640 | I think ExpressVPN is the best VPN out there.

00:02:21.960 | They told me to say it,

00:02:23.040 | but it happens to be true in my humble opinion.

00:02:26.160 | It doesn't log your data.

00:02:27.480 | It's crazy fast and it's easy to use

00:02:30.000 | literally just one big power on button.

00:02:32.800 | Again, it's probably obvious to you,

00:02:35.020 | but I should say it again.

00:02:36.560 | It's really important that they don't log your data.

00:02:40.080 | It works on Linux and every other operating system,

00:02:43.180 | but Linux of course is the best operating system.

00:02:46.640 | Shout out to my favorite flavor, Ubuntu Mate 2004.

00:02:50.600 | Once again, get it at ExpressVPN.com/lexpod

00:02:54.560 | to support this podcast and to get an extra three months free

00:02:58.760 | on a one year package.

00:03:00.800 | And now here's my conversation with Sergei Levine.

00:03:05.300 | What's the difference between a state of the art human,

00:03:08.680 | such as you and I,

00:03:09.920 | well, I don't know if we qualify as state of the art humans,

00:03:11.880 | but a state of the art human and a state of the art robot?

00:03:15.480 | - That's a very interesting question.

00:03:18.720 | Robot capability is, it's kind of a,

00:03:22.360 | I think it's a very tricky thing to understand

00:03:25.320 | because there are some things that are difficult

00:03:28.200 | that we wouldn't think are difficult

00:03:29.280 | and some things that are easy

00:03:30.120 | that we wouldn't think are easy.

00:03:31.680 | And there's also a really big gap

00:03:34.280 | between capabilities of robots in terms of hardware

00:03:37.780 | and their physical capability

00:03:38.800 | and capabilities of robots

00:03:40.380 | in terms of what they can do autonomously.

00:03:42.760 | There is a little video that I think robotics researchers

00:03:46.440 | really like to show,

00:03:47.280 | special robotics learning researchers like myself

00:03:49.360 | from 2004 from Stanford,

00:03:52.160 | which demonstrates a prototype robot called the PR1.

00:03:55.320 | And the PR1 was a robot that was designed

00:03:57.200 | as a home assistance robot.

00:03:59.200 | And there's this beautiful video showing the PR1

00:04:01.800 | tidying up a living room, putting away toys,

00:04:04.880 | and at the end, bringing a beer

00:04:07.200 | to the person sitting on the couch,

00:04:09.320 | which looks really amazing.

00:04:11.540 | And then the punchline is that this robot

00:04:14.080 | is entirely controlled by a person.

00:04:15.960 | So you can, so in some ways,

00:04:17.720 | the gap between a state-of-the-art human

00:04:19.320 | and a state-of-the-art robot,

00:04:20.540 | if the robot has a human brain,

00:04:22.360 | is actually not that large.

00:04:23.880 | Now, obviously, like human bodies are sophisticated

00:04:25.880 | and very robust and resilient in many ways,

00:04:28.200 | but on the whole, if we're willing to like spend

00:04:30.800 | a bit of money and do a bit of engineering,

00:04:32.560 | we can kind of close the hardware gap almost.

00:04:35.520 | But the intelligence gap, that one is very wide.

00:04:40.360 | - And when you say hardware,

00:04:41.280 | you're referring to the physical sort of the actuators,

00:04:43.780 | the actual body of the robot

00:04:45.040 | as opposed to the hardware on which the cognition,

00:04:48.040 | the hardware of the nervous system.

00:04:49.920 | - Yes, exactly.

00:04:50.760 | I'm referring to the body rather than the mind.

00:04:53.320 | So that means that kind of the work is cut out for us.

00:04:56.640 | Like while we can still make the body better,

00:04:59.000 | we kind of know that the big bottleneck right now

00:05:00.800 | is really the mind.

00:05:01.800 | - And how big is that gap?

00:05:03.920 | How big is the difference in your sense

00:05:08.480 | of ability to learn, ability to reason,

00:05:11.040 | ability to perceive the world

00:05:12.400 | between humans and our best robots?

00:05:16.800 | - The gap is very large and the gap becomes larger

00:05:20.600 | the more unexpected events can happen in the world.

00:05:24.560 | So essentially the spectrum along which you can measure

00:05:27.400 | the size of that gap is the spectrum

00:05:30.920 | of how open the world is.

00:05:32.160 | If you control everything in the world very tightly,

00:05:33.920 | if you put the robot in like a factory

00:05:36.200 | and you tell it where everything is

00:05:37.440 | and you rigidly program its motion,

00:05:39.680 | then it can do things.

00:05:41.840 | One might even say in a superhuman way.

00:05:43.600 | It can move faster, it's stronger,

00:05:44.960 | it can lift up a car and things like that.

00:05:47.200 | But as soon as anything starts to vary in the environment,

00:05:50.360 | now it'll trip up and if many, many things vary

00:05:52.600 | like they would like in your kitchen, for example,

00:05:54.640 | then things are pretty much like wide open.

00:05:57.820 | - Now again, we're gonna stick a bit

00:06:00.600 | on the philosophical questions,

00:06:01.960 | but how much on the human side of the cognitive abilities

00:06:06.960 | in your sense is nature versus nurture?

00:06:10.520 | So how much of it is a product of evolution

00:06:15.520 | and how much of it is something we'll learn

00:06:18.520 | from sort of scratch from the day we're born?

00:06:22.080 | - I'm gonna read into your question

00:06:23.280 | as asking about the implications of this for AI.

00:06:26.840 | - Of course, exactly.

00:06:27.680 | - I'm not a biologist,

00:06:28.720 | I can't really speak authoritatively about it.

00:06:30.520 | - So in Tlingit, if it's all about learning,

00:06:35.480 | then there's more hope for AI.

00:06:38.480 | So the way that I look at this is that,

00:06:40.440 | well, first of course, biology is very messy.

00:06:44.960 | And if you ask the question, how does a person do something

00:06:49.200 | or how does a person's mind do something,

00:06:51.280 | you can come up with a bunch of hypotheses

00:06:53.080 | and oftentimes you can find support for many different,

00:06:55.480 | often conflicting hypotheses.

00:06:57.000 | One way that we can approach the question

00:07:00.160 | of what the implications of this for AI are

00:07:03.440 | is we can think about what's sufficient.

00:07:05.440 | So maybe a person is from birth very, very good

00:07:09.840 | at some things like, for example, recognizing faces.

00:07:12.000 | There's a very strong evolutionary pressure to do that.

00:07:14.040 | If you can recognize your mother's face,

00:07:16.280 | then you're more likely to survive

00:07:18.280 | and therefore people are good at this.

00:07:20.520 | But we can also ask like,

00:07:21.360 | what's the minimum sufficient thing?

00:07:23.680 | And one of the ways that we can study

00:07:25.240 | the minimal sufficient thing is we could, for example,

00:07:27.600 | see what people do in unusual situations.

00:07:29.680 | If you present them with things

00:07:30.600 | that evolution couldn't have prepared them for.

00:07:33.800 | Our daily lives actually do this to us all the time.

00:07:35.520 | We didn't evolve to deal with automobiles

00:07:39.320 | and space flight and whatever.

00:07:41.440 | So there are all these situations

00:07:42.520 | that we can find ourselves in and we do very well there.

00:07:45.720 | Like I can give you a joystick to control a robotic arm,

00:07:49.160 | which you've never used before

00:07:50.720 | and you might be pretty bad for the first couple of seconds.

00:07:52.880 | But if I tell you like, your life depends

00:07:54.600 | on using this robotic arm to like open this door,

00:07:58.000 | you'll probably manage it.

00:07:59.480 | Even though you've never seen this device before,

00:08:01.200 | you've never used the joystick control,

00:08:03.280 | and you'll kind of muddle through it.

00:08:04.640 | And that's not your evolved natural ability,

00:08:08.400 | that's your flexibility, your adaptability.

00:08:11.200 | And that's exactly where our current robotic systems

00:08:13.200 | really kind of fall flat.

00:08:14.640 | - But I wonder how much general,

00:08:17.760 | almost what we think of as common sense,

00:08:20.480 | pre-trained models underneath all of that.

00:08:24.120 | So that ability to adapt to a joystick

00:08:26.640 | requires you to have a kind of,

00:08:31.760 | you know, I'm human, so it's hard for me

00:08:33.360 | to introspect all the knowledge I have about the world.

00:08:36.720 | But it seems like there might be an iceberg underneath

00:08:40.360 | of the amount of knowledge we actually bring to the table.

00:08:43.120 | That's kind of the open question.

00:08:44.320 | - I think there's absolutely an iceberg of knowledge

00:08:46.960 | that we bring to the table,

00:08:47.940 | but I think it's very likely that iceberg of knowledge

00:08:51.240 | is actually built up over our lifetimes.

00:08:53.840 | Because we have a lot of prior experience to draw on

00:08:58.400 | and it kind of makes sense that the right way

00:09:01.260 | for us to optimize our efficiency,

00:09:05.840 | our evolutionary fitness and so on,

00:09:07.320 | is to utilize all of that experience

00:09:10.440 | to build up the best iceberg we can get.

00:09:13.240 | And that's actually one of,

00:09:14.840 | while that sounds an awful lot

00:09:16.140 | like what machine learning actually does,

00:09:18.240 | I think that for modern machine learning,

00:09:19.560 | it's actually a really big challenge

00:09:21.120 | to take this unstructured massive experience

00:09:23.520 | and distill out something that looks

00:09:25.880 | like a common sense understanding of the world.

00:09:28.240 | And perhaps part of that is,

00:09:29.480 | it's not because something about machine learning itself

00:09:32.320 | is broken or hard,

00:09:34.400 | but because we've been a little too rigid

00:09:37.040 | in subscribing to a very supervised,

00:09:39.160 | very rigid notion of learning.

00:09:40.960 | Kind of the input output X's go to Y's sort of model.

00:09:43.880 | And maybe what we really need to do

00:09:46.440 | is to view the world more as like a massive experience

00:09:51.340 | that is not necessarily providing any rigid supervision,

00:09:53.880 | but sort of providing many, many instances

00:09:55.600 | of things that could be.

00:09:56.880 | And then you take that and you distill it

00:09:58.240 | into some sort of common sense understanding.

00:10:00.700 | - I see.

00:10:03.040 | Well, you're painting an optimistic, beautiful picture,

00:10:05.560 | especially from the robotics perspective,

00:10:07.680 | 'cause that means we just need to invest

00:10:09.760 | and build better learning algorithms,

00:10:12.360 | figure out how we can get access to more and more data

00:10:16.320 | for those learning algorithms to extract signal from,

00:10:19.080 | and then accumulate that iceberg of knowledge.

00:10:22.960 | It's a beautiful picture.

00:10:24.000 | It's a hopeful one.

00:10:25.240 | - I think it's potentially a little bit more

00:10:26.640 | than just that.

00:10:27.860 | And this is where we perhaps reach the limits

00:10:31.600 | of our current understanding.

00:10:32.760 | But one thing that I think that the research community

00:10:35.980 | hasn't really resolved in a satisfactory way

00:10:38.040 | is how much it matters where that experience comes from.

00:10:41.680 | Like, do you just like download everything on the internet

00:10:44.920 | and cram it into essentially the 21st century analog

00:10:49.000 | of the giant language model and then see what happens?

00:10:52.560 | Or does it actually matter whether your machine

00:10:55.200 | physically experiences the world,

00:10:56.680 | or in the sense that it actually attempts things,

00:11:00.080 | observes the outcome of its actions,

00:11:01.440 | and kind of augments its experience that way?

00:11:03.620 | - That it chooses which parts of the world

00:11:05.880 | it gets to interact with and observe and learn from.

00:11:10.360 | - Right, it may be that the world is so complex

00:11:12.760 | that simply obtaining a large mass of sort of IID samples

00:11:17.760 | of the world is a very difficult way to go.

00:11:20.840 | But if you are actually interacting with the world

00:11:24.000 | and essentially performing this sort of hard negative mining

00:11:26.360 | by attempting what you think might work,

00:11:28.600 | observing the sometimes happy

00:11:30.480 | and sometimes sad outcomes of that,

00:11:32.200 | and augmenting your understanding using that experience,

00:11:35.680 | and you're just doing this continually for many years,

00:11:38.320 | maybe that sort of data in some sense

00:11:40.900 | is actually much more favorable

00:11:42.200 | to obtaining a common sense understanding.

00:11:44.400 | One reason we might think that this is true

00:11:46.200 | is that what we associate with common sense

00:11:50.480 | or lack of common sense is often characterized

00:11:53.160 | by the ability to reason

00:11:54.640 | about kind of counterfactual questions.

00:11:56.560 | Like if I were to,

00:11:58.480 | here I'm this bottle of water sitting on the table,

00:12:01.860 | everything is fine if I were to knock it over,

00:12:04.280 | which I'm not gonna do,

00:12:05.120 | but if I were to do that, what would happen?

00:12:07.400 | And I know that nothing good would happen from that,

00:12:10.280 | but if I have a bad understanding of the world,

00:12:12.720 | I might think that that's a good way for me

00:12:14.200 | to like gain more utility.

00:12:15.940 | If I actually go about my daily life doing the things

00:12:20.720 | that my current understanding of the world suggests

00:12:23.040 | will give me high utility,

00:12:24.320 | in some ways I'll get exactly the right supervision

00:12:28.840 | to tell me not to do those bad things

00:12:31.240 | and to keep doing the good things.

00:12:33.100 | - So there's a spectrum between IID,

00:12:36.160 | random walk through the space of data,

00:12:38.400 | and then there's, and what we humans do.

00:12:41.080 | Well, I don't even know if we do it optimal,

00:12:43.200 | but there might be beyond.

00:12:44.920 | So this open question that you raised,

00:12:48.640 | where do you think systems,

00:12:51.600 | intelligent systems that would be able

00:12:53.760 | to deal with this world fall?

00:12:56.320 | Can we do pretty well by reading all of Wikipedia,

00:12:59.480 | sort of randomly sampling it like language models do,

00:13:03.720 | or do we have to be exceptionally selective

00:13:06.880 | and intelligent about which aspects

00:13:09.480 | of the world we interact with?

00:13:11.040 | - So I think this is first an open scientific problem,

00:13:14.440 | and I don't have like a clear answer,

00:13:15.920 | but I can speculate a little bit.

00:13:18.100 | And what I would speculate is that

00:13:20.280 | you don't need to be super, super careful.

00:13:23.600 | I think it's less about like being careful

00:13:26.320 | to avoid the useless stuff,

00:13:27.840 | and more about making sure that you hit

00:13:29.640 | on the really important stuff.

00:13:31.520 | So perhaps it's okay if you spend part of your day

00:13:34.560 | just guided by your curiosity,

00:13:37.200 | visiting interesting regions of your state space,

00:13:40.040 | but it's important for you to,

00:13:42.240 | every once in a while,

00:13:43.080 | make sure that you really try out the solutions

00:13:46.860 | that your current model of the world suggests

00:13:48.900 | might be effective, and observe whether those solutions

00:13:51.180 | are working as you expect or not.

00:13:52.880 | And perhaps some of that is really essential

00:13:56.260 | to have kind of a perpetual improvement loop.

00:13:59.560 | Like this perpetual improvement loop is really like,

00:14:01.700 | that's really the key,

00:14:03.140 | the key that's gonna potentially distinguish

00:14:05.360 | the best current methods from the best methods

00:14:07.240 | of tomorrow in a sense.

00:14:08.740 | - How important do you think is exploration

00:14:10.560 | or total out of the box thinking,

00:14:15.140 | exploration in this space,

00:14:16.540 | is you jump to totally different domains.

00:14:19.280 | So you kind of mentioned there's an optimization problem,

00:14:21.420 | you kind of explore the specifics of a particular strategy,

00:14:25.680 | whatever the thing you're trying to solve.

00:14:27.680 | How important is it to explore totally outside

00:14:30.880 | of the strategies that have been working for you so far?

00:14:34.160 | What's your intuition there?

00:14:35.160 | - Yeah, I think it's a very problem dependent

00:14:37.900 | kind of question.

00:14:38.780 | And I think that that's actually,

00:14:40.680 | in some ways that question gets at

00:14:45.160 | one of the big differences between

00:14:47.520 | sort of the classic formulation

00:14:50.040 | of a reinforcement learning problem

00:14:51.800 | and some of the sort of more open-ended reformulations

00:14:56.320 | of that problem that have been explored in recent years.

00:14:58.080 | So classically, reinforcement learning

00:15:00.280 | is framed as a problem of maximizing utility,

00:15:02.440 | like any kind of rational AI agent.

00:15:04.560 | And then anything you do is in service

00:15:06.240 | to maximizing that utility.

00:15:07.680 | But a very interesting kind of way to look at,

00:15:14.240 | and I'm not necessarily saying

00:15:15.080 | this is the best way to look at it,

00:15:15.900 | but an interesting alternative way

00:15:16.920 | to look at these problems is as something

00:15:19.660 | where you first get to explore the world,

00:15:22.000 | however you please, and then afterwards,

00:15:24.360 | you will be tasked with doing something.

00:15:26.560 | And that might suggest a somewhat different solution.

00:15:28.840 | So if you don't know what you're gonna be tasked with doing

00:15:31.160 | and you just wanna prepare yourself optimally

00:15:33.000 | for whatever your uncertain future holds,

00:15:35.320 | maybe then you will choose to attain some sort of coverage,

00:15:39.360 | build up sort of an arsenal of cognitive tools, if you will,

00:15:42.680 | such that later on when someone tells you,

00:15:44.440 | now your job is to fetch the coffee for me,

00:15:47.080 | you will be well-prepared to undertake that task.

00:15:49.000 | - And you see that as the modern formulation

00:15:52.160 | of the reinforcement learning problem,

00:15:54.180 | as the kind of, the more multitask,

00:15:56.860 | the general intelligence kind of formulation.

00:15:59.320 | - I think that's one possible vision

00:16:02.740 | of where things might be headed.

00:16:04.480 | I don't think that's by any means the mainstream

00:16:06.720 | or standard way of doing things,

00:16:08.040 | and it's not like if I had to--

00:16:09.800 | - But I like it.

00:16:10.640 | It's a beautiful vision.

00:16:11.740 | So maybe actually take a step back.

00:16:13.960 | What is the goal of robotics?

00:16:16.520 | What's the general problem of robotics

00:16:18.160 | we're trying to solve?

00:16:19.000 | You actually kind of painted two pictures here,

00:16:21.080 | one of sort of the narrow, one of the general.

00:16:23.200 | What, in your view, is the big problem of robotics?

00:16:26.500 | Again, ridiculously philosophical, high-level questions.

00:16:29.600 | - I think that, you know, maybe there are two ways

00:16:33.520 | I can answer this question.

00:16:34.520 | One is there's a very pragmatic problem,

00:16:36.920 | which is like what would make robots,

00:16:40.800 | what would sort of maximize the usefulness of robots?

00:16:43.720 | And there the answer might be something like a system

00:16:47.420 | where a system that can perform whatever task

00:16:52.420 | a human user sets for it, you know,

00:16:57.640 | within the physical constraints, of course.

00:16:59.500 | If you tell it to teleport to another planet,

00:17:01.440 | it probably can't do that.

00:17:02.460 | But if you ask it to do something

00:17:03.900 | that's within its physical capability,

00:17:05.500 | then potentially with a little bit of additional training

00:17:08.400 | or a little bit of additional trial and error,

00:17:10.340 | it ought to be able to figure it out

00:17:11.900 | in much the same way as like a human teleoperator

00:17:14.300 | ought to figure out how to drive the robot to do that.

00:17:16.580 | That's kind of the very pragmatic view

00:17:19.580 | of what it would take to kind of solve

00:17:21.980 | the robotics problem, if you will.

00:17:23.700 | But I think that there is a second answer,

00:17:26.940 | and that answer is a lot closer

00:17:28.940 | to why I want to work on robotics,

00:17:30.540 | which is that I think it's less about

00:17:33.320 | what it would take to do a really good job

00:17:36.020 | in the world of robotics, but more the other way around,

00:17:38.100 | what robotics can bring to the table

00:17:40.700 | to help us understand artificial intelligence.

00:17:43.300 | - So your dream fundamentally is to understand intelligence.

00:17:47.860 | - Yes, I think that's the dream for many people

00:17:50.940 | who actually work in this space.

00:17:53.220 | I think that there's something very pragmatic

00:17:56.700 | and very useful about studying robotics,

00:17:58.540 | but I do think that a lot of people that go into this field,

00:18:01.140 | actually, you know, the things that they draw inspiration

00:18:03.900 | from are the potential for robots

00:18:06.860 | to help us learn about intelligence and about ourselves.

00:18:10.620 | - So that's fascinating, that robotics is basically

00:18:15.220 | the space by which you can get closer to understanding

00:18:18.420 | the fundamentals of artificial intelligence.

00:18:20.580 | So what is it about robotics that's different

00:18:23.780 | from some of the other approaches?

00:18:25.360 | So if we look at some of the early breakthroughs

00:18:27.860 | in deep learning or in the computer vision space

00:18:30.500 | and the natural language processing,

00:18:32.500 | there's really nice, clean benchmarks

00:18:34.860 | that a lot of people competed on

00:18:36.300 | and thereby came up with a lot of brilliant ideas.

00:18:38.380 | What's the fundamental difference to you

00:18:39.900 | between computer vision, purely defined an image net

00:18:43.780 | and kind of the bigger robotics problem?

00:18:46.540 | - So there are a couple of things.

00:18:48.060 | One is that with robotics,

00:18:50.200 | you kind of have to take away many of the crutches.

00:18:55.340 | So you have to deal with both the particular problems

00:19:00.260 | of perception, control, and so on,

00:19:01.700 | but you also have to deal with the integration

00:19:03.060 | of those things.

00:19:04.220 | And classically, we've always thought of the integration

00:19:07.100 | as kind of a separate problem.

00:19:08.700 | So a classic kind of modular engineering approach

00:19:11.060 | is that we solve the individual sub problems

00:19:12.860 | and wire them together, and then the whole thing works.

00:19:15.860 | And one of the things that we've been seeing

00:19:17.580 | over the last couple of decades is that,

00:19:19.460 | well, maybe studying the thing as a whole

00:19:22.100 | might lead to just like very different solutions

00:19:24.300 | than if we were to study the parts and wire them together.

00:19:26.540 | So the integrative nature of robotics research

00:19:29.820 | helps us see the different perspectives on the problem.

00:19:34.060 | Another part of the answer is that with robotics,

00:19:37.820 | it casts a certain paradox into very clever relief.

00:19:41.420 | So this is sometimes referred to as a Moravic's paradox,

00:19:44.580 | the idea that in artificial intelligence,

00:19:48.220 | things that are very hard for people

00:19:50.700 | can be very easy for machines and vice versa.

00:19:52.660 | Things that are very easy for people

00:19:53.740 | can be very hard for machines.

00:19:54.780 | So, you know, integral and differential calculus

00:19:59.660 | is pretty difficult to learn for people,

00:20:01.920 | but if you program a computer to do it,

00:20:03.660 | it can derive derivatives and integrals for you

00:20:06.020 | all day long without any trouble.

00:20:08.220 | Whereas some things like, you know,

00:20:10.980 | drinking from a cup of water,

00:20:12.420 | very easy for a person to do,

00:20:13.620 | very hard for a robot to deal with.

00:20:16.340 | And sometimes when we see such blatant discrepancies,

00:20:20.140 | that gives us a really strong hint

00:20:21.520 | that we're missing something important.

00:20:23.040 | So if we really try to zero in on those discrepancies,

00:20:25.620 | we might find that little bit that we're missing.

00:20:27.900 | And it's not that we need to make machines better

00:20:30.460 | or worse at math and better at drinking water,

00:20:33.040 | but just that by studying those discrepancies,

00:20:34.940 | we might find some new insight.

00:20:37.660 | - So that could be in any space.

00:20:40.260 | It doesn't have to be robotics,

00:20:41.460 | but you're saying, I mean,

00:20:44.100 | I get it's kind of interesting that robotics

00:20:46.660 | seems to have a lot of those discrepancies.

00:20:49.420 | So the Hans-Marx paradox is probably referring

00:20:53.740 | to the space of the physical interaction,

00:20:56.500 | like you said, object manipulation, walking,

00:20:59.220 | all the kind of stuff we do in the physical world.

00:21:02.340 | How do you make sense,

00:21:05.800 | if you were to try to disentangle the Marvox paradox,

00:21:10.800 | like why is there such a gap in our intuition about it?

00:21:17.640 | Why do you think manipulating objects is so hard

00:21:20.640 | from everything you've learned

00:21:22.520 | from applying reinforcement learning in this space?

00:21:25.660 | - Yeah, I think that one reason is maybe that

00:21:31.240 | for many of the other problems that we've studied

00:21:34.320 | in AI and computer science and so on,

00:21:36.880 | the notion of input, output and supervision

00:21:41.220 | is much, much cleaner.

00:21:42.280 | So computer vision, for example,

00:21:43.760 | deals with very complex inputs,

00:21:45.600 | but it's comparatively a bit easier,

00:21:48.540 | at least up to some level of abstraction

00:21:51.540 | to cast it as a very tightly supervised problem.

00:21:54.720 | It's comparatively much, much harder

00:21:56.680 | to cast robotic manipulation

00:21:58.480 | as a very tightly supervised problem.

00:22:00.460 | You can do it, it just doesn't seem to work all that well.

00:22:03.360 | So you could say that, well,

00:22:04.360 | maybe we get a label data set

00:22:06.040 | where we know exactly which motor commands to send

00:22:08.200 | and then we train on that.

00:22:09.100 | But for various reasons,

00:22:11.220 | that's not actually like such a great solution.

00:22:13.560 | And it also doesn't seem to be even remotely similar

00:22:16.320 | to how people and animals learn to do things

00:22:17.920 | because we're not told by like our parents,

00:22:20.360 | here's how you fire your muscles in order to walk.

00:22:24.200 | We do get some guidance,

00:22:26.140 | but the really low level detailed stuff,

00:22:28.160 | we figure out mostly on our own.

00:22:29.640 | - And that's what you mean by tightly coupled,

00:22:31.120 | that every single little sub action

00:22:33.760 | gets a supervised signal of whether it's a good one or not.

00:22:37.120 | - Right.

00:22:37.960 | So while in computer vision,

00:22:39.100 | you could sort of imagine up to a level of abstraction

00:22:41.320 | that maybe somebody told you this is a car

00:22:43.400 | and this is a cat and this is a dog,

00:22:45.320 | in motor control, it's very clear

00:22:46.760 | that that was not the case.

00:22:48.120 | - If we look at sort of the sub spaces of robotics,

00:22:53.880 | that again, as you said,

00:22:56.280 | robotics integrates all of them together

00:22:58.040 | and we get to see how this beautiful mess interplays.

00:23:01.240 | So there's nevertheless still perception.

00:23:03.960 | So it's the computer vision problem,

00:23:06.280 | broadly speaking, understanding the environment.

00:23:09.720 | Then there's also, maybe you can correct me

00:23:11.720 | on this kind of categorization of the space.

00:23:14.480 | Then there's prediction in trying to anticipate

00:23:18.520 | what things are going to do into the future

00:23:20.580 | in order for you to be able to act in that world.

00:23:24.360 | And then there's also this game theoretic aspect

00:23:28.120 | of how your actions will change the behavior of others.

00:23:32.920 | In this kind of space, what,

00:23:36.200 | and this is bigger than reinforcement learning,

00:23:38.120 | this is just broadly looking at the problem in robotics.

00:23:40.820 | What's the hardest problem here?

00:23:42.720 | Or is there, or is what you said true

00:23:46.280 | that when you start to look at all of them together,

00:23:52.040 | that's a whole nother thing?

00:23:54.320 | Like you can't even say which one individually is harder

00:23:57.440 | because all of them together,

00:23:58.800 | you should only be looking at them all together.

00:24:01.480 | - I think when you look at them all together,

00:24:03.400 | some things actually become easier.

00:24:05.160 | And I think that's actually pretty important.

00:24:07.420 | So we had, back in 2014, we had some work,

00:24:12.420 | basically our first work on end-to-end

00:24:15.520 | reinforcement learning for robotic manipulation skills

00:24:17.880 | from vision, which at the time was something

00:24:20.660 | that seemed a little inflammatory and controversial

00:24:23.720 | in the robotics world.

00:24:25.340 | But other than the inflammatory

00:24:28.200 | and controversial part of it,

00:24:29.440 | the point that we were actually trying to make in that work

00:24:31.800 | is that for the particular case

00:24:33.960 | of combining perception and control,

00:24:36.060 | you could actually do better if you treat them together

00:24:38.380 | than if you try to separate them.

00:24:39.580 | And the way that we tried to demonstrate this

00:24:41.420 | is we picked a fairly simple motor control task

00:24:43.640 | where a robot had to insert a little red trapezoid

00:24:47.420 | into a trapezoidal hole.

00:24:49.340 | And we had our separated solution,

00:24:52.320 | which involved first detecting the hole

00:24:53.920 | using a pose detector,

00:24:54.940 | and then actuating the arm to put it in.

00:24:57.480 | And then our intent solution,

00:24:58.780 | which just mapped pixels to the torques.

00:25:01.440 | And one of the things we observed is that

00:25:03.920 | if you use the intent solution,

00:25:05.600 | essentially the pressure on the perception part

00:25:07.160 | of the model is actually lower.

00:25:08.280 | Like it doesn't have to figure out exactly

00:25:09.720 | where the thing is in 3D space.

00:25:11.240 | It just needs to figure out where it is,

00:25:14.120 | distributing the errors in such a way

00:25:15.600 | that the horizontal difference matters

00:25:17.560 | more than the vertical difference,

00:25:18.680 | because vertically it just pushes it down all the way

00:25:20.500 | until it can't go any further.

00:25:21.960 | And their perceptual errors are a lot less harmful,

00:25:24.560 | whereas perpendicular to the direction of motion,

00:25:26.920 | perceptual errors are much more harmful.

00:25:28.940 | So the point is that if you combine these two things,

00:25:32.120 | you can trade off errors between the components

00:25:34.440 | optimally to best accomplish the task.

00:25:37.980 | And the components can actually be weaker

00:25:39.740 | while still leading to better overall performance.

00:25:41.960 | - That's a profound idea.

00:25:43.920 | I mean, in the space of pegs and things like that,

00:25:47.240 | it's quite simple.

00:25:48.760 | It almost is tempting to overlook,

00:25:51.160 | but that seems to be at least intuitively an idea

00:25:54.960 | that should generalize to basically all aspects

00:25:58.560 | of perception control.

00:26:00.080 | - Of course.

00:26:00.900 | - That one strengthens the other.

00:26:01.960 | - Yeah, and people who have studied

00:26:05.160 | sort of perceptual heuristics in humans and animals

00:26:07.560 | find things like that all the time.

00:26:08.840 | So one very well-known example

00:26:10.760 | of this is something called the gaze heuristic,

00:26:12.280 | which is a little trick that you can use

00:26:15.320 | to intercept a flying object.

00:26:17.200 | So if you want to catch a ball, for instance,

00:26:19.280 | you could try to localize it in 3D space,

00:26:21.560 | estimate its velocity,

00:26:22.680 | estimate the effect of wind resistance,

00:26:24.080 | solve a complex system of differential equations

00:26:25.880 | in your head,

00:26:27.240 | or you can maintain a running speed

00:26:31.400 | so that the object stays in the same position

00:26:32.920 | as in your field of view.

00:26:34.040 | So if it dips a little bit, you speed up.

00:26:35.640 | If it rises a little bit, you slow down.

00:26:38.080 | And if you follow the simple rule,

00:26:39.240 | you'll actually arrive at exactly the place

00:26:40.680 | where the object lands and you'll catch it.

00:26:42.720 | And humans use it when they play baseball.

00:26:45.160 | Human pilots use it when they fly airplanes

00:26:47.040 | to figure out if they're about to collide with somebody.

00:26:49.080 | Frogs use this to catch insects and so on and so on.

00:26:51.480 | So this is something that actually happens in nature.

00:26:53.560 | And I'm sure this is just one instance of it

00:26:55.440 | that we were able to identify

00:26:56.760 | just because all the scientists were able to identify

00:26:59.160 | because it's so prevalent,

00:27:00.000 | but there are probably many others.

00:27:02.200 | - Do you have a, just so we can zoom in

00:27:04.200 | as we talk about robotics,

00:27:05.400 | do you have a canonical problem,

00:27:07.240 | sort of a simple, clean, beautiful,

00:27:10.000 | representative problem in robotics

00:27:12.520 | that you think about

00:27:14.040 | when you're thinking about some of these problems?

00:27:15.880 | We talked about robotic manipulation.

00:27:18.640 | To me, that seems intuitively,

00:27:21.360 | at least the robotics community has converged towards that

00:27:25.560 | as a space that's the canonical problem.

00:27:28.600 | If you agree, then maybe do you zoom in

00:27:30.760 | in some particular aspect of that problem

00:27:33.200 | that you just like?

00:27:34.400 | Like if we solve that problem perfectly,

00:27:37.000 | it'll unlock a major step

00:27:39.120 | towards human-level intelligence.

00:27:42.880 | - I don't think I have like a really great answer

00:27:45.760 | to that.

00:27:46.600 | And I think partly the reason I don't have a great answer

00:27:49.520 | kind of has to do with the,

00:27:51.120 | it has to do with the fact that the difficulty

00:27:54.880 | is really in the flexibility and adaptability

00:27:57.400 | rather than in doing a particular thing really, really well.

00:28:00.960 | So it's hard to just say like,

00:28:03.760 | oh, if you can, I don't know,

00:28:05.120 | like shuffle a deck of cards as fast

00:28:07.720 | as like a Vegas casino dealer,

00:28:10.320 | then you'll be very proficient.

00:28:12.760 | It's really the ability to quickly figure out

00:28:16.840 | how to do some arbitrary new thing well enough

00:28:21.840 | to like, you know, to move on to the next arbitrary thing.

00:28:26.000 | - But the source of newness and uncertainty,

00:28:29.800 | have you found problems in which it's easy

00:28:33.800 | to generate new newness-ness-nesses?

00:28:37.840 | - Yeah.

00:28:38.680 | - New types of newness.

00:28:40.400 | - Yeah.

00:28:41.240 | So a few years ago,

00:28:43.160 | so if you had asked me this question around like 2016,

00:28:46.080 | maybe I would have probably said that robotic grasping

00:28:48.680 | is a really great example of that

00:28:50.840 | because it's a task with great real world utility.

00:28:54.200 | Like you will get a lot of money if you can do it well.

00:28:57.160 | - What is robotic grasping?

00:28:58.840 | - Picking up any object.

00:29:00.760 | - With a robotic hand.

00:29:02.280 | - Exactly.

00:29:03.120 | So you will get a lot of money if you do it well

00:29:04.400 | because lots of people want to run warehouses with robots.

00:29:07.560 | And it's highly non-trivial

00:29:08.800 | because very different objects

00:29:12.360 | will require very different grasping strategies.

00:29:14.960 | But actually since then,

00:29:16.880 | people have gotten really good at building systems

00:29:19.360 | to solve this problem.

00:29:21.120 | It's to the point where I'm not actually sure

00:29:22.760 | how much more progress we can make

00:29:25.680 | with that as like the main guiding thing.

00:29:29.400 | But it's kind of interesting to see the kind of methods

00:29:31.880 | that have actually worked well in that space

00:29:33.600 | because a robotic grasping classically

00:29:36.800 | used to be regarded very much as

00:29:39.160 | kind of almost like a geometry problem.

00:29:41.280 | So people who have studied the history of computer vision

00:29:44.960 | will find this very familiar

00:29:46.720 | that it's kind of in the same way

00:29:48.200 | that in the early days of computer vision,

00:29:49.600 | people thought of it very much

00:29:50.600 | as like an inverse graphics thing.

00:29:52.280 | In robotic grasping,

00:29:53.560 | people thought of it as an inverse physics problem.

00:29:56.000 | Essentially, you look at what's in front of you,

00:29:58.640 | figure out the shapes,

00:29:59.880 | then use your best estimate of the laws of physics

00:30:02.480 | to figure out where to put your fingers

00:30:03.720 | and then you pick up the thing.

00:30:05.800 | And it turns out that what works really well

00:30:07.400 | for robotic grasping,

00:30:08.440 | instantiated in many different recent works,

00:30:11.320 | including our own, but also ones from many other labs,

00:30:13.760 | is to use learning methods

00:30:16.760 | with some combination of either exhaustive simulation

00:30:19.880 | or like actual real world trial and error.

00:30:22.080 | And it turns out that those things

00:30:23.080 | actually work really well

00:30:23.920 | and then you don't have to worry about

00:30:25.200 | solving geometry problems or physics problems.

00:30:27.480 | - So what are, just by the way, in the grasping,

00:30:32.360 | what are the difficulties that have been worked on?

00:30:35.280 | So one is like the materials of things,

00:30:38.320 | maybe occlusions and the perception side.

00:30:40.800 | Why is it such a difficult,

00:30:42.400 | why is picking stuff up such a difficult problem?

00:30:45.000 | - Yeah, it's a difficult problem

00:30:47.280 | because the number of things

00:30:50.160 | that you might have to deal with

00:30:51.560 | or the variety of things that you have to deal with

00:30:53.120 | is extremely large.

00:30:54.520 | And oftentimes things that work for one class of objects

00:30:58.880 | won't work for other classes of objects.

00:31:00.240 | So if you get really good at picking up boxes

00:31:03.880 | and now you have to pick up plastic bags,

00:31:06.160 | you just need to employ a very different strategy.

00:31:09.560 | And there are many properties of objects

00:31:13.160 | that are more than just their geometry.

00:31:15.280 | It has to do with the bits that are easier to pick up,

00:31:18.800 | the bits that are harder to pick up,

00:31:19.760 | the bits that are more flexible,

00:31:20.840 | the bits that will cause the thing to pivot and bend

00:31:23.520 | and drop out of your hand

00:31:24.960 | versus the bits that result in a nice secure grasp,

00:31:27.840 | things that are flexible,

00:31:29.120 | things that if you pick them up the wrong way,

00:31:30.560 | they'll fall upside down and the contents will spill out.

00:31:33.720 | So there's all these little details that come up,

00:31:36.000 | but the task is still kind of,

00:31:38.200 | can be characterized as one task.

00:31:39.600 | Like there's a very clear notion of

00:31:41.200 | you did it or you didn't do it.

00:31:42.760 | - So in terms of spilling things,

00:31:46.960 | there creeps in this notion that starts to sound

00:31:50.520 | and feel like common sense reasoning.

00:31:52.980 | Do you think solving the general problem of robotics

00:31:57.980 | requires common sense reasoning,

00:32:01.640 | requires general intelligence,

00:32:04.720 | this kind of human level capability of,

00:32:07.520 | like you said, be robust and deal with uncertainty,

00:32:11.680 | but also be able to sort of reason

00:32:13.360 | and assimilate different pieces of knowledge that you have?

00:32:16.460 | Yeah.

00:32:19.280 | What are your thoughts on the needs

00:32:23.600 | of common sense reasoning in the space

00:32:25.640 | of the general robotics problem?

00:32:28.520 | - So I'm gonna slightly dodge that question

00:32:30.280 | and say that I think maybe actually

00:32:32.320 | it's the other way around is that studying robotics

00:32:36.000 | can help us understand how to put common sense

00:32:38.260 | into our AI systems.

00:32:40.420 | One way to think about common sense is that,

00:32:43.000 | and why our current systems might lack common sense,

00:32:45.440 | is that common sense is a property,

00:32:47.080 | is an emergent property of actually having to interact

00:32:51.620 | with a particular world, a particular universe,

00:32:54.240 | and get things done in that universe.

00:32:56.080 | So you might think that, for instance,

00:32:57.920 | like an image captioning system,

00:33:00.640 | maybe it looks at pictures of the world

00:33:03.760 | and it types out English sentences.

00:33:05.820 | So it kind of deals with our world.

00:33:07.900 | And then you can easily construct situations

00:33:11.040 | where image captioning systems do things

00:33:12.840 | that defy common sense,

00:33:13.900 | like give it a picture of a person wearing a fur coat

00:33:16.200 | and we'll say it's a teddy bear.

00:33:18.460 | But I think what's really happening in those settings

00:33:20.720 | is that the system doesn't actually live in our world,

00:33:24.120 | it lives in its own world that consists of pixels

00:33:26.120 | and English sentences,

00:33:27.480 | and doesn't actually consist of like,

00:33:29.920 | having to put on a fur coat in the winter

00:33:31.440 | so you don't get cold.

00:33:33.120 | So perhaps the reason for the disconnect

00:33:35.960 | is that the systems that we have now

00:33:39.240 | simply inhabit a different universe.

00:33:41.160 | And if we build AI systems that are forced to deal

00:33:43.200 | with all of the messiness and complexity of our universe,

00:33:46.560 | maybe they will have to acquire common sense

00:33:48.900 | to essentially maximize their utility.

00:33:51.520 | Whereas the systems we're building now

00:33:52.880 | don't have to do that, they can take some shortcut.

00:33:56.060 | - That's fascinating.

00:33:57.200 | You've a couple times already sort of reframed

00:34:00.120 | the role of robotics in this whole thing.

00:34:02.040 | And for some reason,

00:34:03.880 | I don't know if my way of thinking is common,

00:34:06.640 | but I thought like,

00:34:08.040 | we need to understand and solve intelligence

00:34:10.360 | in order to solve robotics.

00:34:12.720 | And you're kind of framing it as,

00:34:14.840 | no, robotics is one of the best ways

00:34:16.720 | to just study artificial intelligence

00:34:18.760 | and build sort of like,

00:34:20.440 | robotics is like the right space

00:34:22.260 | in which you get to explore

00:34:25.640 | some of the fundamental learning mechanisms,

00:34:28.580 | fundamental sort of multimodal, multitask,

00:34:33.120 | aggregation of knowledge mechanisms

00:34:35.060 | that are required for general intelligence.

00:34:36.660 | That's really interesting way to think about it.

00:34:39.180 | But let me ask about learning.

00:34:41.420 | Can the general sort of robotics,

00:34:44.100 | the epitome of the robotics problem

00:34:45.740 | be solved purely through learning,

00:34:47.860 | perhaps end-to-end learning?

00:34:51.740 | Sort of learning from scratch

00:34:54.620 | as opposed to injecting human expertise

00:34:57.080 | and rules and heuristics and so on?

00:34:59.080 | - I think that in terms of the spirit of the question,

00:35:02.420 | I would say yes.

00:35:04.700 | I mean, I think that though in some ways

00:35:07.560 | it's maybe like an overly sharp dichotomy.

00:35:11.120 | Like, I think that in some ways when we build algorithms,

00:35:14.540 | at some point a person does something.

00:35:18.040 | - Yeah, hyper parameters, there's always--

00:35:19.800 | - A person turned on the computer,

00:35:21.160 | a person implemented TensorFlow.

00:35:24.920 | But yeah, I think that in terms of the point

00:35:28.560 | that you're getting at, I do think the answer is yes.

00:35:30.280 | I think that we can solve many problems

00:35:34.240 | that have previously required meticulous manual engineering

00:35:37.480 | through automated optimization techniques.

00:35:39.920 | And actually one thing I will say on this topic is

00:35:42.200 | I don't think this is actually a very radical

00:35:44.240 | or very new idea.

00:35:45.200 | I think people have been thinking

00:35:47.840 | about automated optimization techniques

00:35:49.840 | as a way to do control for a very, very long time.

00:35:53.240 | And in some ways what's changed is really more the name.

00:35:57.880 | So today we would say that, oh, my robot does

00:36:01.760 | machine learning, it does reinforcement learning.

00:36:03.560 | Maybe in the 1960s you'd say,

00:36:05.400 | oh, my robot is doing optimal control.

00:36:08.280 | And maybe the difference between typing out

00:36:10.440 | a system of differential equations

00:36:12.160 | and doing feedback linearization

00:36:14.000 | versus training a neural net,

00:36:15.680 | maybe it's not such a large difference.

00:36:16.920 | It's just pushing the optimization deeper

00:36:20.480 | and deeper into the thing.

00:36:22.280 | - Well, it's interesting you think that way,

00:36:23.880 | but with, especially with deep learning,

00:36:26.320 | that the accumulation of sort of experiences

00:36:30.440 | in data form to form deep representations

00:36:34.920 | starts to feel like knowledge

00:36:36.840 | as opposed to optimal control.

00:36:39.360 | So this feels like there's an accumulation of knowledge

00:36:41.720 | through the learning process.

00:36:43.000 | - Yes, yeah.

00:36:43.840 | So I think that is a good point,

00:36:44.800 | that one big difference between learning-based systems

00:36:48.080 | and classic optimal control systems

00:36:49.800 | is that learning-based systems in principle

00:36:52.160 | should get better and better the more they do something.

00:36:55.080 | And I do think that that's actually

00:36:56.320 | a very, very powerful difference.

00:36:58.040 | - So if we look back at the world of expert systems

00:37:01.600 | and symbolic AI and so on,

00:37:03.920 | of using logic to accumulate expertise,

00:37:07.400 | human expertise, human encoded expertise,

00:37:10.960 | do you think that will have a role at some point?

00:37:14.760 | Deep learning, machine learning, reinforcement learning

00:37:17.080 | has shown incredible results and breakthroughs

00:37:21.240 | and just inspired thousands, maybe millions of researchers.

00:37:26.240 | But there's this less popular now,

00:37:30.600 | but it used to be popular idea of symbolic AI.

00:37:32.800 | Do you think that will have a role?

00:37:35.240 | - I think in some ways,

00:37:36.480 | the kind of the descendants of symbolic AI

00:37:42.240 | actually already have a role.

00:37:44.640 | So this is the highly biased history from my perspective.

00:37:48.760 | You say that, well, initially we thought

00:37:50.960 | that rational decision-making

00:37:52.760 | involves logical manipulation.

00:37:54.680 | So you have some model of the world

00:37:56.960 | expressed in terms of logic.

00:37:59.880 | You have some query,

00:38:00.720 | like what action do I take in order for X to be true?

00:38:04.600 | And then you manipulate

00:38:05.600 | your logical symbolic representation to get an answer.

00:38:08.360 | What that turned into somewhere in the 1990s is,

00:38:11.880 | well, instead of building kind of predicates

00:38:14.280 | and statements that have true or false values,

00:38:17.480 | we'll build probabilistic systems

00:38:19.200 | where things have probabilities associated

00:38:21.880 | and probabilities of being true and false.

00:38:23.120 | And that turned into Bayes nets.

00:38:24.960 | And that provided sort of a boost

00:38:27.000 | to what were really,

00:38:28.800 | still essentially logical inference systems,

00:38:30.520 | just probabilistic logical inference systems.

00:38:32.800 | And then people said, well, let's actually learn

00:38:35.560 | the individual probabilities inside these models.

00:38:39.120 | And then people said, well,

00:38:40.680 | let's not even specify the nodes in the models.

00:38:42.760 | Let's just put a big neural net in there.

00:38:45.320 | But in many ways,

00:38:46.160 | I see these as actually kind of descendants

00:38:47.880 | from the same idea.

00:38:48.800 | It's essentially instantiating rational decision-making

00:38:51.440 | by means of some inference process

00:38:53.600 | and learning by means of an optimization process.

00:38:56.680 | So in a sense, I would say, yes, it has a place.

00:39:00.000 | And in many ways, that place is,

00:39:01.920 | it already holds that place.

00:39:04.360 | - It's already in there.

00:39:05.480 | Yeah, it's just by different,

00:39:06.680 | it looks slightly different than it was before.

00:39:09.000 | - Yeah, but there are some things

00:39:10.440 | that we can think about

00:39:11.640 | that make this a little bit more obvious.

00:39:13.120 | Like if I train a big neural net model

00:39:15.800 | to predict what will happen

00:39:17.160 | in response to my robot's actions,

00:39:18.880 | and then I run probabilistic inference,

00:39:21.480 | meaning I invert that model

00:39:22.800 | to figure out the actions

00:39:23.720 | that lead to some plausible outcome.

00:39:24.960 | Like to me, that seems like a kind of logic.

00:39:27.440 | You have a model of the world

00:39:28.840 | that just happens to be expressed by a neural net,

00:39:31.240 | and you are doing some inference procedure,

00:39:33.560 | some sort of manipulation on that model

00:39:35.400 | to figure out the answer to a query that you have.

00:39:39.520 | - It's the interpretability,

00:39:41.000 | it's the explainability though

00:39:42.520 | that seems to be lacking more so.

00:39:44.520 | Because the nice thing about expert systems

00:39:48.040 | is you can follow the reasoning of the system.

00:39:50.600 | That to us mere humans is somehow compelling.

00:39:54.160 | It's just, I don't know what to make of this fact

00:40:00.440 | that there's a human desire for intelligent systems

00:40:04.640 | to be able to convey in a poetic way to us

00:40:10.040 | why it made the decisions it did.

00:40:12.480 | Like tell a convincing story.

00:40:15.160 | And perhaps that's like a silly human thing.

00:40:20.160 | Like we shouldn't expect that of intelligent systems.

00:40:22.960 | Like we should be super happy

00:40:24.380 | that there is intelligent systems out there.

00:40:27.600 | But if I were to sort of psychoanalyze the researchers

00:40:31.920 | at the time, I would say expert systems

00:40:33.800 | connected to that part.

00:40:35.880 | That desire of AI researchers

00:40:37.800 | for systems to be explainable.

00:40:40.160 | I mean, maybe on that topic,

00:40:41.640 | do you have a hope that sort of inferences

00:40:45.680 | of learning-based systems will be as explainable

00:40:50.680 | as the dream was with expert systems, for example?

00:40:53.940 | - I think it's a very complicated question

00:40:56.680 | because I think that in some ways

00:40:58.620 | the question of explainability

00:41:00.680 | is kind of very closely tied

00:41:03.480 | to the question of like performance.

00:41:06.920 | Like, why do you want your system to explain itself?

00:41:09.320 | Well, so that when it screws up,

00:41:11.320 | you can kind of figure out why it did it.

00:41:13.800 | - Right.

00:41:14.640 | - But in some ways that's a much bigger problem, actually.

00:41:17.240 | Like your system might screw up

00:41:19.320 | and then it might screw up in how it explains itself.

00:41:22.680 | Or you might have some bug somewhere

00:41:24.920 | so that it's not actually doing what it was supposed to do.

00:41:26.920 | So, maybe a good way to view that problem

00:41:30.400 | is really as a bigger problem of verification and validation

00:41:36.160 | of which explainability is sort of one component.

00:41:39.320 | - I see.

00:41:40.160 | I just see it differently.

00:41:41.160 | I see explainability, you put it beautifully.

00:41:43.960 | I think you actually summarize the field of explainability.

00:41:46.800 | But to me, there's another aspect of explainability

00:41:49.720 | which is like storytelling

00:41:51.760 | that has nothing to do with errors or with like,

00:41:56.760 | it uses errors as elements of its story

00:42:03.860 | as opposed to a fundamental need

00:42:05.740 | to be explainable when errors occur.

00:42:08.060 | It's just that for other intelligence systems

00:42:10.460 | to be in our world,

00:42:11.620 | we seem to want to tell each other stories.

00:42:14.460 | And that's true in the political world,

00:42:17.780 | that's true in the academic world.

00:42:19.720 | And that, you know,

00:42:21.780 | neural networks are less capable of doing that.

00:42:23.820 | Or perhaps they're equally capable

00:42:25.340 | of storytelling and storytelling.

00:42:26.660 | Maybe it doesn't matter

00:42:28.420 | what the fundamentals of the system are.

00:42:30.260 | You just need to be a good storyteller.

00:42:32.700 | Maybe one specific story I can tell you about

00:42:35.620 | in that space is actually about some work

00:42:38.120 | that was done by my former collaborator

00:42:40.500 | who's now a professor at MIT named Jacob Andreas.

00:42:43.320 | Jacob actually works in natural language processing,

00:42:45.780 | but he had this idea to do a little bit of work

00:42:47.660 | in reinforcement learning

00:42:49.140 | and how natural language can basically structure

00:42:52.740 | the internals of policies trained with RL.

00:42:55.580 | And one of the things he did is he set up a model

00:42:59.140 | that attempts to perform some tasks

00:43:01.420 | that's defined by a reward function,

00:43:03.740 | but the model reads in a natural language instruction.

00:43:06.500 | So this is a pretty common thing to do

00:43:07.820 | in instruction following.

00:43:08.820 | So you tell it like, you know, go to the red house

00:43:11.600 | and then it's supposed to go to the red house.

00:43:13.560 | But then one of the things that Jacob did

00:43:14.960 | is he treated that sentence not as a command from a person,

00:43:19.540 | but as a representation of the internal kind of state

00:43:23.540 | of the mind of this policy, essentially.

00:43:26.620 | So that when it was faced with a new task,

00:43:28.540 | what it would do is it would basically try to think

00:43:31.020 | of possible language descriptions,

00:43:33.540 | attempt to do them and see if they led to the right outcome.

00:43:35.580 | So it would kind of think out loud, like, you know,

00:43:37.580 | I'm faced with this new task, what am I gonna do?

00:43:39.380 | Let me go to the red house.

00:43:40.500 | Oh, that didn't work.

00:43:41.340 | Let me go to the blue room or something.

00:43:43.780 | Let me go to the green plant.

00:43:45.420 | And once it got some reward, it would say,

00:43:46.740 | oh, go to the green plant, that's what's working.

00:43:48.220 | I'm gonna go to the green plant.

00:43:49.500 | And then you could look at the string that it came up with,

00:43:51.100 | and that was a description of how it thought

00:43:52.720 | it should solve the problem.

00:43:54.380 | So you could basically incorporate language

00:43:57.320 | as internal state and you can start getting some handle

00:43:59.500 | on these kinds of things.

00:44:00.880 | - And then what I was kind of trying to get to is that

00:44:04.000 | also if you add to the reward function,

00:44:07.000 | the convincingness of that story.

00:44:10.060 | So I have another reward signal of like,

00:44:12.300 | people who review that story, how much they like it.

00:44:16.620 | So that, you know, initially that could be a hyper parameter

00:44:21.420 | sort of hard coded heuristic type of thing,

00:44:23.420 | but it's an interesting notion of the convincingness

00:44:28.580 | of the story becoming part of the reward function,

00:44:31.800 | the objective function of the explainability.

00:44:33.960 | It's in the world of sort of Twitter and fake news,

00:44:37.500 | that might be a scary notion that the nature of truth

00:44:42.000 | may not be as important as the convincingness of the,

00:44:45.020 | how convinced you are in telling the story around the facts.

00:44:50.020 | Well, let me ask the basic question.

00:44:55.220 | You're one of the world-class researchers

00:44:57.040 | in reinforcement learning, deep reinforcement learning,

00:44:59.580 | certainly in the robotics space.

00:45:01.700 | What is reinforcement learning?

00:45:04.500 | - I think that what reinforcement learning refers to today

00:45:06.940 | is really just the kind of the modern incarnation

00:45:10.740 | of learning-based control.

00:45:12.980 | So classically reinforcement learning

00:45:14.360 | has a much more narrow definition,

00:45:15.700 | which is that it's literally learning from reinforcement,

00:45:18.980 | like the thing does something

00:45:20.200 | and then it gets a reward or punishment.

00:45:22.540 | But really I think the way the term is used today

00:45:24.340 | is it's used to refer more broadly to learning-based control.

00:45:28.140 | So some kind of system

00:45:29.220 | that's supposed to be controlling something

00:45:31.820 | and it uses data to get better.

00:45:34.660 | - And what does control mean?

00:45:35.820 | So is action is the fundamental element there?

00:45:38.460 | - It means making rational decisions.

00:45:40.780 | Now, and rational decisions are decisions

00:45:42.460 | that maximize a measure of utility.

00:45:44.300 | - And sequentially, so you made decisions

00:45:46.620 | time and time and time again.

00:45:48.240 | Now, like, it's easier to see that kind of idea

00:45:52.300 | in the space of maybe games, in the space of robotics.

00:45:56.420 | Do you see it bigger than that?

00:45:58.820 | Is it applicable?

00:46:00.220 | Like, where are the limits of the applicability

00:46:02.940 | of reinforcement learning?

00:46:04.500 | - Yeah, so rational decision-making

00:46:07.380 | is essentially the encapsulation of the AI problem

00:46:11.380 | viewed through a particular lens.

00:46:12.980 | So any problem that we would want a machine to do,

00:46:16.420 | an intelligent machine,

00:46:18.220 | can likely be represented as a decision-making problem.

00:46:20.460 | Classifying images is a decision-making problem,

00:46:23.060 | although not a sequential one typically.

00:46:25.060 | Controlling a chemical plant is a decision-making problem.

00:46:30.260 | Deciding what videos to recommend on YouTube

00:46:32.580 | is a decision-making problem.

00:46:34.460 | And one of the really appealing things

00:46:35.820 | about reinforcement learning is,

00:46:38.220 | if it does encapsulate the range

00:46:40.300 | of all of these decision-making problems,

00:46:41.740 | perhaps working on reinforcement learning

00:46:43.820 | is one of the ways to reach a very broad swath

00:46:47.580 | of AI problems.

00:46:50.260 | - But what is the fundamental difference

00:46:52.740 | between reinforcement learning

00:46:54.260 | and maybe supervised machine learning?

00:46:56.620 | - So reinforcement learning can be viewed

00:47:00.220 | as a generalization of supervised machine learning.

00:47:02.740 | You can certainly cast supervised learning

00:47:04.380 | as a reinforcement learning problem.

00:47:05.620 | You can just say your loss function

00:47:06.780 | is the negative of your reward,

00:47:08.980 | but you have stronger assumptions.

00:47:10.140 | You have the assumption that someone actually told you

00:47:12.340 | what the correct answer was,

00:47:14.420 | that your data was IID and so on.

00:47:15.940 | So you could view reinforcement learning

00:47:18.220 | as essentially relaxing some of those assumptions.

00:47:20.340 | Now that's not always a very productive way to look at it,

00:47:22.100 | because if you actually have a supervised learning problem,

00:47:24.300 | you'll probably solve it much more effectively

00:47:25.980 | by using supervised learning methods, because it's easier.

00:47:29.380 | But you can view reinforcement learning

00:47:31.460 | as a generalization of that.

00:47:32.380 | - No, for sure.

00:47:33.220 | But they're fundamentally different.

00:47:35.780 | That's a mathematical statement

00:47:37.140 | that's absolutely correct.

00:47:38.540 | But it seems that reinforcement learning,

00:47:41.580 | the kind of tools we bring to the table today,

00:47:43.700 | of today, so maybe down the line,

00:47:46.500 | everything will be a reinforcement learning problem,

00:47:48.860 | just like you said, image classification should be mapped

00:47:52.340 | to a reinforcement learning problem.

00:47:53.660 | But today, the tools and ideas,

00:47:56.300 | the way we think about them are different.

00:47:58.820 | Sort of supervised learning has been used very effectively

00:48:02.780 | to solve basic, narrow AI problems.

00:48:06.540 | Reinforcement learning kind of represents the dream of AI.

00:48:11.460 | It's very much so in the research space now,

00:48:15.220 | in sort of captivating the imagination of people,

00:48:17.780 | of what we can do with intelligent systems,

00:48:19.980 | but it hasn't yet had as wide of an impact

00:48:23.540 | as the supervised learning approaches.

00:48:25.380 | So my question comes from the more practical sense.

00:48:29.020 | Like, what do you see is the gap

00:48:31.900 | between the more general reinforcement learning

00:48:34.540 | and the very specific, yes, it's sequential decision-making

00:48:38.780 | with one step in the sequence of the supervised learning?

00:48:43.060 | - So from a practical standpoint,

00:48:44.500 | I think that one thing that is potentially

00:48:48.540 | a little tough now, and this is, I think,

00:48:49.980 | something that we'll see, this is a gap

00:48:52.060 | that we might see closing over the next couple of years,

00:48:54.700 | is the ability of reinforcement learning algorithms

00:48:57.100 | to effectively utilize large amounts of prior data.

00:49:00.420 | So one of the reasons why it's a bit difficult today

00:49:03.300 | to use reinforcement learning for all the things

00:49:05.700 | that we might wanna use it for is that

00:49:07.900 | in most of the settings where we wanna do

00:49:10.220 | rational decision-making, it's a little bit tough

00:49:13.060 | to just deploy some policy that does crazy stuff

00:49:16.820 | and learns purely through trial and error.

00:49:18.740 | It's much easier to collect a lot of data,

00:49:21.100 | a lot of logs of some other policy that you've got,

00:49:23.980 | and then maybe if you can get a good policy out of that,

00:49:27.620 | then you deploy it and let it kind of fine tune

00:49:29.140 | a little bit.

00:49:30.500 | But algorithmically, it's quite difficult to do that.

00:49:33.340 | So I think that once we figure out how to get

00:49:36.180 | reinforcement learning to bootstrap effectively

00:49:37.940 | from large data sets, then we'll see

00:49:40.780 | a very, very rapid growth in applications

00:49:44.020 | of these technologies.

00:49:44.860 | So this is what's referred to as

00:49:45.820 | off-policy reinforcement learning,

00:49:47.340 | or offline RL, or batch RL.

00:49:49.820 | And I think we're seeing a lot of research right now

00:49:52.260 | that's bringing us closer and closer to that.

00:49:54.580 | - Can you maybe paint the picture of the different methods?

00:49:57.260 | So you said off-policy, what's value-based

00:50:01.100 | reinforcement learning?

00:50:01.940 | What's policy-based?

00:50:02.820 | What's model-based?

00:50:03.660 | What's off-policy, on-policy?

00:50:05.220 | What are the different categories of reinforcement learning?

00:50:08.060 | - So one way we can think about reinforcement learning

00:50:10.740 | is that it's, in some very fundamental way,

00:50:15.060 | it's about learning models that can answer

00:50:18.540 | kind of what-if questions.

00:50:20.100 | So what would happen if I take this action

00:50:22.340 | that I hadn't taken before?

00:50:23.940 | And you do that, of course, from experience, from data.

00:50:26.700 | And oftentimes, you do it in a loop.

00:50:28.300 | So you build a model that answers these what-if questions,

00:50:31.860 | use it to figure out the best action you can take,

00:50:33.860 | and then go and try taking that

00:50:35.220 | and see if the outcome agrees with what you predicted.

00:50:38.820 | So the different kinds of techniques

00:50:41.700 | basically refer to different ways of doing it.

00:50:43.300 | So model-based methods answer a question

00:50:45.580 | of what state you would get,

00:50:48.060 | basically what would happen to the world

00:50:49.740 | if you were to take a certain action.

00:50:50.820 | Value-based methods, they answer the question

00:50:53.740 | of what value you would get,

00:50:54.780 | meaning what utility you would get.

00:50:57.060 | But in a sense, they're not really all that different

00:50:59.020 | because they're both really just answering

00:51:01.460 | these what-if questions.

00:51:03.340 | Now, unfortunately for us,

00:51:05.020 | with current machine learning methods,

00:51:06.340 | answering what-if questions can be really hard

00:51:08.380 | because they are really questions

00:51:10.420 | about things that didn't happen.

00:51:12.500 | If you wanted to answer what-if questions

00:51:13.780 | about things that did happen,

00:51:14.780 | you wouldn't need a learned model.

00:51:15.700 | You would just repeat the thing that worked before.

00:51:18.900 | And that's really a big part of why RL is a little bit tough.

00:51:23.340 | So if you have a purely on-policy online process,

00:51:27.900 | then you ask these what-if questions,

00:51:29.780 | you make some mistakes,

00:51:31.060 | then you go and try doing those mistaken things,

00:51:33.460 | and then you observe the counter examples

00:51:35.460 | that'll teach you not to do those things again.

00:51:37.740 | If you have a bunch of off-policy data

00:51:39.940 | and you just want to synthesize the best policy you can

00:51:42.620 | out of that data,

00:51:43.740 | then you really have to deal with the challenges

00:51:46.500 | of making these counterfactual.

00:51:48.140 | - First of all, what's a policy?

00:51:49.900 | - Yeah, a policy is a model or some kind of function

00:51:54.900 | that maps from observations of the world to actions.

00:51:59.860 | So in reinforcement learning,

00:52:01.540 | we often refer to the current configuration

00:52:05.100 | of the world as the state.

00:52:06.300 | So we say the state kind of encompasses everything

00:52:08.020 | you need to fully define where the world is at at the moment.

00:52:11.100 | And depending on how we formulate the problem,

00:52:13.660 | we might say you either get to see the state

00:52:15.340 | or you get to see an observation,

00:52:16.940 | which is some snapshot, some piece of the state.

00:52:19.780 | - So policy just includes everything in it

00:52:23.660 | in order to be able to act in this world.

00:52:25.820 | - Yes.

00:52:26.660 | - And so what does off-policy mean?

00:52:29.020 | - So yeah, so the terms on-policy and off-policy

00:52:31.620 | refer to how you get your data.

00:52:33.460 | So if you get your data from somebody else

00:52:36.020 | who was doing some other stuff,

00:52:37.220 | maybe you get your data from some manually programmed system

00:52:41.620 | that was just running in the world before,

00:52:44.540 | that's referred to as off-policy data.

00:52:46.540 | But if you got the data by actually acting in the world

00:52:48.980 | based on what your current policy thinks is good,

00:52:51.340 | we call that on-policy data.

00:52:53.260 | And obviously on-policy data is more useful to you

00:52:55.780 | because if your current policy makes some bad decisions,

00:52:59.300 | you will actually see that those decisions are bad.

00:53:01.740 | Off-policy data, however, might be much easier to obtain

00:53:03.980 | because maybe that's all the log data

00:53:06.420 | that you have from before.

00:53:08.580 | - So we talk about, offline talked about autonomous vehicles

00:53:12.940 | so you can envision off-policy kind of approaches

00:53:15.660 | in robotic spaces where there's already

00:53:18.420 | ton of robots out there, but they don't get the luxury

00:53:20.900 | of being able to explore based on

00:53:24.260 | reinforcement learning framework.

00:53:26.140 | So how do we make, again, open question,

00:53:29.180 | but how do we make off-policy methods work?

00:53:32.300 | - Yeah, so this is something that has been

00:53:35.140 | kind of a big open problem for a while.

00:53:36.940 | And in the last few years,

00:53:38.420 | people have made a little bit of progress on that.

00:53:41.740 | You know, I can tell you about,

00:53:42.900 | and it's not by any means solved yet,

00:53:44.260 | but I can tell you some of the things that, for example,

00:53:46.380 | we've done to try to address some of the challenges.

00:53:49.620 | It turns out that one really big challenge

00:53:51.620 | with off-policy reinforcement learning

00:53:53.580 | is that you can't really trust your models

00:53:57.100 | to give accurate predictions for any possible action.

00:54:00.140 | So if I've never tried to, if in my data set,

00:54:03.420 | I never saw somebody steering the car

00:54:05.740 | off the road onto the sidewalk,

00:54:07.980 | my value function or my model

00:54:10.100 | is probably not going to predict the right thing

00:54:11.860 | if I ask what would happen if I were to steer the car

00:54:13.940 | off the road onto the sidewalk.

00:54:15.540 | So one of the important things you have to do

00:54:18.100 | to get off-policy RL to work

00:54:20.300 | is you have to be able to figure out

00:54:21.620 | whether a given action will result

00:54:23.500 | in a trustworthy prediction or not.

00:54:25.260 | And you can use kind of distribution estimation methods,

00:54:29.300 | kind of density estimation methods

00:54:31.260 | to try to figure that out.

00:54:32.180 | So you could figure out that, well, this action,

00:54:33.920 | my model is telling me that it's great,

00:54:35.580 | but it looks totally different

00:54:36.820 | from any action I've taken before,

00:54:37.940 | so my model is probably not correct.

00:54:40.020 | And you can incorporate regularization terms

00:54:43.060 | into your learning objective

00:54:44.180 | that will essentially tell you not to ask those questions

00:54:48.260 | that your model is unable to answer.

00:54:50.620 | - What would lead to breakthroughs in this space,

00:54:53.260 | do you think?

00:54:54.100 | Like what's needed?

00:54:55.380 | Is this a data set question?

00:54:57.580 | Do we need to collect big benchmark data sets

00:55:01.220 | that allow us to explore the space?

00:55:03.580 | Is it a new kinds of methodologies?

00:55:07.580 | Like what's your sense?

00:55:09.620 | Or maybe coming together in a space of robotics

00:55:12.260 | and defining the right problem to be working on.

00:55:15.100 | - I think for off-policy reinforcement learning

00:55:16.700 | in particular, it's very much

00:55:17.720 | an algorithms question right now.

00:55:19.460 | And this is something that I think is great

00:55:22.620 | because an algorithms question is,

00:55:24.860 | that that just takes some very smart people to get together

00:55:27.620 | and think about it really hard.

00:55:28.980 | Whereas if it was like a data problem or hardware problem,

00:55:32.780 | that would take some serious engineering.

00:55:34.620 | So that's why I'm pretty excited about that problem

00:55:37.060 | because I think that we're in a position

00:55:38.380 | where we can make some real progress on it

00:55:40.100 | just by coming up with the right algorithms.

00:55:42.100 | In terms of which algorithms they could be,

00:55:44.940 | the problems at their core are very related to problems

00:55:48.820 | in things like causal inference, right?

00:55:51.420 | Because what you're really dealing with is situations

00:55:53.740 | where you have a model, a statistical model,

00:55:56.340 | that's trying to make predictions

00:55:57.860 | about things that it hadn't seen before.

00:56:00.180 | And if it's a model that's generalizing properly,

00:56:03.220 | that'll make good predictions.

00:56:04.660 | If it's a model that picks up on spurious correlations

00:56:07.060 | that will not generalize properly.

00:56:08.820 | And then you have an arsenal of tools you could use.

00:56:11.020 | You could, for example, figure out

00:56:12.620 | what are the regions where it's trustworthy,

00:56:14.460 | or on the other hand,

00:56:15.580 | you could try to make it generalize better somehow,

00:56:17.580 | or some combination of the two.

00:56:20.700 | - Is there room for mixing, sort of,

00:56:24.940 | where most of it, like 90, 95% is off policy,

00:56:29.620 | you already have the dataset,

00:56:31.180 | and then you get to send the robot out

00:56:34.140 | to do a little exploration?

00:56:35.580 | Like, what's that role of mixing them together?

00:56:39.140 | - Yeah, absolutely.

00:56:39.980 | I think that this is something that you actually

00:56:43.140 | described very well at the beginning of our discussion

00:56:45.780 | when you talked about the iceberg.

00:56:47.340 | Like, this is the iceberg.

00:56:48.460 | The 99% of your prior experience,

00:56:50.420 | that's your iceberg.

00:56:51.580 | You'd use that for off-policy reinforcement learning.

00:56:54.020 | And then, of course, if you've never, you know,

00:56:56.860 | opened that particular kind of door

00:56:58.620 | with that particular lock before,

00:57:00.300 | then you have to go out and fiddle with it a little bit,

00:57:02.060 | and that's that additional 1%

00:57:03.740 | to help you figure out a new task.

00:57:05.140 | And I think that's actually, like,

00:57:06.100 | a pretty good recipe going forward.

00:57:08.140 | - Is this, to you, the most exciting space

00:57:11.380 | of reinforcement learning now?

00:57:12.700 | Or is there, what's, and maybe taking a step back,

00:57:16.460 | not just now, but what's, to you,

00:57:18.060 | is the most beautiful idea?

00:57:20.100 | Apologize for the romanticized question,

00:57:22.020 | but the beautiful idea or concept

00:57:24.300 | in reinforcement learning?

00:57:25.620 | - In general, I actually think that one of the things

00:57:30.660 | that is a very beautiful idea in reinforcement learning

00:57:32.980 | is just the idea that you can obtain a near-optimal control

00:57:37.980 | or a near-optimal policy without actually having

00:57:43.180 | a complete model of the world.

00:57:45.420 | This is, you know, it's something that feels

00:57:49.660 | perhaps kind of obvious if you just hear

00:57:53.020 | the term reinforcement learning

00:57:54.020 | or you think about trial and error learning,

00:57:55.700 | but from a control's perspective, it's a very weird thing

00:57:58.180 | because classically, you know, we think about

00:58:03.020 | engineered systems and controlling engineered systems

00:58:05.660 | as the problem of writing down some equations

00:58:08.460 | and then figuring out, given these equations,

00:58:10.420 | you know, basically like solve for X,

00:58:11.780 | figure out the thing that maximizes its performance.

00:58:15.220 | And the theory of reinforcement learning

00:58:18.900 | actually gives us a mathematically principled framework

00:58:21.340 | to reason about, you know, optimizing some quantity

00:58:25.540 | when you don't actually know the equations

00:58:27.660 | that govern that system.

00:58:28.740 | And that, I don't know, to me, that actually seems

00:58:31.860 | kind of, you know, very elegant,

00:58:34.020 | not something that sort of becomes immediately obvious,

00:58:38.700 | at least in the mathematical sense.

00:58:40.060 | - Does it make sense to you that it works at all?

00:58:42.420 | - Well, I think it makes sense when you take some time

00:58:46.700 | to think about it, but it is a little surprising.

00:58:49.140 | - Well, then taking a step into the more

00:58:53.060 | deeper representations, which is also very surprising,

00:58:56.740 | of sort of the richness of the state space,

00:59:01.740 | the space of environments that this kind of approach

00:59:05.200 | can operate in, can you maybe say

00:59:07.380 | what is deep reinforcement learning?

00:59:10.220 | - Well, deep reinforcement learning simply refers

00:59:13.960 | to taking reinforcement learning algorithms

00:59:16.180 | and combining them with high capacity

00:59:18.340 | neural net representations, which is, you know,

00:59:21.460 | kind of, it might at first seem like a pretty arbitrary

00:59:23.700 | thing, just take these two components

00:59:24.900 | and stick them together.

00:59:26.340 | But the reason that it's something that has become

00:59:29.900 | so important in recent years is that reinforcement learning,

00:59:35.100 | it kind of faces an exacerbated version of a problem

00:59:38.020 | that has faced many other machine learning techniques.

00:59:39.980 | So if we go back to like, you know, the early 2000s

00:59:44.020 | or the late 90s, we'll see a lot of research

00:59:46.740 | on machine learning methods that have some very appealing

00:59:50.340 | mathematical properties, like they reduce

00:59:52.380 | the convex optimization problems, for instance,

00:59:54.780 | but they require very special inputs.

00:59:57.140 | They require a representation of the input

00:59:59.620 | that is clean in some way, like for example,

01:00:02.540 | clean in the sense that the classes

01:00:05.060 | in your multi-class classification problems

01:00:06.620 | separate linearly.

01:00:07.580 | So they have some kind of good representation

01:00:10.020 | and we call this a feature representation.

01:00:12.420 | And for a long time, people were very worried

01:00:14.060 | about features in the world of supervised learning

01:00:15.820 | because somebody had to actually build those features.

01:00:18.140 | So you couldn't just take an image and plug it

01:00:19.940 | into your logistic regression or your SVM or something.

01:00:22.740 | Someone had to take that image and process it

01:00:24.740 | using some handwritten code.

01:00:26.700 | And then neural nets came along

01:00:27.980 | and they could actually learn the features.

01:00:29.700 | And suddenly we could apply learning directly

01:00:32.140 | to the raw inputs, which was great for images,

01:00:34.780 | but it was even more great for all the other fields

01:00:37.540 | where people hadn't come up with good features yet.

01:00:39.860 | And one of those fields is actually reinforcement learning

01:00:41.780 | because in reinforcement learning,

01:00:43.300 | the notion of features, if you don't use neural nets

01:00:45.340 | and you have to design your own features,

01:00:46.860 | is very, very opaque.

01:00:48.340 | Like it's very hard to imagine,

01:00:51.100 | let's say I'm playing chess or go,

01:00:53.740 | what is a feature with which I can represent

01:00:56.020 | the value function for go

01:00:57.580 | or even the optimal policy for go linearly?

01:01:00.780 | Like, I don't even know how to start thinking about it.

01:01:02.980 | And people tried all sorts of things.

01:01:04.300 | They would write down, you know,

01:01:05.380 | an expert chess player looks for whether the knight

01:01:07.940 | is in the middle of the board or not.

01:01:09.140 | So that's a feature, is knight in middle of board?

01:01:11.660 | And they would write these like long lists

01:01:13.220 | of kind of arbitrary made up stuff.

01:01:15.820 | And that was really kind of getting us nowhere.

01:01:17.420 | - And that's a little, chess is a little more accessible

01:01:20.300 | than the robotics problem.

01:01:21.780 | - Absolutely.

01:01:22.620 | - Right, there's at least experts

01:01:24.540 | in the different features for chess.

01:01:27.900 | But still like the neural network there,

01:01:31.740 | to me that's, I mean, you put it eloquently

01:01:34.580 | and almost made it seem like a natural step

01:01:36.700 | to add neural networks.

01:01:38.180 | But the fact that neural networks are able

01:01:41.020 | to discover features in the control problem,

01:01:44.340 | it's very interesting, it's hopeful.

01:01:47.020 | I'm not sure what to think about it,

01:01:48.260 | but it feels hopeful that the control problem

01:01:51.580 | has features to be learned.

01:01:54.540 | - Like, I guess my question is,

01:01:57.700 | is it surprising to you how far the deep side

01:02:01.620 | of deep reinforcement learning was able to,

01:02:03.220 | like what the space of problems has been able to tackle

01:02:06.620 | from, especially in games with the AlphaStar

01:02:10.940 | and AlphaZero and just the representation power there

01:02:15.940 | and in the robotics space.

01:02:18.860 | And what is your sense of the limits

01:02:21.780 | of this representation power and the control context?

01:02:26.180 | - I think that in regard to the limits that here,

01:02:30.100 | I think that one thing that makes it a little hard

01:02:33.660 | to fully answer this question is because in settings

01:02:38.660 | where we would like to push these things to the limit,

01:02:41.940 | we encounter other bottlenecks.

01:02:43.940 | So like the reason that I can't get my robot

01:02:48.380 | to learn how to like, I don't know,

01:02:51.260 | do the dishes in the kitchen,

01:02:53.580 | it's not because its neural net is not big enough.

01:02:56.040 | It's because when you try to actually do trial

01:02:59.700 | and error learning, reinforcement learning directly

01:03:03.140 | in the real world, where you have the potential

01:03:05.120 | to gather these large, highly varied and complex datasets,

01:03:09.880 | you start running into other problems.

01:03:11.620 | Like one problem you run into very quickly,

01:03:13.780 | it'll first sound like a very pragmatic problem,

01:03:16.860 | but it actually turns out to be a pretty deep scientific

01:03:18.540 | problem, take the robot, put it in your kitchen,

01:03:20.820 | have it try to learn to do the dishes with trial and error,

01:03:22.980 | it'll break all your dishes

01:03:24.500 | and then we'll have no more dishes to clean.

01:03:27.060 | Now you might think this is a very practical issue,

01:03:28.940 | but there's something to this,

01:03:30.020 | which is that if you have a person trying to do this,

01:03:32.300 | a person will have some degree of common sense,

01:03:34.120 | they'll break one dish,

01:03:35.180 | they'll be a little more careful with the next one.

01:03:37.020 | And if they break all of them,

01:03:38.060 | they're gonna go and get more or something like that.

01:03:41.020 | So there's all sorts of scaffolding

01:03:42.900 | that comes very naturally to us for our learning process,

01:03:46.720 | like if I have to learn something through trial and error,

01:03:49.780 | I have the common sense to know that I have to try multiple

01:03:52.660 | times, if I screw something up, I ask for help,

01:03:55.100 | or I reset things or something like that.

01:03:57.360 | And all of that is kind of outside of the classic

01:03:59.620 | reinforcement learning problem formulation.

01:04:02.020 | There are other things that can also be categorized

01:04:05.060 | as kind of scaffolding, but are very important.

01:04:07.300 | Like for example, where do you get your reward function?

01:04:09.460 | If I wanna learn how to pour a cup of water,

01:04:13.460 | well, how do I know if I've done it correctly?

01:04:15.300 | Now that probably requires an entire computer vision system

01:04:17.600 | to be built just to determine that.

01:04:19.360 | And that seems a little bit inelegant.

01:04:21.160 | So there are all sorts of things like this

01:04:22.960 | that start to come up when we think through

01:04:24.560 | what we really need to get reinforcement learning

01:04:26.440 | to happen at scale in the real world.

01:04:28.360 | And I think that many of these things actually suggest

01:04:30.920 | a little bit of a shortcoming in the problem formulation

01:04:33.440 | and a few deeper questions that we have to resolve.

01:04:36.160 | - That's really interesting.

01:04:37.000 | I talked to like David Silver about AlphaZero,

01:04:41.480 | and it seems like there's no, again,

01:04:44.400 | that we haven't hit the limit at all

01:04:47.820 | in the context when there's no broken dishes.

01:04:50.060 | So in the case of Go, you can,

01:04:53.000 | it's really about just scaling compute.

01:04:54.940 | So again, like the bottleneck is the amount of money

01:04:59.120 | you're willing to invest in compute,

01:05:00.840 | and then maybe the different, the scaffolding around

01:05:04.400 | how difficult it is to scale compute maybe.

01:05:07.240 | But there, there's no limit.

01:05:08.780 | And it's interesting.

01:05:09.980 | Now we move to the real world and there's the broken dishes,

01:05:12.540 | there's all the, and the reward function like you mentioned.

01:05:16.380 | That's really nice.

01:05:17.220 | So what, how do we push forward there?

01:05:19.860 | Do you think, there's this kind of sample efficiency

01:05:23.460 | question that people bring up of, you know,

01:05:27.020 | not having to break 100,000 dishes.

01:05:30.660 | Is this an algorithm question?

01:05:32.920 | Is this a data selection like question?

01:05:37.100 | What do you think?

01:05:38.100 | How do we, how do we not break too many dishes?

01:05:41.180 | - Yeah, well, one way we can think about that is that

01:05:44.780 | maybe we need to be better at reusing our data,

01:05:52.220 | building that iceberg.

01:05:53.900 | So perhaps it's too much to hope that

01:05:57.340 | you can have a machine that's in isolation,

01:06:02.540 | in the vacuum without anything else,

01:06:04.360 | can just master complex tasks in like in minutes,

01:06:07.240 | the way that people do.

01:06:08.580 | But perhaps it also doesn't have to.

01:06:09.780 | Perhaps what it really needs to do is have an existence,

01:06:12.560 | a lifetime where it does many things

01:06:15.260 | and the previous things that it has done,

01:06:17.020 | prepare it to do new things more efficiently.

01:06:20.040 | And, you know, the study of these kinds of questions

01:06:22.900 | typically falls under categories like multitask learning

01:06:25.580 | or meta learning, but they all fundamentally deal

01:06:28.260 | with the same general theme, which is use experience

01:06:32.580 | for doing other things to learn to do new things

01:06:35.660 | efficiently and quickly.

01:06:37.180 | - So what do you think about,

01:06:38.900 | if you just look at one particular case study

01:06:41.220 | of a Tesla autopilot that has quickly approaching

01:06:44.820 | towards a million vehicles on the road,

01:06:47.460 | where some percentage of the time, 30, 40% of the time

01:06:50.780 | is driven using the computer vision,

01:06:53.620 | multitask, hydranet, right?

01:06:57.740 | And then the other percent,

01:06:59.660 | that's what they call it, hydranet.

01:07:01.420 | The other percent is human controlled.

01:07:06.180 | From the human side, how can we use that data?

01:07:09.740 | What's your sense?

01:07:10.580 | So like, what's the signal?

01:07:14.100 | Do you have ideas in this autonomous vehicle space

01:07:16.060 | when people can lose their lives?

01:07:17.820 | You know, it's a safety critical environment.

01:07:21.340 | So how do we use that data?

01:07:23.860 | - So I think that actually the kind of problems

01:07:28.020 | that come up when we want systems that are reliable

01:07:33.020 | and that can kind of understand the limits

01:07:35.320 | of their capabilities,

01:07:36.660 | they're actually very similar to the kind of problems

01:07:38.220 | that come up when we're doing

01:07:39.700 | off-policy reinforcement learning.

01:07:41.100 | So as I mentioned before,

01:07:41.980 | in off-policy reinforcement learning,

01:07:43.700 | the big problem is you need to know

01:07:46.140 | when you can trust the predictions of your model,

01:07:48.280 | because if you're trying to evaluate

01:07:50.940 | some pattern of behavior for which your model

01:07:52.540 | doesn't give you an accurate prediction,

01:07:54.020 | then you shouldn't use that to modify your policy.

01:07:57.260 | And it's actually very similar to the problem

01:07:58.500 | that we're faced when we actually then deploy that thing

01:08:01.100 | and we want to decide whether we trust it

01:08:03.180 | in the moment or not.

01:08:05.060 | So perhaps we just need to do a better job

01:08:06.760 | of figuring out that part.

01:08:07.760 | And that's a very deep research question, of course,

01:08:10.160 | but it's also a question that a lot of people are working on.

01:08:11.700 | So I'm pretty optimistic that we can make some progress

01:08:13.500 | on that over the next few years.

01:08:15.760 | - What's the role of simulation in reinforcement learning,

01:08:18.880 | deep reinforcement learning, reinforcement learning?

01:08:21.080 | Like how essential is it?

01:08:22.920 | It's been essential for the breakthroughs so far,

01:08:26.680 | for some interesting breakthroughs.

01:08:28.120 | Do you think it's a crutch that we rely on?

01:08:32.000 | I mean, again, this connects to our off policy discussion,

01:08:35.220 | but do you think we can ever get rid of simulation

01:08:38.260 | or do you think simulation will actually take over?

01:08:40.060 | We'll create more and more realistic simulations

01:08:42.100 | that will allow us to solve actual real world problems,

01:08:46.080 | like transfer the models we learn in simulation

01:08:48.220 | to real world problems.

01:08:49.060 | - Yeah.

01:08:50.060 | I think that simulation is a very pragmatic tool

01:08:52.660 | that we can use to get a lot of useful stuff

01:08:54.740 | to work right now.

01:08:56.100 | But I think that in the long run,

01:08:57.660 | we will need to build machines that can learn

01:09:00.620 | from real data, because that's the only way

01:09:02.580 | that we'll get them to improve perpetually.

01:09:04.580 | Because if we can't have our machines learn from real data,

01:09:08.180 | if they have to rely on simulated data,

01:09:09.780 | eventually the simulator becomes the bottleneck.

01:09:12.420 | In fact, this is a general thing.

01:09:13.500 | If your machine has any bottleneck that is built by humans

01:09:17.940 | and that doesn't improve from data,

01:09:20.240 | it will eventually be the thing that holds it back.

01:09:23.100 | And if you're entirely reliant on your simulator,

01:09:25.100 | that'll be the bottleneck.

01:09:25.940 | If you're entirely reliant on a manually designed controller,

01:09:28.820 | that's gonna be the bottleneck.

01:09:30.400 | So simulation is very useful.

01:09:32.120 | It's very pragmatic, but it's not a substitute

01:09:35.260 | for being able to utilize real experience.

01:09:38.640 | And this is, by the way, this is something

01:09:41.220 | that I think is quite relevant now,

01:09:43.660 | especially in the context of some of the things

01:09:45.560 | we've discussed, because some of these kind

01:09:47.900 | of scaffolding issues that I mentioned,

01:09:49.260 | things like the broken dishes

01:09:50.660 | and the unknown reward function,

01:09:51.860 | like these are not problems that you would ever stumble on

01:09:54.820 | when working in a purely simulated kind of environment.

01:09:58.700 | But they become very apparent

01:09:59.740 | when we try to actually run these things in the real world.

01:10:03.220 | - To throw a brief wrench into our discussion, let me ask,

01:10:05.620 | do you think we're living in a simulation?

01:10:08.100 | - Oh, I have no idea.

01:10:09.900 | - Do you think that's a useful thing to even think about,

01:10:12.440 | about the fundamental physics nature of reality?

01:10:17.160 | Or another perspective, the reason I think

01:10:20.900 | the simulation hypothesis is interesting

01:10:23.300 | is to think about how difficult is it to create

01:10:29.580 | sort of a virtual reality game type situation

01:10:32.940 | that will be sufficiently convincing to us humans,

01:10:36.500 | or sufficiently enjoyable that we wouldn't wanna leave.

01:10:40.140 | I mean, that's actually a practical engineering challenge.

01:10:43.420 | And I personally really enjoy virtual reality,

01:10:46.220 | but it's quite far away, but I kind of think about

01:10:49.180 | what would it take for me to wanna spend more time

01:10:51.740 | in virtual reality versus the real world?

01:10:54.820 | And that's a sort of a nice, clean question,

01:10:58.640 | because at that point, we've reached,

01:11:02.260 | if I wanna live in a virtual reality,

01:11:04.740 | that means we're just a few years away,

01:11:06.700 | we're a majority of the population lives in a virtual reality

01:11:09.100 | and that's how we create the simulation, right?

01:11:11.380 | You don't need to actually simulate the quantum gravity

01:11:15.940 | and just every aspect of the universe.

01:11:19.740 | And that's an interesting question

01:11:21.460 | for reinforcement learning, too,

01:11:23.260 | is if we wanna make sufficiently realistic simulations

01:11:25.980 | that may, it blend the difference between

01:11:29.400 | sort of the real world and the simulation,

01:11:31.920 | thereby just some of the things we've been talking about,

01:11:36.180 | kind of the problems go away,

01:11:37.680 | if we can create actually interesting, rich simulations.

01:11:40.800 | - It's an interesting question.

01:11:41.640 | And it actually, I think your question

01:11:43.640 | casts your previous question in a very interesting light,

01:11:46.840 | because in some ways, asking whether we can,

01:11:49.720 | well, the more kind of practical version of this,

01:11:53.760 | like, can we build simulators that are good enough

01:11:56.100 | to train essentially AI systems that will work in the world?

01:12:00.600 | And it's kind of interesting to think about this,

01:12:04.300 | about what this implies, if true,

01:12:06.300 | it kind of implies that it's easier to create the universe

01:12:08.540 | than it is to create a brain.

01:12:09.980 | And that seems like, put this way, it seems kind of weird.

01:12:14.300 | - The aspect of the simulation most interesting to me

01:12:17.540 | is the simulation of other humans.

01:12:20.860 | That seems to be a complexity

01:12:25.160 | that makes the robotics problem harder.

01:12:27.880 | Now, I don't know if every robotics person

01:12:30.240 | agrees with that notion, just as a quick aside,

01:12:33.600 | what are your thoughts about when the human enters

01:12:37.380 | the picture of the robotics problem?

01:12:39.800 | How does that change the reinforcement learning problem,

01:12:42.200 | the learning problem in general?

01:12:45.040 | - Yeah, I think that's a, it's a kind of a complex question.

01:12:48.560 | And I guess my hope for a while had been that

01:12:53.560 | if we build these robotic learning systems

01:12:56.880 | that are multitask, that utilize lots of prior data

01:13:01.020 | and that learn from their own experience,

01:13:03.120 | the bit where they have to interact with people

01:13:05.480 | will be perhaps handled in much the same way

01:13:07.600 | as all the other bits.

01:13:08.760 | So if they have prior experience of interacting with people

01:13:11.240 | and they can learn from their own experience

01:13:13.200 | of interacting with people for this new task,

01:13:15.040 | maybe that'll be enough.

01:13:17.280 | Now, of course, if it's not enough,

01:13:19.360 | there are many other things we can do.

01:13:20.520 | And there's quite a bit of research in that area.

01:13:22.800 | But I think it's worth a shot to see

01:13:24.560 | whether the multi-agent interaction,

01:13:28.540 | the ability to understand that other beings in the world

01:13:33.360 | have their own goals, intentions, and thoughts, and so on,

01:13:36.240 | whether that kind of understanding can emerge automatically

01:13:40.000 | from simply learning to do things with and maximize utility.

01:13:44.040 | - That information arises from the data.

01:13:46.420 | You've said something about gravity,

01:13:49.260 | that you don't need to explicitly inject anything

01:13:53.480 | into the system that can be learned from the data.

01:13:55.720 | And gravity is an example of something

01:13:57.440 | that could be learned from data,

01:13:58.920 | sort of like the physics of the world.

01:14:00.820 | What are the limits of what we can learn from data?

01:14:07.680 | So a very simple, clean way to ask that is,

01:14:13.340 | do you really think we can learn gravity from just data?

01:14:16.960 | The idea, the laws of gravity.

01:14:19.820 | - So something that I think is a common kind of pitfall

01:14:23.720 | when thinking about prior knowledge and learning

01:14:27.040 | is to assume that just because we know something,

01:14:32.040 | then that it's better to tell the machine about that

01:14:34.680 | rather than have it figure it out on its own.

01:14:36.760 | In many cases, things that are important

01:14:41.400 | that affect many of the events

01:14:43.620 | that the machine will experience

01:14:45.140 | are actually pretty easy to learn.

01:14:46.660 | Like, if things, if every time you drop something,

01:14:49.340 | it falls down, like, yeah, you might not get the,

01:14:52.540 | you might get kind of the Newton's version,

01:14:54.220 | not Einstein's version, but it'll be pretty good.

01:14:56.820 | And it will probably be sufficient for you

01:14:58.820 | to act rationally in the world

01:15:00.820 | because you see the phenomenon all the time.

01:15:03.220 | So things that are readily apparent from the data,

01:15:06.140 | we might not need to specify those by hand.

01:15:07.900 | It might actually be easier

01:15:08.740 | to let the machine figure them out.

01:15:10.220 | - It just feels like that there might be a space

01:15:12.440 | of many local minima in terms of theories of this world

01:15:17.440 | that we would discover and get stuck on.

01:15:20.760 | - Yeah, of course.

01:15:21.600 | - That Newtonian mechanics is not necessarily

01:15:25.760 | easy to come by.

01:15:27.600 | - Yeah, and well, in fact, in some fields of science,

01:15:31.160 | for example, human civilizations

01:15:32.600 | that sell full of these local optimums.

01:15:34.080 | So for example, if you think about how people

01:15:37.840 | tried to figure out biology and medicine,

01:15:40.420 | for the longest time, the kind of rules,

01:15:43.300 | the kind of principles that serve us very well

01:15:45.680 | in our day-to-day lives actually serve us very poorly

01:15:47.920 | in understanding medicine and biology.

01:15:50.120 | We had kind of very superstitious and weird ideas

01:15:53.740 | about how the body worked until the advent

01:15:55.680 | of the modern scientific method.

01:15:57.920 | So that does seem to be a failing of this approach,

01:16:01.000 | but it's also a failing of human intelligence, arguably.

01:16:04.000 | - Maybe a small aside, but some,

01:16:06.720 | the idea of self-play is fascinating

01:16:09.080 | in reinforcement learning, sort of these competitive,

01:16:11.440 | creating a competitive context in which agents

01:16:14.080 | can play against each other in a,

01:16:17.660 | sort of at the same skill level

01:16:19.040 | and thereby increasing each other's skill level.

01:16:21.040 | It seems to be this kind of self-improving mechanism

01:16:24.840 | is exceptionally powerful in the context

01:16:26.900 | where it could be applied.

01:16:28.760 | First of all, is that beautiful to you

01:16:32.080 | that this mechanism work as well as it does

01:16:34.560 | and also can be generalized to other contexts

01:16:38.800 | like in the robotic space or anything

01:16:41.720 | that's applicable to the real world?

01:16:43.840 | - I think that it's a very interesting idea,

01:16:47.560 | but I suspect that the bottleneck

01:16:50.440 | to actually generalizing it to the robotic setting

01:16:53.760 | is actually going to be the same

01:16:54.720 | as the bottleneck for everything else,

01:16:57.080 | that we need to be able to build machines

01:16:59.940 | that can get better and better

01:17:01.900 | through natural interaction with the world.

01:17:04.660 | And once we can do that,

01:17:05.780 | then they can go out and play with,

01:17:07.900 | they can play with each other, they can play with people,

01:17:09.560 | they can play with the natural environment.

01:17:11.860 | But before we get there,

01:17:14.100 | we've got all these other problems

01:17:15.260 | we've got, we have to get out of the way.

01:17:16.380 | - So there's no shortcut around that.

01:17:17.860 | You have to interact with a natural environment that.

01:17:21.060 | - Well, because in a self-play setting,

01:17:22.980 | you still need a mediating mechanism.

01:17:24.580 | So the reason that self-play works for a board game

01:17:29.260 | is because the rules of that board game

01:17:31.280 | mediate the interaction between the agents.

01:17:33.660 | So the kind of intelligent behavior that will emerge

01:17:36.300 | depends very heavily on the nature

01:17:37.860 | of that mediating mechanism.

01:17:39.860 | - So on the side of reward functions,

01:17:42.100 | that's coming up with good reward functions

01:17:44.220 | seems to be the thing that we associate

01:17:46.200 | with general intelligence,

01:17:47.780 | like human beings seem to value the idea

01:17:51.420 | of developing our own reward functions

01:17:53.460 | of arriving at meaning and so on.

01:17:58.220 | And yet for reinforcement learning,

01:17:59.860 | we often kind of specify this, the given.

01:18:02.460 | What's your sense of how we develop

01:18:05.080 | good reward functions?

01:18:08.900 | - Yeah, I think that's a very complicated

01:18:11.260 | and very deep question.

01:18:12.120 | And you're completely right that classically

01:18:14.140 | in reinforcement learning,

01:18:15.540 | this question has kind of been treated as a non-issue

01:18:19.320 | that you sort of treat the reward as this external thing

01:18:22.400 | that comes from some other bit of your biology

01:18:26.420 | and you kind of don't worry about it.

01:18:28.400 | And I do think that that's actually,

01:18:30.140 | a little bit of a mistake that we should worry about it.

01:18:33.240 | And we can approach it in a few different ways.

01:18:34.860 | We can approach it, for instance,

01:18:36.860 | by thinking of reward as a communication medium.

01:18:39.020 | We can say, well, how does a person communicate

01:18:41.320 | to a robot what its objective is?

01:18:43.320 | You can approach it also as a sort of more

01:18:45.760 | of an intrinsic motivation medium.

01:18:47.720 | You could say, can we write down

01:18:50.380 | kind of a general objective that leads to good capability?

01:18:55.120 | Like, for example, can you write down some objectives

01:18:56.800 | such that even in the absence of any other task,

01:18:58.960 | if you maximize that objective,

01:19:00.200 | you'll sort of learn useful things.

01:19:02.640 | This is something that has sometimes been called

01:19:05.440 | unsupervised reinforcement learning,

01:19:07.020 | which I think is a really fascinating area of research,

01:19:09.960 | especially today.

01:19:11.520 | We've done a bit of work on that recently.

01:19:12.960 | One of the things we've studied is whether

01:19:14.840 | we can have some notion of unsupervised reinforcement

01:19:19.840 | learning by means of information theoretic quantities,

01:19:23.440 | like for instance, minimizing a Bayesian measure of surprise.

01:19:26.640 | This is an idea that was pioneered actually

01:19:29.120 | in the computational neuroscience community

01:19:30.640 | by folks like Carl Friston.

01:19:32.640 | And we've done some work recently that shows

01:19:34.360 | that you can actually learn pretty interesting skills

01:19:36.920 | by essentially behaving in a way that allows you

01:19:40.480 | to make accurate predictions about the world.

01:19:42.440 | It seems a little circular, like do the things

01:19:44.200 | that will lead to you getting the right answer

01:19:46.680 | for prediction.

01:19:48.740 | But you can, by doing this, you can sort of discover

01:19:51.800 | stable niches in the world.

01:19:53.040 | You can discover that if you're playing Tetris,

01:19:55.480 | then correctly clearing the rows will let you play Tetris

01:19:58.880 | for longer and keep the board nice and clean,

01:20:00.600 | which sort of satisfies some desire for order in the world.

01:20:04.040 | And as a result, get some degree of leverage

01:20:05.960 | over your domain.

01:20:07.320 | So we're exploring that pretty actively.

01:20:08.760 | - Is there a role for a human notion of curiosity

01:20:12.560 | in itself being the reward, sort of discovering new things

01:20:16.600 | about the world?

01:20:18.480 | - So one of the things that I'm pretty interested in

01:20:21.400 | is actually whether discovering new things

01:20:25.520 | can actually be an emergent property

01:20:27.800 | of some other objective that quantifies capability.

01:20:30.640 | So new things for the sake of new things,

01:20:33.160 | maybe it might not by itself be the right answer,

01:20:37.280 | but perhaps we can figure out an objective

01:20:40.040 | for which discovering new things

01:20:41.880 | is actually the natural consequence.

01:20:44.360 | That's something we're working on right now,

01:20:45.840 | but I don't have a clear answer for you there yet.

01:20:47.680 | That's still a work in progress.

01:20:49.500 | - You mean just as a curious observation

01:20:52.000 | to see sort of creative patterns of curiosity

01:20:57.000 | on the way to optimize for a particular--

01:21:00.920 | - On the way to optimize

01:21:01.920 | for a particular measure of capability.

01:21:03.880 | - Is there ways to understand or anticipate unexpected,

01:21:09.800 | unintended consequences of particular reward functions,

01:21:16.800 | sort of anticipate the kind of strategies

01:21:20.960 | that might be developed

01:21:21.920 | and try to avoid highly detrimental strategies?

01:21:25.760 | - Yeah, so classically, this is something

01:21:28.640 | that has been pretty hard in reinforcement learning

01:21:30.360 | because it's difficult for a designer

01:21:33.320 | to have good intuition about

01:21:34.960 | what a learning algorithm will come up with

01:21:36.280 | when they give it some objective.

01:21:37.920 | There are ways to mitigate that.

01:21:40.200 | One way to mitigate it is to actually define an objective

01:21:43.400 | that says like, don't do weird stuff.

01:21:46.080 | You can actually quantify it.

01:21:46.920 | You can say just like, don't enter situations

01:21:49.360 | that have low probability under the distribution of states

01:21:52.800 | you've seen before.

01:21:54.640 | It turns out that that's actually one very good way

01:21:56.400 | to do off-policy reinforcement learning actually.

01:21:59.440 | So we can do some things like that.

01:22:01.200 | - If we slowly venture in speaking about reward functions

01:22:07.000 | into greater and greater levels of intelligence,

01:22:09.240 | there's, I mean, Stuart Russell thinks about this,

01:22:12.600 | the alignment of AI systems with us humans.

01:22:18.040 | So how do we ensure that AGI systems align with us humans?

01:22:23.040 | It's kind of a reward function question

01:22:27.120 | of specifying the behavior of AI systems

01:22:31.840 | such that their success aligns

01:22:34.680 | with the broader intended success interest of human beings.

01:22:39.680 | Do you have thoughts on this?

01:22:41.800 | Do you have kind of concerns

01:22:43.360 | of where reinforcement learning fits into this?

01:22:45.240 | Or are you really focused on the current moment

01:22:48.200 | of us being quite far away

01:22:49.600 | and trying to solve the robotics problem?

01:22:51.760 | - I don't have a great answer to this,

01:22:53.120 | but, you know, and I do think that this is a problem

01:22:56.800 | that's important to figure out.

01:22:59.320 | For my part, I'm actually a bit more concerned

01:23:01.800 | about the other side of this equation that, you know,

01:23:05.560 | maybe rather than unintended consequences

01:23:09.480 | for objectives that are specified too well,

01:23:12.560 | I'm actually more worried right now

01:23:14.080 | about unintended consequences for objectives

01:23:16.000 | that are not optimized well enough,

01:23:18.680 | which might become a very pressing problem

01:23:21.160 | when we, for instance, try to use these techniques

01:23:23.960 | for safety critical systems like cars and aircraft and so on.

01:23:28.480 | I think at some point we'll face the issue

01:23:30.160 | of objectives being optimized too well,

01:23:31.840 | but right now I think we're more likely

01:23:34.480 | to face the issue of them not being optimized well enough.

01:23:37.000 | - But you don't think unintended consequences can arise

01:23:39.480 | even when you're far from optimality,

01:23:41.240 | sort of like on the path to it?

01:23:43.520 | - Oh no, I think unintended consequences

01:23:45.160 | can absolutely arise.

01:23:46.840 | It's just, I think right now the bottleneck

01:23:49.400 | for improving reliability, safety, and things like that

01:23:52.840 | is more with systems that like need to work better,

01:23:56.480 | that need to optimize their objective better.

01:23:58.920 | - Do you have thoughts, concerns about existential threats

01:24:03.120 | of human level intelligence?

01:24:04.720 | Sort of, if we put on our hat of looking in 10, 20, 100,

01:24:09.120 | 500 years from now, do you have concerns

01:24:11.720 | about existential threats of AI systems?

01:24:15.640 | - I think there are absolutely existential threats

01:24:17.720 | for AI systems, just like there are

01:24:19.160 | for any powerful technology.

01:24:20.640 | But I think that these kinds of problems

01:24:24.800 | can take many forms and some of those forms

01:24:28.720 | will come down to people with nefarious intent.

01:24:33.720 | Some of them will come down to AI systems

01:24:36.920 | that have some fatal flaws.

01:24:38.800 | And some of them will, of course, come down to AI systems

01:24:41.280 | that are too capable in some way.

01:24:43.000 | But among this set of potential concerns,

01:24:48.600 | I would actually be much more concerned

01:24:50.080 | about the first two right now,

01:24:51.920 | and principally the one with nefarious humans,

01:24:53.760 | because just through all of human history,

01:24:55.840 | actually it's the nefarious humans that have been

01:24:57.120 | the problem, not the nefarious machines,

01:24:59.840 | than I am about the others.

01:25:01.360 | And I think that right now the best that I can do

01:25:04.760 | to make sure things go well is to build

01:25:07.120 | the best technology I can and also hopefully

01:25:09.760 | to promote responsible use of that technology.

01:25:12.120 | - Do you think RL systems has something to teach us, humans?

01:25:18.720 | You said nefarious humans getting us in trouble.

01:25:21.120 | I mean, machine learning systems have in some ways

01:25:23.840 | have revealed to us the ethical flaws in our data.

01:25:28.200 | In that same kind of way, can reinforcement learning

01:25:30.720 | teach us about ourselves?

01:25:32.600 | Has it taught something?

01:25:34.400 | What have you learned about yourself

01:25:36.840 | from trying to build robots

01:25:39.200 | and reinforcement learning systems?

01:25:41.080 | - I'm not sure what I've learned about myself,

01:25:44.680 | but maybe part of the answer to your question

01:25:49.680 | might become a little bit more apparent

01:25:52.520 | once we see more widespread deployment

01:25:54.520 | of reinforcement learning for decision-making support

01:25:57.160 | in domains like healthcare, education,

01:26:01.520 | social media, et cetera.

01:26:03.360 | And I think we will see some interesting stuff emerge there.

01:26:06.680 | We will see, for instance, what kind of behaviors

01:26:09.320 | these systems come up with in situations

01:26:12.600 | where there is interaction with humans

01:26:14.240 | and where they have possibility

01:26:16.800 | of influencing human behavior.

01:26:18.960 | I think we're not quite there yet,

01:26:20.160 | but maybe in the next few years,

01:26:21.720 | we'll see some interesting stuff come out in that area.

01:26:23.840 | - I hope outside the research space,

01:26:25.360 | 'cause the exciting space where this could be observed

01:26:28.880 | is sort of large companies that deal with large data.

01:26:32.160 | And I hope there's some transparency.

01:26:34.520 | And one of the things that's unclear

01:26:36.720 | when I look at social networks and just online

01:26:39.400 | is why an algorithm did something

01:26:42.240 | or whether even an algorithm was involved.

01:26:45.080 | And that'd be interesting from a research perspective

01:26:48.160 | just to observe the results of algorithms

01:26:53.160 | to open up that data

01:26:55.480 | or to at least be sufficiently transparent

01:26:57.880 | about the behavior of these AI systems in the real world.

01:27:00.720 | What's your sense?

01:27:03.080 | I don't know if you looked at the blog post,

01:27:04.840 | Bitter Lesson by Rich Sutton,

01:27:07.680 | where it looks at sort of the big lesson

01:27:11.400 | of researching AI and reinforcement learning

01:27:14.880 | is that simple methods, general methods

01:27:18.320 | that leverage computation seem to work well.

01:27:21.680 | So basically don't try to do any kind of fancy algorithms,

01:27:24.480 | just wait for computation to get fast.

01:27:26.920 | Do you share this kind of intuition?

01:27:31.160 | I think the high level idea makes a lot of sense.

01:27:34.320 | I'm not sure that my takeaway would be

01:27:35.840 | that we don't need to work on algorithms.

01:27:37.480 | I think that my takeaway would be

01:27:39.520 | that we should work on general algorithms.

01:27:43.480 | And actually I think that this idea

01:27:46.840 | of needing to better automate

01:27:50.600 | the acquisition of experience in the real world

01:27:53.360 | actually follows pretty naturally

01:27:55.920 | from Rich Sutton's conclusion.

01:27:58.640 | So if the claim is that automated general methods

01:28:03.600 | plus data leads to good results,

01:28:06.440 | then it makes sense that we should build general methods

01:28:08.240 | and we should build the kind of methods

01:28:09.840 | that we can deploy and get them to go out there

01:28:11.560 | and like collect their experience autonomously.

01:28:14.480 | I think that one place where I think

01:28:16.960 | that the current state of things

01:28:18.840 | falls a little bit short of that

01:28:19.880 | is actually the going out there

01:28:21.600 | and collecting the data autonomously,

01:28:23.480 | which is easy to do in a simulated board game,

01:28:26.000 | but very hard to do in the real world.

01:28:27.720 | - Yeah, it keeps coming back to this one problem, right?

01:28:30.520 | So your mind is focused there now in this real world.

01:28:35.760 | It just seems scary, this step of collecting the data.

01:28:40.480 | And it seems unclear to me how we can do it effectively.

01:28:45.200 | - Yeah, well, you know, seven billion people in the world,

01:28:48.280 | each of them have to do that at some point in their lives.

01:28:50.960 | - And we should leverage that experience

01:28:52.680 | that they've all done.

01:28:54.840 | We should be able to try to collect that kind of data.

01:28:58.200 | Okay, big questions.

01:29:01.520 | Maybe stepping back through your life,

01:29:05.280 | would book or books, technical or fiction or philosophical,

01:29:10.280 | had a big impact on the way you saw the world,

01:29:14.080 | on the way you thought about in the world,

01:29:15.640 | your life in general?

01:29:16.840 | And maybe what books, if it's different,

01:29:22.120 | would you recommend people consider reading on their own?

01:29:24.800 | - Intellectual journey.

01:29:26.240 | It could be within reinforcement learning,

01:29:28.760 | but it could be very much bigger.

01:29:31.520 | - I don't know if this is like a scientifically,

01:29:36.400 | like particularly meaningful answer,

01:29:39.280 | but like the honest answer is that I actually found

01:29:43.760 | a lot of the work by Isaac Asimov

01:29:45.800 | to be very inspiring when I was younger.

01:29:47.840 | I don't know if that has anything to do

01:29:49.040 | with AI necessarily.

01:29:50.840 | - You don't think it had a ripple effect in your life?

01:29:53.000 | - Maybe it did.

01:29:55.160 | But yeah, I think that a vision of a future where,

01:30:00.160 | well, first of all, artificial,

01:30:05.720 | I might say artificial intelligence system,

01:30:07.080 | artificial robotic systems have kind of a big place,

01:30:10.760 | a big role in society.

01:30:12.480 | And where we try to imagine the sort of the limiting case

01:30:17.480 | of technological advancement and how that might play out

01:30:21.400 | in our future history.

01:30:23.680 | But yeah, I think that that was in some way influential.

01:30:28.680 | I don't really know how, but I would recommend it.

01:30:33.040 | I mean, if nothing else, you'd be well entertained.

01:30:35.440 | - When did you first yourself like fall in love

01:30:38.000 | with the idea of artificial intelligence,

01:30:40.280 | get captivated by this field?

01:30:42.240 | - So my honest answer here is actually that

01:30:47.120 | I only really started to think about it

01:30:49.920 | as something that I might want to do

01:30:52.400 | actually in graduate school pretty late.

01:30:54.800 | And a big part of that was that until,

01:30:58.040 | somewhere around 2009, 2010,

01:31:00.760 | it just wasn't really high on my priority list

01:31:02.960 | because I didn't think that it was something

01:31:05.640 | where we're going to see very substantial advances

01:31:07.800 | in my lifetime.

01:31:08.720 | And maybe in terms of my career,

01:31:14.360 | the time when I really decided I wanted to work on this

01:31:18.520 | was when I actually took a seminar course

01:31:21.080 | that was taught by Professor Andrew Ng.

01:31:23.040 | And at that point, I of course had some,

01:31:26.240 | he had like a decent understanding

01:31:27.320 | of the technical things involved.

01:31:29.040 | But one of the things that really resonated with me

01:31:30.680 | was when he said in the opening lecture,

01:31:32.520 | something to the effect of like,

01:31:33.640 | well, he used to have graduate students come to him

01:31:36.040 | and talk about how they want to work on AI

01:31:38.280 | and he would kind of chuckle

01:31:39.280 | and give them some math problem to deal with.

01:31:41.360 | But now he's actually thinking that this is an area

01:31:43.560 | where we might see like substantial advances

01:31:45.240 | in our lifetime.

01:31:46.640 | And that kind of got me thinking because,

01:31:49.960 | you know, in some abstract sense,

01:31:51.680 | yeah, like you can kind of imagine that,

01:31:53.600 | but in a very real sense,

01:31:55.320 | when someone who had been working on that kind of stuff

01:31:57.920 | their whole career suddenly says that,

01:32:00.360 | yeah, like that had some effect on me.

01:32:03.960 | - Yeah, this might be a special moment

01:32:05.560 | in the history of the field.

01:32:07.800 | That this is where we might see

01:32:10.520 | some interesting breakthroughs.

01:32:13.840 | So in the space of advice,

01:32:16.120 | somebody who's interested in getting started

01:32:18.440 | in machine learning or reinforcement learning,

01:32:21.120 | what advice would you give to maybe an undergraduate student

01:32:23.760 | or maybe even younger,

01:32:25.160 | how, what are the first steps to take

01:32:27.800 | and further on, what are the steps to take on that journey?

01:32:32.680 | - So something that I think is important to do

01:32:37.680 | is to not be afraid to like spend time

01:32:42.960 | imagining the kind of outcome that you might like to see.

01:32:46.160 | So, you know, one outcome might be a successful career,

01:32:49.640 | a large paycheck or something,

01:32:50.920 | or state-of-the-art results on some benchmark,

01:32:53.680 | but hopefully that's not the thing

01:32:54.760 | that's like the main driving force for somebody.

01:32:57.600 | But I think that if someone who's a student

01:33:01.840 | considering a career in AI,

01:33:03.040 | like takes a little while, sits down and thinks like,

01:33:05.840 | what do I really want to see?

01:33:07.320 | What I want to see a machine do?

01:33:08.600 | What do I want to see a robot do?

01:33:10.200 | What do I want to do in,

01:33:11.040 | what I want to see a natural language system?

01:33:12.520 | Just like imagine, you know,

01:33:14.840 | imagine it almost like a commercial

01:33:16.520 | for a future product or something,

01:33:18.200 | or like something that you'd like to see in the world,

01:33:21.080 | and then actually sit down and think about the steps

01:33:23.360 | that are necessary to get there.

01:33:24.960 | And hopefully that thing is not a better number

01:33:27.480 | on ImageNet classification.

01:33:28.760 | It's like, it's probably like an actual thing

01:33:30.600 | that we can't do today that would be really awesome.

01:33:32.640 | Whether it's a robot butler or a, you know,

01:33:36.120 | a really awesome healthcare decision-making support system,

01:33:38.840 | whatever it is that you find inspiring.

01:33:41.560 | And I think that thinking about that

01:33:43.040 | and then backtracking from there

01:33:44.720 | and imagining the steps needed to get there

01:33:46.520 | will actually lead to much better research.

01:33:48.080 | It'll lead to rethinking the assumptions.

01:33:50.320 | It'll lead to working on the bottlenecks

01:33:53.080 | that other people aren't working on.

01:33:55.720 | - And then naturally to turn to you,

01:33:58.120 | we've talked about reward functions,

01:34:00.480 | and you just give an advice on looking forward

01:34:03.200 | how you'd like to see,

01:34:04.280 | what kind of change you would like to make in the world.

01:34:06.640 | What do you think, ridiculous, big question,

01:34:09.200 | what do you think is the meaning of life?

01:34:11.400 | What is the meaning of your life?

01:34:13.280 | What gives you fulfillment, purpose, happiness, and meaning?

01:34:18.280 | - That's a very big question.

01:34:21.800 | - What's the reward function under which you're operating?

01:34:27.520 | - Yeah, I think one thing that does give, you know,

01:34:30.280 | if not meaning, at least satisfaction

01:34:31.920 | is some degree of confidence

01:34:35.120 | that I'm working on a problem that really matters.

01:34:37.400 | I feel like it's less important to me

01:34:38.720 | to like actually solve a problem,

01:34:41.680 | but it's quite nice to take things to spend my time on

01:34:46.680 | that I believe really matter.

01:34:48.960 | And I try pretty hard to look for that.

01:34:52.120 | - I don't know if it's easy to answer this,

01:34:54.640 | but if you're successful, what does that look like?

01:34:59.640 | What's the big dream?

01:35:01.880 | Now, of course, success is built on top of success,

01:35:05.620 | and you keep going forever, but what is the dream?

01:35:10.640 | - Yeah, so one very concrete thing,

01:35:12.400 | or maybe as concrete as it's gonna get here

01:35:15.760 | is to see machines that actually get better and better

01:35:20.760 | the longer they exist in the world.

01:35:23.360 | And that kind of seems like on the surface,

01:35:25.720 | one might even think that that's something

01:35:26.880 | that we have today, but I think we really don't.

01:35:28.880 | I think that there is an unending complexity

01:35:33.880 | in the universe, and to date,

01:35:37.880 | all of the machines that we've been able to build

01:35:40.160 | don't sort of improve up to the limit of that complexity.

01:35:43.680 | They hit a wall somewhere.

01:35:45.520 | Maybe they hit a wall because they're in a simulator

01:35:48.000 | that is only a very limited,

01:35:50.120 | very pale imitation of the real world,

01:35:52.240 | or they hit a wall because they rely on a labeled dataset,

01:35:55.320 | but they never hit the wall

01:35:56.760 | of like running out of stuff to see.

01:35:58.820 | So I'd like to build a machine

01:36:02.440 | that can go as far as possible in that regard.

01:36:04.720 | - Runs up against the ceiling

01:36:06.460 | of the complexity of the universe.

01:36:07.880 | - Yes.

01:36:09.400 | - Well, I don't think there's a better way to end it, Sergey.

01:36:11.840 | Thank you so much.

01:36:12.680 | It's a huge honor.

01:36:13.500 | I can't wait to see the amazing work

01:36:16.240 | that you have to publish in the education space

01:36:20.420 | in terms of reinforcement learning.

01:36:21.680 | Thank you for inspiring the world.

01:36:22.900 | Thank you for the great research you do.

01:36:24.640 | - Thank you.

01:36:25.560 | - Thanks for listening to this conversation

01:36:27.280 | with Sergey Lavin, and thank you to our sponsors,

01:36:30.820 | Cash App and ExpressVPN.

01:36:33.440 | Please consider supporting this podcast

01:36:35.520 | by downloading Cash App and using code LexPodcast

01:36:40.020 | and signing up at expressvpn.com/lexpod.

01:36:44.640 | Click all the links, buy all the stuff.

01:36:47.760 | It's the best way to support this podcast

01:36:50.280 | and the journey I'm on.

01:36:52.280 | If you enjoy this thing, subscribe on YouTube,

01:36:54.760 | review it with Firestars and Apple Podcast,

01:36:57.060 | support on Patreon, or connect with me on Twitter

01:36:59.880 | at Lex Friedman, spelled somehow,

01:37:02.920 | if you can figure out how,

01:37:04.160 | without using the letter E, just F-R-I-D-M-A-N.

01:37:08.960 | And now, let me leave you with some words

01:37:11.200 | from Salvador Dali.

01:37:12.520 | Intelligence without ambition is a bird without wings.

01:37:17.660 | Thank you for listening, and hope to see you next time.

01:37:21.700 | (upbeat music)

01:37:24.280 | (upbeat music)

01:37:26.860 | [BLANK_AUDIO]

Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108

Chapters