back to index

Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108


Chapters

0:0 Introduction
3:5 State-of-the-art robots vs humans
16:13 Robotics may help us understand intelligence
22:49 End-to-end learning in robotics
27:1 Canonical problem in robotics
31:44 Commonsense reasoning in robotics
34:41 Can we solve robotics through learning?
44:55 What is reinforcement learning?
66:36 Tesla Autopilot
68:15 Simulation in reinforcement learning
73:46 Can we learn gravity from data?
76:3 Self-play
77:39 Reward functions
87:1 Bitter lesson by Rich Sutton
92:13 Advice for students interesting in AI
93:55 Meaning of life

Whisper Transcript | Transcript Only Page

00:00:00.000 | The following is a conversation with Sergei Levine,
00:00:03.300 | a professor at Berkeley and a world-class researcher
00:00:06.340 | in deep learning, reinforcement learning,
00:00:08.400 | robotics and computer vision,
00:00:10.360 | including the development of algorithms
00:00:12.540 | for end-to-end training of neural network policies
00:00:15.160 | that combine perception and control,
00:00:17.540 | scalable algorithms for inverse reinforcement learning
00:00:20.360 | and in general, deep RL algorithms.
00:00:23.940 | Quick summary of the ads.
00:00:25.280 | Two sponsors, Cash App and ExpressVPN.
00:00:28.580 | Please consider supporting the podcast
00:00:30.280 | by downloading Cash App and using code LexPodcast
00:00:33.800 | and signing up at expressvpn.com/lexpod.
00:00:38.720 | Click the links, buy the stuff.
00:00:41.040 | It's the best way to support this podcast
00:00:42.800 | and in general, the journey I'm on.
00:00:45.300 | If you enjoy this thing, subscribe on YouTube,
00:00:48.720 | review it with Five Stars on Apple Podcast,
00:00:50.840 | follow on Spotify, support it on Patreon
00:00:53.440 | or connect with me on Twitter @LexFriedman.
00:00:57.640 | As usual, I'll do a few minutes of ads now
00:00:59.900 | and never any ads in the middle
00:01:01.120 | that can break the flow of the conversation.
00:01:03.900 | This show is presented by Cash App,
00:01:06.060 | the number one finance app in the App Store.
00:01:08.340 | When you get it, use code LexPodcast.
00:01:11.580 | Cash App lets you send money to friends,
00:01:13.700 | buy Bitcoin and invest in the stock market
00:01:15.860 | with as little as $1.
00:01:18.300 | Since Cash App does fractional share trading,
00:01:20.660 | let me mention that the order execution algorithm
00:01:23.540 | that works behind the scenes to create the abstraction
00:01:26.820 | of the fractional orders is an algorithmic marvel.
00:01:30.080 | So big props to the Cash App engineers
00:01:32.320 | for taking a step up to the next layer of abstraction
00:01:34.560 | over the stock market,
00:01:35.840 | making trading more accessible for new investors
00:01:38.600 | and diversification much easier.
00:01:41.640 | So again, if you get Cash App from the App Store
00:01:43.920 | or Google Play and use the code LexPodcast,
00:01:48.120 | you get $10 and Cash App will also donate $10 to FIRST,
00:01:52.720 | an organization that is helping to advance robotics
00:01:55.320 | and STEM education for young people around the world.
00:01:58.660 | This show is also sponsored by ExpressVPN.
00:02:04.000 | Get it at ExpressVPN.com/lexpod
00:02:08.320 | to support this podcast and to get an extra three months free
00:02:12.580 | on a one year package.
00:02:14.520 | I've been using ExpressVPN for many years.
00:02:17.400 | I love it.
00:02:18.640 | I think ExpressVPN is the best VPN out there.
00:02:21.960 | They told me to say it,
00:02:23.040 | but it happens to be true in my humble opinion.
00:02:26.160 | It doesn't log your data.
00:02:27.480 | It's crazy fast and it's easy to use
00:02:30.000 | literally just one big power on button.
00:02:32.800 | Again, it's probably obvious to you,
00:02:35.020 | but I should say it again.
00:02:36.560 | It's really important that they don't log your data.
00:02:40.080 | It works on Linux and every other operating system,
00:02:43.180 | but Linux of course is the best operating system.
00:02:46.640 | Shout out to my favorite flavor, Ubuntu Mate 2004.
00:02:50.600 | Once again, get it at ExpressVPN.com/lexpod
00:02:54.560 | to support this podcast and to get an extra three months free
00:02:58.760 | on a one year package.
00:03:00.800 | And now here's my conversation with Sergei Levine.
00:03:05.300 | What's the difference between a state of the art human,
00:03:08.680 | such as you and I,
00:03:09.920 | well, I don't know if we qualify as state of the art humans,
00:03:11.880 | but a state of the art human and a state of the art robot?
00:03:15.480 | - That's a very interesting question.
00:03:18.720 | Robot capability is, it's kind of a,
00:03:22.360 | I think it's a very tricky thing to understand
00:03:25.320 | because there are some things that are difficult
00:03:28.200 | that we wouldn't think are difficult
00:03:29.280 | and some things that are easy
00:03:30.120 | that we wouldn't think are easy.
00:03:31.680 | And there's also a really big gap
00:03:34.280 | between capabilities of robots in terms of hardware
00:03:37.780 | and their physical capability
00:03:38.800 | and capabilities of robots
00:03:40.380 | in terms of what they can do autonomously.
00:03:42.760 | There is a little video that I think robotics researchers
00:03:46.440 | really like to show,
00:03:47.280 | special robotics learning researchers like myself
00:03:49.360 | from 2004 from Stanford,
00:03:52.160 | which demonstrates a prototype robot called the PR1.
00:03:55.320 | And the PR1 was a robot that was designed
00:03:57.200 | as a home assistance robot.
00:03:59.200 | And there's this beautiful video showing the PR1
00:04:01.800 | tidying up a living room, putting away toys,
00:04:04.880 | and at the end, bringing a beer
00:04:07.200 | to the person sitting on the couch,
00:04:09.320 | which looks really amazing.
00:04:11.540 | And then the punchline is that this robot
00:04:14.080 | is entirely controlled by a person.
00:04:15.960 | So you can, so in some ways,
00:04:17.720 | the gap between a state-of-the-art human
00:04:19.320 | and a state-of-the-art robot,
00:04:20.540 | if the robot has a human brain,
00:04:22.360 | is actually not that large.
00:04:23.880 | Now, obviously, like human bodies are sophisticated
00:04:25.880 | and very robust and resilient in many ways,
00:04:28.200 | but on the whole, if we're willing to like spend
00:04:30.800 | a bit of money and do a bit of engineering,
00:04:32.560 | we can kind of close the hardware gap almost.
00:04:35.520 | But the intelligence gap, that one is very wide.
00:04:40.360 | - And when you say hardware,
00:04:41.280 | you're referring to the physical sort of the actuators,
00:04:43.780 | the actual body of the robot
00:04:45.040 | as opposed to the hardware on which the cognition,
00:04:48.040 | the hardware of the nervous system.
00:04:49.920 | - Yes, exactly.
00:04:50.760 | I'm referring to the body rather than the mind.
00:04:53.320 | So that means that kind of the work is cut out for us.
00:04:56.640 | Like while we can still make the body better,
00:04:59.000 | we kind of know that the big bottleneck right now
00:05:00.800 | is really the mind.
00:05:01.800 | - And how big is that gap?
00:05:03.920 | How big is the difference in your sense
00:05:08.480 | of ability to learn, ability to reason,
00:05:11.040 | ability to perceive the world
00:05:12.400 | between humans and our best robots?
00:05:16.800 | - The gap is very large and the gap becomes larger
00:05:20.600 | the more unexpected events can happen in the world.
00:05:24.560 | So essentially the spectrum along which you can measure
00:05:27.400 | the size of that gap is the spectrum
00:05:30.920 | of how open the world is.
00:05:32.160 | If you control everything in the world very tightly,
00:05:33.920 | if you put the robot in like a factory
00:05:36.200 | and you tell it where everything is
00:05:37.440 | and you rigidly program its motion,
00:05:39.680 | then it can do things.
00:05:41.840 | One might even say in a superhuman way.
00:05:43.600 | It can move faster, it's stronger,
00:05:44.960 | it can lift up a car and things like that.
00:05:47.200 | But as soon as anything starts to vary in the environment,
00:05:50.360 | now it'll trip up and if many, many things vary
00:05:52.600 | like they would like in your kitchen, for example,
00:05:54.640 | then things are pretty much like wide open.
00:05:57.820 | - Now again, we're gonna stick a bit
00:06:00.600 | on the philosophical questions,
00:06:01.960 | but how much on the human side of the cognitive abilities
00:06:06.960 | in your sense is nature versus nurture?
00:06:10.520 | So how much of it is a product of evolution
00:06:15.520 | and how much of it is something we'll learn
00:06:18.520 | from sort of scratch from the day we're born?
00:06:22.080 | - I'm gonna read into your question
00:06:23.280 | as asking about the implications of this for AI.
00:06:26.840 | - Of course, exactly.
00:06:27.680 | - I'm not a biologist,
00:06:28.720 | I can't really speak authoritatively about it.
00:06:30.520 | - So in Tlingit, if it's all about learning,
00:06:35.480 | then there's more hope for AI.
00:06:38.480 | So the way that I look at this is that,
00:06:40.440 | well, first of course, biology is very messy.
00:06:44.960 | And if you ask the question, how does a person do something
00:06:49.200 | or how does a person's mind do something,
00:06:51.280 | you can come up with a bunch of hypotheses
00:06:53.080 | and oftentimes you can find support for many different,
00:06:55.480 | often conflicting hypotheses.
00:06:57.000 | One way that we can approach the question
00:07:00.160 | of what the implications of this for AI are
00:07:03.440 | is we can think about what's sufficient.
00:07:05.440 | So maybe a person is from birth very, very good
00:07:09.840 | at some things like, for example, recognizing faces.
00:07:12.000 | There's a very strong evolutionary pressure to do that.
00:07:14.040 | If you can recognize your mother's face,
00:07:16.280 | then you're more likely to survive
00:07:18.280 | and therefore people are good at this.
00:07:20.520 | But we can also ask like,
00:07:21.360 | what's the minimum sufficient thing?
00:07:23.680 | And one of the ways that we can study
00:07:25.240 | the minimal sufficient thing is we could, for example,
00:07:27.600 | see what people do in unusual situations.
00:07:29.680 | If you present them with things
00:07:30.600 | that evolution couldn't have prepared them for.
00:07:33.800 | Our daily lives actually do this to us all the time.
00:07:35.520 | We didn't evolve to deal with automobiles
00:07:39.320 | and space flight and whatever.
00:07:41.440 | So there are all these situations
00:07:42.520 | that we can find ourselves in and we do very well there.
00:07:45.720 | Like I can give you a joystick to control a robotic arm,
00:07:49.160 | which you've never used before
00:07:50.720 | and you might be pretty bad for the first couple of seconds.
00:07:52.880 | But if I tell you like, your life depends
00:07:54.600 | on using this robotic arm to like open this door,
00:07:58.000 | you'll probably manage it.
00:07:59.480 | Even though you've never seen this device before,
00:08:01.200 | you've never used the joystick control,
00:08:03.280 | and you'll kind of muddle through it.
00:08:04.640 | And that's not your evolved natural ability,
00:08:08.400 | that's your flexibility, your adaptability.
00:08:11.200 | And that's exactly where our current robotic systems
00:08:13.200 | really kind of fall flat.
00:08:14.640 | - But I wonder how much general,
00:08:17.760 | almost what we think of as common sense,
00:08:20.480 | pre-trained models underneath all of that.
00:08:24.120 | So that ability to adapt to a joystick
00:08:26.640 | requires you to have a kind of,
00:08:31.760 | you know, I'm human, so it's hard for me
00:08:33.360 | to introspect all the knowledge I have about the world.
00:08:36.720 | But it seems like there might be an iceberg underneath
00:08:40.360 | of the amount of knowledge we actually bring to the table.
00:08:43.120 | That's kind of the open question.
00:08:44.320 | - I think there's absolutely an iceberg of knowledge
00:08:46.960 | that we bring to the table,
00:08:47.940 | but I think it's very likely that iceberg of knowledge
00:08:51.240 | is actually built up over our lifetimes.
00:08:53.840 | Because we have a lot of prior experience to draw on
00:08:58.400 | and it kind of makes sense that the right way
00:09:01.260 | for us to optimize our efficiency,
00:09:05.840 | our evolutionary fitness and so on,
00:09:07.320 | is to utilize all of that experience
00:09:10.440 | to build up the best iceberg we can get.
00:09:13.240 | And that's actually one of,
00:09:14.840 | while that sounds an awful lot
00:09:16.140 | like what machine learning actually does,
00:09:18.240 | I think that for modern machine learning,
00:09:19.560 | it's actually a really big challenge
00:09:21.120 | to take this unstructured massive experience
00:09:23.520 | and distill out something that looks
00:09:25.880 | like a common sense understanding of the world.
00:09:28.240 | And perhaps part of that is,
00:09:29.480 | it's not because something about machine learning itself
00:09:32.320 | is broken or hard,
00:09:34.400 | but because we've been a little too rigid
00:09:37.040 | in subscribing to a very supervised,
00:09:39.160 | very rigid notion of learning.
00:09:40.960 | Kind of the input output X's go to Y's sort of model.
00:09:43.880 | And maybe what we really need to do
00:09:46.440 | is to view the world more as like a massive experience
00:09:51.340 | that is not necessarily providing any rigid supervision,
00:09:53.880 | but sort of providing many, many instances
00:09:55.600 | of things that could be.
00:09:56.880 | And then you take that and you distill it
00:09:58.240 | into some sort of common sense understanding.
00:10:00.700 | - I see.
00:10:03.040 | Well, you're painting an optimistic, beautiful picture,
00:10:05.560 | especially from the robotics perspective,
00:10:07.680 | 'cause that means we just need to invest
00:10:09.760 | and build better learning algorithms,
00:10:12.360 | figure out how we can get access to more and more data
00:10:16.320 | for those learning algorithms to extract signal from,
00:10:19.080 | and then accumulate that iceberg of knowledge.
00:10:22.960 | It's a beautiful picture.
00:10:24.000 | It's a hopeful one.
00:10:25.240 | - I think it's potentially a little bit more
00:10:26.640 | than just that.
00:10:27.860 | And this is where we perhaps reach the limits
00:10:31.600 | of our current understanding.
00:10:32.760 | But one thing that I think that the research community
00:10:35.980 | hasn't really resolved in a satisfactory way
00:10:38.040 | is how much it matters where that experience comes from.
00:10:41.680 | Like, do you just like download everything on the internet
00:10:44.920 | and cram it into essentially the 21st century analog
00:10:49.000 | of the giant language model and then see what happens?
00:10:52.560 | Or does it actually matter whether your machine
00:10:55.200 | physically experiences the world,
00:10:56.680 | or in the sense that it actually attempts things,
00:11:00.080 | observes the outcome of its actions,
00:11:01.440 | and kind of augments its experience that way?
00:11:03.620 | - That it chooses which parts of the world
00:11:05.880 | it gets to interact with and observe and learn from.
00:11:10.360 | - Right, it may be that the world is so complex
00:11:12.760 | that simply obtaining a large mass of sort of IID samples
00:11:17.760 | of the world is a very difficult way to go.
00:11:20.840 | But if you are actually interacting with the world
00:11:24.000 | and essentially performing this sort of hard negative mining
00:11:26.360 | by attempting what you think might work,
00:11:28.600 | observing the sometimes happy
00:11:30.480 | and sometimes sad outcomes of that,
00:11:32.200 | and augmenting your understanding using that experience,
00:11:35.680 | and you're just doing this continually for many years,
00:11:38.320 | maybe that sort of data in some sense
00:11:40.900 | is actually much more favorable
00:11:42.200 | to obtaining a common sense understanding.
00:11:44.400 | One reason we might think that this is true
00:11:46.200 | is that what we associate with common sense
00:11:50.480 | or lack of common sense is often characterized
00:11:53.160 | by the ability to reason
00:11:54.640 | about kind of counterfactual questions.
00:11:56.560 | Like if I were to,
00:11:58.480 | here I'm this bottle of water sitting on the table,
00:12:01.860 | everything is fine if I were to knock it over,
00:12:04.280 | which I'm not gonna do,
00:12:05.120 | but if I were to do that, what would happen?
00:12:07.400 | And I know that nothing good would happen from that,
00:12:10.280 | but if I have a bad understanding of the world,
00:12:12.720 | I might think that that's a good way for me
00:12:14.200 | to like gain more utility.
00:12:15.940 | If I actually go about my daily life doing the things
00:12:20.720 | that my current understanding of the world suggests
00:12:23.040 | will give me high utility,
00:12:24.320 | in some ways I'll get exactly the right supervision
00:12:28.840 | to tell me not to do those bad things
00:12:31.240 | and to keep doing the good things.
00:12:33.100 | - So there's a spectrum between IID,
00:12:36.160 | random walk through the space of data,
00:12:38.400 | and then there's, and what we humans do.
00:12:41.080 | Well, I don't even know if we do it optimal,
00:12:43.200 | but there might be beyond.
00:12:44.920 | So this open question that you raised,
00:12:48.640 | where do you think systems,
00:12:51.600 | intelligent systems that would be able
00:12:53.760 | to deal with this world fall?
00:12:56.320 | Can we do pretty well by reading all of Wikipedia,
00:12:59.480 | sort of randomly sampling it like language models do,
00:13:03.720 | or do we have to be exceptionally selective
00:13:06.880 | and intelligent about which aspects
00:13:09.480 | of the world we interact with?
00:13:11.040 | - So I think this is first an open scientific problem,
00:13:14.440 | and I don't have like a clear answer,
00:13:15.920 | but I can speculate a little bit.
00:13:18.100 | And what I would speculate is that
00:13:20.280 | you don't need to be super, super careful.
00:13:23.600 | I think it's less about like being careful
00:13:26.320 | to avoid the useless stuff,
00:13:27.840 | and more about making sure that you hit
00:13:29.640 | on the really important stuff.
00:13:31.520 | So perhaps it's okay if you spend part of your day
00:13:34.560 | just guided by your curiosity,
00:13:37.200 | visiting interesting regions of your state space,
00:13:40.040 | but it's important for you to,
00:13:42.240 | every once in a while,
00:13:43.080 | make sure that you really try out the solutions
00:13:46.860 | that your current model of the world suggests
00:13:48.900 | might be effective, and observe whether those solutions
00:13:51.180 | are working as you expect or not.
00:13:52.880 | And perhaps some of that is really essential
00:13:56.260 | to have kind of a perpetual improvement loop.
00:13:59.560 | Like this perpetual improvement loop is really like,
00:14:01.700 | that's really the key,
00:14:03.140 | the key that's gonna potentially distinguish
00:14:05.360 | the best current methods from the best methods
00:14:07.240 | of tomorrow in a sense.
00:14:08.740 | - How important do you think is exploration
00:14:10.560 | or total out of the box thinking,
00:14:15.140 | exploration in this space,
00:14:16.540 | is you jump to totally different domains.
00:14:19.280 | So you kind of mentioned there's an optimization problem,
00:14:21.420 | you kind of explore the specifics of a particular strategy,
00:14:25.680 | whatever the thing you're trying to solve.
00:14:27.680 | How important is it to explore totally outside
00:14:30.880 | of the strategies that have been working for you so far?
00:14:34.160 | What's your intuition there?
00:14:35.160 | - Yeah, I think it's a very problem dependent
00:14:37.900 | kind of question.
00:14:38.780 | And I think that that's actually,
00:14:40.680 | in some ways that question gets at
00:14:45.160 | one of the big differences between
00:14:47.520 | sort of the classic formulation
00:14:50.040 | of a reinforcement learning problem
00:14:51.800 | and some of the sort of more open-ended reformulations
00:14:56.320 | of that problem that have been explored in recent years.
00:14:58.080 | So classically, reinforcement learning
00:15:00.280 | is framed as a problem of maximizing utility,
00:15:02.440 | like any kind of rational AI agent.
00:15:04.560 | And then anything you do is in service
00:15:06.240 | to maximizing that utility.
00:15:07.680 | But a very interesting kind of way to look at,
00:15:14.240 | and I'm not necessarily saying
00:15:15.080 | this is the best way to look at it,
00:15:15.900 | but an interesting alternative way
00:15:16.920 | to look at these problems is as something
00:15:19.660 | where you first get to explore the world,
00:15:22.000 | however you please, and then afterwards,
00:15:24.360 | you will be tasked with doing something.
00:15:26.560 | And that might suggest a somewhat different solution.
00:15:28.840 | So if you don't know what you're gonna be tasked with doing
00:15:31.160 | and you just wanna prepare yourself optimally
00:15:33.000 | for whatever your uncertain future holds,
00:15:35.320 | maybe then you will choose to attain some sort of coverage,
00:15:39.360 | build up sort of an arsenal of cognitive tools, if you will,
00:15:42.680 | such that later on when someone tells you,
00:15:44.440 | now your job is to fetch the coffee for me,
00:15:47.080 | you will be well-prepared to undertake that task.
00:15:49.000 | - And you see that as the modern formulation
00:15:52.160 | of the reinforcement learning problem,
00:15:54.180 | as the kind of, the more multitask,
00:15:56.860 | the general intelligence kind of formulation.
00:15:59.320 | - I think that's one possible vision
00:16:02.740 | of where things might be headed.
00:16:04.480 | I don't think that's by any means the mainstream
00:16:06.720 | or standard way of doing things,
00:16:08.040 | and it's not like if I had to--
00:16:09.800 | - But I like it.
00:16:10.640 | It's a beautiful vision.
00:16:11.740 | So maybe actually take a step back.
00:16:13.960 | What is the goal of robotics?
00:16:16.520 | What's the general problem of robotics
00:16:18.160 | we're trying to solve?
00:16:19.000 | You actually kind of painted two pictures here,
00:16:21.080 | one of sort of the narrow, one of the general.
00:16:23.200 | What, in your view, is the big problem of robotics?
00:16:26.500 | Again, ridiculously philosophical, high-level questions.
00:16:29.600 | - I think that, you know, maybe there are two ways
00:16:33.520 | I can answer this question.
00:16:34.520 | One is there's a very pragmatic problem,
00:16:36.920 | which is like what would make robots,
00:16:40.800 | what would sort of maximize the usefulness of robots?
00:16:43.720 | And there the answer might be something like a system
00:16:47.420 | where a system that can perform whatever task
00:16:52.420 | a human user sets for it, you know,
00:16:57.640 | within the physical constraints, of course.
00:16:59.500 | If you tell it to teleport to another planet,
00:17:01.440 | it probably can't do that.
00:17:02.460 | But if you ask it to do something
00:17:03.900 | that's within its physical capability,
00:17:05.500 | then potentially with a little bit of additional training
00:17:08.400 | or a little bit of additional trial and error,
00:17:10.340 | it ought to be able to figure it out
00:17:11.900 | in much the same way as like a human teleoperator
00:17:14.300 | ought to figure out how to drive the robot to do that.
00:17:16.580 | That's kind of the very pragmatic view
00:17:19.580 | of what it would take to kind of solve
00:17:21.980 | the robotics problem, if you will.
00:17:23.700 | But I think that there is a second answer,
00:17:26.940 | and that answer is a lot closer
00:17:28.940 | to why I want to work on robotics,
00:17:30.540 | which is that I think it's less about
00:17:33.320 | what it would take to do a really good job
00:17:36.020 | in the world of robotics, but more the other way around,
00:17:38.100 | what robotics can bring to the table
00:17:40.700 | to help us understand artificial intelligence.
00:17:43.300 | - So your dream fundamentally is to understand intelligence.
00:17:47.860 | - Yes, I think that's the dream for many people
00:17:50.940 | who actually work in this space.
00:17:53.220 | I think that there's something very pragmatic
00:17:56.700 | and very useful about studying robotics,
00:17:58.540 | but I do think that a lot of people that go into this field,
00:18:01.140 | actually, you know, the things that they draw inspiration
00:18:03.900 | from are the potential for robots
00:18:06.860 | to help us learn about intelligence and about ourselves.
00:18:10.620 | - So that's fascinating, that robotics is basically
00:18:15.220 | the space by which you can get closer to understanding
00:18:18.420 | the fundamentals of artificial intelligence.
00:18:20.580 | So what is it about robotics that's different
00:18:23.780 | from some of the other approaches?
00:18:25.360 | So if we look at some of the early breakthroughs
00:18:27.860 | in deep learning or in the computer vision space
00:18:30.500 | and the natural language processing,
00:18:32.500 | there's really nice, clean benchmarks
00:18:34.860 | that a lot of people competed on
00:18:36.300 | and thereby came up with a lot of brilliant ideas.
00:18:38.380 | What's the fundamental difference to you
00:18:39.900 | between computer vision, purely defined an image net
00:18:43.780 | and kind of the bigger robotics problem?
00:18:46.540 | - So there are a couple of things.
00:18:48.060 | One is that with robotics,
00:18:50.200 | you kind of have to take away many of the crutches.
00:18:55.340 | So you have to deal with both the particular problems
00:19:00.260 | of perception, control, and so on,
00:19:01.700 | but you also have to deal with the integration
00:19:03.060 | of those things.
00:19:04.220 | And classically, we've always thought of the integration
00:19:07.100 | as kind of a separate problem.
00:19:08.700 | So a classic kind of modular engineering approach
00:19:11.060 | is that we solve the individual sub problems
00:19:12.860 | and wire them together, and then the whole thing works.
00:19:15.860 | And one of the things that we've been seeing
00:19:17.580 | over the last couple of decades is that,
00:19:19.460 | well, maybe studying the thing as a whole
00:19:22.100 | might lead to just like very different solutions
00:19:24.300 | than if we were to study the parts and wire them together.
00:19:26.540 | So the integrative nature of robotics research
00:19:29.820 | helps us see the different perspectives on the problem.
00:19:34.060 | Another part of the answer is that with robotics,
00:19:37.820 | it casts a certain paradox into very clever relief.
00:19:41.420 | So this is sometimes referred to as a Moravic's paradox,
00:19:44.580 | the idea that in artificial intelligence,
00:19:48.220 | things that are very hard for people
00:19:50.700 | can be very easy for machines and vice versa.
00:19:52.660 | Things that are very easy for people
00:19:53.740 | can be very hard for machines.
00:19:54.780 | So, you know, integral and differential calculus
00:19:59.660 | is pretty difficult to learn for people,
00:20:01.920 | but if you program a computer to do it,
00:20:03.660 | it can derive derivatives and integrals for you
00:20:06.020 | all day long without any trouble.
00:20:08.220 | Whereas some things like, you know,
00:20:10.980 | drinking from a cup of water,
00:20:12.420 | very easy for a person to do,
00:20:13.620 | very hard for a robot to deal with.
00:20:16.340 | And sometimes when we see such blatant discrepancies,
00:20:20.140 | that gives us a really strong hint
00:20:21.520 | that we're missing something important.
00:20:23.040 | So if we really try to zero in on those discrepancies,
00:20:25.620 | we might find that little bit that we're missing.
00:20:27.900 | And it's not that we need to make machines better
00:20:30.460 | or worse at math and better at drinking water,
00:20:33.040 | but just that by studying those discrepancies,
00:20:34.940 | we might find some new insight.
00:20:37.660 | - So that could be in any space.
00:20:40.260 | It doesn't have to be robotics,
00:20:41.460 | but you're saying, I mean,
00:20:44.100 | I get it's kind of interesting that robotics
00:20:46.660 | seems to have a lot of those discrepancies.
00:20:49.420 | So the Hans-Marx paradox is probably referring
00:20:53.740 | to the space of the physical interaction,
00:20:56.500 | like you said, object manipulation, walking,
00:20:59.220 | all the kind of stuff we do in the physical world.
00:21:02.340 | How do you make sense,
00:21:05.800 | if you were to try to disentangle the Marvox paradox,
00:21:10.800 | like why is there such a gap in our intuition about it?
00:21:17.640 | Why do you think manipulating objects is so hard
00:21:20.640 | from everything you've learned
00:21:22.520 | from applying reinforcement learning in this space?
00:21:25.660 | - Yeah, I think that one reason is maybe that
00:21:31.240 | for many of the other problems that we've studied
00:21:34.320 | in AI and computer science and so on,
00:21:36.880 | the notion of input, output and supervision
00:21:41.220 | is much, much cleaner.
00:21:42.280 | So computer vision, for example,
00:21:43.760 | deals with very complex inputs,
00:21:45.600 | but it's comparatively a bit easier,
00:21:48.540 | at least up to some level of abstraction
00:21:51.540 | to cast it as a very tightly supervised problem.
00:21:54.720 | It's comparatively much, much harder
00:21:56.680 | to cast robotic manipulation
00:21:58.480 | as a very tightly supervised problem.
00:22:00.460 | You can do it, it just doesn't seem to work all that well.
00:22:03.360 | So you could say that, well,
00:22:04.360 | maybe we get a label data set
00:22:06.040 | where we know exactly which motor commands to send
00:22:08.200 | and then we train on that.
00:22:09.100 | But for various reasons,
00:22:11.220 | that's not actually like such a great solution.
00:22:13.560 | And it also doesn't seem to be even remotely similar
00:22:16.320 | to how people and animals learn to do things
00:22:17.920 | because we're not told by like our parents,
00:22:20.360 | here's how you fire your muscles in order to walk.
00:22:24.200 | We do get some guidance,
00:22:26.140 | but the really low level detailed stuff,
00:22:28.160 | we figure out mostly on our own.
00:22:29.640 | - And that's what you mean by tightly coupled,
00:22:31.120 | that every single little sub action
00:22:33.760 | gets a supervised signal of whether it's a good one or not.
00:22:37.120 | - Right.
00:22:37.960 | So while in computer vision,
00:22:39.100 | you could sort of imagine up to a level of abstraction
00:22:41.320 | that maybe somebody told you this is a car
00:22:43.400 | and this is a cat and this is a dog,
00:22:45.320 | in motor control, it's very clear
00:22:46.760 | that that was not the case.
00:22:48.120 | - If we look at sort of the sub spaces of robotics,
00:22:53.880 | that again, as you said,
00:22:56.280 | robotics integrates all of them together
00:22:58.040 | and we get to see how this beautiful mess interplays.
00:23:01.240 | So there's nevertheless still perception.
00:23:03.960 | So it's the computer vision problem,
00:23:06.280 | broadly speaking, understanding the environment.
00:23:09.720 | Then there's also, maybe you can correct me
00:23:11.720 | on this kind of categorization of the space.
00:23:14.480 | Then there's prediction in trying to anticipate
00:23:18.520 | what things are going to do into the future
00:23:20.580 | in order for you to be able to act in that world.
00:23:24.360 | And then there's also this game theoretic aspect
00:23:28.120 | of how your actions will change the behavior of others.
00:23:32.920 | In this kind of space, what,
00:23:36.200 | and this is bigger than reinforcement learning,
00:23:38.120 | this is just broadly looking at the problem in robotics.
00:23:40.820 | What's the hardest problem here?
00:23:42.720 | Or is there, or is what you said true
00:23:46.280 | that when you start to look at all of them together,
00:23:52.040 | that's a whole nother thing?
00:23:54.320 | Like you can't even say which one individually is harder
00:23:57.440 | because all of them together,
00:23:58.800 | you should only be looking at them all together.
00:24:01.480 | - I think when you look at them all together,
00:24:03.400 | some things actually become easier.
00:24:05.160 | And I think that's actually pretty important.
00:24:07.420 | So we had, back in 2014, we had some work,
00:24:12.420 | basically our first work on end-to-end
00:24:15.520 | reinforcement learning for robotic manipulation skills
00:24:17.880 | from vision, which at the time was something
00:24:20.660 | that seemed a little inflammatory and controversial
00:24:23.720 | in the robotics world.
00:24:25.340 | But other than the inflammatory
00:24:28.200 | and controversial part of it,
00:24:29.440 | the point that we were actually trying to make in that work
00:24:31.800 | is that for the particular case
00:24:33.960 | of combining perception and control,
00:24:36.060 | you could actually do better if you treat them together
00:24:38.380 | than if you try to separate them.
00:24:39.580 | And the way that we tried to demonstrate this
00:24:41.420 | is we picked a fairly simple motor control task
00:24:43.640 | where a robot had to insert a little red trapezoid
00:24:47.420 | into a trapezoidal hole.
00:24:49.340 | And we had our separated solution,
00:24:52.320 | which involved first detecting the hole
00:24:53.920 | using a pose detector,
00:24:54.940 | and then actuating the arm to put it in.
00:24:57.480 | And then our intent solution,
00:24:58.780 | which just mapped pixels to the torques.
00:25:01.440 | And one of the things we observed is that
00:25:03.920 | if you use the intent solution,
00:25:05.600 | essentially the pressure on the perception part
00:25:07.160 | of the model is actually lower.
00:25:08.280 | Like it doesn't have to figure out exactly
00:25:09.720 | where the thing is in 3D space.
00:25:11.240 | It just needs to figure out where it is,
00:25:14.120 | distributing the errors in such a way
00:25:15.600 | that the horizontal difference matters
00:25:17.560 | more than the vertical difference,
00:25:18.680 | because vertically it just pushes it down all the way
00:25:20.500 | until it can't go any further.
00:25:21.960 | And their perceptual errors are a lot less harmful,
00:25:24.560 | whereas perpendicular to the direction of motion,
00:25:26.920 | perceptual errors are much more harmful.
00:25:28.940 | So the point is that if you combine these two things,
00:25:32.120 | you can trade off errors between the components
00:25:34.440 | optimally to best accomplish the task.
00:25:37.980 | And the components can actually be weaker
00:25:39.740 | while still leading to better overall performance.
00:25:41.960 | - That's a profound idea.
00:25:43.920 | I mean, in the space of pegs and things like that,
00:25:47.240 | it's quite simple.
00:25:48.760 | It almost is tempting to overlook,
00:25:51.160 | but that seems to be at least intuitively an idea
00:25:54.960 | that should generalize to basically all aspects
00:25:58.560 | of perception control.
00:26:00.080 | - Of course.
00:26:00.900 | - That one strengthens the other.
00:26:01.960 | - Yeah, and people who have studied
00:26:05.160 | sort of perceptual heuristics in humans and animals
00:26:07.560 | find things like that all the time.
00:26:08.840 | So one very well-known example
00:26:10.760 | of this is something called the gaze heuristic,
00:26:12.280 | which is a little trick that you can use
00:26:15.320 | to intercept a flying object.
00:26:17.200 | So if you want to catch a ball, for instance,
00:26:19.280 | you could try to localize it in 3D space,
00:26:21.560 | estimate its velocity,
00:26:22.680 | estimate the effect of wind resistance,
00:26:24.080 | solve a complex system of differential equations
00:26:25.880 | in your head,
00:26:27.240 | or you can maintain a running speed
00:26:31.400 | so that the object stays in the same position
00:26:32.920 | as in your field of view.
00:26:34.040 | So if it dips a little bit, you speed up.
00:26:35.640 | If it rises a little bit, you slow down.
00:26:38.080 | And if you follow the simple rule,
00:26:39.240 | you'll actually arrive at exactly the place
00:26:40.680 | where the object lands and you'll catch it.
00:26:42.720 | And humans use it when they play baseball.
00:26:45.160 | Human pilots use it when they fly airplanes
00:26:47.040 | to figure out if they're about to collide with somebody.
00:26:49.080 | Frogs use this to catch insects and so on and so on.
00:26:51.480 | So this is something that actually happens in nature.
00:26:53.560 | And I'm sure this is just one instance of it
00:26:55.440 | that we were able to identify
00:26:56.760 | just because all the scientists were able to identify
00:26:59.160 | because it's so prevalent,
00:27:00.000 | but there are probably many others.
00:27:02.200 | - Do you have a, just so we can zoom in
00:27:04.200 | as we talk about robotics,
00:27:05.400 | do you have a canonical problem,
00:27:07.240 | sort of a simple, clean, beautiful,
00:27:10.000 | representative problem in robotics
00:27:12.520 | that you think about
00:27:14.040 | when you're thinking about some of these problems?
00:27:15.880 | We talked about robotic manipulation.
00:27:18.640 | To me, that seems intuitively,
00:27:21.360 | at least the robotics community has converged towards that
00:27:25.560 | as a space that's the canonical problem.
00:27:28.600 | If you agree, then maybe do you zoom in
00:27:30.760 | in some particular aspect of that problem
00:27:33.200 | that you just like?
00:27:34.400 | Like if we solve that problem perfectly,
00:27:37.000 | it'll unlock a major step
00:27:39.120 | towards human-level intelligence.
00:27:42.880 | - I don't think I have like a really great answer
00:27:45.760 | to that.
00:27:46.600 | And I think partly the reason I don't have a great answer
00:27:49.520 | kind of has to do with the,
00:27:51.120 | it has to do with the fact that the difficulty
00:27:54.880 | is really in the flexibility and adaptability
00:27:57.400 | rather than in doing a particular thing really, really well.
00:28:00.960 | So it's hard to just say like,
00:28:03.760 | oh, if you can, I don't know,
00:28:05.120 | like shuffle a deck of cards as fast
00:28:07.720 | as like a Vegas casino dealer,
00:28:10.320 | then you'll be very proficient.
00:28:12.760 | It's really the ability to quickly figure out
00:28:16.840 | how to do some arbitrary new thing well enough
00:28:21.840 | to like, you know, to move on to the next arbitrary thing.
00:28:26.000 | - But the source of newness and uncertainty,
00:28:29.800 | have you found problems in which it's easy
00:28:33.800 | to generate new newness-ness-nesses?
00:28:37.840 | - Yeah.
00:28:38.680 | - New types of newness.
00:28:40.400 | - Yeah.
00:28:41.240 | So a few years ago,
00:28:43.160 | so if you had asked me this question around like 2016,
00:28:46.080 | maybe I would have probably said that robotic grasping
00:28:48.680 | is a really great example of that
00:28:50.840 | because it's a task with great real world utility.
00:28:54.200 | Like you will get a lot of money if you can do it well.
00:28:57.160 | - What is robotic grasping?
00:28:58.840 | - Picking up any object.
00:29:00.760 | - With a robotic hand.
00:29:02.280 | - Exactly.
00:29:03.120 | So you will get a lot of money if you do it well
00:29:04.400 | because lots of people want to run warehouses with robots.
00:29:07.560 | And it's highly non-trivial
00:29:08.800 | because very different objects
00:29:12.360 | will require very different grasping strategies.
00:29:14.960 | But actually since then,
00:29:16.880 | people have gotten really good at building systems
00:29:19.360 | to solve this problem.
00:29:21.120 | It's to the point where I'm not actually sure
00:29:22.760 | how much more progress we can make
00:29:25.680 | with that as like the main guiding thing.
00:29:29.400 | But it's kind of interesting to see the kind of methods
00:29:31.880 | that have actually worked well in that space
00:29:33.600 | because a robotic grasping classically
00:29:36.800 | used to be regarded very much as
00:29:39.160 | kind of almost like a geometry problem.
00:29:41.280 | So people who have studied the history of computer vision
00:29:44.960 | will find this very familiar
00:29:46.720 | that it's kind of in the same way
00:29:48.200 | that in the early days of computer vision,
00:29:49.600 | people thought of it very much
00:29:50.600 | as like an inverse graphics thing.
00:29:52.280 | In robotic grasping,
00:29:53.560 | people thought of it as an inverse physics problem.
00:29:56.000 | Essentially, you look at what's in front of you,
00:29:58.640 | figure out the shapes,
00:29:59.880 | then use your best estimate of the laws of physics
00:30:02.480 | to figure out where to put your fingers
00:30:03.720 | and then you pick up the thing.
00:30:05.800 | And it turns out that what works really well
00:30:07.400 | for robotic grasping,
00:30:08.440 | instantiated in many different recent works,
00:30:11.320 | including our own, but also ones from many other labs,
00:30:13.760 | is to use learning methods
00:30:16.760 | with some combination of either exhaustive simulation
00:30:19.880 | or like actual real world trial and error.
00:30:22.080 | And it turns out that those things
00:30:23.080 | actually work really well
00:30:23.920 | and then you don't have to worry about
00:30:25.200 | solving geometry problems or physics problems.
00:30:27.480 | - So what are, just by the way, in the grasping,
00:30:32.360 | what are the difficulties that have been worked on?
00:30:35.280 | So one is like the materials of things,
00:30:38.320 | maybe occlusions and the perception side.
00:30:40.800 | Why is it such a difficult,
00:30:42.400 | why is picking stuff up such a difficult problem?
00:30:45.000 | - Yeah, it's a difficult problem
00:30:47.280 | because the number of things
00:30:50.160 | that you might have to deal with
00:30:51.560 | or the variety of things that you have to deal with
00:30:53.120 | is extremely large.
00:30:54.520 | And oftentimes things that work for one class of objects
00:30:58.880 | won't work for other classes of objects.
00:31:00.240 | So if you get really good at picking up boxes
00:31:03.880 | and now you have to pick up plastic bags,
00:31:06.160 | you just need to employ a very different strategy.
00:31:09.560 | And there are many properties of objects
00:31:13.160 | that are more than just their geometry.
00:31:15.280 | It has to do with the bits that are easier to pick up,
00:31:18.800 | the bits that are harder to pick up,
00:31:19.760 | the bits that are more flexible,
00:31:20.840 | the bits that will cause the thing to pivot and bend
00:31:23.520 | and drop out of your hand
00:31:24.960 | versus the bits that result in a nice secure grasp,
00:31:27.840 | things that are flexible,
00:31:29.120 | things that if you pick them up the wrong way,
00:31:30.560 | they'll fall upside down and the contents will spill out.
00:31:33.720 | So there's all these little details that come up,
00:31:36.000 | but the task is still kind of,
00:31:38.200 | can be characterized as one task.
00:31:39.600 | Like there's a very clear notion of
00:31:41.200 | you did it or you didn't do it.
00:31:42.760 | - So in terms of spilling things,
00:31:46.960 | there creeps in this notion that starts to sound
00:31:50.520 | and feel like common sense reasoning.
00:31:52.980 | Do you think solving the general problem of robotics
00:31:57.980 | requires common sense reasoning,
00:32:01.640 | requires general intelligence,
00:32:04.720 | this kind of human level capability of,
00:32:07.520 | like you said, be robust and deal with uncertainty,
00:32:11.680 | but also be able to sort of reason
00:32:13.360 | and assimilate different pieces of knowledge that you have?
00:32:16.460 | Yeah.
00:32:19.280 | What are your thoughts on the needs
00:32:23.600 | of common sense reasoning in the space
00:32:25.640 | of the general robotics problem?
00:32:28.520 | - So I'm gonna slightly dodge that question
00:32:30.280 | and say that I think maybe actually
00:32:32.320 | it's the other way around is that studying robotics
00:32:36.000 | can help us understand how to put common sense
00:32:38.260 | into our AI systems.
00:32:40.420 | One way to think about common sense is that,
00:32:43.000 | and why our current systems might lack common sense,
00:32:45.440 | is that common sense is a property,
00:32:47.080 | is an emergent property of actually having to interact
00:32:51.620 | with a particular world, a particular universe,
00:32:54.240 | and get things done in that universe.
00:32:56.080 | So you might think that, for instance,
00:32:57.920 | like an image captioning system,
00:33:00.640 | maybe it looks at pictures of the world
00:33:03.760 | and it types out English sentences.
00:33:05.820 | So it kind of deals with our world.
00:33:07.900 | And then you can easily construct situations
00:33:11.040 | where image captioning systems do things
00:33:12.840 | that defy common sense,
00:33:13.900 | like give it a picture of a person wearing a fur coat
00:33:16.200 | and we'll say it's a teddy bear.
00:33:18.460 | But I think what's really happening in those settings
00:33:20.720 | is that the system doesn't actually live in our world,
00:33:24.120 | it lives in its own world that consists of pixels
00:33:26.120 | and English sentences,
00:33:27.480 | and doesn't actually consist of like,
00:33:29.920 | having to put on a fur coat in the winter
00:33:31.440 | so you don't get cold.
00:33:33.120 | So perhaps the reason for the disconnect
00:33:35.960 | is that the systems that we have now
00:33:39.240 | simply inhabit a different universe.
00:33:41.160 | And if we build AI systems that are forced to deal
00:33:43.200 | with all of the messiness and complexity of our universe,
00:33:46.560 | maybe they will have to acquire common sense
00:33:48.900 | to essentially maximize their utility.
00:33:51.520 | Whereas the systems we're building now
00:33:52.880 | don't have to do that, they can take some shortcut.
00:33:56.060 | - That's fascinating.
00:33:57.200 | You've a couple times already sort of reframed
00:34:00.120 | the role of robotics in this whole thing.
00:34:02.040 | And for some reason,
00:34:03.880 | I don't know if my way of thinking is common,
00:34:06.640 | but I thought like,
00:34:08.040 | we need to understand and solve intelligence
00:34:10.360 | in order to solve robotics.
00:34:12.720 | And you're kind of framing it as,
00:34:14.840 | no, robotics is one of the best ways
00:34:16.720 | to just study artificial intelligence
00:34:18.760 | and build sort of like,
00:34:20.440 | robotics is like the right space
00:34:22.260 | in which you get to explore
00:34:25.640 | some of the fundamental learning mechanisms,
00:34:28.580 | fundamental sort of multimodal, multitask,
00:34:33.120 | aggregation of knowledge mechanisms
00:34:35.060 | that are required for general intelligence.
00:34:36.660 | That's really interesting way to think about it.
00:34:39.180 | But let me ask about learning.
00:34:41.420 | Can the general sort of robotics,
00:34:44.100 | the epitome of the robotics problem
00:34:45.740 | be solved purely through learning,
00:34:47.860 | perhaps end-to-end learning?
00:34:51.740 | Sort of learning from scratch
00:34:54.620 | as opposed to injecting human expertise
00:34:57.080 | and rules and heuristics and so on?
00:34:59.080 | - I think that in terms of the spirit of the question,
00:35:02.420 | I would say yes.
00:35:04.700 | I mean, I think that though in some ways
00:35:07.560 | it's maybe like an overly sharp dichotomy.
00:35:11.120 | Like, I think that in some ways when we build algorithms,
00:35:14.540 | at some point a person does something.
00:35:18.040 | - Yeah, hyper parameters, there's always--
00:35:19.800 | - A person turned on the computer,
00:35:21.160 | a person implemented TensorFlow.
00:35:24.920 | But yeah, I think that in terms of the point
00:35:28.560 | that you're getting at, I do think the answer is yes.
00:35:30.280 | I think that we can solve many problems
00:35:34.240 | that have previously required meticulous manual engineering
00:35:37.480 | through automated optimization techniques.
00:35:39.920 | And actually one thing I will say on this topic is
00:35:42.200 | I don't think this is actually a very radical
00:35:44.240 | or very new idea.
00:35:45.200 | I think people have been thinking
00:35:47.840 | about automated optimization techniques
00:35:49.840 | as a way to do control for a very, very long time.
00:35:53.240 | And in some ways what's changed is really more the name.
00:35:57.880 | So today we would say that, oh, my robot does
00:36:01.760 | machine learning, it does reinforcement learning.
00:36:03.560 | Maybe in the 1960s you'd say,
00:36:05.400 | oh, my robot is doing optimal control.
00:36:08.280 | And maybe the difference between typing out
00:36:10.440 | a system of differential equations
00:36:12.160 | and doing feedback linearization
00:36:14.000 | versus training a neural net,
00:36:15.680 | maybe it's not such a large difference.
00:36:16.920 | It's just pushing the optimization deeper
00:36:20.480 | and deeper into the thing.
00:36:22.280 | - Well, it's interesting you think that way,
00:36:23.880 | but with, especially with deep learning,
00:36:26.320 | that the accumulation of sort of experiences
00:36:30.440 | in data form to form deep representations
00:36:34.920 | starts to feel like knowledge
00:36:36.840 | as opposed to optimal control.
00:36:39.360 | So this feels like there's an accumulation of knowledge
00:36:41.720 | through the learning process.
00:36:43.000 | - Yes, yeah.
00:36:43.840 | So I think that is a good point,
00:36:44.800 | that one big difference between learning-based systems
00:36:48.080 | and classic optimal control systems
00:36:49.800 | is that learning-based systems in principle
00:36:52.160 | should get better and better the more they do something.
00:36:55.080 | And I do think that that's actually
00:36:56.320 | a very, very powerful difference.
00:36:58.040 | - So if we look back at the world of expert systems
00:37:01.600 | and symbolic AI and so on,
00:37:03.920 | of using logic to accumulate expertise,
00:37:07.400 | human expertise, human encoded expertise,
00:37:10.960 | do you think that will have a role at some point?
00:37:14.760 | Deep learning, machine learning, reinforcement learning
00:37:17.080 | has shown incredible results and breakthroughs
00:37:21.240 | and just inspired thousands, maybe millions of researchers.
00:37:26.240 | But there's this less popular now,
00:37:30.600 | but it used to be popular idea of symbolic AI.
00:37:32.800 | Do you think that will have a role?
00:37:35.240 | - I think in some ways,
00:37:36.480 | the kind of the descendants of symbolic AI
00:37:42.240 | actually already have a role.
00:37:44.640 | So this is the highly biased history from my perspective.
00:37:48.760 | You say that, well, initially we thought
00:37:50.960 | that rational decision-making
00:37:52.760 | involves logical manipulation.
00:37:54.680 | So you have some model of the world
00:37:56.960 | expressed in terms of logic.
00:37:59.880 | You have some query,
00:38:00.720 | like what action do I take in order for X to be true?
00:38:04.600 | And then you manipulate
00:38:05.600 | your logical symbolic representation to get an answer.
00:38:08.360 | What that turned into somewhere in the 1990s is,
00:38:11.880 | well, instead of building kind of predicates
00:38:14.280 | and statements that have true or false values,
00:38:17.480 | we'll build probabilistic systems
00:38:19.200 | where things have probabilities associated
00:38:21.880 | and probabilities of being true and false.
00:38:23.120 | And that turned into Bayes nets.
00:38:24.960 | And that provided sort of a boost
00:38:27.000 | to what were really,
00:38:28.800 | still essentially logical inference systems,
00:38:30.520 | just probabilistic logical inference systems.
00:38:32.800 | And then people said, well, let's actually learn
00:38:35.560 | the individual probabilities inside these models.
00:38:39.120 | And then people said, well,
00:38:40.680 | let's not even specify the nodes in the models.
00:38:42.760 | Let's just put a big neural net in there.
00:38:45.320 | But in many ways,
00:38:46.160 | I see these as actually kind of descendants
00:38:47.880 | from the same idea.
00:38:48.800 | It's essentially instantiating rational decision-making
00:38:51.440 | by means of some inference process
00:38:53.600 | and learning by means of an optimization process.
00:38:56.680 | So in a sense, I would say, yes, it has a place.
00:39:00.000 | And in many ways, that place is,
00:39:01.920 | it already holds that place.
00:39:04.360 | - It's already in there.
00:39:05.480 | Yeah, it's just by different,
00:39:06.680 | it looks slightly different than it was before.
00:39:09.000 | - Yeah, but there are some things
00:39:10.440 | that we can think about
00:39:11.640 | that make this a little bit more obvious.
00:39:13.120 | Like if I train a big neural net model
00:39:15.800 | to predict what will happen
00:39:17.160 | in response to my robot's actions,
00:39:18.880 | and then I run probabilistic inference,
00:39:21.480 | meaning I invert that model
00:39:22.800 | to figure out the actions
00:39:23.720 | that lead to some plausible outcome.
00:39:24.960 | Like to me, that seems like a kind of logic.
00:39:27.440 | You have a model of the world
00:39:28.840 | that just happens to be expressed by a neural net,
00:39:31.240 | and you are doing some inference procedure,
00:39:33.560 | some sort of manipulation on that model
00:39:35.400 | to figure out the answer to a query that you have.
00:39:39.520 | - It's the interpretability,
00:39:41.000 | it's the explainability though
00:39:42.520 | that seems to be lacking more so.
00:39:44.520 | Because the nice thing about expert systems
00:39:48.040 | is you can follow the reasoning of the system.
00:39:50.600 | That to us mere humans is somehow compelling.
00:39:54.160 | It's just, I don't know what to make of this fact
00:40:00.440 | that there's a human desire for intelligent systems
00:40:04.640 | to be able to convey in a poetic way to us
00:40:10.040 | why it made the decisions it did.
00:40:12.480 | Like tell a convincing story.
00:40:15.160 | And perhaps that's like a silly human thing.
00:40:20.160 | Like we shouldn't expect that of intelligent systems.
00:40:22.960 | Like we should be super happy
00:40:24.380 | that there is intelligent systems out there.
00:40:27.600 | But if I were to sort of psychoanalyze the researchers
00:40:31.920 | at the time, I would say expert systems
00:40:33.800 | connected to that part.
00:40:35.880 | That desire of AI researchers
00:40:37.800 | for systems to be explainable.
00:40:40.160 | I mean, maybe on that topic,
00:40:41.640 | do you have a hope that sort of inferences
00:40:45.680 | of learning-based systems will be as explainable
00:40:50.680 | as the dream was with expert systems, for example?
00:40:53.940 | - I think it's a very complicated question
00:40:56.680 | because I think that in some ways
00:40:58.620 | the question of explainability
00:41:00.680 | is kind of very closely tied
00:41:03.480 | to the question of like performance.
00:41:06.920 | Like, why do you want your system to explain itself?
00:41:09.320 | Well, so that when it screws up,
00:41:11.320 | you can kind of figure out why it did it.
00:41:13.800 | - Right.
00:41:14.640 | - But in some ways that's a much bigger problem, actually.
00:41:17.240 | Like your system might screw up
00:41:19.320 | and then it might screw up in how it explains itself.
00:41:22.680 | Or you might have some bug somewhere
00:41:24.920 | so that it's not actually doing what it was supposed to do.
00:41:26.920 | So, maybe a good way to view that problem
00:41:30.400 | is really as a bigger problem of verification and validation
00:41:36.160 | of which explainability is sort of one component.
00:41:39.320 | - I see.
00:41:40.160 | I just see it differently.
00:41:41.160 | I see explainability, you put it beautifully.
00:41:43.960 | I think you actually summarize the field of explainability.
00:41:46.800 | But to me, there's another aspect of explainability
00:41:49.720 | which is like storytelling
00:41:51.760 | that has nothing to do with errors or with like,
00:41:56.760 | it uses errors as elements of its story
00:42:03.860 | as opposed to a fundamental need
00:42:05.740 | to be explainable when errors occur.
00:42:08.060 | It's just that for other intelligence systems
00:42:10.460 | to be in our world,
00:42:11.620 | we seem to want to tell each other stories.
00:42:14.460 | And that's true in the political world,
00:42:17.780 | that's true in the academic world.
00:42:19.720 | And that, you know,
00:42:21.780 | neural networks are less capable of doing that.
00:42:23.820 | Or perhaps they're equally capable
00:42:25.340 | of storytelling and storytelling.
00:42:26.660 | Maybe it doesn't matter
00:42:28.420 | what the fundamentals of the system are.
00:42:30.260 | You just need to be a good storyteller.
00:42:32.700 | Maybe one specific story I can tell you about
00:42:35.620 | in that space is actually about some work
00:42:38.120 | that was done by my former collaborator
00:42:40.500 | who's now a professor at MIT named Jacob Andreas.
00:42:43.320 | Jacob actually works in natural language processing,
00:42:45.780 | but he had this idea to do a little bit of work
00:42:47.660 | in reinforcement learning
00:42:49.140 | and how natural language can basically structure
00:42:52.740 | the internals of policies trained with RL.
00:42:55.580 | And one of the things he did is he set up a model
00:42:59.140 | that attempts to perform some tasks
00:43:01.420 | that's defined by a reward function,
00:43:03.740 | but the model reads in a natural language instruction.
00:43:06.500 | So this is a pretty common thing to do
00:43:07.820 | in instruction following.
00:43:08.820 | So you tell it like, you know, go to the red house
00:43:11.600 | and then it's supposed to go to the red house.
00:43:13.560 | But then one of the things that Jacob did
00:43:14.960 | is he treated that sentence not as a command from a person,
00:43:19.540 | but as a representation of the internal kind of state
00:43:23.540 | of the mind of this policy, essentially.
00:43:26.620 | So that when it was faced with a new task,
00:43:28.540 | what it would do is it would basically try to think
00:43:31.020 | of possible language descriptions,
00:43:33.540 | attempt to do them and see if they led to the right outcome.
00:43:35.580 | So it would kind of think out loud, like, you know,
00:43:37.580 | I'm faced with this new task, what am I gonna do?
00:43:39.380 | Let me go to the red house.
00:43:40.500 | Oh, that didn't work.
00:43:41.340 | Let me go to the blue room or something.
00:43:43.780 | Let me go to the green plant.
00:43:45.420 | And once it got some reward, it would say,
00:43:46.740 | oh, go to the green plant, that's what's working.
00:43:48.220 | I'm gonna go to the green plant.
00:43:49.500 | And then you could look at the string that it came up with,
00:43:51.100 | and that was a description of how it thought
00:43:52.720 | it should solve the problem.
00:43:54.380 | So you could basically incorporate language
00:43:57.320 | as internal state and you can start getting some handle
00:43:59.500 | on these kinds of things.
00:44:00.880 | - And then what I was kind of trying to get to is that
00:44:04.000 | also if you add to the reward function,
00:44:07.000 | the convincingness of that story.
00:44:10.060 | So I have another reward signal of like,
00:44:12.300 | people who review that story, how much they like it.
00:44:16.620 | So that, you know, initially that could be a hyper parameter
00:44:21.420 | sort of hard coded heuristic type of thing,
00:44:23.420 | but it's an interesting notion of the convincingness
00:44:28.580 | of the story becoming part of the reward function,
00:44:31.800 | the objective function of the explainability.
00:44:33.960 | It's in the world of sort of Twitter and fake news,
00:44:37.500 | that might be a scary notion that the nature of truth
00:44:42.000 | may not be as important as the convincingness of the,
00:44:45.020 | how convinced you are in telling the story around the facts.
00:44:50.020 | Well, let me ask the basic question.
00:44:55.220 | You're one of the world-class researchers
00:44:57.040 | in reinforcement learning, deep reinforcement learning,
00:44:59.580 | certainly in the robotics space.
00:45:01.700 | What is reinforcement learning?
00:45:04.500 | - I think that what reinforcement learning refers to today
00:45:06.940 | is really just the kind of the modern incarnation
00:45:10.740 | of learning-based control.
00:45:12.980 | So classically reinforcement learning
00:45:14.360 | has a much more narrow definition,
00:45:15.700 | which is that it's literally learning from reinforcement,
00:45:18.980 | like the thing does something
00:45:20.200 | and then it gets a reward or punishment.
00:45:22.540 | But really I think the way the term is used today
00:45:24.340 | is it's used to refer more broadly to learning-based control.
00:45:28.140 | So some kind of system
00:45:29.220 | that's supposed to be controlling something
00:45:31.820 | and it uses data to get better.
00:45:34.660 | - And what does control mean?
00:45:35.820 | So is action is the fundamental element there?
00:45:38.460 | - It means making rational decisions.
00:45:40.780 | Now, and rational decisions are decisions
00:45:42.460 | that maximize a measure of utility.
00:45:44.300 | - And sequentially, so you made decisions
00:45:46.620 | time and time and time again.
00:45:48.240 | Now, like, it's easier to see that kind of idea
00:45:52.300 | in the space of maybe games, in the space of robotics.
00:45:56.420 | Do you see it bigger than that?
00:45:58.820 | Is it applicable?
00:46:00.220 | Like, where are the limits of the applicability
00:46:02.940 | of reinforcement learning?
00:46:04.500 | - Yeah, so rational decision-making
00:46:07.380 | is essentially the encapsulation of the AI problem
00:46:11.380 | viewed through a particular lens.
00:46:12.980 | So any problem that we would want a machine to do,
00:46:16.420 | an intelligent machine,
00:46:18.220 | can likely be represented as a decision-making problem.
00:46:20.460 | Classifying images is a decision-making problem,
00:46:23.060 | although not a sequential one typically.
00:46:25.060 | Controlling a chemical plant is a decision-making problem.
00:46:30.260 | Deciding what videos to recommend on YouTube
00:46:32.580 | is a decision-making problem.
00:46:34.460 | And one of the really appealing things
00:46:35.820 | about reinforcement learning is,
00:46:38.220 | if it does encapsulate the range
00:46:40.300 | of all of these decision-making problems,
00:46:41.740 | perhaps working on reinforcement learning
00:46:43.820 | is one of the ways to reach a very broad swath
00:46:47.580 | of AI problems.
00:46:50.260 | - But what is the fundamental difference
00:46:52.740 | between reinforcement learning
00:46:54.260 | and maybe supervised machine learning?
00:46:56.620 | - So reinforcement learning can be viewed
00:47:00.220 | as a generalization of supervised machine learning.
00:47:02.740 | You can certainly cast supervised learning
00:47:04.380 | as a reinforcement learning problem.
00:47:05.620 | You can just say your loss function
00:47:06.780 | is the negative of your reward,
00:47:08.980 | but you have stronger assumptions.
00:47:10.140 | You have the assumption that someone actually told you
00:47:12.340 | what the correct answer was,
00:47:14.420 | that your data was IID and so on.
00:47:15.940 | So you could view reinforcement learning
00:47:18.220 | as essentially relaxing some of those assumptions.
00:47:20.340 | Now that's not always a very productive way to look at it,
00:47:22.100 | because if you actually have a supervised learning problem,
00:47:24.300 | you'll probably solve it much more effectively
00:47:25.980 | by using supervised learning methods, because it's easier.
00:47:29.380 | But you can view reinforcement learning
00:47:31.460 | as a generalization of that.
00:47:32.380 | - No, for sure.
00:47:33.220 | But they're fundamentally different.
00:47:35.780 | That's a mathematical statement
00:47:37.140 | that's absolutely correct.
00:47:38.540 | But it seems that reinforcement learning,
00:47:41.580 | the kind of tools we bring to the table today,
00:47:43.700 | of today, so maybe down the line,
00:47:46.500 | everything will be a reinforcement learning problem,
00:47:48.860 | just like you said, image classification should be mapped
00:47:52.340 | to a reinforcement learning problem.
00:47:53.660 | But today, the tools and ideas,
00:47:56.300 | the way we think about them are different.
00:47:58.820 | Sort of supervised learning has been used very effectively
00:48:02.780 | to solve basic, narrow AI problems.
00:48:06.540 | Reinforcement learning kind of represents the dream of AI.
00:48:11.460 | It's very much so in the research space now,
00:48:15.220 | in sort of captivating the imagination of people,
00:48:17.780 | of what we can do with intelligent systems,
00:48:19.980 | but it hasn't yet had as wide of an impact
00:48:23.540 | as the supervised learning approaches.
00:48:25.380 | So my question comes from the more practical sense.
00:48:29.020 | Like, what do you see is the gap
00:48:31.900 | between the more general reinforcement learning
00:48:34.540 | and the very specific, yes, it's sequential decision-making
00:48:38.780 | with one step in the sequence of the supervised learning?
00:48:43.060 | - So from a practical standpoint,
00:48:44.500 | I think that one thing that is potentially
00:48:48.540 | a little tough now, and this is, I think,
00:48:49.980 | something that we'll see, this is a gap
00:48:52.060 | that we might see closing over the next couple of years,
00:48:54.700 | is the ability of reinforcement learning algorithms
00:48:57.100 | to effectively utilize large amounts of prior data.
00:49:00.420 | So one of the reasons why it's a bit difficult today
00:49:03.300 | to use reinforcement learning for all the things
00:49:05.700 | that we might wanna use it for is that
00:49:07.900 | in most of the settings where we wanna do
00:49:10.220 | rational decision-making, it's a little bit tough
00:49:13.060 | to just deploy some policy that does crazy stuff
00:49:16.820 | and learns purely through trial and error.
00:49:18.740 | It's much easier to collect a lot of data,
00:49:21.100 | a lot of logs of some other policy that you've got,
00:49:23.980 | and then maybe if you can get a good policy out of that,
00:49:27.620 | then you deploy it and let it kind of fine tune
00:49:29.140 | a little bit.
00:49:30.500 | But algorithmically, it's quite difficult to do that.
00:49:33.340 | So I think that once we figure out how to get
00:49:36.180 | reinforcement learning to bootstrap effectively
00:49:37.940 | from large data sets, then we'll see
00:49:40.780 | a very, very rapid growth in applications
00:49:44.020 | of these technologies.
00:49:44.860 | So this is what's referred to as
00:49:45.820 | off-policy reinforcement learning,
00:49:47.340 | or offline RL, or batch RL.
00:49:49.820 | And I think we're seeing a lot of research right now
00:49:52.260 | that's bringing us closer and closer to that.
00:49:54.580 | - Can you maybe paint the picture of the different methods?
00:49:57.260 | So you said off-policy, what's value-based
00:50:01.100 | reinforcement learning?
00:50:01.940 | What's policy-based?
00:50:02.820 | What's model-based?
00:50:03.660 | What's off-policy, on-policy?
00:50:05.220 | What are the different categories of reinforcement learning?
00:50:08.060 | - So one way we can think about reinforcement learning
00:50:10.740 | is that it's, in some very fundamental way,
00:50:15.060 | it's about learning models that can answer
00:50:18.540 | kind of what-if questions.
00:50:20.100 | So what would happen if I take this action
00:50:22.340 | that I hadn't taken before?
00:50:23.940 | And you do that, of course, from experience, from data.
00:50:26.700 | And oftentimes, you do it in a loop.
00:50:28.300 | So you build a model that answers these what-if questions,
00:50:31.860 | use it to figure out the best action you can take,
00:50:33.860 | and then go and try taking that
00:50:35.220 | and see if the outcome agrees with what you predicted.
00:50:38.820 | So the different kinds of techniques
00:50:41.700 | basically refer to different ways of doing it.
00:50:43.300 | So model-based methods answer a question
00:50:45.580 | of what state you would get,
00:50:48.060 | basically what would happen to the world
00:50:49.740 | if you were to take a certain action.
00:50:50.820 | Value-based methods, they answer the question
00:50:53.740 | of what value you would get,
00:50:54.780 | meaning what utility you would get.
00:50:57.060 | But in a sense, they're not really all that different
00:50:59.020 | because they're both really just answering
00:51:01.460 | these what-if questions.
00:51:03.340 | Now, unfortunately for us,
00:51:05.020 | with current machine learning methods,
00:51:06.340 | answering what-if questions can be really hard
00:51:08.380 | because they are really questions
00:51:10.420 | about things that didn't happen.
00:51:12.500 | If you wanted to answer what-if questions
00:51:13.780 | about things that did happen,
00:51:14.780 | you wouldn't need a learned model.
00:51:15.700 | You would just repeat the thing that worked before.
00:51:18.900 | And that's really a big part of why RL is a little bit tough.
00:51:23.340 | So if you have a purely on-policy online process,
00:51:27.900 | then you ask these what-if questions,
00:51:29.780 | you make some mistakes,
00:51:31.060 | then you go and try doing those mistaken things,
00:51:33.460 | and then you observe the counter examples
00:51:35.460 | that'll teach you not to do those things again.
00:51:37.740 | If you have a bunch of off-policy data
00:51:39.940 | and you just want to synthesize the best policy you can
00:51:42.620 | out of that data,
00:51:43.740 | then you really have to deal with the challenges
00:51:46.500 | of making these counterfactual.
00:51:48.140 | - First of all, what's a policy?
00:51:49.900 | - Yeah, a policy is a model or some kind of function
00:51:54.900 | that maps from observations of the world to actions.
00:51:59.860 | So in reinforcement learning,
00:52:01.540 | we often refer to the current configuration
00:52:05.100 | of the world as the state.
00:52:06.300 | So we say the state kind of encompasses everything
00:52:08.020 | you need to fully define where the world is at at the moment.
00:52:11.100 | And depending on how we formulate the problem,
00:52:13.660 | we might say you either get to see the state
00:52:15.340 | or you get to see an observation,
00:52:16.940 | which is some snapshot, some piece of the state.
00:52:19.780 | - So policy just includes everything in it
00:52:23.660 | in order to be able to act in this world.
00:52:25.820 | - Yes.
00:52:26.660 | - And so what does off-policy mean?
00:52:29.020 | - So yeah, so the terms on-policy and off-policy
00:52:31.620 | refer to how you get your data.
00:52:33.460 | So if you get your data from somebody else
00:52:36.020 | who was doing some other stuff,
00:52:37.220 | maybe you get your data from some manually programmed system
00:52:41.620 | that was just running in the world before,
00:52:44.540 | that's referred to as off-policy data.
00:52:46.540 | But if you got the data by actually acting in the world
00:52:48.980 | based on what your current policy thinks is good,
00:52:51.340 | we call that on-policy data.
00:52:53.260 | And obviously on-policy data is more useful to you
00:52:55.780 | because if your current policy makes some bad decisions,
00:52:59.300 | you will actually see that those decisions are bad.
00:53:01.740 | Off-policy data, however, might be much easier to obtain
00:53:03.980 | because maybe that's all the log data
00:53:06.420 | that you have from before.
00:53:08.580 | - So we talk about, offline talked about autonomous vehicles
00:53:12.940 | so you can envision off-policy kind of approaches
00:53:15.660 | in robotic spaces where there's already
00:53:18.420 | ton of robots out there, but they don't get the luxury
00:53:20.900 | of being able to explore based on
00:53:24.260 | reinforcement learning framework.
00:53:26.140 | So how do we make, again, open question,
00:53:29.180 | but how do we make off-policy methods work?
00:53:32.300 | - Yeah, so this is something that has been
00:53:35.140 | kind of a big open problem for a while.
00:53:36.940 | And in the last few years,
00:53:38.420 | people have made a little bit of progress on that.
00:53:41.740 | You know, I can tell you about,
00:53:42.900 | and it's not by any means solved yet,
00:53:44.260 | but I can tell you some of the things that, for example,
00:53:46.380 | we've done to try to address some of the challenges.
00:53:49.620 | It turns out that one really big challenge
00:53:51.620 | with off-policy reinforcement learning
00:53:53.580 | is that you can't really trust your models
00:53:57.100 | to give accurate predictions for any possible action.
00:54:00.140 | So if I've never tried to, if in my data set,
00:54:03.420 | I never saw somebody steering the car
00:54:05.740 | off the road onto the sidewalk,
00:54:07.980 | my value function or my model
00:54:10.100 | is probably not going to predict the right thing
00:54:11.860 | if I ask what would happen if I were to steer the car
00:54:13.940 | off the road onto the sidewalk.
00:54:15.540 | So one of the important things you have to do
00:54:18.100 | to get off-policy RL to work
00:54:20.300 | is you have to be able to figure out
00:54:21.620 | whether a given action will result
00:54:23.500 | in a trustworthy prediction or not.
00:54:25.260 | And you can use kind of distribution estimation methods,
00:54:29.300 | kind of density estimation methods
00:54:31.260 | to try to figure that out.
00:54:32.180 | So you could figure out that, well, this action,
00:54:33.920 | my model is telling me that it's great,
00:54:35.580 | but it looks totally different
00:54:36.820 | from any action I've taken before,
00:54:37.940 | so my model is probably not correct.
00:54:40.020 | And you can incorporate regularization terms
00:54:43.060 | into your learning objective
00:54:44.180 | that will essentially tell you not to ask those questions
00:54:48.260 | that your model is unable to answer.
00:54:50.620 | - What would lead to breakthroughs in this space,
00:54:53.260 | do you think?
00:54:54.100 | Like what's needed?
00:54:55.380 | Is this a data set question?
00:54:57.580 | Do we need to collect big benchmark data sets
00:55:01.220 | that allow us to explore the space?
00:55:03.580 | Is it a new kinds of methodologies?
00:55:07.580 | Like what's your sense?
00:55:09.620 | Or maybe coming together in a space of robotics
00:55:12.260 | and defining the right problem to be working on.
00:55:15.100 | - I think for off-policy reinforcement learning
00:55:16.700 | in particular, it's very much
00:55:17.720 | an algorithms question right now.
00:55:19.460 | And this is something that I think is great
00:55:22.620 | because an algorithms question is,
00:55:24.860 | that that just takes some very smart people to get together
00:55:27.620 | and think about it really hard.
00:55:28.980 | Whereas if it was like a data problem or hardware problem,
00:55:32.780 | that would take some serious engineering.
00:55:34.620 | So that's why I'm pretty excited about that problem
00:55:37.060 | because I think that we're in a position
00:55:38.380 | where we can make some real progress on it
00:55:40.100 | just by coming up with the right algorithms.
00:55:42.100 | In terms of which algorithms they could be,
00:55:44.940 | the problems at their core are very related to problems
00:55:48.820 | in things like causal inference, right?
00:55:51.420 | Because what you're really dealing with is situations
00:55:53.740 | where you have a model, a statistical model,
00:55:56.340 | that's trying to make predictions
00:55:57.860 | about things that it hadn't seen before.
00:56:00.180 | And if it's a model that's generalizing properly,
00:56:03.220 | that'll make good predictions.
00:56:04.660 | If it's a model that picks up on spurious correlations
00:56:07.060 | that will not generalize properly.
00:56:08.820 | And then you have an arsenal of tools you could use.
00:56:11.020 | You could, for example, figure out
00:56:12.620 | what are the regions where it's trustworthy,
00:56:14.460 | or on the other hand,
00:56:15.580 | you could try to make it generalize better somehow,
00:56:17.580 | or some combination of the two.
00:56:20.700 | - Is there room for mixing, sort of,
00:56:24.940 | where most of it, like 90, 95% is off policy,
00:56:29.620 | you already have the dataset,
00:56:31.180 | and then you get to send the robot out
00:56:34.140 | to do a little exploration?
00:56:35.580 | Like, what's that role of mixing them together?
00:56:39.140 | - Yeah, absolutely.
00:56:39.980 | I think that this is something that you actually
00:56:43.140 | described very well at the beginning of our discussion
00:56:45.780 | when you talked about the iceberg.
00:56:47.340 | Like, this is the iceberg.
00:56:48.460 | The 99% of your prior experience,
00:56:50.420 | that's your iceberg.
00:56:51.580 | You'd use that for off-policy reinforcement learning.
00:56:54.020 | And then, of course, if you've never, you know,
00:56:56.860 | opened that particular kind of door
00:56:58.620 | with that particular lock before,
00:57:00.300 | then you have to go out and fiddle with it a little bit,
00:57:02.060 | and that's that additional 1%
00:57:03.740 | to help you figure out a new task.
00:57:05.140 | And I think that's actually, like,
00:57:06.100 | a pretty good recipe going forward.
00:57:08.140 | - Is this, to you, the most exciting space
00:57:11.380 | of reinforcement learning now?
00:57:12.700 | Or is there, what's, and maybe taking a step back,
00:57:16.460 | not just now, but what's, to you,
00:57:18.060 | is the most beautiful idea?
00:57:20.100 | Apologize for the romanticized question,
00:57:22.020 | but the beautiful idea or concept
00:57:24.300 | in reinforcement learning?
00:57:25.620 | - In general, I actually think that one of the things
00:57:30.660 | that is a very beautiful idea in reinforcement learning
00:57:32.980 | is just the idea that you can obtain a near-optimal control
00:57:37.980 | or a near-optimal policy without actually having
00:57:43.180 | a complete model of the world.
00:57:45.420 | This is, you know, it's something that feels
00:57:49.660 | perhaps kind of obvious if you just hear
00:57:53.020 | the term reinforcement learning
00:57:54.020 | or you think about trial and error learning,
00:57:55.700 | but from a control's perspective, it's a very weird thing
00:57:58.180 | because classically, you know, we think about
00:58:03.020 | engineered systems and controlling engineered systems
00:58:05.660 | as the problem of writing down some equations
00:58:08.460 | and then figuring out, given these equations,
00:58:10.420 | you know, basically like solve for X,
00:58:11.780 | figure out the thing that maximizes its performance.
00:58:15.220 | And the theory of reinforcement learning
00:58:18.900 | actually gives us a mathematically principled framework
00:58:21.340 | to reason about, you know, optimizing some quantity
00:58:25.540 | when you don't actually know the equations
00:58:27.660 | that govern that system.
00:58:28.740 | And that, I don't know, to me, that actually seems
00:58:31.860 | kind of, you know, very elegant,
00:58:34.020 | not something that sort of becomes immediately obvious,
00:58:38.700 | at least in the mathematical sense.
00:58:40.060 | - Does it make sense to you that it works at all?
00:58:42.420 | - Well, I think it makes sense when you take some time
00:58:46.700 | to think about it, but it is a little surprising.
00:58:49.140 | - Well, then taking a step into the more
00:58:53.060 | deeper representations, which is also very surprising,
00:58:56.740 | of sort of the richness of the state space,
00:59:01.740 | the space of environments that this kind of approach
00:59:05.200 | can operate in, can you maybe say
00:59:07.380 | what is deep reinforcement learning?
00:59:10.220 | - Well, deep reinforcement learning simply refers
00:59:13.960 | to taking reinforcement learning algorithms
00:59:16.180 | and combining them with high capacity
00:59:18.340 | neural net representations, which is, you know,
00:59:21.460 | kind of, it might at first seem like a pretty arbitrary
00:59:23.700 | thing, just take these two components
00:59:24.900 | and stick them together.
00:59:26.340 | But the reason that it's something that has become
00:59:29.900 | so important in recent years is that reinforcement learning,
00:59:35.100 | it kind of faces an exacerbated version of a problem
00:59:38.020 | that has faced many other machine learning techniques.
00:59:39.980 | So if we go back to like, you know, the early 2000s
00:59:44.020 | or the late 90s, we'll see a lot of research
00:59:46.740 | on machine learning methods that have some very appealing
00:59:50.340 | mathematical properties, like they reduce
00:59:52.380 | the convex optimization problems, for instance,
00:59:54.780 | but they require very special inputs.
00:59:57.140 | They require a representation of the input
00:59:59.620 | that is clean in some way, like for example,
01:00:02.540 | clean in the sense that the classes
01:00:05.060 | in your multi-class classification problems
01:00:06.620 | separate linearly.
01:00:07.580 | So they have some kind of good representation
01:00:10.020 | and we call this a feature representation.
01:00:12.420 | And for a long time, people were very worried
01:00:14.060 | about features in the world of supervised learning
01:00:15.820 | because somebody had to actually build those features.
01:00:18.140 | So you couldn't just take an image and plug it
01:00:19.940 | into your logistic regression or your SVM or something.
01:00:22.740 | Someone had to take that image and process it
01:00:24.740 | using some handwritten code.
01:00:26.700 | And then neural nets came along
01:00:27.980 | and they could actually learn the features.
01:00:29.700 | And suddenly we could apply learning directly
01:00:32.140 | to the raw inputs, which was great for images,
01:00:34.780 | but it was even more great for all the other fields
01:00:37.540 | where people hadn't come up with good features yet.
01:00:39.860 | And one of those fields is actually reinforcement learning
01:00:41.780 | because in reinforcement learning,
01:00:43.300 | the notion of features, if you don't use neural nets
01:00:45.340 | and you have to design your own features,
01:00:46.860 | is very, very opaque.
01:00:48.340 | Like it's very hard to imagine,
01:00:51.100 | let's say I'm playing chess or go,
01:00:53.740 | what is a feature with which I can represent
01:00:56.020 | the value function for go
01:00:57.580 | or even the optimal policy for go linearly?
01:01:00.780 | Like, I don't even know how to start thinking about it.
01:01:02.980 | And people tried all sorts of things.
01:01:04.300 | They would write down, you know,
01:01:05.380 | an expert chess player looks for whether the knight
01:01:07.940 | is in the middle of the board or not.
01:01:09.140 | So that's a feature, is knight in middle of board?
01:01:11.660 | And they would write these like long lists
01:01:13.220 | of kind of arbitrary made up stuff.
01:01:15.820 | And that was really kind of getting us nowhere.
01:01:17.420 | - And that's a little, chess is a little more accessible
01:01:20.300 | than the robotics problem.
01:01:21.780 | - Absolutely.
01:01:22.620 | - Right, there's at least experts
01:01:24.540 | in the different features for chess.
01:01:27.900 | But still like the neural network there,
01:01:31.740 | to me that's, I mean, you put it eloquently
01:01:34.580 | and almost made it seem like a natural step
01:01:36.700 | to add neural networks.
01:01:38.180 | But the fact that neural networks are able
01:01:41.020 | to discover features in the control problem,
01:01:44.340 | it's very interesting, it's hopeful.
01:01:47.020 | I'm not sure what to think about it,
01:01:48.260 | but it feels hopeful that the control problem
01:01:51.580 | has features to be learned.
01:01:54.540 | - Like, I guess my question is,
01:01:57.700 | is it surprising to you how far the deep side
01:02:01.620 | of deep reinforcement learning was able to,
01:02:03.220 | like what the space of problems has been able to tackle
01:02:06.620 | from, especially in games with the AlphaStar
01:02:10.940 | and AlphaZero and just the representation power there
01:02:15.940 | and in the robotics space.
01:02:18.860 | And what is your sense of the limits
01:02:21.780 | of this representation power and the control context?
01:02:26.180 | - I think that in regard to the limits that here,
01:02:30.100 | I think that one thing that makes it a little hard
01:02:33.660 | to fully answer this question is because in settings
01:02:38.660 | where we would like to push these things to the limit,
01:02:41.940 | we encounter other bottlenecks.
01:02:43.940 | So like the reason that I can't get my robot
01:02:48.380 | to learn how to like, I don't know,
01:02:51.260 | do the dishes in the kitchen,
01:02:53.580 | it's not because its neural net is not big enough.
01:02:56.040 | It's because when you try to actually do trial
01:02:59.700 | and error learning, reinforcement learning directly
01:03:03.140 | in the real world, where you have the potential
01:03:05.120 | to gather these large, highly varied and complex datasets,
01:03:09.880 | you start running into other problems.
01:03:11.620 | Like one problem you run into very quickly,
01:03:13.780 | it'll first sound like a very pragmatic problem,
01:03:16.860 | but it actually turns out to be a pretty deep scientific
01:03:18.540 | problem, take the robot, put it in your kitchen,
01:03:20.820 | have it try to learn to do the dishes with trial and error,
01:03:22.980 | it'll break all your dishes
01:03:24.500 | and then we'll have no more dishes to clean.
01:03:27.060 | Now you might think this is a very practical issue,
01:03:28.940 | but there's something to this,
01:03:30.020 | which is that if you have a person trying to do this,
01:03:32.300 | a person will have some degree of common sense,
01:03:34.120 | they'll break one dish,
01:03:35.180 | they'll be a little more careful with the next one.
01:03:37.020 | And if they break all of them,
01:03:38.060 | they're gonna go and get more or something like that.
01:03:41.020 | So there's all sorts of scaffolding
01:03:42.900 | that comes very naturally to us for our learning process,
01:03:46.720 | like if I have to learn something through trial and error,
01:03:49.780 | I have the common sense to know that I have to try multiple
01:03:52.660 | times, if I screw something up, I ask for help,
01:03:55.100 | or I reset things or something like that.
01:03:57.360 | And all of that is kind of outside of the classic
01:03:59.620 | reinforcement learning problem formulation.
01:04:02.020 | There are other things that can also be categorized
01:04:05.060 | as kind of scaffolding, but are very important.
01:04:07.300 | Like for example, where do you get your reward function?
01:04:09.460 | If I wanna learn how to pour a cup of water,
01:04:13.460 | well, how do I know if I've done it correctly?
01:04:15.300 | Now that probably requires an entire computer vision system
01:04:17.600 | to be built just to determine that.
01:04:19.360 | And that seems a little bit inelegant.
01:04:21.160 | So there are all sorts of things like this
01:04:22.960 | that start to come up when we think through
01:04:24.560 | what we really need to get reinforcement learning
01:04:26.440 | to happen at scale in the real world.
01:04:28.360 | And I think that many of these things actually suggest
01:04:30.920 | a little bit of a shortcoming in the problem formulation
01:04:33.440 | and a few deeper questions that we have to resolve.
01:04:36.160 | - That's really interesting.
01:04:37.000 | I talked to like David Silver about AlphaZero,
01:04:41.480 | and it seems like there's no, again,
01:04:44.400 | that we haven't hit the limit at all
01:04:47.820 | in the context when there's no broken dishes.
01:04:50.060 | So in the case of Go, you can,
01:04:53.000 | it's really about just scaling compute.
01:04:54.940 | So again, like the bottleneck is the amount of money
01:04:59.120 | you're willing to invest in compute,
01:05:00.840 | and then maybe the different, the scaffolding around
01:05:04.400 | how difficult it is to scale compute maybe.
01:05:07.240 | But there, there's no limit.
01:05:08.780 | And it's interesting.
01:05:09.980 | Now we move to the real world and there's the broken dishes,
01:05:12.540 | there's all the, and the reward function like you mentioned.
01:05:16.380 | That's really nice.
01:05:17.220 | So what, how do we push forward there?
01:05:19.860 | Do you think, there's this kind of sample efficiency
01:05:23.460 | question that people bring up of, you know,
01:05:27.020 | not having to break 100,000 dishes.
01:05:30.660 | Is this an algorithm question?
01:05:32.920 | Is this a data selection like question?
01:05:37.100 | What do you think?
01:05:38.100 | How do we, how do we not break too many dishes?
01:05:41.180 | - Yeah, well, one way we can think about that is that
01:05:44.780 | maybe we need to be better at reusing our data,
01:05:52.220 | building that iceberg.
01:05:53.900 | So perhaps it's too much to hope that
01:05:57.340 | you can have a machine that's in isolation,
01:06:02.540 | in the vacuum without anything else,
01:06:04.360 | can just master complex tasks in like in minutes,
01:06:07.240 | the way that people do.
01:06:08.580 | But perhaps it also doesn't have to.
01:06:09.780 | Perhaps what it really needs to do is have an existence,
01:06:12.560 | a lifetime where it does many things
01:06:15.260 | and the previous things that it has done,
01:06:17.020 | prepare it to do new things more efficiently.
01:06:20.040 | And, you know, the study of these kinds of questions
01:06:22.900 | typically falls under categories like multitask learning
01:06:25.580 | or meta learning, but they all fundamentally deal
01:06:28.260 | with the same general theme, which is use experience
01:06:32.580 | for doing other things to learn to do new things
01:06:35.660 | efficiently and quickly.
01:06:37.180 | - So what do you think about,
01:06:38.900 | if you just look at one particular case study
01:06:41.220 | of a Tesla autopilot that has quickly approaching
01:06:44.820 | towards a million vehicles on the road,
01:06:47.460 | where some percentage of the time, 30, 40% of the time
01:06:50.780 | is driven using the computer vision,
01:06:53.620 | multitask, hydranet, right?
01:06:57.740 | And then the other percent,
01:06:59.660 | that's what they call it, hydranet.
01:07:01.420 | The other percent is human controlled.
01:07:06.180 | From the human side, how can we use that data?
01:07:09.740 | What's your sense?
01:07:10.580 | So like, what's the signal?
01:07:14.100 | Do you have ideas in this autonomous vehicle space
01:07:16.060 | when people can lose their lives?
01:07:17.820 | You know, it's a safety critical environment.
01:07:21.340 | So how do we use that data?
01:07:23.860 | - So I think that actually the kind of problems
01:07:28.020 | that come up when we want systems that are reliable
01:07:33.020 | and that can kind of understand the limits
01:07:35.320 | of their capabilities,
01:07:36.660 | they're actually very similar to the kind of problems
01:07:38.220 | that come up when we're doing
01:07:39.700 | off-policy reinforcement learning.
01:07:41.100 | So as I mentioned before,
01:07:41.980 | in off-policy reinforcement learning,
01:07:43.700 | the big problem is you need to know
01:07:46.140 | when you can trust the predictions of your model,
01:07:48.280 | because if you're trying to evaluate
01:07:50.940 | some pattern of behavior for which your model
01:07:52.540 | doesn't give you an accurate prediction,
01:07:54.020 | then you shouldn't use that to modify your policy.
01:07:57.260 | And it's actually very similar to the problem
01:07:58.500 | that we're faced when we actually then deploy that thing
01:08:01.100 | and we want to decide whether we trust it
01:08:03.180 | in the moment or not.
01:08:05.060 | So perhaps we just need to do a better job
01:08:06.760 | of figuring out that part.
01:08:07.760 | And that's a very deep research question, of course,
01:08:10.160 | but it's also a question that a lot of people are working on.
01:08:11.700 | So I'm pretty optimistic that we can make some progress
01:08:13.500 | on that over the next few years.
01:08:15.760 | - What's the role of simulation in reinforcement learning,
01:08:18.880 | deep reinforcement learning, reinforcement learning?
01:08:21.080 | Like how essential is it?
01:08:22.920 | It's been essential for the breakthroughs so far,
01:08:26.680 | for some interesting breakthroughs.
01:08:28.120 | Do you think it's a crutch that we rely on?
01:08:32.000 | I mean, again, this connects to our off policy discussion,
01:08:35.220 | but do you think we can ever get rid of simulation
01:08:38.260 | or do you think simulation will actually take over?
01:08:40.060 | We'll create more and more realistic simulations
01:08:42.100 | that will allow us to solve actual real world problems,
01:08:46.080 | like transfer the models we learn in simulation
01:08:48.220 | to real world problems.
01:08:49.060 | - Yeah.
01:08:50.060 | I think that simulation is a very pragmatic tool
01:08:52.660 | that we can use to get a lot of useful stuff
01:08:54.740 | to work right now.
01:08:56.100 | But I think that in the long run,
01:08:57.660 | we will need to build machines that can learn
01:09:00.620 | from real data, because that's the only way
01:09:02.580 | that we'll get them to improve perpetually.
01:09:04.580 | Because if we can't have our machines learn from real data,
01:09:08.180 | if they have to rely on simulated data,
01:09:09.780 | eventually the simulator becomes the bottleneck.
01:09:12.420 | In fact, this is a general thing.
01:09:13.500 | If your machine has any bottleneck that is built by humans
01:09:17.940 | and that doesn't improve from data,
01:09:20.240 | it will eventually be the thing that holds it back.
01:09:23.100 | And if you're entirely reliant on your simulator,
01:09:25.100 | that'll be the bottleneck.
01:09:25.940 | If you're entirely reliant on a manually designed controller,
01:09:28.820 | that's gonna be the bottleneck.
01:09:30.400 | So simulation is very useful.
01:09:32.120 | It's very pragmatic, but it's not a substitute
01:09:35.260 | for being able to utilize real experience.
01:09:38.640 | And this is, by the way, this is something
01:09:41.220 | that I think is quite relevant now,
01:09:43.660 | especially in the context of some of the things
01:09:45.560 | we've discussed, because some of these kind
01:09:47.900 | of scaffolding issues that I mentioned,
01:09:49.260 | things like the broken dishes
01:09:50.660 | and the unknown reward function,
01:09:51.860 | like these are not problems that you would ever stumble on
01:09:54.820 | when working in a purely simulated kind of environment.
01:09:58.700 | But they become very apparent
01:09:59.740 | when we try to actually run these things in the real world.
01:10:03.220 | - To throw a brief wrench into our discussion, let me ask,
01:10:05.620 | do you think we're living in a simulation?
01:10:08.100 | - Oh, I have no idea.
01:10:09.900 | - Do you think that's a useful thing to even think about,
01:10:12.440 | about the fundamental physics nature of reality?
01:10:17.160 | Or another perspective, the reason I think
01:10:20.900 | the simulation hypothesis is interesting
01:10:23.300 | is to think about how difficult is it to create
01:10:29.580 | sort of a virtual reality game type situation
01:10:32.940 | that will be sufficiently convincing to us humans,
01:10:36.500 | or sufficiently enjoyable that we wouldn't wanna leave.
01:10:40.140 | I mean, that's actually a practical engineering challenge.
01:10:43.420 | And I personally really enjoy virtual reality,
01:10:46.220 | but it's quite far away, but I kind of think about
01:10:49.180 | what would it take for me to wanna spend more time
01:10:51.740 | in virtual reality versus the real world?
01:10:54.820 | And that's a sort of a nice, clean question,
01:10:58.640 | because at that point, we've reached,
01:11:02.260 | if I wanna live in a virtual reality,
01:11:04.740 | that means we're just a few years away,
01:11:06.700 | we're a majority of the population lives in a virtual reality
01:11:09.100 | and that's how we create the simulation, right?
01:11:11.380 | You don't need to actually simulate the quantum gravity
01:11:15.940 | and just every aspect of the universe.
01:11:19.740 | And that's an interesting question
01:11:21.460 | for reinforcement learning, too,
01:11:23.260 | is if we wanna make sufficiently realistic simulations
01:11:25.980 | that may, it blend the difference between
01:11:29.400 | sort of the real world and the simulation,
01:11:31.920 | thereby just some of the things we've been talking about,
01:11:36.180 | kind of the problems go away,
01:11:37.680 | if we can create actually interesting, rich simulations.
01:11:40.800 | - It's an interesting question.
01:11:41.640 | And it actually, I think your question
01:11:43.640 | casts your previous question in a very interesting light,
01:11:46.840 | because in some ways, asking whether we can,
01:11:49.720 | well, the more kind of practical version of this,
01:11:53.760 | like, can we build simulators that are good enough
01:11:56.100 | to train essentially AI systems that will work in the world?
01:12:00.600 | And it's kind of interesting to think about this,
01:12:04.300 | about what this implies, if true,
01:12:06.300 | it kind of implies that it's easier to create the universe
01:12:08.540 | than it is to create a brain.
01:12:09.980 | And that seems like, put this way, it seems kind of weird.
01:12:14.300 | - The aspect of the simulation most interesting to me
01:12:17.540 | is the simulation of other humans.
01:12:20.860 | That seems to be a complexity
01:12:25.160 | that makes the robotics problem harder.
01:12:27.880 | Now, I don't know if every robotics person
01:12:30.240 | agrees with that notion, just as a quick aside,
01:12:33.600 | what are your thoughts about when the human enters
01:12:37.380 | the picture of the robotics problem?
01:12:39.800 | How does that change the reinforcement learning problem,
01:12:42.200 | the learning problem in general?
01:12:45.040 | - Yeah, I think that's a, it's a kind of a complex question.
01:12:48.560 | And I guess my hope for a while had been that
01:12:53.560 | if we build these robotic learning systems
01:12:56.880 | that are multitask, that utilize lots of prior data
01:13:01.020 | and that learn from their own experience,
01:13:03.120 | the bit where they have to interact with people
01:13:05.480 | will be perhaps handled in much the same way
01:13:07.600 | as all the other bits.
01:13:08.760 | So if they have prior experience of interacting with people
01:13:11.240 | and they can learn from their own experience
01:13:13.200 | of interacting with people for this new task,
01:13:15.040 | maybe that'll be enough.
01:13:17.280 | Now, of course, if it's not enough,
01:13:19.360 | there are many other things we can do.
01:13:20.520 | And there's quite a bit of research in that area.
01:13:22.800 | But I think it's worth a shot to see
01:13:24.560 | whether the multi-agent interaction,
01:13:28.540 | the ability to understand that other beings in the world
01:13:33.360 | have their own goals, intentions, and thoughts, and so on,
01:13:36.240 | whether that kind of understanding can emerge automatically
01:13:40.000 | from simply learning to do things with and maximize utility.
01:13:44.040 | - That information arises from the data.
01:13:46.420 | You've said something about gravity,
01:13:49.260 | that you don't need to explicitly inject anything
01:13:53.480 | into the system that can be learned from the data.
01:13:55.720 | And gravity is an example of something
01:13:57.440 | that could be learned from data,
01:13:58.920 | sort of like the physics of the world.
01:14:00.820 | What are the limits of what we can learn from data?
01:14:07.680 | So a very simple, clean way to ask that is,
01:14:13.340 | do you really think we can learn gravity from just data?
01:14:16.960 | The idea, the laws of gravity.
01:14:19.820 | - So something that I think is a common kind of pitfall
01:14:23.720 | when thinking about prior knowledge and learning
01:14:27.040 | is to assume that just because we know something,
01:14:32.040 | then that it's better to tell the machine about that
01:14:34.680 | rather than have it figure it out on its own.
01:14:36.760 | In many cases, things that are important
01:14:41.400 | that affect many of the events
01:14:43.620 | that the machine will experience
01:14:45.140 | are actually pretty easy to learn.
01:14:46.660 | Like, if things, if every time you drop something,
01:14:49.340 | it falls down, like, yeah, you might not get the,
01:14:52.540 | you might get kind of the Newton's version,
01:14:54.220 | not Einstein's version, but it'll be pretty good.
01:14:56.820 | And it will probably be sufficient for you
01:14:58.820 | to act rationally in the world
01:15:00.820 | because you see the phenomenon all the time.
01:15:03.220 | So things that are readily apparent from the data,
01:15:06.140 | we might not need to specify those by hand.
01:15:07.900 | It might actually be easier
01:15:08.740 | to let the machine figure them out.
01:15:10.220 | - It just feels like that there might be a space
01:15:12.440 | of many local minima in terms of theories of this world
01:15:17.440 | that we would discover and get stuck on.
01:15:20.760 | - Yeah, of course.
01:15:21.600 | - That Newtonian mechanics is not necessarily
01:15:25.760 | easy to come by.
01:15:27.600 | - Yeah, and well, in fact, in some fields of science,
01:15:31.160 | for example, human civilizations
01:15:32.600 | that sell full of these local optimums.
01:15:34.080 | So for example, if you think about how people
01:15:37.840 | tried to figure out biology and medicine,
01:15:40.420 | for the longest time, the kind of rules,
01:15:43.300 | the kind of principles that serve us very well
01:15:45.680 | in our day-to-day lives actually serve us very poorly
01:15:47.920 | in understanding medicine and biology.
01:15:50.120 | We had kind of very superstitious and weird ideas
01:15:53.740 | about how the body worked until the advent
01:15:55.680 | of the modern scientific method.
01:15:57.920 | So that does seem to be a failing of this approach,
01:16:01.000 | but it's also a failing of human intelligence, arguably.
01:16:04.000 | - Maybe a small aside, but some,
01:16:06.720 | the idea of self-play is fascinating
01:16:09.080 | in reinforcement learning, sort of these competitive,
01:16:11.440 | creating a competitive context in which agents
01:16:14.080 | can play against each other in a,
01:16:17.660 | sort of at the same skill level
01:16:19.040 | and thereby increasing each other's skill level.
01:16:21.040 | It seems to be this kind of self-improving mechanism
01:16:24.840 | is exceptionally powerful in the context
01:16:26.900 | where it could be applied.
01:16:28.760 | First of all, is that beautiful to you
01:16:32.080 | that this mechanism work as well as it does
01:16:34.560 | and also can be generalized to other contexts
01:16:38.800 | like in the robotic space or anything
01:16:41.720 | that's applicable to the real world?
01:16:43.840 | - I think that it's a very interesting idea,
01:16:47.560 | but I suspect that the bottleneck
01:16:50.440 | to actually generalizing it to the robotic setting
01:16:53.760 | is actually going to be the same
01:16:54.720 | as the bottleneck for everything else,
01:16:57.080 | that we need to be able to build machines
01:16:59.940 | that can get better and better
01:17:01.900 | through natural interaction with the world.
01:17:04.660 | And once we can do that,
01:17:05.780 | then they can go out and play with,
01:17:07.900 | they can play with each other, they can play with people,
01:17:09.560 | they can play with the natural environment.
01:17:11.860 | But before we get there,
01:17:14.100 | we've got all these other problems
01:17:15.260 | we've got, we have to get out of the way.
01:17:16.380 | - So there's no shortcut around that.
01:17:17.860 | You have to interact with a natural environment that.
01:17:21.060 | - Well, because in a self-play setting,
01:17:22.980 | you still need a mediating mechanism.
01:17:24.580 | So the reason that self-play works for a board game
01:17:29.260 | is because the rules of that board game
01:17:31.280 | mediate the interaction between the agents.
01:17:33.660 | So the kind of intelligent behavior that will emerge
01:17:36.300 | depends very heavily on the nature
01:17:37.860 | of that mediating mechanism.
01:17:39.860 | - So on the side of reward functions,
01:17:42.100 | that's coming up with good reward functions
01:17:44.220 | seems to be the thing that we associate
01:17:46.200 | with general intelligence,
01:17:47.780 | like human beings seem to value the idea
01:17:51.420 | of developing our own reward functions
01:17:53.460 | of arriving at meaning and so on.
01:17:58.220 | And yet for reinforcement learning,
01:17:59.860 | we often kind of specify this, the given.
01:18:02.460 | What's your sense of how we develop
01:18:05.080 | good reward functions?
01:18:08.900 | - Yeah, I think that's a very complicated
01:18:11.260 | and very deep question.
01:18:12.120 | And you're completely right that classically
01:18:14.140 | in reinforcement learning,
01:18:15.540 | this question has kind of been treated as a non-issue
01:18:19.320 | that you sort of treat the reward as this external thing
01:18:22.400 | that comes from some other bit of your biology
01:18:26.420 | and you kind of don't worry about it.
01:18:28.400 | And I do think that that's actually,
01:18:30.140 | a little bit of a mistake that we should worry about it.
01:18:33.240 | And we can approach it in a few different ways.
01:18:34.860 | We can approach it, for instance,
01:18:36.860 | by thinking of reward as a communication medium.
01:18:39.020 | We can say, well, how does a person communicate
01:18:41.320 | to a robot what its objective is?
01:18:43.320 | You can approach it also as a sort of more
01:18:45.760 | of an intrinsic motivation medium.
01:18:47.720 | You could say, can we write down
01:18:50.380 | kind of a general objective that leads to good capability?
01:18:55.120 | Like, for example, can you write down some objectives
01:18:56.800 | such that even in the absence of any other task,
01:18:58.960 | if you maximize that objective,
01:19:00.200 | you'll sort of learn useful things.
01:19:02.640 | This is something that has sometimes been called
01:19:05.440 | unsupervised reinforcement learning,
01:19:07.020 | which I think is a really fascinating area of research,
01:19:09.960 | especially today.
01:19:11.520 | We've done a bit of work on that recently.
01:19:12.960 | One of the things we've studied is whether
01:19:14.840 | we can have some notion of unsupervised reinforcement
01:19:19.840 | learning by means of information theoretic quantities,
01:19:23.440 | like for instance, minimizing a Bayesian measure of surprise.
01:19:26.640 | This is an idea that was pioneered actually
01:19:29.120 | in the computational neuroscience community
01:19:30.640 | by folks like Carl Friston.
01:19:32.640 | And we've done some work recently that shows
01:19:34.360 | that you can actually learn pretty interesting skills
01:19:36.920 | by essentially behaving in a way that allows you
01:19:40.480 | to make accurate predictions about the world.
01:19:42.440 | It seems a little circular, like do the things
01:19:44.200 | that will lead to you getting the right answer
01:19:46.680 | for prediction.
01:19:48.740 | But you can, by doing this, you can sort of discover
01:19:51.800 | stable niches in the world.
01:19:53.040 | You can discover that if you're playing Tetris,
01:19:55.480 | then correctly clearing the rows will let you play Tetris
01:19:58.880 | for longer and keep the board nice and clean,
01:20:00.600 | which sort of satisfies some desire for order in the world.
01:20:04.040 | And as a result, get some degree of leverage
01:20:05.960 | over your domain.
01:20:07.320 | So we're exploring that pretty actively.
01:20:08.760 | - Is there a role for a human notion of curiosity
01:20:12.560 | in itself being the reward, sort of discovering new things
01:20:16.600 | about the world?
01:20:18.480 | - So one of the things that I'm pretty interested in
01:20:21.400 | is actually whether discovering new things
01:20:25.520 | can actually be an emergent property
01:20:27.800 | of some other objective that quantifies capability.
01:20:30.640 | So new things for the sake of new things,
01:20:33.160 | maybe it might not by itself be the right answer,
01:20:37.280 | but perhaps we can figure out an objective
01:20:40.040 | for which discovering new things
01:20:41.880 | is actually the natural consequence.
01:20:44.360 | That's something we're working on right now,
01:20:45.840 | but I don't have a clear answer for you there yet.
01:20:47.680 | That's still a work in progress.
01:20:49.500 | - You mean just as a curious observation
01:20:52.000 | to see sort of creative patterns of curiosity
01:20:57.000 | on the way to optimize for a particular--
01:21:00.920 | - On the way to optimize
01:21:01.920 | for a particular measure of capability.
01:21:03.880 | - Is there ways to understand or anticipate unexpected,
01:21:09.800 | unintended consequences of particular reward functions,
01:21:16.800 | sort of anticipate the kind of strategies
01:21:20.960 | that might be developed
01:21:21.920 | and try to avoid highly detrimental strategies?
01:21:25.760 | - Yeah, so classically, this is something
01:21:28.640 | that has been pretty hard in reinforcement learning
01:21:30.360 | because it's difficult for a designer
01:21:33.320 | to have good intuition about
01:21:34.960 | what a learning algorithm will come up with
01:21:36.280 | when they give it some objective.
01:21:37.920 | There are ways to mitigate that.
01:21:40.200 | One way to mitigate it is to actually define an objective
01:21:43.400 | that says like, don't do weird stuff.
01:21:46.080 | You can actually quantify it.
01:21:46.920 | You can say just like, don't enter situations
01:21:49.360 | that have low probability under the distribution of states
01:21:52.800 | you've seen before.
01:21:54.640 | It turns out that that's actually one very good way
01:21:56.400 | to do off-policy reinforcement learning actually.
01:21:59.440 | So we can do some things like that.
01:22:01.200 | - If we slowly venture in speaking about reward functions
01:22:07.000 | into greater and greater levels of intelligence,
01:22:09.240 | there's, I mean, Stuart Russell thinks about this,
01:22:12.600 | the alignment of AI systems with us humans.
01:22:18.040 | So how do we ensure that AGI systems align with us humans?
01:22:23.040 | It's kind of a reward function question
01:22:27.120 | of specifying the behavior of AI systems
01:22:31.840 | such that their success aligns
01:22:34.680 | with the broader intended success interest of human beings.
01:22:39.680 | Do you have thoughts on this?
01:22:41.800 | Do you have kind of concerns
01:22:43.360 | of where reinforcement learning fits into this?
01:22:45.240 | Or are you really focused on the current moment
01:22:48.200 | of us being quite far away
01:22:49.600 | and trying to solve the robotics problem?
01:22:51.760 | - I don't have a great answer to this,
01:22:53.120 | but, you know, and I do think that this is a problem
01:22:56.800 | that's important to figure out.
01:22:59.320 | For my part, I'm actually a bit more concerned
01:23:01.800 | about the other side of this equation that, you know,
01:23:05.560 | maybe rather than unintended consequences
01:23:09.480 | for objectives that are specified too well,
01:23:12.560 | I'm actually more worried right now
01:23:14.080 | about unintended consequences for objectives
01:23:16.000 | that are not optimized well enough,
01:23:18.680 | which might become a very pressing problem
01:23:21.160 | when we, for instance, try to use these techniques
01:23:23.960 | for safety critical systems like cars and aircraft and so on.
01:23:28.480 | I think at some point we'll face the issue
01:23:30.160 | of objectives being optimized too well,
01:23:31.840 | but right now I think we're more likely
01:23:34.480 | to face the issue of them not being optimized well enough.
01:23:37.000 | - But you don't think unintended consequences can arise
01:23:39.480 | even when you're far from optimality,
01:23:41.240 | sort of like on the path to it?
01:23:43.520 | - Oh no, I think unintended consequences
01:23:45.160 | can absolutely arise.
01:23:46.840 | It's just, I think right now the bottleneck
01:23:49.400 | for improving reliability, safety, and things like that
01:23:52.840 | is more with systems that like need to work better,
01:23:56.480 | that need to optimize their objective better.
01:23:58.920 | - Do you have thoughts, concerns about existential threats
01:24:03.120 | of human level intelligence?
01:24:04.720 | Sort of, if we put on our hat of looking in 10, 20, 100,
01:24:09.120 | 500 years from now, do you have concerns
01:24:11.720 | about existential threats of AI systems?
01:24:15.640 | - I think there are absolutely existential threats
01:24:17.720 | for AI systems, just like there are
01:24:19.160 | for any powerful technology.
01:24:20.640 | But I think that these kinds of problems
01:24:24.800 | can take many forms and some of those forms
01:24:28.720 | will come down to people with nefarious intent.
01:24:33.720 | Some of them will come down to AI systems
01:24:36.920 | that have some fatal flaws.
01:24:38.800 | And some of them will, of course, come down to AI systems
01:24:41.280 | that are too capable in some way.
01:24:43.000 | But among this set of potential concerns,
01:24:48.600 | I would actually be much more concerned
01:24:50.080 | about the first two right now,
01:24:51.920 | and principally the one with nefarious humans,
01:24:53.760 | because just through all of human history,
01:24:55.840 | actually it's the nefarious humans that have been
01:24:57.120 | the problem, not the nefarious machines,
01:24:59.840 | than I am about the others.
01:25:01.360 | And I think that right now the best that I can do
01:25:04.760 | to make sure things go well is to build
01:25:07.120 | the best technology I can and also hopefully
01:25:09.760 | to promote responsible use of that technology.
01:25:12.120 | - Do you think RL systems has something to teach us, humans?
01:25:18.720 | You said nefarious humans getting us in trouble.
01:25:21.120 | I mean, machine learning systems have in some ways
01:25:23.840 | have revealed to us the ethical flaws in our data.
01:25:28.200 | In that same kind of way, can reinforcement learning
01:25:30.720 | teach us about ourselves?
01:25:32.600 | Has it taught something?
01:25:34.400 | What have you learned about yourself
01:25:36.840 | from trying to build robots
01:25:39.200 | and reinforcement learning systems?
01:25:41.080 | - I'm not sure what I've learned about myself,
01:25:44.680 | but maybe part of the answer to your question
01:25:49.680 | might become a little bit more apparent
01:25:52.520 | once we see more widespread deployment
01:25:54.520 | of reinforcement learning for decision-making support
01:25:57.160 | in domains like healthcare, education,
01:26:01.520 | social media, et cetera.
01:26:03.360 | And I think we will see some interesting stuff emerge there.
01:26:06.680 | We will see, for instance, what kind of behaviors
01:26:09.320 | these systems come up with in situations
01:26:12.600 | where there is interaction with humans
01:26:14.240 | and where they have possibility
01:26:16.800 | of influencing human behavior.
01:26:18.960 | I think we're not quite there yet,
01:26:20.160 | but maybe in the next few years,
01:26:21.720 | we'll see some interesting stuff come out in that area.
01:26:23.840 | - I hope outside the research space,
01:26:25.360 | 'cause the exciting space where this could be observed
01:26:28.880 | is sort of large companies that deal with large data.
01:26:32.160 | And I hope there's some transparency.
01:26:34.520 | And one of the things that's unclear
01:26:36.720 | when I look at social networks and just online
01:26:39.400 | is why an algorithm did something
01:26:42.240 | or whether even an algorithm was involved.
01:26:45.080 | And that'd be interesting from a research perspective
01:26:48.160 | just to observe the results of algorithms
01:26:53.160 | to open up that data
01:26:55.480 | or to at least be sufficiently transparent
01:26:57.880 | about the behavior of these AI systems in the real world.
01:27:00.720 | What's your sense?
01:27:03.080 | I don't know if you looked at the blog post,
01:27:04.840 | Bitter Lesson by Rich Sutton,
01:27:07.680 | where it looks at sort of the big lesson
01:27:11.400 | of researching AI and reinforcement learning
01:27:14.880 | is that simple methods, general methods
01:27:18.320 | that leverage computation seem to work well.
01:27:21.680 | So basically don't try to do any kind of fancy algorithms,
01:27:24.480 | just wait for computation to get fast.
01:27:26.920 | Do you share this kind of intuition?
01:27:31.160 | I think the high level idea makes a lot of sense.
01:27:34.320 | I'm not sure that my takeaway would be
01:27:35.840 | that we don't need to work on algorithms.
01:27:37.480 | I think that my takeaway would be
01:27:39.520 | that we should work on general algorithms.
01:27:43.480 | And actually I think that this idea
01:27:46.840 | of needing to better automate
01:27:50.600 | the acquisition of experience in the real world
01:27:53.360 | actually follows pretty naturally
01:27:55.920 | from Rich Sutton's conclusion.
01:27:58.640 | So if the claim is that automated general methods
01:28:03.600 | plus data leads to good results,
01:28:06.440 | then it makes sense that we should build general methods
01:28:08.240 | and we should build the kind of methods
01:28:09.840 | that we can deploy and get them to go out there
01:28:11.560 | and like collect their experience autonomously.
01:28:14.480 | I think that one place where I think
01:28:16.960 | that the current state of things
01:28:18.840 | falls a little bit short of that
01:28:19.880 | is actually the going out there
01:28:21.600 | and collecting the data autonomously,
01:28:23.480 | which is easy to do in a simulated board game,
01:28:26.000 | but very hard to do in the real world.
01:28:27.720 | - Yeah, it keeps coming back to this one problem, right?
01:28:30.520 | So your mind is focused there now in this real world.
01:28:35.760 | It just seems scary, this step of collecting the data.
01:28:40.480 | And it seems unclear to me how we can do it effectively.
01:28:45.200 | - Yeah, well, you know, seven billion people in the world,
01:28:48.280 | each of them have to do that at some point in their lives.
01:28:50.960 | - And we should leverage that experience
01:28:52.680 | that they've all done.
01:28:54.840 | We should be able to try to collect that kind of data.
01:28:58.200 | Okay, big questions.
01:29:01.520 | Maybe stepping back through your life,
01:29:05.280 | would book or books, technical or fiction or philosophical,
01:29:10.280 | had a big impact on the way you saw the world,
01:29:14.080 | on the way you thought about in the world,
01:29:15.640 | your life in general?
01:29:16.840 | And maybe what books, if it's different,
01:29:22.120 | would you recommend people consider reading on their own?
01:29:24.800 | - Intellectual journey.
01:29:26.240 | It could be within reinforcement learning,
01:29:28.760 | but it could be very much bigger.
01:29:31.520 | - I don't know if this is like a scientifically,
01:29:36.400 | like particularly meaningful answer,
01:29:39.280 | but like the honest answer is that I actually found
01:29:43.760 | a lot of the work by Isaac Asimov
01:29:45.800 | to be very inspiring when I was younger.
01:29:47.840 | I don't know if that has anything to do
01:29:49.040 | with AI necessarily.
01:29:50.840 | - You don't think it had a ripple effect in your life?
01:29:53.000 | - Maybe it did.
01:29:55.160 | But yeah, I think that a vision of a future where,
01:30:00.160 | well, first of all, artificial,
01:30:05.720 | I might say artificial intelligence system,
01:30:07.080 | artificial robotic systems have kind of a big place,
01:30:10.760 | a big role in society.
01:30:12.480 | And where we try to imagine the sort of the limiting case
01:30:17.480 | of technological advancement and how that might play out
01:30:21.400 | in our future history.
01:30:23.680 | But yeah, I think that that was in some way influential.
01:30:28.680 | I don't really know how, but I would recommend it.
01:30:33.040 | I mean, if nothing else, you'd be well entertained.
01:30:35.440 | - When did you first yourself like fall in love
01:30:38.000 | with the idea of artificial intelligence,
01:30:40.280 | get captivated by this field?
01:30:42.240 | - So my honest answer here is actually that
01:30:47.120 | I only really started to think about it
01:30:49.920 | as something that I might want to do
01:30:52.400 | actually in graduate school pretty late.
01:30:54.800 | And a big part of that was that until,
01:30:58.040 | somewhere around 2009, 2010,
01:31:00.760 | it just wasn't really high on my priority list
01:31:02.960 | because I didn't think that it was something
01:31:05.640 | where we're going to see very substantial advances
01:31:07.800 | in my lifetime.
01:31:08.720 | And maybe in terms of my career,
01:31:14.360 | the time when I really decided I wanted to work on this
01:31:18.520 | was when I actually took a seminar course
01:31:21.080 | that was taught by Professor Andrew Ng.
01:31:23.040 | And at that point, I of course had some,
01:31:26.240 | he had like a decent understanding
01:31:27.320 | of the technical things involved.
01:31:29.040 | But one of the things that really resonated with me
01:31:30.680 | was when he said in the opening lecture,
01:31:32.520 | something to the effect of like,
01:31:33.640 | well, he used to have graduate students come to him
01:31:36.040 | and talk about how they want to work on AI
01:31:38.280 | and he would kind of chuckle
01:31:39.280 | and give them some math problem to deal with.
01:31:41.360 | But now he's actually thinking that this is an area
01:31:43.560 | where we might see like substantial advances
01:31:45.240 | in our lifetime.
01:31:46.640 | And that kind of got me thinking because,
01:31:49.960 | you know, in some abstract sense,
01:31:51.680 | yeah, like you can kind of imagine that,
01:31:53.600 | but in a very real sense,
01:31:55.320 | when someone who had been working on that kind of stuff
01:31:57.920 | their whole career suddenly says that,
01:32:00.360 | yeah, like that had some effect on me.
01:32:03.960 | - Yeah, this might be a special moment
01:32:05.560 | in the history of the field.
01:32:07.800 | That this is where we might see
01:32:10.520 | some interesting breakthroughs.
01:32:13.840 | So in the space of advice,
01:32:16.120 | somebody who's interested in getting started
01:32:18.440 | in machine learning or reinforcement learning,
01:32:21.120 | what advice would you give to maybe an undergraduate student
01:32:23.760 | or maybe even younger,
01:32:25.160 | how, what are the first steps to take
01:32:27.800 | and further on, what are the steps to take on that journey?
01:32:32.680 | - So something that I think is important to do
01:32:37.680 | is to not be afraid to like spend time
01:32:42.960 | imagining the kind of outcome that you might like to see.
01:32:46.160 | So, you know, one outcome might be a successful career,
01:32:49.640 | a large paycheck or something,
01:32:50.920 | or state-of-the-art results on some benchmark,
01:32:53.680 | but hopefully that's not the thing
01:32:54.760 | that's like the main driving force for somebody.
01:32:57.600 | But I think that if someone who's a student
01:33:01.840 | considering a career in AI,
01:33:03.040 | like takes a little while, sits down and thinks like,
01:33:05.840 | what do I really want to see?
01:33:07.320 | What I want to see a machine do?
01:33:08.600 | What do I want to see a robot do?
01:33:10.200 | What do I want to do in,
01:33:11.040 | what I want to see a natural language system?
01:33:12.520 | Just like imagine, you know,
01:33:14.840 | imagine it almost like a commercial
01:33:16.520 | for a future product or something,
01:33:18.200 | or like something that you'd like to see in the world,
01:33:21.080 | and then actually sit down and think about the steps
01:33:23.360 | that are necessary to get there.
01:33:24.960 | And hopefully that thing is not a better number
01:33:27.480 | on ImageNet classification.
01:33:28.760 | It's like, it's probably like an actual thing
01:33:30.600 | that we can't do today that would be really awesome.
01:33:32.640 | Whether it's a robot butler or a, you know,
01:33:36.120 | a really awesome healthcare decision-making support system,
01:33:38.840 | whatever it is that you find inspiring.
01:33:41.560 | And I think that thinking about that
01:33:43.040 | and then backtracking from there
01:33:44.720 | and imagining the steps needed to get there
01:33:46.520 | will actually lead to much better research.
01:33:48.080 | It'll lead to rethinking the assumptions.
01:33:50.320 | It'll lead to working on the bottlenecks
01:33:53.080 | that other people aren't working on.
01:33:55.720 | - And then naturally to turn to you,
01:33:58.120 | we've talked about reward functions,
01:34:00.480 | and you just give an advice on looking forward
01:34:03.200 | how you'd like to see,
01:34:04.280 | what kind of change you would like to make in the world.
01:34:06.640 | What do you think, ridiculous, big question,
01:34:09.200 | what do you think is the meaning of life?
01:34:11.400 | What is the meaning of your life?
01:34:13.280 | What gives you fulfillment, purpose, happiness, and meaning?
01:34:18.280 | - That's a very big question.
01:34:21.800 | - What's the reward function under which you're operating?
01:34:27.520 | - Yeah, I think one thing that does give, you know,
01:34:30.280 | if not meaning, at least satisfaction
01:34:31.920 | is some degree of confidence
01:34:35.120 | that I'm working on a problem that really matters.
01:34:37.400 | I feel like it's less important to me
01:34:38.720 | to like actually solve a problem,
01:34:41.680 | but it's quite nice to take things to spend my time on
01:34:46.680 | that I believe really matter.
01:34:48.960 | And I try pretty hard to look for that.
01:34:52.120 | - I don't know if it's easy to answer this,
01:34:54.640 | but if you're successful, what does that look like?
01:34:59.640 | What's the big dream?
01:35:01.880 | Now, of course, success is built on top of success,
01:35:05.620 | and you keep going forever, but what is the dream?
01:35:10.640 | - Yeah, so one very concrete thing,
01:35:12.400 | or maybe as concrete as it's gonna get here
01:35:15.760 | is to see machines that actually get better and better
01:35:20.760 | the longer they exist in the world.
01:35:23.360 | And that kind of seems like on the surface,
01:35:25.720 | one might even think that that's something
01:35:26.880 | that we have today, but I think we really don't.
01:35:28.880 | I think that there is an unending complexity
01:35:33.880 | in the universe, and to date,
01:35:37.880 | all of the machines that we've been able to build
01:35:40.160 | don't sort of improve up to the limit of that complexity.
01:35:43.680 | They hit a wall somewhere.
01:35:45.520 | Maybe they hit a wall because they're in a simulator
01:35:48.000 | that is only a very limited,
01:35:50.120 | very pale imitation of the real world,
01:35:52.240 | or they hit a wall because they rely on a labeled dataset,
01:35:55.320 | but they never hit the wall
01:35:56.760 | of like running out of stuff to see.
01:35:58.820 | So I'd like to build a machine
01:36:02.440 | that can go as far as possible in that regard.
01:36:04.720 | - Runs up against the ceiling
01:36:06.460 | of the complexity of the universe.
01:36:07.880 | - Yes.
01:36:09.400 | - Well, I don't think there's a better way to end it, Sergey.
01:36:11.840 | Thank you so much.
01:36:12.680 | It's a huge honor.
01:36:13.500 | I can't wait to see the amazing work
01:36:16.240 | that you have to publish in the education space
01:36:20.420 | in terms of reinforcement learning.
01:36:21.680 | Thank you for inspiring the world.
01:36:22.900 | Thank you for the great research you do.
01:36:24.640 | - Thank you.
01:36:25.560 | - Thanks for listening to this conversation
01:36:27.280 | with Sergey Lavin, and thank you to our sponsors,
01:36:30.820 | Cash App and ExpressVPN.
01:36:33.440 | Please consider supporting this podcast
01:36:35.520 | by downloading Cash App and using code LexPodcast
01:36:40.020 | and signing up at expressvpn.com/lexpod.
01:36:44.640 | Click all the links, buy all the stuff.
01:36:47.760 | It's the best way to support this podcast
01:36:50.280 | and the journey I'm on.
01:36:52.280 | If you enjoy this thing, subscribe on YouTube,
01:36:54.760 | review it with Firestars and Apple Podcast,
01:36:57.060 | support on Patreon, or connect with me on Twitter
01:36:59.880 | at Lex Friedman, spelled somehow,
01:37:02.920 | if you can figure out how,
01:37:04.160 | without using the letter E, just F-R-I-D-M-A-N.
01:37:08.960 | And now, let me leave you with some words
01:37:11.200 | from Salvador Dali.
01:37:12.520 | Intelligence without ambition is a bird without wings.
01:37:17.660 | Thank you for listening, and hope to see you next time.
01:37:21.700 | (upbeat music)
01:37:24.280 | (upbeat music)
01:37:26.860 | [BLANK_AUDIO]