back to indexSergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108
Chapters
0:0 Introduction
3:5 State-of-the-art robots vs humans
16:13 Robotics may help us understand intelligence
22:49 End-to-end learning in robotics
27:1 Canonical problem in robotics
31:44 Commonsense reasoning in robotics
34:41 Can we solve robotics through learning?
44:55 What is reinforcement learning?
66:36 Tesla Autopilot
68:15 Simulation in reinforcement learning
73:46 Can we learn gravity from data?
76:3 Self-play
77:39 Reward functions
87:1 Bitter lesson by Rich Sutton
92:13 Advice for students interesting in AI
93:55 Meaning of life
00:00:00.000 |
The following is a conversation with Sergei Levine, 00:00:03.300 |
a professor at Berkeley and a world-class researcher 00:00:12.540 |
for end-to-end training of neural network policies 00:00:17.540 |
scalable algorithms for inverse reinforcement learning 00:00:30.280 |
by downloading Cash App and using code LexPodcast 00:00:45.300 |
If you enjoy this thing, subscribe on YouTube, 00:01:18.300 |
Since Cash App does fractional share trading, 00:01:20.660 |
let me mention that the order execution algorithm 00:01:23.540 |
that works behind the scenes to create the abstraction 00:01:26.820 |
of the fractional orders is an algorithmic marvel. 00:01:32.320 |
for taking a step up to the next layer of abstraction 00:01:35.840 |
making trading more accessible for new investors 00:01:41.640 |
So again, if you get Cash App from the App Store 00:01:48.120 |
you get $10 and Cash App will also donate $10 to FIRST, 00:01:52.720 |
an organization that is helping to advance robotics 00:01:55.320 |
and STEM education for young people around the world. 00:02:08.320 |
to support this podcast and to get an extra three months free 00:02:18.640 |
I think ExpressVPN is the best VPN out there. 00:02:23.040 |
but it happens to be true in my humble opinion. 00:02:36.560 |
It's really important that they don't log your data. 00:02:40.080 |
It works on Linux and every other operating system, 00:02:43.180 |
but Linux of course is the best operating system. 00:02:46.640 |
Shout out to my favorite flavor, Ubuntu Mate 2004. 00:02:54.560 |
to support this podcast and to get an extra three months free 00:03:00.800 |
And now here's my conversation with Sergei Levine. 00:03:05.300 |
What's the difference between a state of the art human, 00:03:09.920 |
well, I don't know if we qualify as state of the art humans, 00:03:11.880 |
but a state of the art human and a state of the art robot? 00:03:22.360 |
I think it's a very tricky thing to understand 00:03:25.320 |
because there are some things that are difficult 00:03:34.280 |
between capabilities of robots in terms of hardware 00:03:42.760 |
There is a little video that I think robotics researchers 00:03:47.280 |
special robotics learning researchers like myself 00:03:52.160 |
which demonstrates a prototype robot called the PR1. 00:03:59.200 |
And there's this beautiful video showing the PR1 00:04:23.880 |
Now, obviously, like human bodies are sophisticated 00:04:28.200 |
but on the whole, if we're willing to like spend 00:04:32.560 |
we can kind of close the hardware gap almost. 00:04:35.520 |
But the intelligence gap, that one is very wide. 00:04:41.280 |
you're referring to the physical sort of the actuators, 00:04:45.040 |
as opposed to the hardware on which the cognition, 00:04:50.760 |
I'm referring to the body rather than the mind. 00:04:53.320 |
So that means that kind of the work is cut out for us. 00:04:56.640 |
Like while we can still make the body better, 00:04:59.000 |
we kind of know that the big bottleneck right now 00:05:16.800 |
- The gap is very large and the gap becomes larger 00:05:20.600 |
the more unexpected events can happen in the world. 00:05:24.560 |
So essentially the spectrum along which you can measure 00:05:32.160 |
If you control everything in the world very tightly, 00:05:47.200 |
But as soon as anything starts to vary in the environment, 00:05:50.360 |
now it'll trip up and if many, many things vary 00:05:52.600 |
like they would like in your kitchen, for example, 00:06:01.960 |
but how much on the human side of the cognitive abilities 00:06:18.520 |
from sort of scratch from the day we're born? 00:06:23.280 |
as asking about the implications of this for AI. 00:06:28.720 |
I can't really speak authoritatively about it. 00:06:40.440 |
well, first of course, biology is very messy. 00:06:44.960 |
And if you ask the question, how does a person do something 00:06:53.080 |
and oftentimes you can find support for many different, 00:07:05.440 |
So maybe a person is from birth very, very good 00:07:09.840 |
at some things like, for example, recognizing faces. 00:07:12.000 |
There's a very strong evolutionary pressure to do that. 00:07:25.240 |
the minimal sufficient thing is we could, for example, 00:07:30.600 |
that evolution couldn't have prepared them for. 00:07:33.800 |
Our daily lives actually do this to us all the time. 00:07:42.520 |
that we can find ourselves in and we do very well there. 00:07:45.720 |
Like I can give you a joystick to control a robotic arm, 00:07:50.720 |
and you might be pretty bad for the first couple of seconds. 00:07:54.600 |
on using this robotic arm to like open this door, 00:07:59.480 |
Even though you've never seen this device before, 00:08:11.200 |
And that's exactly where our current robotic systems 00:08:33.360 |
to introspect all the knowledge I have about the world. 00:08:36.720 |
But it seems like there might be an iceberg underneath 00:08:40.360 |
of the amount of knowledge we actually bring to the table. 00:08:44.320 |
- I think there's absolutely an iceberg of knowledge 00:08:47.940 |
but I think it's very likely that iceberg of knowledge 00:08:53.840 |
Because we have a lot of prior experience to draw on 00:08:58.400 |
and it kind of makes sense that the right way 00:09:25.880 |
like a common sense understanding of the world. 00:09:29.480 |
it's not because something about machine learning itself 00:09:40.960 |
Kind of the input output X's go to Y's sort of model. 00:09:46.440 |
is to view the world more as like a massive experience 00:09:51.340 |
that is not necessarily providing any rigid supervision, 00:09:58.240 |
into some sort of common sense understanding. 00:10:03.040 |
Well, you're painting an optimistic, beautiful picture, 00:10:12.360 |
figure out how we can get access to more and more data 00:10:16.320 |
for those learning algorithms to extract signal from, 00:10:19.080 |
and then accumulate that iceberg of knowledge. 00:10:27.860 |
And this is where we perhaps reach the limits 00:10:32.760 |
But one thing that I think that the research community 00:10:38.040 |
is how much it matters where that experience comes from. 00:10:41.680 |
Like, do you just like download everything on the internet 00:10:44.920 |
and cram it into essentially the 21st century analog 00:10:49.000 |
of the giant language model and then see what happens? 00:10:52.560 |
Or does it actually matter whether your machine 00:10:56.680 |
or in the sense that it actually attempts things, 00:11:01.440 |
and kind of augments its experience that way? 00:11:05.880 |
it gets to interact with and observe and learn from. 00:11:10.360 |
- Right, it may be that the world is so complex 00:11:12.760 |
that simply obtaining a large mass of sort of IID samples 00:11:20.840 |
But if you are actually interacting with the world 00:11:24.000 |
and essentially performing this sort of hard negative mining 00:11:32.200 |
and augmenting your understanding using that experience, 00:11:35.680 |
and you're just doing this continually for many years, 00:11:50.480 |
or lack of common sense is often characterized 00:11:58.480 |
here I'm this bottle of water sitting on the table, 00:12:01.860 |
everything is fine if I were to knock it over, 00:12:07.400 |
And I know that nothing good would happen from that, 00:12:10.280 |
but if I have a bad understanding of the world, 00:12:15.940 |
If I actually go about my daily life doing the things 00:12:20.720 |
that my current understanding of the world suggests 00:12:24.320 |
in some ways I'll get exactly the right supervision 00:12:56.320 |
Can we do pretty well by reading all of Wikipedia, 00:12:59.480 |
sort of randomly sampling it like language models do, 00:13:11.040 |
- So I think this is first an open scientific problem, 00:13:31.520 |
So perhaps it's okay if you spend part of your day 00:13:37.200 |
visiting interesting regions of your state space, 00:13:43.080 |
make sure that you really try out the solutions 00:13:46.860 |
that your current model of the world suggests 00:13:48.900 |
might be effective, and observe whether those solutions 00:13:56.260 |
to have kind of a perpetual improvement loop. 00:13:59.560 |
Like this perpetual improvement loop is really like, 00:14:05.360 |
the best current methods from the best methods 00:14:19.280 |
So you kind of mentioned there's an optimization problem, 00:14:21.420 |
you kind of explore the specifics of a particular strategy, 00:14:27.680 |
How important is it to explore totally outside 00:14:30.880 |
of the strategies that have been working for you so far? 00:14:35.160 |
- Yeah, I think it's a very problem dependent 00:14:51.800 |
and some of the sort of more open-ended reformulations 00:14:56.320 |
of that problem that have been explored in recent years. 00:15:00.280 |
is framed as a problem of maximizing utility, 00:15:07.680 |
But a very interesting kind of way to look at, 00:15:26.560 |
And that might suggest a somewhat different solution. 00:15:28.840 |
So if you don't know what you're gonna be tasked with doing 00:15:31.160 |
and you just wanna prepare yourself optimally 00:15:35.320 |
maybe then you will choose to attain some sort of coverage, 00:15:39.360 |
build up sort of an arsenal of cognitive tools, if you will, 00:15:47.080 |
you will be well-prepared to undertake that task. 00:15:56.860 |
the general intelligence kind of formulation. 00:16:04.480 |
I don't think that's by any means the mainstream 00:16:19.000 |
You actually kind of painted two pictures here, 00:16:21.080 |
one of sort of the narrow, one of the general. 00:16:23.200 |
What, in your view, is the big problem of robotics? 00:16:26.500 |
Again, ridiculously philosophical, high-level questions. 00:16:29.600 |
- I think that, you know, maybe there are two ways 00:16:40.800 |
what would sort of maximize the usefulness of robots? 00:16:43.720 |
And there the answer might be something like a system 00:16:47.420 |
where a system that can perform whatever task 00:16:59.500 |
If you tell it to teleport to another planet, 00:17:05.500 |
then potentially with a little bit of additional training 00:17:08.400 |
or a little bit of additional trial and error, 00:17:11.900 |
in much the same way as like a human teleoperator 00:17:14.300 |
ought to figure out how to drive the robot to do that. 00:17:36.020 |
in the world of robotics, but more the other way around, 00:17:40.700 |
to help us understand artificial intelligence. 00:17:43.300 |
- So your dream fundamentally is to understand intelligence. 00:17:47.860 |
- Yes, I think that's the dream for many people 00:17:53.220 |
I think that there's something very pragmatic 00:17:58.540 |
but I do think that a lot of people that go into this field, 00:18:01.140 |
actually, you know, the things that they draw inspiration 00:18:06.860 |
to help us learn about intelligence and about ourselves. 00:18:10.620 |
- So that's fascinating, that robotics is basically 00:18:15.220 |
the space by which you can get closer to understanding 00:18:20.580 |
So what is it about robotics that's different 00:18:25.360 |
So if we look at some of the early breakthroughs 00:18:27.860 |
in deep learning or in the computer vision space 00:18:36.300 |
and thereby came up with a lot of brilliant ideas. 00:18:39.900 |
between computer vision, purely defined an image net 00:18:50.200 |
you kind of have to take away many of the crutches. 00:18:55.340 |
So you have to deal with both the particular problems 00:19:01.700 |
but you also have to deal with the integration 00:19:04.220 |
And classically, we've always thought of the integration 00:19:08.700 |
So a classic kind of modular engineering approach 00:19:12.860 |
and wire them together, and then the whole thing works. 00:19:22.100 |
might lead to just like very different solutions 00:19:24.300 |
than if we were to study the parts and wire them together. 00:19:26.540 |
So the integrative nature of robotics research 00:19:29.820 |
helps us see the different perspectives on the problem. 00:19:34.060 |
Another part of the answer is that with robotics, 00:19:37.820 |
it casts a certain paradox into very clever relief. 00:19:41.420 |
So this is sometimes referred to as a Moravic's paradox, 00:19:50.700 |
can be very easy for machines and vice versa. 00:19:54.780 |
So, you know, integral and differential calculus 00:20:03.660 |
it can derive derivatives and integrals for you 00:20:16.340 |
And sometimes when we see such blatant discrepancies, 00:20:23.040 |
So if we really try to zero in on those discrepancies, 00:20:25.620 |
we might find that little bit that we're missing. 00:20:27.900 |
And it's not that we need to make machines better 00:20:30.460 |
or worse at math and better at drinking water, 00:20:33.040 |
but just that by studying those discrepancies, 00:20:49.420 |
So the Hans-Marx paradox is probably referring 00:20:59.220 |
all the kind of stuff we do in the physical world. 00:21:05.800 |
if you were to try to disentangle the Marvox paradox, 00:21:10.800 |
like why is there such a gap in our intuition about it? 00:21:17.640 |
Why do you think manipulating objects is so hard 00:21:22.520 |
from applying reinforcement learning in this space? 00:21:25.660 |
- Yeah, I think that one reason is maybe that 00:21:31.240 |
for many of the other problems that we've studied 00:21:51.540 |
to cast it as a very tightly supervised problem. 00:22:00.460 |
You can do it, it just doesn't seem to work all that well. 00:22:06.040 |
where we know exactly which motor commands to send 00:22:11.220 |
that's not actually like such a great solution. 00:22:13.560 |
And it also doesn't seem to be even remotely similar 00:22:20.360 |
here's how you fire your muscles in order to walk. 00:22:29.640 |
- And that's what you mean by tightly coupled, 00:22:33.760 |
gets a supervised signal of whether it's a good one or not. 00:22:39.100 |
you could sort of imagine up to a level of abstraction 00:22:48.120 |
- If we look at sort of the sub spaces of robotics, 00:22:58.040 |
and we get to see how this beautiful mess interplays. 00:23:06.280 |
broadly speaking, understanding the environment. 00:23:14.480 |
Then there's prediction in trying to anticipate 00:23:20.580 |
in order for you to be able to act in that world. 00:23:24.360 |
And then there's also this game theoretic aspect 00:23:28.120 |
of how your actions will change the behavior of others. 00:23:36.200 |
and this is bigger than reinforcement learning, 00:23:38.120 |
this is just broadly looking at the problem in robotics. 00:23:46.280 |
that when you start to look at all of them together, 00:23:54.320 |
Like you can't even say which one individually is harder 00:23:58.800 |
you should only be looking at them all together. 00:24:01.480 |
- I think when you look at them all together, 00:24:05.160 |
And I think that's actually pretty important. 00:24:15.520 |
reinforcement learning for robotic manipulation skills 00:24:20.660 |
that seemed a little inflammatory and controversial 00:24:29.440 |
the point that we were actually trying to make in that work 00:24:36.060 |
you could actually do better if you treat them together 00:24:39.580 |
And the way that we tried to demonstrate this 00:24:41.420 |
is we picked a fairly simple motor control task 00:24:43.640 |
where a robot had to insert a little red trapezoid 00:25:05.600 |
essentially the pressure on the perception part 00:25:18.680 |
because vertically it just pushes it down all the way 00:25:21.960 |
And their perceptual errors are a lot less harmful, 00:25:24.560 |
whereas perpendicular to the direction of motion, 00:25:28.940 |
So the point is that if you combine these two things, 00:25:32.120 |
you can trade off errors between the components 00:25:39.740 |
while still leading to better overall performance. 00:25:43.920 |
I mean, in the space of pegs and things like that, 00:25:51.160 |
but that seems to be at least intuitively an idea 00:25:54.960 |
that should generalize to basically all aspects 00:26:05.160 |
sort of perceptual heuristics in humans and animals 00:26:10.760 |
of this is something called the gaze heuristic, 00:26:17.200 |
So if you want to catch a ball, for instance, 00:26:24.080 |
solve a complex system of differential equations 00:26:31.400 |
so that the object stays in the same position 00:26:47.040 |
to figure out if they're about to collide with somebody. 00:26:49.080 |
Frogs use this to catch insects and so on and so on. 00:26:51.480 |
So this is something that actually happens in nature. 00:26:56.760 |
just because all the scientists were able to identify 00:27:14.040 |
when you're thinking about some of these problems? 00:27:21.360 |
at least the robotics community has converged towards that 00:27:42.880 |
- I don't think I have like a really great answer 00:27:46.600 |
And I think partly the reason I don't have a great answer 00:27:51.120 |
it has to do with the fact that the difficulty 00:27:54.880 |
is really in the flexibility and adaptability 00:27:57.400 |
rather than in doing a particular thing really, really well. 00:28:12.760 |
It's really the ability to quickly figure out 00:28:16.840 |
how to do some arbitrary new thing well enough 00:28:21.840 |
to like, you know, to move on to the next arbitrary thing. 00:28:43.160 |
so if you had asked me this question around like 2016, 00:28:46.080 |
maybe I would have probably said that robotic grasping 00:28:50.840 |
because it's a task with great real world utility. 00:28:54.200 |
Like you will get a lot of money if you can do it well. 00:29:03.120 |
So you will get a lot of money if you do it well 00:29:04.400 |
because lots of people want to run warehouses with robots. 00:29:12.360 |
will require very different grasping strategies. 00:29:16.880 |
people have gotten really good at building systems 00:29:21.120 |
It's to the point where I'm not actually sure 00:29:29.400 |
But it's kind of interesting to see the kind of methods 00:29:41.280 |
So people who have studied the history of computer vision 00:29:53.560 |
people thought of it as an inverse physics problem. 00:29:56.000 |
Essentially, you look at what's in front of you, 00:29:59.880 |
then use your best estimate of the laws of physics 00:30:11.320 |
including our own, but also ones from many other labs, 00:30:16.760 |
with some combination of either exhaustive simulation 00:30:25.200 |
solving geometry problems or physics problems. 00:30:27.480 |
- So what are, just by the way, in the grasping, 00:30:32.360 |
what are the difficulties that have been worked on? 00:30:42.400 |
why is picking stuff up such a difficult problem? 00:30:51.560 |
or the variety of things that you have to deal with 00:30:54.520 |
And oftentimes things that work for one class of objects 00:31:00.240 |
So if you get really good at picking up boxes 00:31:06.160 |
you just need to employ a very different strategy. 00:31:15.280 |
It has to do with the bits that are easier to pick up, 00:31:20.840 |
the bits that will cause the thing to pivot and bend 00:31:24.960 |
versus the bits that result in a nice secure grasp, 00:31:29.120 |
things that if you pick them up the wrong way, 00:31:30.560 |
they'll fall upside down and the contents will spill out. 00:31:33.720 |
So there's all these little details that come up, 00:31:46.960 |
there creeps in this notion that starts to sound 00:31:52.980 |
Do you think solving the general problem of robotics 00:32:07.520 |
like you said, be robust and deal with uncertainty, 00:32:13.360 |
and assimilate different pieces of knowledge that you have? 00:32:32.320 |
it's the other way around is that studying robotics 00:32:36.000 |
can help us understand how to put common sense 00:32:43.000 |
and why our current systems might lack common sense, 00:32:47.080 |
is an emergent property of actually having to interact 00:32:51.620 |
with a particular world, a particular universe, 00:33:13.900 |
like give it a picture of a person wearing a fur coat 00:33:18.460 |
But I think what's really happening in those settings 00:33:20.720 |
is that the system doesn't actually live in our world, 00:33:24.120 |
it lives in its own world that consists of pixels 00:33:41.160 |
And if we build AI systems that are forced to deal 00:33:43.200 |
with all of the messiness and complexity of our universe, 00:33:52.880 |
don't have to do that, they can take some shortcut. 00:33:57.200 |
You've a couple times already sort of reframed 00:34:03.880 |
I don't know if my way of thinking is common, 00:34:36.660 |
That's really interesting way to think about it. 00:34:59.080 |
- I think that in terms of the spirit of the question, 00:35:11.120 |
Like, I think that in some ways when we build algorithms, 00:35:28.560 |
that you're getting at, I do think the answer is yes. 00:35:34.240 |
that have previously required meticulous manual engineering 00:35:39.920 |
And actually one thing I will say on this topic is 00:35:42.200 |
I don't think this is actually a very radical 00:35:49.840 |
as a way to do control for a very, very long time. 00:35:53.240 |
And in some ways what's changed is really more the name. 00:35:57.880 |
So today we would say that, oh, my robot does 00:36:01.760 |
machine learning, it does reinforcement learning. 00:36:39.360 |
So this feels like there's an accumulation of knowledge 00:36:44.800 |
that one big difference between learning-based systems 00:36:52.160 |
should get better and better the more they do something. 00:36:58.040 |
- So if we look back at the world of expert systems 00:37:10.960 |
do you think that will have a role at some point? 00:37:14.760 |
Deep learning, machine learning, reinforcement learning 00:37:17.080 |
has shown incredible results and breakthroughs 00:37:21.240 |
and just inspired thousands, maybe millions of researchers. 00:37:30.600 |
but it used to be popular idea of symbolic AI. 00:37:44.640 |
So this is the highly biased history from my perspective. 00:38:00.720 |
like what action do I take in order for X to be true? 00:38:05.600 |
your logical symbolic representation to get an answer. 00:38:08.360 |
What that turned into somewhere in the 1990s is, 00:38:14.280 |
and statements that have true or false values, 00:38:30.520 |
just probabilistic logical inference systems. 00:38:32.800 |
And then people said, well, let's actually learn 00:38:35.560 |
the individual probabilities inside these models. 00:38:40.680 |
let's not even specify the nodes in the models. 00:38:48.800 |
It's essentially instantiating rational decision-making 00:38:53.600 |
and learning by means of an optimization process. 00:38:56.680 |
So in a sense, I would say, yes, it has a place. 00:39:06.680 |
it looks slightly different than it was before. 00:39:28.840 |
that just happens to be expressed by a neural net, 00:39:35.400 |
to figure out the answer to a query that you have. 00:39:48.040 |
is you can follow the reasoning of the system. 00:39:50.600 |
That to us mere humans is somehow compelling. 00:39:54.160 |
It's just, I don't know what to make of this fact 00:40:00.440 |
that there's a human desire for intelligent systems 00:40:20.160 |
Like we shouldn't expect that of intelligent systems. 00:40:27.600 |
But if I were to sort of psychoanalyze the researchers 00:40:45.680 |
of learning-based systems will be as explainable 00:40:50.680 |
as the dream was with expert systems, for example? 00:41:06.920 |
Like, why do you want your system to explain itself? 00:41:14.640 |
- But in some ways that's a much bigger problem, actually. 00:41:19.320 |
and then it might screw up in how it explains itself. 00:41:24.920 |
so that it's not actually doing what it was supposed to do. 00:41:30.400 |
is really as a bigger problem of verification and validation 00:41:36.160 |
of which explainability is sort of one component. 00:41:41.160 |
I see explainability, you put it beautifully. 00:41:43.960 |
I think you actually summarize the field of explainability. 00:41:46.800 |
But to me, there's another aspect of explainability 00:41:51.760 |
that has nothing to do with errors or with like, 00:42:08.060 |
It's just that for other intelligence systems 00:42:21.780 |
neural networks are less capable of doing that. 00:42:32.700 |
Maybe one specific story I can tell you about 00:42:40.500 |
who's now a professor at MIT named Jacob Andreas. 00:42:43.320 |
Jacob actually works in natural language processing, 00:42:45.780 |
but he had this idea to do a little bit of work 00:42:49.140 |
and how natural language can basically structure 00:42:55.580 |
And one of the things he did is he set up a model 00:43:03.740 |
but the model reads in a natural language instruction. 00:43:08.820 |
So you tell it like, you know, go to the red house 00:43:11.600 |
and then it's supposed to go to the red house. 00:43:14.960 |
is he treated that sentence not as a command from a person, 00:43:19.540 |
but as a representation of the internal kind of state 00:43:28.540 |
what it would do is it would basically try to think 00:43:33.540 |
attempt to do them and see if they led to the right outcome. 00:43:35.580 |
So it would kind of think out loud, like, you know, 00:43:37.580 |
I'm faced with this new task, what am I gonna do? 00:43:46.740 |
oh, go to the green plant, that's what's working. 00:43:49.500 |
And then you could look at the string that it came up with, 00:43:57.320 |
as internal state and you can start getting some handle 00:44:00.880 |
- And then what I was kind of trying to get to is that 00:44:12.300 |
people who review that story, how much they like it. 00:44:16.620 |
So that, you know, initially that could be a hyper parameter 00:44:23.420 |
but it's an interesting notion of the convincingness 00:44:28.580 |
of the story becoming part of the reward function, 00:44:31.800 |
the objective function of the explainability. 00:44:33.960 |
It's in the world of sort of Twitter and fake news, 00:44:37.500 |
that might be a scary notion that the nature of truth 00:44:42.000 |
may not be as important as the convincingness of the, 00:44:45.020 |
how convinced you are in telling the story around the facts. 00:44:57.040 |
in reinforcement learning, deep reinforcement learning, 00:45:04.500 |
- I think that what reinforcement learning refers to today 00:45:06.940 |
is really just the kind of the modern incarnation 00:45:15.700 |
which is that it's literally learning from reinforcement, 00:45:22.540 |
But really I think the way the term is used today 00:45:24.340 |
is it's used to refer more broadly to learning-based control. 00:45:35.820 |
So is action is the fundamental element there? 00:45:48.240 |
Now, like, it's easier to see that kind of idea 00:45:52.300 |
in the space of maybe games, in the space of robotics. 00:46:00.220 |
Like, where are the limits of the applicability 00:46:07.380 |
is essentially the encapsulation of the AI problem 00:46:12.980 |
So any problem that we would want a machine to do, 00:46:18.220 |
can likely be represented as a decision-making problem. 00:46:20.460 |
Classifying images is a decision-making problem, 00:46:25.060 |
Controlling a chemical plant is a decision-making problem. 00:46:43.820 |
is one of the ways to reach a very broad swath 00:47:00.220 |
as a generalization of supervised machine learning. 00:47:10.140 |
You have the assumption that someone actually told you 00:47:18.220 |
as essentially relaxing some of those assumptions. 00:47:20.340 |
Now that's not always a very productive way to look at it, 00:47:22.100 |
because if you actually have a supervised learning problem, 00:47:24.300 |
you'll probably solve it much more effectively 00:47:25.980 |
by using supervised learning methods, because it's easier. 00:47:41.580 |
the kind of tools we bring to the table today, 00:47:46.500 |
everything will be a reinforcement learning problem, 00:47:48.860 |
just like you said, image classification should be mapped 00:47:58.820 |
Sort of supervised learning has been used very effectively 00:48:06.540 |
Reinforcement learning kind of represents the dream of AI. 00:48:15.220 |
in sort of captivating the imagination of people, 00:48:25.380 |
So my question comes from the more practical sense. 00:48:31.900 |
between the more general reinforcement learning 00:48:34.540 |
and the very specific, yes, it's sequential decision-making 00:48:38.780 |
with one step in the sequence of the supervised learning? 00:48:52.060 |
that we might see closing over the next couple of years, 00:48:54.700 |
is the ability of reinforcement learning algorithms 00:48:57.100 |
to effectively utilize large amounts of prior data. 00:49:00.420 |
So one of the reasons why it's a bit difficult today 00:49:03.300 |
to use reinforcement learning for all the things 00:49:10.220 |
rational decision-making, it's a little bit tough 00:49:13.060 |
to just deploy some policy that does crazy stuff 00:49:21.100 |
a lot of logs of some other policy that you've got, 00:49:23.980 |
and then maybe if you can get a good policy out of that, 00:49:27.620 |
then you deploy it and let it kind of fine tune 00:49:30.500 |
But algorithmically, it's quite difficult to do that. 00:49:33.340 |
So I think that once we figure out how to get 00:49:36.180 |
reinforcement learning to bootstrap effectively 00:49:49.820 |
And I think we're seeing a lot of research right now 00:49:52.260 |
that's bringing us closer and closer to that. 00:49:54.580 |
- Can you maybe paint the picture of the different methods? 00:50:05.220 |
What are the different categories of reinforcement learning? 00:50:08.060 |
- So one way we can think about reinforcement learning 00:50:23.940 |
And you do that, of course, from experience, from data. 00:50:28.300 |
So you build a model that answers these what-if questions, 00:50:31.860 |
use it to figure out the best action you can take, 00:50:35.220 |
and see if the outcome agrees with what you predicted. 00:50:41.700 |
basically refer to different ways of doing it. 00:50:50.820 |
Value-based methods, they answer the question 00:50:57.060 |
But in a sense, they're not really all that different 00:51:06.340 |
answering what-if questions can be really hard 00:51:15.700 |
You would just repeat the thing that worked before. 00:51:18.900 |
And that's really a big part of why RL is a little bit tough. 00:51:23.340 |
So if you have a purely on-policy online process, 00:51:31.060 |
then you go and try doing those mistaken things, 00:51:35.460 |
that'll teach you not to do those things again. 00:51:39.940 |
and you just want to synthesize the best policy you can 00:51:43.740 |
then you really have to deal with the challenges 00:51:49.900 |
- Yeah, a policy is a model or some kind of function 00:51:54.900 |
that maps from observations of the world to actions. 00:52:06.300 |
So we say the state kind of encompasses everything 00:52:08.020 |
you need to fully define where the world is at at the moment. 00:52:11.100 |
And depending on how we formulate the problem, 00:52:16.940 |
which is some snapshot, some piece of the state. 00:52:29.020 |
- So yeah, so the terms on-policy and off-policy 00:52:37.220 |
maybe you get your data from some manually programmed system 00:52:46.540 |
But if you got the data by actually acting in the world 00:52:48.980 |
based on what your current policy thinks is good, 00:52:53.260 |
And obviously on-policy data is more useful to you 00:52:55.780 |
because if your current policy makes some bad decisions, 00:52:59.300 |
you will actually see that those decisions are bad. 00:53:01.740 |
Off-policy data, however, might be much easier to obtain 00:53:08.580 |
- So we talk about, offline talked about autonomous vehicles 00:53:12.940 |
so you can envision off-policy kind of approaches 00:53:18.420 |
ton of robots out there, but they don't get the luxury 00:53:38.420 |
people have made a little bit of progress on that. 00:53:44.260 |
but I can tell you some of the things that, for example, 00:53:46.380 |
we've done to try to address some of the challenges. 00:53:57.100 |
to give accurate predictions for any possible action. 00:54:00.140 |
So if I've never tried to, if in my data set, 00:54:10.100 |
is probably not going to predict the right thing 00:54:11.860 |
if I ask what would happen if I were to steer the car 00:54:15.540 |
So one of the important things you have to do 00:54:25.260 |
And you can use kind of distribution estimation methods, 00:54:32.180 |
So you could figure out that, well, this action, 00:54:44.180 |
that will essentially tell you not to ask those questions 00:54:50.620 |
- What would lead to breakthroughs in this space, 00:54:57.580 |
Do we need to collect big benchmark data sets 00:55:09.620 |
Or maybe coming together in a space of robotics 00:55:12.260 |
and defining the right problem to be working on. 00:55:15.100 |
- I think for off-policy reinforcement learning 00:55:24.860 |
that that just takes some very smart people to get together 00:55:28.980 |
Whereas if it was like a data problem or hardware problem, 00:55:34.620 |
So that's why I'm pretty excited about that problem 00:55:44.940 |
the problems at their core are very related to problems 00:55:51.420 |
Because what you're really dealing with is situations 00:56:00.180 |
And if it's a model that's generalizing properly, 00:56:04.660 |
If it's a model that picks up on spurious correlations 00:56:08.820 |
And then you have an arsenal of tools you could use. 00:56:15.580 |
you could try to make it generalize better somehow, 00:56:24.940 |
where most of it, like 90, 95% is off policy, 00:56:35.580 |
Like, what's that role of mixing them together? 00:56:39.980 |
I think that this is something that you actually 00:56:43.140 |
described very well at the beginning of our discussion 00:56:51.580 |
You'd use that for off-policy reinforcement learning. 00:56:54.020 |
And then, of course, if you've never, you know, 00:57:00.300 |
then you have to go out and fiddle with it a little bit, 00:57:12.700 |
Or is there, what's, and maybe taking a step back, 00:57:25.620 |
- In general, I actually think that one of the things 00:57:30.660 |
that is a very beautiful idea in reinforcement learning 00:57:32.980 |
is just the idea that you can obtain a near-optimal control 00:57:37.980 |
or a near-optimal policy without actually having 00:57:55.700 |
but from a control's perspective, it's a very weird thing 00:57:58.180 |
because classically, you know, we think about 00:58:03.020 |
engineered systems and controlling engineered systems 00:58:05.660 |
as the problem of writing down some equations 00:58:08.460 |
and then figuring out, given these equations, 00:58:11.780 |
figure out the thing that maximizes its performance. 00:58:18.900 |
actually gives us a mathematically principled framework 00:58:21.340 |
to reason about, you know, optimizing some quantity 00:58:28.740 |
And that, I don't know, to me, that actually seems 00:58:34.020 |
not something that sort of becomes immediately obvious, 00:58:40.060 |
- Does it make sense to you that it works at all? 00:58:42.420 |
- Well, I think it makes sense when you take some time 00:58:46.700 |
to think about it, but it is a little surprising. 00:58:53.060 |
deeper representations, which is also very surprising, 00:59:01.740 |
the space of environments that this kind of approach 00:59:10.220 |
- Well, deep reinforcement learning simply refers 00:59:18.340 |
neural net representations, which is, you know, 00:59:21.460 |
kind of, it might at first seem like a pretty arbitrary 00:59:26.340 |
But the reason that it's something that has become 00:59:29.900 |
so important in recent years is that reinforcement learning, 00:59:35.100 |
it kind of faces an exacerbated version of a problem 00:59:38.020 |
that has faced many other machine learning techniques. 00:59:39.980 |
So if we go back to like, you know, the early 2000s 00:59:46.740 |
on machine learning methods that have some very appealing 00:59:52.380 |
the convex optimization problems, for instance, 01:00:07.580 |
So they have some kind of good representation 01:00:12.420 |
And for a long time, people were very worried 01:00:14.060 |
about features in the world of supervised learning 01:00:15.820 |
because somebody had to actually build those features. 01:00:18.140 |
So you couldn't just take an image and plug it 01:00:19.940 |
into your logistic regression or your SVM or something. 01:00:22.740 |
Someone had to take that image and process it 01:00:29.700 |
And suddenly we could apply learning directly 01:00:32.140 |
to the raw inputs, which was great for images, 01:00:34.780 |
but it was even more great for all the other fields 01:00:37.540 |
where people hadn't come up with good features yet. 01:00:39.860 |
And one of those fields is actually reinforcement learning 01:00:43.300 |
the notion of features, if you don't use neural nets 01:01:00.780 |
Like, I don't even know how to start thinking about it. 01:01:05.380 |
an expert chess player looks for whether the knight 01:01:09.140 |
So that's a feature, is knight in middle of board? 01:01:15.820 |
And that was really kind of getting us nowhere. 01:01:17.420 |
- And that's a little, chess is a little more accessible 01:01:48.260 |
but it feels hopeful that the control problem 01:01:57.700 |
is it surprising to you how far the deep side 01:02:03.220 |
like what the space of problems has been able to tackle 01:02:10.940 |
and AlphaZero and just the representation power there 01:02:21.780 |
of this representation power and the control context? 01:02:26.180 |
- I think that in regard to the limits that here, 01:02:30.100 |
I think that one thing that makes it a little hard 01:02:33.660 |
to fully answer this question is because in settings 01:02:38.660 |
where we would like to push these things to the limit, 01:02:53.580 |
it's not because its neural net is not big enough. 01:02:56.040 |
It's because when you try to actually do trial 01:02:59.700 |
and error learning, reinforcement learning directly 01:03:03.140 |
in the real world, where you have the potential 01:03:05.120 |
to gather these large, highly varied and complex datasets, 01:03:13.780 |
it'll first sound like a very pragmatic problem, 01:03:16.860 |
but it actually turns out to be a pretty deep scientific 01:03:18.540 |
problem, take the robot, put it in your kitchen, 01:03:20.820 |
have it try to learn to do the dishes with trial and error, 01:03:27.060 |
Now you might think this is a very practical issue, 01:03:30.020 |
which is that if you have a person trying to do this, 01:03:32.300 |
a person will have some degree of common sense, 01:03:35.180 |
they'll be a little more careful with the next one. 01:03:38.060 |
they're gonna go and get more or something like that. 01:03:42.900 |
that comes very naturally to us for our learning process, 01:03:46.720 |
like if I have to learn something through trial and error, 01:03:49.780 |
I have the common sense to know that I have to try multiple 01:03:52.660 |
times, if I screw something up, I ask for help, 01:03:57.360 |
And all of that is kind of outside of the classic 01:04:02.020 |
There are other things that can also be categorized 01:04:05.060 |
as kind of scaffolding, but are very important. 01:04:07.300 |
Like for example, where do you get your reward function? 01:04:13.460 |
well, how do I know if I've done it correctly? 01:04:15.300 |
Now that probably requires an entire computer vision system 01:04:24.560 |
what we really need to get reinforcement learning 01:04:28.360 |
And I think that many of these things actually suggest 01:04:30.920 |
a little bit of a shortcoming in the problem formulation 01:04:33.440 |
and a few deeper questions that we have to resolve. 01:04:37.000 |
I talked to like David Silver about AlphaZero, 01:04:47.820 |
in the context when there's no broken dishes. 01:04:54.940 |
So again, like the bottleneck is the amount of money 01:05:00.840 |
and then maybe the different, the scaffolding around 01:05:09.980 |
Now we move to the real world and there's the broken dishes, 01:05:12.540 |
there's all the, and the reward function like you mentioned. 01:05:19.860 |
Do you think, there's this kind of sample efficiency 01:05:38.100 |
How do we, how do we not break too many dishes? 01:05:41.180 |
- Yeah, well, one way we can think about that is that 01:05:44.780 |
maybe we need to be better at reusing our data, 01:06:04.360 |
can just master complex tasks in like in minutes, 01:06:09.780 |
Perhaps what it really needs to do is have an existence, 01:06:17.020 |
prepare it to do new things more efficiently. 01:06:20.040 |
And, you know, the study of these kinds of questions 01:06:22.900 |
typically falls under categories like multitask learning 01:06:25.580 |
or meta learning, but they all fundamentally deal 01:06:28.260 |
with the same general theme, which is use experience 01:06:32.580 |
for doing other things to learn to do new things 01:06:38.900 |
if you just look at one particular case study 01:06:41.220 |
of a Tesla autopilot that has quickly approaching 01:06:47.460 |
where some percentage of the time, 30, 40% of the time 01:07:06.180 |
From the human side, how can we use that data? 01:07:14.100 |
Do you have ideas in this autonomous vehicle space 01:07:17.820 |
You know, it's a safety critical environment. 01:07:23.860 |
- So I think that actually the kind of problems 01:07:28.020 |
that come up when we want systems that are reliable 01:07:36.660 |
they're actually very similar to the kind of problems 01:07:46.140 |
when you can trust the predictions of your model, 01:07:50.940 |
some pattern of behavior for which your model 01:07:54.020 |
then you shouldn't use that to modify your policy. 01:07:57.260 |
And it's actually very similar to the problem 01:07:58.500 |
that we're faced when we actually then deploy that thing 01:08:07.760 |
And that's a very deep research question, of course, 01:08:10.160 |
but it's also a question that a lot of people are working on. 01:08:11.700 |
So I'm pretty optimistic that we can make some progress 01:08:15.760 |
- What's the role of simulation in reinforcement learning, 01:08:18.880 |
deep reinforcement learning, reinforcement learning? 01:08:22.920 |
It's been essential for the breakthroughs so far, 01:08:32.000 |
I mean, again, this connects to our off policy discussion, 01:08:35.220 |
but do you think we can ever get rid of simulation 01:08:38.260 |
or do you think simulation will actually take over? 01:08:40.060 |
We'll create more and more realistic simulations 01:08:42.100 |
that will allow us to solve actual real world problems, 01:08:46.080 |
like transfer the models we learn in simulation 01:08:50.060 |
I think that simulation is a very pragmatic tool 01:08:57.660 |
we will need to build machines that can learn 01:09:04.580 |
Because if we can't have our machines learn from real data, 01:09:09.780 |
eventually the simulator becomes the bottleneck. 01:09:13.500 |
If your machine has any bottleneck that is built by humans 01:09:20.240 |
it will eventually be the thing that holds it back. 01:09:23.100 |
And if you're entirely reliant on your simulator, 01:09:25.940 |
If you're entirely reliant on a manually designed controller, 01:09:32.120 |
It's very pragmatic, but it's not a substitute 01:09:43.660 |
especially in the context of some of the things 01:09:51.860 |
like these are not problems that you would ever stumble on 01:09:54.820 |
when working in a purely simulated kind of environment. 01:09:59.740 |
when we try to actually run these things in the real world. 01:10:03.220 |
- To throw a brief wrench into our discussion, let me ask, 01:10:09.900 |
- Do you think that's a useful thing to even think about, 01:10:12.440 |
about the fundamental physics nature of reality? 01:10:23.300 |
is to think about how difficult is it to create 01:10:29.580 |
sort of a virtual reality game type situation 01:10:32.940 |
that will be sufficiently convincing to us humans, 01:10:36.500 |
or sufficiently enjoyable that we wouldn't wanna leave. 01:10:40.140 |
I mean, that's actually a practical engineering challenge. 01:10:43.420 |
And I personally really enjoy virtual reality, 01:10:46.220 |
but it's quite far away, but I kind of think about 01:10:49.180 |
what would it take for me to wanna spend more time 01:11:06.700 |
we're a majority of the population lives in a virtual reality 01:11:09.100 |
and that's how we create the simulation, right? 01:11:11.380 |
You don't need to actually simulate the quantum gravity 01:11:23.260 |
is if we wanna make sufficiently realistic simulations 01:11:31.920 |
thereby just some of the things we've been talking about, 01:11:37.680 |
if we can create actually interesting, rich simulations. 01:11:43.640 |
casts your previous question in a very interesting light, 01:11:49.720 |
well, the more kind of practical version of this, 01:11:53.760 |
like, can we build simulators that are good enough 01:11:56.100 |
to train essentially AI systems that will work in the world? 01:12:00.600 |
And it's kind of interesting to think about this, 01:12:06.300 |
it kind of implies that it's easier to create the universe 01:12:09.980 |
And that seems like, put this way, it seems kind of weird. 01:12:14.300 |
- The aspect of the simulation most interesting to me 01:12:30.240 |
agrees with that notion, just as a quick aside, 01:12:33.600 |
what are your thoughts about when the human enters 01:12:39.800 |
How does that change the reinforcement learning problem, 01:12:45.040 |
- Yeah, I think that's a, it's a kind of a complex question. 01:12:48.560 |
And I guess my hope for a while had been that 01:12:56.880 |
that are multitask, that utilize lots of prior data 01:13:03.120 |
the bit where they have to interact with people 01:13:08.760 |
So if they have prior experience of interacting with people 01:13:13.200 |
of interacting with people for this new task, 01:13:20.520 |
And there's quite a bit of research in that area. 01:13:28.540 |
the ability to understand that other beings in the world 01:13:33.360 |
have their own goals, intentions, and thoughts, and so on, 01:13:36.240 |
whether that kind of understanding can emerge automatically 01:13:40.000 |
from simply learning to do things with and maximize utility. 01:13:49.260 |
that you don't need to explicitly inject anything 01:13:53.480 |
into the system that can be learned from the data. 01:14:00.820 |
What are the limits of what we can learn from data? 01:14:13.340 |
do you really think we can learn gravity from just data? 01:14:19.820 |
- So something that I think is a common kind of pitfall 01:14:23.720 |
when thinking about prior knowledge and learning 01:14:27.040 |
is to assume that just because we know something, 01:14:32.040 |
then that it's better to tell the machine about that 01:14:34.680 |
rather than have it figure it out on its own. 01:14:46.660 |
Like, if things, if every time you drop something, 01:14:49.340 |
it falls down, like, yeah, you might not get the, 01:14:54.220 |
not Einstein's version, but it'll be pretty good. 01:15:03.220 |
So things that are readily apparent from the data, 01:15:10.220 |
- It just feels like that there might be a space 01:15:12.440 |
of many local minima in terms of theories of this world 01:15:21.600 |
- That Newtonian mechanics is not necessarily 01:15:27.600 |
- Yeah, and well, in fact, in some fields of science, 01:15:34.080 |
So for example, if you think about how people 01:15:43.300 |
the kind of principles that serve us very well 01:15:45.680 |
in our day-to-day lives actually serve us very poorly 01:15:50.120 |
We had kind of very superstitious and weird ideas 01:15:57.920 |
So that does seem to be a failing of this approach, 01:16:01.000 |
but it's also a failing of human intelligence, arguably. 01:16:09.080 |
in reinforcement learning, sort of these competitive, 01:16:11.440 |
creating a competitive context in which agents 01:16:19.040 |
and thereby increasing each other's skill level. 01:16:21.040 |
It seems to be this kind of self-improving mechanism 01:16:34.560 |
and also can be generalized to other contexts 01:16:50.440 |
to actually generalizing it to the robotic setting 01:17:07.900 |
they can play with each other, they can play with people, 01:17:17.860 |
You have to interact with a natural environment that. 01:17:24.580 |
So the reason that self-play works for a board game 01:17:33.660 |
So the kind of intelligent behavior that will emerge 01:18:15.540 |
this question has kind of been treated as a non-issue 01:18:19.320 |
that you sort of treat the reward as this external thing 01:18:22.400 |
that comes from some other bit of your biology 01:18:30.140 |
a little bit of a mistake that we should worry about it. 01:18:33.240 |
And we can approach it in a few different ways. 01:18:36.860 |
by thinking of reward as a communication medium. 01:18:39.020 |
We can say, well, how does a person communicate 01:18:50.380 |
kind of a general objective that leads to good capability? 01:18:55.120 |
Like, for example, can you write down some objectives 01:18:56.800 |
such that even in the absence of any other task, 01:19:02.640 |
This is something that has sometimes been called 01:19:07.020 |
which I think is a really fascinating area of research, 01:19:14.840 |
we can have some notion of unsupervised reinforcement 01:19:19.840 |
learning by means of information theoretic quantities, 01:19:23.440 |
like for instance, minimizing a Bayesian measure of surprise. 01:19:34.360 |
that you can actually learn pretty interesting skills 01:19:36.920 |
by essentially behaving in a way that allows you 01:19:40.480 |
to make accurate predictions about the world. 01:19:42.440 |
It seems a little circular, like do the things 01:19:44.200 |
that will lead to you getting the right answer 01:19:48.740 |
But you can, by doing this, you can sort of discover 01:19:53.040 |
You can discover that if you're playing Tetris, 01:19:55.480 |
then correctly clearing the rows will let you play Tetris 01:19:58.880 |
for longer and keep the board nice and clean, 01:20:00.600 |
which sort of satisfies some desire for order in the world. 01:20:08.760 |
- Is there a role for a human notion of curiosity 01:20:12.560 |
in itself being the reward, sort of discovering new things 01:20:18.480 |
- So one of the things that I'm pretty interested in 01:20:27.800 |
of some other objective that quantifies capability. 01:20:33.160 |
maybe it might not by itself be the right answer, 01:20:45.840 |
but I don't have a clear answer for you there yet. 01:20:52.000 |
to see sort of creative patterns of curiosity 01:21:03.880 |
- Is there ways to understand or anticipate unexpected, 01:21:09.800 |
unintended consequences of particular reward functions, 01:21:21.920 |
and try to avoid highly detrimental strategies? 01:21:28.640 |
that has been pretty hard in reinforcement learning 01:21:40.200 |
One way to mitigate it is to actually define an objective 01:21:46.920 |
You can say just like, don't enter situations 01:21:49.360 |
that have low probability under the distribution of states 01:21:54.640 |
It turns out that that's actually one very good way 01:21:56.400 |
to do off-policy reinforcement learning actually. 01:22:01.200 |
- If we slowly venture in speaking about reward functions 01:22:07.000 |
into greater and greater levels of intelligence, 01:22:09.240 |
there's, I mean, Stuart Russell thinks about this, 01:22:18.040 |
So how do we ensure that AGI systems align with us humans? 01:22:34.680 |
with the broader intended success interest of human beings. 01:22:43.360 |
of where reinforcement learning fits into this? 01:22:45.240 |
Or are you really focused on the current moment 01:22:53.120 |
but, you know, and I do think that this is a problem 01:22:59.320 |
For my part, I'm actually a bit more concerned 01:23:01.800 |
about the other side of this equation that, you know, 01:23:21.160 |
when we, for instance, try to use these techniques 01:23:23.960 |
for safety critical systems like cars and aircraft and so on. 01:23:34.480 |
to face the issue of them not being optimized well enough. 01:23:37.000 |
- But you don't think unintended consequences can arise 01:23:49.400 |
for improving reliability, safety, and things like that 01:23:52.840 |
is more with systems that like need to work better, 01:23:56.480 |
that need to optimize their objective better. 01:23:58.920 |
- Do you have thoughts, concerns about existential threats 01:24:04.720 |
Sort of, if we put on our hat of looking in 10, 20, 100, 01:24:15.640 |
- I think there are absolutely existential threats 01:24:28.720 |
will come down to people with nefarious intent. 01:24:38.800 |
And some of them will, of course, come down to AI systems 01:24:51.920 |
and principally the one with nefarious humans, 01:24:55.840 |
actually it's the nefarious humans that have been 01:25:01.360 |
And I think that right now the best that I can do 01:25:09.760 |
to promote responsible use of that technology. 01:25:12.120 |
- Do you think RL systems has something to teach us, humans? 01:25:18.720 |
You said nefarious humans getting us in trouble. 01:25:21.120 |
I mean, machine learning systems have in some ways 01:25:23.840 |
have revealed to us the ethical flaws in our data. 01:25:28.200 |
In that same kind of way, can reinforcement learning 01:25:41.080 |
- I'm not sure what I've learned about myself, 01:25:44.680 |
but maybe part of the answer to your question 01:25:54.520 |
of reinforcement learning for decision-making support 01:26:03.360 |
And I think we will see some interesting stuff emerge there. 01:26:06.680 |
We will see, for instance, what kind of behaviors 01:26:21.720 |
we'll see some interesting stuff come out in that area. 01:26:25.360 |
'cause the exciting space where this could be observed 01:26:28.880 |
is sort of large companies that deal with large data. 01:26:36.720 |
when I look at social networks and just online 01:26:45.080 |
And that'd be interesting from a research perspective 01:26:57.880 |
about the behavior of these AI systems in the real world. 01:27:21.680 |
So basically don't try to do any kind of fancy algorithms, 01:27:31.160 |
I think the high level idea makes a lot of sense. 01:27:50.600 |
the acquisition of experience in the real world 01:27:58.640 |
So if the claim is that automated general methods 01:28:06.440 |
then it makes sense that we should build general methods 01:28:09.840 |
that we can deploy and get them to go out there 01:28:11.560 |
and like collect their experience autonomously. 01:28:23.480 |
which is easy to do in a simulated board game, 01:28:27.720 |
- Yeah, it keeps coming back to this one problem, right? 01:28:30.520 |
So your mind is focused there now in this real world. 01:28:35.760 |
It just seems scary, this step of collecting the data. 01:28:40.480 |
And it seems unclear to me how we can do it effectively. 01:28:45.200 |
- Yeah, well, you know, seven billion people in the world, 01:28:48.280 |
each of them have to do that at some point in their lives. 01:28:54.840 |
We should be able to try to collect that kind of data. 01:29:05.280 |
would book or books, technical or fiction or philosophical, 01:29:10.280 |
had a big impact on the way you saw the world, 01:29:22.120 |
would you recommend people consider reading on their own? 01:29:31.520 |
- I don't know if this is like a scientifically, 01:29:39.280 |
but like the honest answer is that I actually found 01:29:50.840 |
- You don't think it had a ripple effect in your life? 01:29:55.160 |
But yeah, I think that a vision of a future where, 01:30:07.080 |
artificial robotic systems have kind of a big place, 01:30:12.480 |
And where we try to imagine the sort of the limiting case 01:30:17.480 |
of technological advancement and how that might play out 01:30:23.680 |
But yeah, I think that that was in some way influential. 01:30:28.680 |
I don't really know how, but I would recommend it. 01:30:33.040 |
I mean, if nothing else, you'd be well entertained. 01:30:35.440 |
- When did you first yourself like fall in love 01:31:00.760 |
it just wasn't really high on my priority list 01:31:05.640 |
where we're going to see very substantial advances 01:31:14.360 |
the time when I really decided I wanted to work on this 01:31:29.040 |
But one of the things that really resonated with me 01:31:33.640 |
well, he used to have graduate students come to him 01:31:39.280 |
and give them some math problem to deal with. 01:31:41.360 |
But now he's actually thinking that this is an area 01:31:55.320 |
when someone who had been working on that kind of stuff 01:32:18.440 |
in machine learning or reinforcement learning, 01:32:21.120 |
what advice would you give to maybe an undergraduate student 01:32:27.800 |
and further on, what are the steps to take on that journey? 01:32:32.680 |
- So something that I think is important to do 01:32:42.960 |
imagining the kind of outcome that you might like to see. 01:32:46.160 |
So, you know, one outcome might be a successful career, 01:32:50.920 |
or state-of-the-art results on some benchmark, 01:32:54.760 |
that's like the main driving force for somebody. 01:33:03.040 |
like takes a little while, sits down and thinks like, 01:33:11.040 |
what I want to see a natural language system? 01:33:18.200 |
or like something that you'd like to see in the world, 01:33:21.080 |
and then actually sit down and think about the steps 01:33:24.960 |
And hopefully that thing is not a better number 01:33:28.760 |
It's like, it's probably like an actual thing 01:33:30.600 |
that we can't do today that would be really awesome. 01:33:36.120 |
a really awesome healthcare decision-making support system, 01:34:00.480 |
and you just give an advice on looking forward 01:34:04.280 |
what kind of change you would like to make in the world. 01:34:13.280 |
What gives you fulfillment, purpose, happiness, and meaning? 01:34:21.800 |
- What's the reward function under which you're operating? 01:34:27.520 |
- Yeah, I think one thing that does give, you know, 01:34:35.120 |
that I'm working on a problem that really matters. 01:34:41.680 |
but it's quite nice to take things to spend my time on 01:34:54.640 |
but if you're successful, what does that look like? 01:35:01.880 |
Now, of course, success is built on top of success, 01:35:05.620 |
and you keep going forever, but what is the dream? 01:35:15.760 |
is to see machines that actually get better and better 01:35:26.880 |
that we have today, but I think we really don't. 01:35:37.880 |
all of the machines that we've been able to build 01:35:40.160 |
don't sort of improve up to the limit of that complexity. 01:35:45.520 |
Maybe they hit a wall because they're in a simulator 01:35:52.240 |
or they hit a wall because they rely on a labeled dataset, 01:36:02.440 |
that can go as far as possible in that regard. 01:36:09.400 |
- Well, I don't think there's a better way to end it, Sergey. 01:36:16.240 |
that you have to publish in the education space 01:36:27.280 |
with Sergey Lavin, and thank you to our sponsors, 01:36:35.520 |
by downloading Cash App and using code LexPodcast 01:36:52.280 |
If you enjoy this thing, subscribe on YouTube, 01:36:57.060 |
support on Patreon, or connect with me on Twitter 01:37:04.160 |
without using the letter E, just F-R-I-D-M-A-N. 01:37:12.520 |
Intelligence without ambition is a bird without wings. 01:37:17.660 |
Thank you for listening, and hope to see you next time.