back to indexDrago Anguelov (Waymo) - MIT Self-Driving Cars
Chapters
0:0 Introduction
0:47 Background
1:31 Waymo story (2009 to today)
4:31 Long tail of events
8:55 Perception, prediction, and planning
14:54 Machine learning at scale
26:43 Addressing the limits of machine learning
29:38 Large-scale testing
50:51 Scaling to dozens and hundreds of cities
54:35 Q&A
00:00:10.080 |
Aside from having the coolest name in autonomous driving, 00:00:16.400 |
in developing and applying machine learning methods 00:00:20.320 |
and more generally in computer vision robotics. 00:00:22.440 |
He's now helping Waymo lead the world in autonomous driving. 00:00:26.320 |
10 plus million miles achieved autonomously to date, 00:00:34.040 |
So it's exciting to have Drago here with us to speak. 00:00:54.400 |
"Taming the Long Tail of Autonomous Driving Challenges." 00:01:04.120 |
and worked closely with one of the pioneers in the space, 00:01:09.080 |
I spent eight years at Google doing research on perception, 00:01:19.320 |
I was heading the 3D perception team at Zooks. 00:01:22.520 |
We built another perception system for autonomous driving, 00:01:25.840 |
and I've been leading the research team at Waymo 00:01:30.400 |
So I want to tell you a little bit about Waymo when we start. 00:01:34.880 |
Waymo actually this month has its 10-year anniversary. 00:02:10.120 |
in what fully driverless mobility would be like. 00:02:38.560 |
we launched a fleet of fully self-driving vehicles 00:02:57.320 |
for what fully driverless experience is like. 00:03:39.920 |
Last year, we launched our first commercial service 00:03:48.920 |
It can come pick them up and help them with errands 00:03:52.760 |
And we've been already learning a lot from these customers 00:03:55.680 |
and we are looking to grow and expand the service 00:04:03.440 |
we have driven 10 million miles on public road, 00:04:07.440 |
And driverlessly and more also with human drivers 00:04:16.360 |
And we've driven all kinds of scenarios, cities, 00:04:39.800 |
And I guess all the problems that come with this 00:04:44.680 |
how Waymo has been thinking about these issues. 00:04:57.880 |
And so when you think about self-driving vehicles, 00:05:05.520 |
It needs to be able to handle the entire task of driving. 00:05:10.680 |
and remove the human operator from the vehicle. 00:05:26.160 |
the question is, well, how many of these capabilities 00:05:28.360 |
and how many scenarios do you really need to handle? 00:05:30.920 |
Well, it turns out, well, the world is quite diverse 00:05:35.440 |
and complicated and there is a lot of rare situations 00:05:48.440 |
It's one type of effort to get yourself driving 00:05:52.000 |
for the common cases and then it's another effort 00:05:55.000 |
to tame the rest and they really, really matter. 00:06:03.080 |
For example, this is us driving in the street 00:06:06.280 |
and let's see if you can tell what is unusual in this video. 00:06:14.880 |
So there's a bicyclist and he's carrying a stop sign. 00:06:25.120 |
but it's certainly not a stop sign we need to stop for, 00:06:34.280 |
This is another case where we are happily staying there 00:06:38.520 |
and then the vehicle stops and a big pile of poles 00:07:05.040 |
and this is our vehicle correctly identifying 00:07:08.760 |
between all of these cones and successfully executing it. 00:07:14.920 |
And this is something that happens fairly often 00:07:28.420 |
I think you can understand what happened here. 00:07:33.060 |
And you can notice actually, so we hear the siren. 00:07:42.260 |
we hear it and stop and some guys are much later than us 00:07:49.540 |
And here's another scenario potentially I want to show you. 00:07:56.740 |
Let's see if you can understand what happened. 00:08:10.820 |
we're about to go and someone goes at high speed 00:08:16.820 |
Right, and we successfully stop and prevent issues. 00:08:21.820 |
Right, and so sometimes you have the rules of the way 00:08:25.820 |
and you have your road and people don't always abide by them 00:08:29.340 |
and that's something that you don't want to just 00:08:34.420 |
So hopefully with this I convinced you that the situations 00:08:43.780 |
And I want to take you a little bit on the tour 00:08:45.740 |
of what makes this challenging and then tell you 00:08:51.100 |
And so to do this, we're gonna delve a little bit more 00:08:56.900 |
which is perception, prediction, and planning. 00:08:59.860 |
And so I'll tell you a little bit about those. 00:09:02.420 |
Right, and perception, these are the core AI aspects 00:09:08.380 |
These tasks, there's others, we can talk about others 00:09:10.780 |
as well in a little bit, but let's focus on these first. 00:09:15.380 |
and potentially prior knowledge of the environment 00:09:19.380 |
And that scene representation can contain objects, 00:09:27.060 |
you can learn about object relationships and so on. 00:09:29.620 |
And perception, the space of things you need to handle 00:09:34.700 |
in perception is fairly hard, it's a complex mapping. 00:09:54.500 |
there are a bunch of people dressed as dinosaurs 00:09:56.280 |
in this case, people generally are fairly creative 00:10:04.220 |
people come in different poses, and we have seen it all. 00:10:09.860 |
There's different environments that these objects appear in. 00:10:14.560 |
So there are times of day, seasons, day, night, 00:10:25.520 |
And then there's a different variability axis, 00:10:28.520 |
and this is a little more, slightly more abstract. 00:10:30.840 |
There are different objects can come in this environment 00:10:50.200 |
Because I just want to show you the space, right? 00:10:55.920 |
in most environments in most reasonable configurations, 00:11:00.520 |
from the sensor inputs to a representation that makes sense, 00:11:13.640 |
you need to be able to anticipate and predict 00:11:16.200 |
what some of the actors in the world are going to do, 00:11:20.200 |
and people is honestly what makes driving quite challenging. 00:11:26.520 |
it's, you know, a vehicle needs to be out there 00:11:28.840 |
and be a full-fledged traffic scene participant. 00:11:35.920 |
so sometimes when you want to make a decision, 00:11:40.120 |
it does not interfere with what anyone else is going to do, 00:11:43.080 |
and it can go from one second to maybe 10 seconds or more, 00:12:13.360 |
And of course there's subtle appearance cues, 00:12:18.000 |
so for example, if a person's watching our vehicle 00:12:21.520 |
we can be fairly confident they're paying attention 00:12:23.840 |
and not going to do anything particularly dangerous. 00:12:28.840 |
If someone's not paying attention or being distracted, 00:12:31.040 |
or there is a person in the car waving at us, 00:12:35.760 |
various gestures, cues, the blinkers on the vehicles, 00:12:47.160 |
even when you predict how other agents behave, 00:13:08.080 |
So here's a case, our Waymo vehicle is driving, 00:13:21.280 |
that as they bike, they will go around the car, 00:13:29.560 |
This is the prediction, our most likely prediction 00:13:46.960 |
typically ends up in control commands to the vehicle, 00:13:54.240 |
that ultimately has several properties to it, 00:14:19.480 |
So you need to trade all of these in a reasonable way. 00:14:29.080 |
This is a complex, I think, school gathering. 00:14:37.000 |
a bunch of pedestrians, and we need to make progress, 00:14:50.240 |
to the dense urban environments, being able to do this. 00:14:57.800 |
I think when you have this complicated models and systems, 00:15:01.100 |
machine learning is a really great tool to model 00:15:05.360 |
complex actions, complex mapping functions, features. 00:15:11.520 |
Right, and so we're going to learn our system. 00:15:17.080 |
So obviously, this is now a machine learning revolution. 00:15:37.640 |
Right, and I'll tell you a little more on this how. 00:15:42.080 |
So I have this allegory about machine learning 00:15:52.960 |
Early machine learning systems also can be a bit classical. 00:15:57.640 |
You have your tools and you need to build this product. 00:16:03.520 |
And it can fairly quickly get something reasonable. 00:16:06.420 |
But then it's harder to change, it's harder to evolve. 00:16:08.840 |
If you learn new things, now you need to go back 00:16:14.920 |
and it starts becoming, the more complicated the product 00:16:19.700 |
And machine learning, modern machine learning, 00:16:24.200 |
Right, so machine learning, you build the factory, 00:16:28.600 |
which is the machine learning infrastructure. 00:16:34.660 |
and get nice models that solve your problems, right? 00:16:37.480 |
And so, kind of infrastructure is at the heart 00:16:49.180 |
Just keep the right data, keep feeding the machine, 00:16:53.080 |
So what is a ML factory for self-driving models? 00:17:03.720 |
We have a software release, we put it on the vehicle, 00:17:07.280 |
We drive, we collect data, we collect it and we store it. 00:17:19.480 |
that we find interesting and that's a knowledge 00:17:31.200 |
And then what we're going to do is we're gonna train 00:17:36.500 |
After we have the models, we will do testing and validation, 00:17:39.300 |
validate that they're good to put on our vehicles. 00:17:42.140 |
And once they're good to put on our vehicles, 00:17:45.260 |
And then the process starts going again and again. 00:17:48.780 |
So you collect more data, now you select new data 00:17:52.960 |
You add it to your data set, you keep training the model. 00:17:56.920 |
And iterate, iterate, iterate, it's a nice scalable setup. 00:18:07.360 |
And at Waymo, we have the beautiful advantage 00:18:15.180 |
And I'll tell you a bit about its ingredients 00:18:20.340 |
So ingredient one is computing software infrastructure 00:18:25.020 |
and we are able to, first of all, leverage TensorFlow, 00:18:31.300 |
We have access to the experts that wrote TensorFlow 00:18:35.660 |
We have data centers to run large-scale parallel compute 00:18:40.700 |
We have specialized hardware for training models, 00:18:42.880 |
which make it cheaper and more affordable and faster 00:18:51.960 |
We have the scale to collect and store hundreds 00:18:55.540 |
and thousands and more miles, to millions of miles. 00:18:58.420 |
And just collecting and storing 10 millions of miles 00:19:07.180 |
because there is a decreasing utility to the data. 00:19:10.900 |
So most of the data comes from common scenarios 00:19:17.020 |
So it's really important how you select the data. 00:19:20.260 |
And so this is the important part of this pipeline. 00:19:22.180 |
So while you're running a release on the vehicle, 00:19:25.740 |
we have a bunch of understanding about the world, 00:19:41.740 |
we need to be very careful how to select data. 00:19:45.100 |
that are interesting in some way and complement, 00:19:49.780 |
that we potentially may not be doing so well on. 00:20:02.020 |
look for parts of your system which are uncertain 00:20:04.420 |
or inconsistent over time and go and label those cases. 00:20:08.640 |
Last but not least, we also produce auto labels. 00:20:13.780 |
Well, when you collect data, you also see the future 00:20:19.500 |
And so because of that, now knowing the past and the future, 00:20:30.140 |
And so you need to do all of this as part of the system. 00:20:34.580 |
Ingredient number three, high quality models. 00:20:37.340 |
We're part of larger Alphabet and Google and DeepMind 00:20:49.020 |
I happened to have the chance to be there at the time. 00:20:51.660 |
It was 2013 when I got on to do deep learning 00:20:57.340 |
and we were there working on it earlier than most people. 00:21:06.500 |
we invented neural net architecture like Inception, 00:21:14.500 |
object detection, fast object detector called SSD. 00:21:21.860 |
Google and DeepMind are leaders in perception 00:21:32.260 |
The object detection of course goes without saying. 00:21:34.420 |
And so we collaborate with Google and DeepMind 00:21:39.120 |
And so this is my factory for self-driving models 00:22:01.500 |
and adjusting architectures of neural networks. 00:22:08.260 |
So there is a team at Google working on AutoML, 00:22:14.200 |
And usually networks themselves have complex architecture. 00:22:21.020 |
And sometimes we have very high latency constraints 00:22:23.600 |
in the models, we have some compute constraints. 00:22:27.860 |
It takes often people months to find the right architecture 00:22:30.980 |
that's most performant, low latency and so on. 00:22:33.440 |
And so there's a way to offload this work to the machines. 00:22:43.360 |
that's both low latency and high performance. 00:22:50.060 |
and as we keep collecting data and finding new cities 00:22:52.860 |
or new examples, the architectures may change 00:22:55.740 |
and we want to easily find that and keep evolving that 00:23:01.940 |
and they had a strong work where they invented, 00:23:05.340 |
well, they developed a system that searched the space 00:23:09.060 |
of architectures and found a set of components 00:23:24.580 |
And they discovered in a small vision data set, 00:23:36.040 |
So the first thing we did is we took some problems 00:23:44.540 |
So you have a map representation and some lighter points 00:23:48.940 |
and you essentially segment the lighter points. 00:24:01.620 |
we explored several hundred NASCEL combinations 00:24:17.420 |
One of them is we can find models with similar quality 00:24:24.220 |
And then there is models of a bit higher quality 00:24:31.880 |
And similar results were obtained for other problems, 00:24:41.100 |
Of course, you can also do end-to-end architecture search. 00:24:44.300 |
So there's no reason why what was found on CIFAR-10 00:24:47.740 |
is best suited for our more specialized problems. 00:24:52.180 |
And so we went about this more from the ground up. 00:24:55.660 |
So let's find exactly deeper search, much larger space, 00:25:02.820 |
And so the way to do this is because our networks 00:25:36.700 |
now we train the large networks on those models 00:25:43.260 |
And so this way we can explore a much larger space 00:25:48.660 |
So on the left, this is 4,000 different models 00:26:01.700 |
than the transfer, which already leveraged their insight. 00:26:04.860 |
So then we took the learnings and the best models 00:26:07.420 |
from this search and did the second round of search, 00:26:10.220 |
which was in yellow, which allowed us to beat it. 00:26:21.980 |
And that one was able to significantly improve 00:26:24.940 |
on the red dot, which also significantly improves 00:26:48.220 |
And for some situations, we have fairly few examples as well. 00:26:52.220 |
And so there are cases where the models are uncertain 00:27:02.060 |
well, our networks just don't handle some case 00:27:05.220 |
and it's, so we have designed our system to be robust, 00:27:12.580 |
So one part is, of course, you want redundant 00:27:19.060 |
on our vehicles, both in camera, LiDAR, and radar. 00:27:24.140 |
First of all, an object is seen in all of them. 00:27:26.740 |
Second of all, they all have different strengths 00:27:38.040 |
Also, we've designed our system to be a hybrid system. 00:28:04.180 |
with very few examples with the current state of the art. 00:28:06.500 |
And so the state of the art keeps improving, of course. 00:28:08.620 |
So this is their zero-shot and one-shot learning. 00:28:15.940 |
we can also leverage expert domain knowledge. 00:28:19.860 |
So humans can help develop the right input representations. 00:28:28.260 |
to fewer parameters that already describe the task. 00:28:30.860 |
And then with that bias, it is easier to learn models 00:28:46.140 |
And so an example of what that looks for perception is, 00:28:54.800 |
where the machine learning system may be not confident, 00:29:00.520 |
and we make sure that we drive relative to those safely. 00:29:14.480 |
and our models become more powerful, of course, improve, 00:29:23.360 |
And the set of cases that you can handle with it increases. 00:29:46.480 |
and also in getting the vehicles on the road. 00:29:48.720 |
So how do you normally develop a self-driving algorithm? 00:29:58.840 |
and you would put it on the vehicle and drive a bunch 00:30:13.980 |
And so if you do this, you're gonna wait a long time. 00:30:17.420 |
Furthermore, you don't just want to take your code 00:30:24.580 |
like you want very strongly tested code on public streets. 00:30:47.300 |
So you can select and deliberately stage safely 00:30:53.620 |
Now again, you cannot do this for all situations. 00:31:09.140 |
So we simulate the equivalent of 25,000 cars, 00:31:39.540 |
And furthermore, it goes all the way bottom up. 00:31:45.620 |
for example, slightly different segmentation or detection, 00:32:16.500 |
Of course, you can do it manually, you can create them. 00:32:20.420 |
Well, you want to leverage your driving data. 00:32:29.540 |
So you can pick interesting situations from your logs. 00:32:36.540 |
and you create variations of these situations 00:32:48.260 |
This is what happened in the real world the first time. 00:32:53.180 |
we mostly stayed in the middle lane and stopped. 00:33:48.980 |
but it's no longer safe because we changed what we did. 00:34:06.300 |
the realistic driver and pedestrian behavior. 00:34:08.540 |
So, you know, you could think of a simple model. 00:34:12.740 |
Well, what is a good proxy or what's a good approximation 00:34:19.700 |
So you just say, well, there is some normal way 00:34:24.220 |
You know, I have a reaction time and breaking profile 00:34:28.500 |
so if an agent sees someone in front of them, 00:34:32.380 |
Right, so hopefully I convinced you that behavior 00:34:35.580 |
can be fairly complicated and this will not always produce 00:34:42.020 |
interactive cases such as merges, lane changes, 00:34:49.980 |
You could learn an agent from real demonstrations. 00:34:55.500 |
Well, you went and collected all this data in the world, 00:34:57.620 |
you have a bunch of information of how vehicles, 00:35:00.940 |
pedestrians behave, you can learn a model and use that. 00:35:17.340 |
And it develops a policy, it develops a reaction, 00:35:21.980 |
it's a driver agent and applies acceleration and steering, 00:35:25.460 |
then gets new sensor information, new map information, 00:35:32.020 |
And if it's our own vehicle, then you also have a router 00:35:37.340 |
well, the passenger wants you to go over there, 00:35:43.860 |
And this is an agent, it could be in simulation, 00:35:46.660 |
it could be in the real world, roughly this is the picture. 00:35:52.340 |
To its best approximation, if you learn a good policy 00:35:59.500 |
And so I'm gonna tell you a little bit about work 00:36:04.060 |
So we put a paper in archive about a month ago, I believe. 00:36:12.500 |
and we tried to see how well we can imitate it 00:36:26.660 |
Well, we have a good perception system at Waymo, 00:36:28.780 |
so why don't we use its products for that agent? 00:36:33.020 |
Also can simplify the input representation a bit, 00:36:41.420 |
so no need to worry about acceleration and torques, 00:36:45.980 |
Now, if you want to see in a little more detail 00:37:03.540 |
we can generate a little bit of rotation to the image 00:37:07.580 |
just so we don't over-bias the orientation a specific way. 00:37:13.860 |
so we roughly see about 60 meters in front of us 00:37:23.540 |
which is the map, like which lanes you're allowed 00:37:32.180 |
and how the traffic lights permit it or do not permit it. 00:37:39.140 |
the objects, result of your perception system, 00:37:42.380 |
you render your current vehicle where it believes it is, 00:37:50.460 |
So you give an image of where the agent's been 00:37:58.420 |
you render the intent, so the intent is where you want to go. 00:38:01.900 |
So it's conditioned on this intent and this input, 00:38:04.780 |
you want to predict the future waypoints for this vehicle. 00:38:08.440 |
And you can phrase it as a supervised learning problem. 00:38:11.240 |
Right, just learn to, learn a policy with this network 00:38:15.900 |
that approximates what you've seen in the world, 00:38:19.260 |
Of course, learning agents, there is a well-known problem, 00:38:27.780 |
by Stefan Ross, who is actually at Waymo now, 00:38:35.500 |
so even though in each step, if you do a relatively 00:38:38.140 |
good estimate, if you string 10 steps together, 00:38:40.140 |
you can end up very different from where agents 00:38:44.160 |
Right, and there is techniques to handle this. 00:38:47.900 |
One thing we did was synthesize perturbations. 00:38:51.080 |
So you have your trajectory, and we synthesize, 00:38:54.540 |
deform the trajectory and force the vehicle to learn 00:39:02.540 |
Now, if you just have direct imitation based on supervision, 00:39:06.900 |
we are trying to pass a vehicle in the street, 00:39:27.340 |
which essentially takes the past and creates memory 00:39:37.540 |
So it predicts the trajectory piecemeal in the future. 00:39:46.620 |
So we augment the network, and now the network 00:39:58.660 |
You say, hey, if you drive or generate motions 00:40:01.100 |
that take you outside the road, that's probably not good. 00:40:06.020 |
where your perception network, which takes the other object 00:40:10.300 |
and predicts their motion, so predict here our motion, 00:40:13.640 |
where the road is, and the other agent's motion 00:40:19.060 |
there's no collisions and that we stay on the road. 00:40:26.860 |
So it's not just limited, but what it's explicitly seeing, 00:40:38.660 |
And you can see that we're predicting the future 00:40:46.340 |
Actually handles a lot of scenarios very well. 00:40:49.340 |
If you're interested, I welcome you to go read the paper. 00:40:52.140 |
It handles most of the simple situations fine. 00:41:11.020 |
at the stop sign happily, which is the red line over there, 00:41:16.060 |
And what we did beyond this is, we took the system, 00:41:19.280 |
as learned on imitation data, and we actually drove 00:41:23.900 |
So we took it to Castle, the Air Force Base staging grounds, 00:41:27.260 |
and this is it driving a road it's never seen before 00:41:33.500 |
We could use it also in agent simulation world, 00:41:36.120 |
and we could drive a car with it, but it has some issues. 00:41:40.900 |
So here it is driving, and then it was driving too fast, 00:41:45.900 |
so because our range is limited, it didn't know 00:41:49.300 |
it had to make a turn, and it overran the turn. 00:41:54.500 |
So, you know, one area of improvement, more range. 00:42:00.780 |
So yellow is, by the way, what we did in the real world, 00:42:05.380 |
and green is what we do in the simulation, in that example. 00:42:08.660 |
And here, we're trying to execute a complex maneuver, 00:42:13.180 |
a U-turn, we're sitting there, and we're gonna try to do it, 00:42:25.620 |
When they get really complex, this network also 00:42:32.700 |
Well, long tail came again in testing, right? 00:42:48.000 |
You want to test in the scenarios where someone 00:42:49.900 |
is obnoxious and adversarial and does something 00:43:04.620 |
It could be aggressive and conservative, right? 00:43:24.860 |
it could, in theory, learn any policy, right? 00:43:38.780 |
This is images that are 80 by 80 with multiple channels. 00:43:42.980 |
The model can have tens of millions of parameters. 00:43:45.500 |
Now, if you have an example, if you have a case 00:43:56.020 |
And so it's really good when you have a lot of examples. 00:44:11.460 |
This is, there is a lot of room to keep evolving this. 00:44:16.100 |
And then this area will keep expanding, right? 00:44:20.220 |
There is a lot of interesting questions how to do that, 00:44:24.940 |
Hopefully I get to share with you another time. 00:44:27.020 |
Something else you can do, if you remember from my slide 00:44:29.340 |
about the hybrid system, when you go to the long tail, 00:44:35.580 |
which is simpler, biased, expert design input distribution 00:44:39.380 |
that is much easier to learn with few examples. 00:44:41.660 |
You can also, of course, use expert design models. 00:44:48.460 |
something reasonable by inputting this human knowledge. 00:44:55.280 |
You could just tune to various aspects of this distribution. 00:44:58.640 |
You can have little models for all the aspects 00:45:10.840 |
So we take inspiration from motion control theory, 00:45:14.120 |
and we want to plan a good trajectory for the vehicle, 00:45:18.360 |
the agent vehicle, and that satisfies a bunch 00:45:23.880 |
And so one insight to this is that we already know 00:45:29.760 |
what the agent did in the environment last time. 00:45:32.840 |
So you have fairly strong idea about the intent. 00:45:35.680 |
And that helps you when you specify the preferences. 00:45:38.440 |
'Cause you can say, okay, well, give me a trajectory 00:45:51.360 |
you can add these attractor potentials saying, 00:45:53.720 |
well, try to go where you used to be before, for example. 00:46:01.600 |
And of course, you can have repeller potential. 00:46:05.000 |
Don't hit things, don't run into vehicles, right? 00:46:26.700 |
Typically, we're talking a few dozen parameters or less. 00:46:41.960 |
to the trajectories you've observed in the real world. 00:46:50.840 |
And then you want to generate reasonable trajectories, 00:46:53.280 |
continuous, feasible, that satisfy this, right? 00:47:01.100 |
And so here's some agents I want to show you. 00:47:07.520 |
Two vehicles, but you can see on the left is, 00:47:29.380 |
And they induce very different reactions in our vehicle. 00:47:40.700 |
In the other case, when you have a conservative driver, 00:47:43.040 |
we are in front of them and they're not bugging us 00:47:47.200 |
We can switch into the right lane where we want to go. 00:47:50.400 |
All right, so this is agents that can test your system well. 00:47:52.680 |
Now you have different scenarios in this case, 00:48:15.000 |
It's slow down for a knowing slow vehicle in front, 00:48:21.460 |
And you can generate multiple futures with this agent. 00:48:30.700 |
Right, and on the left was the more conservative person. 00:48:36.060 |
The aggressive guy found a gap between the two vehicles 00:48:53.780 |
So I guess what's my takeaway from this story 00:48:59.420 |
You need a menagerie of agents at the moment, right? 00:49:14.240 |
The task of modeling agent behavior is complex 00:49:23.640 |
and you can hand design trajectories for agents 00:49:26.600 |
to for this reaction do this, for that reaction do that. 00:49:31.440 |
that mostly if there's someone in front of an agent 00:49:36.100 |
Trajectory optimization, which I just showed. 00:49:58.140 |
And so one other takeaway I wanted to tell you 00:50:00.980 |
is smart agents are critical for autonomy at scale. 00:50:04.620 |
This is something I truly believe working in the space. 00:50:12.700 |
that there's still a lot of interesting progress to be made. 00:50:18.480 |
Well you have accurate models of human behavior 00:50:23.900 |
First, you will do better decisions when you drive yourself. 00:50:27.140 |
You'll be able to anticipate what others will do better 00:50:30.980 |
Second, you can develop a robust simulation environment 00:50:41.660 |
It's an agent we have more control than the others 00:50:45.820 |
And so this is very exciting and interesting. 00:50:57.560 |
that is tackling a complex AI challenge like self-driving, 00:51:01.160 |
what is the good properties of the system to have 00:51:07.920 |
We want to grow and handle and bring our service 00:51:10.680 |
to more and more environments, more and more cities. 00:51:13.120 |
How do you scale to dozens or hundreds of cities? 00:51:18.520 |
each new environment can bring new challenges. 00:51:20.860 |
And they can be complex intersections in cities like Paris. 00:51:25.100 |
There's our Lombard Street in San Francisco, I'm from there. 00:51:30.320 |
There is all kinds of, the long tail keeps coming. 00:51:36.000 |
in Pittsburgh people drive the famous Pittsburgh left. 00:51:44.600 |
all of this needs to be accounted for as you expand. 00:51:47.640 |
And this makes your system potentially more complex 00:51:49.800 |
or easier, harder to tune to all environments. 00:51:55.960 |
So how do you, what should a scalable process do? 00:52:04.920 |
I mean, this very much parallels the factory analogy. 00:52:09.700 |
You take your vehicles, we put a bunch of way more cars 00:52:12.680 |
and we drive a long time in that environment with drivers. 00:52:15.680 |
Maybe 30 days, maybe more, at least that long. 00:52:22.040 |
And then your system should be able to improve a lot 00:52:34.560 |
to train the system too much in the real world 00:52:38.640 |
after you've collected data about the environment. 00:52:42.280 |
So it needs to be trainable on collected data. 00:52:44.580 |
It's very important for a system to be able to quantify 00:52:51.640 |
whether it's incorrect or not confident, right? 00:53:00.300 |
people should think of when they design systems. 00:53:06.860 |
You can ask questions to raters, that's fairly legit. 00:53:10.220 |
Typically active learning is a bit like this, right? 00:53:21.040 |
And even better, the system could potentially 00:53:24.620 |
directly update itself, and this is an interesting question. 00:53:27.980 |
How do systems update themselves in light of new knowledge? 00:53:30.600 |
And we have a system that clearly does this, right? 00:53:45.580 |
It is one answer, there is possibly others, right? 00:53:49.140 |
But one way is you can check and enforce consistency 00:53:51.700 |
of your beliefs, and you can look for explanations 00:54:05.060 |
It can improve itself on just collected data. 00:54:08.320 |
And I think it's interesting to think of systems 00:54:10.920 |
where you can do reasoning and the representations 00:54:15.340 |
And last and not least, we need scalable training 00:54:22.880 |
This is part of the fact that I was talking about. 00:54:24.940 |
I'm very lucky at Waymo to have wonderful infrastructure. 00:54:42.060 |
Thank you so much for the talk, I really appreciate it. 00:54:43.620 |
So if you were to train off of image and LiDAR data, 00:54:49.020 |
would you weight the synthetic data differently 00:54:52.900 |
than real-world data when training your models? 00:54:56.140 |
- So there's actually a lot of interesting research 00:55:04.740 |
that make simulator data look like real data. 00:55:10.320 |
So you're essentially, you're trying to build consistency, 00:55:14.520 |
or at least you're training on simulator scenarios, 00:55:16.980 |
but if you learn a mapping from simulator scenes 00:55:19.780 |
to real scenes, right, you could potentially train 00:55:27.220 |
There's many ways to do this, ultimately, right? 00:55:48.660 |
like neural network when you're not quite sure 00:55:51.460 |
what they would do, and rules where you're sure 00:55:56.620 |
- I mean, through lots and lots of testing and analysis, 00:56:15.340 |
And it's a natural process of evolution, right? 00:56:23.100 |
as the capabilities in the data sets grow, right? 00:56:26.340 |
- So you stressed at the end of both the first half 00:56:34.300 |
and the predictions that your models are making. 00:56:42.460 |
or are you using some probabilistic graphical models 00:56:46.540 |
- I mean, so a lot of the models are neural nets. 00:56:59.020 |
I think, first of all, there's techniques in neural nets 00:57:07.300 |
for certain products, or use ensembles of networks 00:57:18.260 |
is to leverage constraints in the environment. 00:57:24.700 |
You don't want, for example, objects to appear or disappear, 00:57:27.980 |
or generally unreasonable changes in the environment, 00:58:00.220 |
I find the simulator work really, really exciting. 00:58:02.900 |
And I was wondering if you could either talk more about, 00:58:06.820 |
or maybe provide some insights into simulating pedestrians. 00:58:12.580 |
I feel like my behavior is a lot less constrained 00:58:16.500 |
- And I imagine, I mean, there's an advantage 00:58:20.060 |
and you kind of know, your sensors are for like 00:58:21.860 |
first person from a vehicle, but not from a pedestrian. 00:58:25.620 |
I mean, so if you want to simulate pedestrians 00:58:31.500 |
and you want to simulate them at very high resolution, 00:58:36.100 |
you may not have the detailed data on that pedestrian. 00:58:39.820 |
At the same time, the subtle cues for that pedestrian 00:58:50.940 |
of what fidelity do you need to simulate things? 00:59:08.620 |
Since you, you know, titled and talked about it, 00:59:18.860 |
Do you think, well, we're gonna have this figured out 00:59:33.580 |
we've really worked out everything necessary? 00:59:39.060 |
- It's a bit hard to, that's a good question. 00:59:48.100 |
I think one thing I would say is it will take a while 00:59:50.620 |
for self-driving cars to roll out at scale, right? 00:59:56.860 |
it turn the crank and appears everywhere, right? 01:00:03.020 |
to make sure it's really safe in the various environments. 01:00:12.420 |
and saying if a person or if someone is looking at us, 01:00:15.340 |
we can assume that they will behave differently 01:00:17.300 |
than if they're not paying attention to what we're doing. 01:00:22.340 |
Do you take into consideration if pedestrians 01:00:29.660 |
- So I can't comment on our model designs too much, 01:00:35.260 |
one needs to pay attention to, they're very significant. 01:00:37.700 |
I mean, you know, even when people drive, for example, 01:00:40.980 |
there's someone sitting in the vehicle next to you waving, 01:00:44.620 |
And these are natural interactions in the environment. 01:00:47.260 |
That, you know, is something you need to think about. 01:00:56.660 |
In one of your last slides, you talked about resolving 01:01:00.140 |
certain uncertainties by the means of establishing 01:01:11.580 |
is underexplored in deep learning and what it means, right? 01:01:16.300 |
So if you read Tversky-Kahneman, Type I, Type II reasoning, 01:01:20.700 |
we're really good at the instinctive mapping type of tasks, 01:01:25.700 |
right, so like some low to mid to maybe high-level perception 01:01:32.940 |
up to a point, but the reasoning part with neural networks, 01:01:54.940 |
in connection with the models you guys are working with, 01:01:58.420 |
- So I'll give an example from current work, right? 01:02:01.060 |
And there's a lot of work on weekly supervised learning. 01:02:05.740 |
- And that's kind of been a big topic in 2018, 01:02:07.980 |
and there were a lot of really strong papers, 01:02:13.700 |
and essentially, if you used to read the books 01:02:16.940 |
about 3D reconstruction and geometry and so on, right, 01:02:26.180 |
So when you have video, and when you have 3D outputs 01:02:29.300 |
in your models, there is certain amount of consistency. 01:02:31.820 |
One example is ego motion versus depth estimation. 01:02:46.700 |
You know this about the environment, you expect it. 01:02:51.260 |
And so more of this type of reasoning may be interesting. 01:03:08.780 |
to tackling the challenges of autonomous driving? 01:03:11.900 |
- Could you say one more time how important is, 01:03:17.500 |
Every now and then, you just, you sprinkle in, 01:03:19.980 |
like, here we can try expert designed algorithms, 01:03:22.540 |
because we actually understand some parts of the problem, 01:03:24.900 |
and I was wondering, like, what is really important 01:03:33.100 |
- I mean, generally, you want, the problem is, 01:03:37.180 |
That makes it such that you don't want to make errors 01:03:40.580 |
in perception, prediction, and planning, right? 01:03:44.460 |
And the state of machine learning is not at the point 01:03:47.580 |
where it never makes errors, provided the scope 01:04:02.020 |
and I think machine learning, as it improves, 01:04:05.140 |
I think there'll be less and less need to do it. 01:04:11.300 |
especially in an evolving system, to do that, 01:04:15.140 |
But right now, I think this is the main thing 01:04:18.020 |
that keeps you able to do complex behaviors in some cases, 01:04:29.940 |
So the way I view it, I'm a machine learning person, 01:04:34.080 |
That said, we're not religious, it should not be. 01:04:38.500 |
and right now, the right mix is a hybrid system,