back to index

Drago Anguelov (Waymo) - MIT Self-Driving Cars


Chapters

0:0 Introduction
0:47 Background
1:31 Waymo story (2009 to today)
4:31 Long tail of events
8:55 Perception, prediction, and planning
14:54 Machine learning at scale
26:43 Addressing the limits of machine learning
29:38 Large-scale testing
50:51 Scaling to dozens and hundreds of cities
54:35 Q&A

Whisper Transcript | Transcript Only Page

00:00:00.000 | All right, welcome back to 6.094,
00:00:03.000 | Deep Learning for Self-Driving Cars.
00:00:04.920 | Today we have Drago Anghielov,
00:00:07.880 | Principal Scientist at Waymo.
00:00:10.080 | Aside from having the coolest name in autonomous driving,
00:00:13.840 | Drago has done a lot of excellent work
00:00:16.400 | in developing and applying machine learning methods
00:00:18.360 | to autonomous vehicle perception
00:00:20.320 | and more generally in computer vision robotics.
00:00:22.440 | He's now helping Waymo lead the world in autonomous driving.
00:00:26.320 | 10 plus million miles achieved autonomously to date,
00:00:31.320 | which is an incredible accomplishment.
00:00:34.040 | So it's exciting to have Drago here with us to speak.
00:00:37.640 | Please give him a big hand.
00:00:39.440 | (audience applauding)
00:00:42.600 | - Hi, thanks for having me.
00:00:46.600 | I will tell you a bit about our work
00:00:48.320 | and the exciting nature of self-driving
00:00:50.480 | and the problem and our solutions.
00:00:52.200 | So my talk is called
00:00:54.400 | "Taming the Long Tail of Autonomous Driving Challenges."
00:00:57.720 | My background is in perception and robotics.
00:01:00.040 | I did a PhD at Stanford with Daphne Koller
00:01:04.120 | and worked closely with one of the pioneers in the space,
00:01:07.160 | Professor Sebastian Thrun.
00:01:09.080 | I spent eight years at Google doing research on perception,
00:01:11.720 | also working on Street View
00:01:13.080 | and developing deep models for detection,
00:01:15.920 | neural net architectures.
00:01:17.840 | I was briefly at Zooks.
00:01:19.320 | I was heading the 3D perception team at Zooks.
00:01:22.520 | We built another perception system for autonomous driving,
00:01:25.840 | and I've been leading the research team at Waymo
00:01:28.520 | most recently.
00:01:30.400 | So I want to tell you a little bit about Waymo when we start.
00:01:34.880 | Waymo actually this month has its 10-year anniversary.
00:01:39.160 | It started when Sebastian Thrun
00:01:42.320 | convinced the Google leadership
00:01:43.480 | to try an exciting new moonshot.
00:01:45.440 | And the goal that they set for themselves
00:01:49.400 | was to drive 10 different segments
00:01:51.760 | that were 100 miles long.
00:01:54.280 | And later that year, they succeeded
00:01:56.000 | and drove an order of magnitude
00:01:57.280 | more than anyone has ever driven.
00:01:58.960 | In 2015, we brought this car to the road.
00:02:06.480 | It was built ground up as a study
00:02:10.120 | in what fully driverless mobility would be like.
00:02:13.680 | In 2015, we put this vehicle in Austin
00:02:19.640 | and it completed the world's first
00:02:21.960 | fully autonomous ride on public roads.
00:02:25.000 | And the person inside this car
00:02:26.520 | is a fan of the project and he is blind.
00:02:28.800 | So we did not want this to be
00:02:33.560 | just a demo fully driverless experience.
00:02:36.520 | We worked hard and in 2017,
00:02:38.560 | we launched a fleet of fully self-driving vehicles
00:02:41.800 | on the streets in Phoenix metro area.
00:02:47.040 | And we have been doing driverless,
00:02:49.960 | fully driverless operations ever since.
00:02:53.000 | So I wanted to give you a feel
00:02:57.320 | for what fully driverless experience is like.
00:02:59.960 | (people chattering)
00:03:02.960 | (upbeat music)
00:03:05.560 | (people chattering)
00:03:08.640 | (people chattering)
00:03:38.000 | And so we continued.
00:03:39.920 | Last year, we launched our first commercial service
00:03:42.760 | in the metro area of Phoenix.
00:03:44.480 | There, people can call away on their phone.
00:03:48.920 | It can come pick them up and help them with errands
00:03:51.160 | or go to school.
00:03:52.760 | And we've been already learning a lot from these customers
00:03:55.680 | and we are looking to grow and expand the service
00:03:58.480 | and bring it to more people.
00:04:00.640 | So in the process of growing the service,
00:04:03.440 | we have driven 10 million miles on public road,
00:04:05.760 | as Lex said.
00:04:07.440 | And driverlessly and more also with human drivers
00:04:12.440 | to collect data.
00:04:16.360 | And we've driven all kinds of scenarios, cities,
00:04:20.040 | capturing a diverse set of conditions
00:04:22.760 | and a diverse set of situations
00:04:24.640 | in which we develop our systems.
00:04:28.280 | And so in this talk, I want to tell you,
00:04:31.480 | I mean, about the long tail of events.
00:04:34.200 | This is all the things we need to handle
00:04:36.520 | to enable truly self-driverless future.
00:04:39.800 | And I guess all the problems that come with this
00:04:42.240 | and offer some solutions and show you
00:04:44.680 | how Waymo has been thinking about these issues.
00:04:47.840 | So as we drove 10 million miles,
00:04:50.080 | of course, we still find scenarios,
00:04:54.240 | new ones that we have not seen before
00:04:55.960 | and we still keep collecting them.
00:04:57.880 | And so when you think about self-driving vehicles,
00:05:01.360 | they need to have the following properties.
00:05:02.840 | First, the vehicle needs to be capable.
00:05:05.520 | It needs to be able to handle the entire task of driving.
00:05:09.280 | So it cannot just do a subset
00:05:10.680 | and remove the human operator from the vehicle.
00:05:14.440 | And also all of these tasks obviously
00:05:16.120 | need to do well and safely.
00:05:18.640 | And that is the requirement
00:05:20.800 | to achieving self-driving at scale.
00:05:24.400 | And when you think about this now,
00:05:26.160 | the question is, well, how many of these capabilities
00:05:28.360 | and how many scenarios do you really need to handle?
00:05:30.920 | Well, it turns out, well, the world is quite diverse
00:05:35.440 | and complicated and there is a lot of rare situations
00:05:40.760 | and all of them need to be handled well.
00:05:42.960 | And they call this the long tail.
00:05:45.840 | The long tail of situations.
00:05:48.440 | It's one type of effort to get yourself driving
00:05:52.000 | for the common cases and then it's another effort
00:05:55.000 | to tame the rest and they really, really matter.
00:06:00.040 | And so I'll show you some.
00:06:03.080 | For example, this is us driving in the street
00:06:06.280 | and let's see if you can tell what is unusual in this video.
00:06:10.640 | (audience laughing)
00:06:12.600 | You see, so I can play it one more time.
00:06:14.880 | So there's a bicyclist and he's carrying a stop sign.
00:06:21.600 | And I don't know where he picked it up
00:06:25.120 | but it's certainly not a stop sign we need to stop for,
00:06:28.400 | unlike others, right?
00:06:29.800 | And so you need to understand that.
00:06:31.560 | Let me show you another scenario.
00:06:34.280 | This is another case where we are happily staying there
00:06:38.520 | and then the vehicle stops and a big pile of poles
00:06:41.800 | comes our way, right?
00:06:42.720 | And you need to potentially understand that
00:06:44.360 | and learn to avoid it.
00:06:45.440 | Generally, well, different types of objects
00:06:49.560 | can fall on the road, it's not just poles.
00:06:52.040 | Here's another interesting scenario.
00:06:53.640 | This happens a lot, it's called construction
00:06:56.260 | and there's various aspects of it.
00:06:57.880 | One of them is someone closed the lane,
00:07:01.960 | put a bunch of cones and we learn,
00:07:05.040 | and this is our vehicle correctly identifying
00:07:07.500 | where it's supposed to be driving
00:07:08.760 | between all of these cones and successfully executing it.
00:07:11.760 | So yeah, we drive for a while.
00:07:14.920 | And this is something that happens fairly often
00:07:19.000 | if you drive a lot.
00:07:20.000 | Another case is this one.
00:07:28.420 | I think you can understand what happened here.
00:07:33.060 | And you can notice actually, so we hear the siren.
00:07:36.320 | So we have the ability to understand
00:07:40.460 | sirens of special vehicles and you can see
00:07:42.260 | we hear it and stop and some guys are much later than us
00:07:45.980 | braking at the last moment,
00:07:47.500 | letting the emergency vehicle pass.
00:07:49.540 | And here's another scenario potentially I want to show you.
00:07:56.740 | Let's see if you can understand what happened.
00:07:59.040 | So let me play one more time.
00:08:05.740 | Did you guys see?
00:08:07.820 | So we stopped at, there's a green light,
00:08:10.820 | we're about to go and someone goes at high speed
00:08:14.180 | running a red light without any remorse.
00:08:16.820 | Right, and we successfully stop and prevent issues.
00:08:21.820 | Right, and so sometimes you have the rules of the way
00:08:25.820 | and you have your road and people don't always abide by them
00:08:29.340 | and that's something that you don't want to just
00:08:31.340 | directly go in front of that person
00:08:32.780 | even if they're breaking the law.
00:08:34.420 | So hopefully with this I convinced you that the situations
00:08:39.340 | that can occur are diverse and challenging
00:08:42.600 | and there's quite a few of them.
00:08:43.780 | And I want to take you a little bit on the tour
00:08:45.740 | of what makes this challenging and then tell you
00:08:47.780 | some ways in which we think about it
00:08:49.800 | and how we're handling it.
00:08:51.100 | And so to do this, we're gonna delve a little bit more
00:08:55.020 | into the main tasks for self-driving
00:08:56.900 | which is perception, prediction, and planning.
00:08:59.860 | And so I'll tell you a little bit about those.
00:09:02.420 | Right, and perception, these are the core AI aspects
00:09:07.420 | of the car usually.
00:09:08.380 | These tasks, there's others, we can talk about others
00:09:10.780 | as well in a little bit, but let's focus on these first.
00:09:13.220 | So perception is mapping from sensory inputs
00:09:15.380 | and potentially prior knowledge of the environment
00:09:18.060 | to a scene representation.
00:09:19.380 | And that scene representation can contain objects,
00:09:22.100 | it can contain scene semantics,
00:09:23.980 | potentially you can reconstruct a map,
00:09:27.060 | you can learn about object relationships and so on.
00:09:29.620 | And perception, the space of things you need to handle
00:09:34.700 | in perception is fairly hard, it's a complex mapping.
00:09:37.680 | Right, so you have sensors, the pixels come,
00:09:39.820 | lighter points come, or radar scans come,
00:09:42.620 | and you have multiple axis of variability
00:09:45.620 | in the environment.
00:09:46.460 | So obviously there's a lot of objects.
00:09:48.820 | They have different types, appearance, pose.
00:09:53.340 | I don't know if you see this well,
00:09:54.500 | there are a bunch of people dressed as dinosaurs
00:09:56.280 | in this case, people generally are fairly creative
00:09:59.260 | in how they dress.
00:10:00.460 | Vehicles can also be different types,
00:10:04.220 | people come in different poses, and we have seen it all.
00:10:07.700 | All right, so that's one aspect.
00:10:09.860 | There's different environments that these objects appear in.
00:10:14.560 | So there are times of day, seasons, day, night,
00:10:20.320 | different, for example, highway environment,
00:10:22.960 | suburban street and so on.
00:10:25.520 | And then there's a different variability axis,
00:10:28.520 | and this is a little more, slightly more abstract.
00:10:30.840 | There are different objects can come in this environment
00:10:32.960 | in different configurations
00:10:34.040 | and can have different relationships.
00:10:36.280 | And so things like occlusion,
00:10:38.240 | there's a guy carrying a big board,
00:10:41.400 | there is reflections, there is, you know,
00:10:44.720 | people riding on horses and so on.
00:10:47.300 | And so why am I showing this?
00:10:50.200 | Because I just want to show you the space, right?
00:10:52.280 | So in most cases you care about most objects
00:10:55.920 | in most environments in most reasonable configurations,
00:10:58.880 | and that's a space that you need to map
00:11:00.520 | from the sensor inputs to a representation that makes sense,
00:11:04.000 | and you need to learn this mapping function
00:11:06.120 | or represent it somehow.
00:11:07.320 | All right, and so let's go to the next step,
00:11:10.200 | which is prediction.
00:11:11.040 | So apart from just understanding
00:11:12.360 | what's happening in the world,
00:11:13.640 | you need to be able to anticipate and predict
00:11:16.200 | what some of the actors in the world are going to do,
00:11:18.600 | the actors being mostly people,
00:11:20.200 | and people is honestly what makes driving quite challenging.
00:11:24.680 | This is one of the aspects that do so,
00:11:26.520 | it's, you know, a vehicle needs to be out there
00:11:28.840 | and be a full-fledged traffic scene participant.
00:11:31.600 | And this anticipation of agent behavior
00:11:34.400 | sometimes needs to be fairly long term,
00:11:35.920 | so sometimes when you want to make a decision,
00:11:37.560 | you want to validate or convince yourself
00:11:40.120 | it does not interfere with what anyone else is going to do,
00:11:43.080 | and it can go from one second to maybe 10 seconds or more,
00:11:45.800 | you need to anticipate the future.
00:11:48.160 | So what goes into anticipating the future?
00:11:51.160 | Well, you can watch a past behavior,
00:11:52.760 | someone's, I'm going this way,
00:11:53.880 | maybe I will continue going there,
00:11:55.480 | or maybe I'm very aggressively walking,
00:11:57.680 | and maybe I'm more likely to do
00:11:59.360 | aggressive motions in the future.
00:12:01.320 | High-level scene semantics,
00:12:04.120 | well, I'm in a presentation room,
00:12:05.800 | I'm sitting here at front giving a talk,
00:12:07.480 | I'll probably stay here and continue,
00:12:11.040 | even though stranger things have happened.
00:12:13.360 | And of course there's subtle appearance cues,
00:12:18.000 | so for example, if a person's watching our vehicle
00:12:20.640 | and moving towards them,
00:12:21.520 | we can be fairly confident they're paying attention
00:12:23.840 | and not going to do anything particularly dangerous.
00:12:28.840 | If someone's not paying attention or being distracted,
00:12:31.040 | or there is a person in the car waving at us,
00:12:35.760 | various gestures, cues, the blinkers on the vehicles,
00:12:38.440 | these are all signals and subtle signals
00:12:41.840 | that we need to understand
00:12:43.160 | in order to be able to behave well.
00:12:45.360 | And last but not least,
00:12:47.160 | even when you predict how other agents behave,
00:12:49.880 | agents also are affected by the other agents
00:12:53.120 | in the environment as well.
00:12:54.160 | So everyone can affect everyone else,
00:12:56.080 | and you need to be mindful of this.
00:12:58.280 | So I'll show you an example of this.
00:13:00.200 | I think this is one of the issues
00:13:03.040 | that really needs to be thought about.
00:13:04.840 | We are all interacting with each other.
00:13:08.080 | So here's a case, our Waymo vehicle is driving,
00:13:10.720 | and there is two bicyclists in red
00:13:15.520 | going around a parked car.
00:13:17.720 | And what happens is we correctly anticipate
00:13:21.280 | that as they bike, they will go around the car,
00:13:23.440 | and we slow down and let them pass.
00:13:25.320 | So we're reasoning that they will interact
00:13:28.160 | with the parked car.
00:13:29.560 | This is the prediction, our most likely prediction
00:13:32.160 | for the rear bicyclists.
00:13:34.600 | We anticipate that they will do this,
00:13:36.480 | and we correctly handle this case.
00:13:38.640 | So this illustrates prediction.
00:13:40.480 | And we have planning.
00:13:42.200 | This is our decision-making machine.
00:13:44.480 | It produces vehicle behavior,
00:13:46.960 | typically ends up in control commands to the vehicle,
00:13:49.720 | accelerate, slow down, steer the wheel.
00:13:52.960 | And you need to generate behavior
00:13:54.240 | that ultimately has several properties to it,
00:13:56.520 | and it's important to think of them,
00:13:57.760 | which is safe, safety comes first,
00:14:00.520 | comfortable for the passengers,
00:14:04.720 | and also sends the right signals
00:14:07.960 | to the other traffic participants,
00:14:09.660 | because they can interact with you,
00:14:12.320 | and they will react to your actions,
00:14:14.000 | so you need to be mindful.
00:14:15.280 | And you need to, of course, make progress.
00:14:18.200 | You need to deliver your passengers.
00:14:19.480 | So you need to trade all of these in a reasonable way.
00:14:22.080 | And it can be fairly sophisticated reasoning
00:14:26.600 | in complex environments.
00:14:27.640 | I'll show you just one scene.
00:14:29.080 | This is a complex, I think, school gathering.
00:14:33.400 | There's bicyclists trailing us,
00:14:34.920 | vehicles really closely hemmed within us,
00:14:37.000 | a bunch of pedestrians, and we need to make progress,
00:14:39.740 | and here is us.
00:14:40.700 | We're driving reasonably well.
00:14:44.440 | In crowded scenes.
00:14:45.520 | And that is part of the prerequisite
00:14:47.640 | of bringing this technology
00:14:50.240 | to the dense urban environments, being able to do this.
00:14:53.480 | So how are we gonna do it?
00:14:54.320 | Well, I gave it up.
00:14:55.600 | I'm a machine learning person.
00:14:57.800 | I think when you have this complicated models and systems,
00:15:01.100 | machine learning is a really great tool to model
00:15:05.360 | complex actions, complex mapping functions, features.
00:15:11.520 | Right, and so we're going to learn our system.
00:15:14.840 | And we've been doing this.
00:15:15.880 | I mean, we're not the only ones.
00:15:17.080 | So obviously, this is now a machine learning revolution.
00:15:20.360 | And machine learning is permeating
00:15:23.160 | all parts of the way in Mostak.
00:15:24.840 | All of these systems that I'm talking about,
00:15:26.860 | it helps us perceive the world.
00:15:28.120 | It helps us make decisions about
00:15:30.600 | what others are going to do.
00:15:31.880 | It helps us make our own decisions.
00:15:33.640 | Right, and machine learning is a tool
00:15:36.400 | to handle the long tail.
00:15:37.640 | Right, and I'll tell you a little more on this how.
00:15:42.080 | So I have this allegory about machine learning
00:15:44.480 | that I like to think about.
00:15:45.660 | So there is a classical system
00:15:47.020 | and there is a machine learning system.
00:15:49.120 | And to me, a classical system,
00:15:50.480 | and I've been there, I've done well.
00:15:52.960 | Early machine learning systems also can be a bit classical.
00:15:56.120 | You're the artisan, you're the expert.
00:15:57.640 | You have your tools and you need to build this product.
00:16:00.280 | And you have your craft and you go
00:16:01.640 | and take your tools and build it, right?
00:16:03.520 | And it can fairly quickly get something reasonable.
00:16:06.420 | But then it's harder to change, it's harder to evolve.
00:16:08.840 | If you learn new things, now you need to go back
00:16:11.160 | and maybe the tools don't quite fit
00:16:12.680 | and you need to essentially keep tweaking it
00:16:14.920 | and it starts becoming, the more complicated the product
00:16:17.360 | becomes, the harder it is to do.
00:16:19.700 | And machine learning, modern machine learning,
00:16:22.000 | is like a factory.
00:16:24.200 | Right, so machine learning, you build the factory,
00:16:28.600 | which is the machine learning infrastructure.
00:16:30.800 | And then you feed data in this factory
00:16:34.660 | and get nice models that solve your problems, right?
00:16:37.480 | And so, kind of infrastructure is at the heart
00:16:40.280 | of this new paradigm.
00:16:41.480 | You need to build a factory.
00:16:43.960 | Right, once you do it, now you can iterate,
00:16:47.480 | it's scalable, right?
00:16:49.180 | Just keep the right data, keep feeding the machine,
00:16:51.640 | keeps giving you good models.
00:16:53.080 | So what is a ML factory for self-driving models?
00:16:58.740 | Well, roughly it goes like this.
00:17:03.720 | We have a software release, we put it on the vehicle,
00:17:06.360 | we're able to drive.
00:17:07.280 | We drive, we collect data, we collect it and we store it.
00:17:11.560 | And then we select some parts of this data
00:17:15.440 | and we send it to labelers.
00:17:17.640 | And the labelers label parts of the data
00:17:19.480 | that we find interesting and that's a knowledge
00:17:22.200 | that we want to extract from the data.
00:17:23.720 | These are the labels, the annotations,
00:17:25.120 | the results we want for our models.
00:17:26.880 | Right, there it is.
00:17:31.200 | And then what we're going to do is we're gonna train
00:17:33.640 | machine learning models on this data.
00:17:36.500 | After we have the models, we will do testing and validation,
00:17:39.300 | validate that they're good to put on our vehicles.
00:17:42.140 | And once they're good to put on our vehicles,
00:17:43.860 | we go and collect more data.
00:17:45.260 | And then the process starts going again and again.
00:17:48.780 | So you collect more data, now you select new data
00:17:51.260 | that you have not selected before.
00:17:52.960 | You add it to your data set, you keep training the model.
00:17:56.920 | And iterate, iterate, iterate, it's a nice scalable setup.
00:18:00.780 | Of course, this needs to be automated.
00:18:03.520 | It needs to be scalable itself.
00:18:05.280 | It's a game of infrastructure.
00:18:07.360 | And at Waymo, we have the beautiful advantage
00:18:10.760 | to be really well set up with regards
00:18:13.380 | to the machine learning infrastructure.
00:18:15.180 | And I'll tell you a bit about its ingredients
00:18:17.280 | and how we go about it.
00:18:20.340 | So ingredient one is computing software infrastructure
00:18:23.580 | and we're part of Alphabet, Google,
00:18:25.020 | and we are able to, first of all, leverage TensorFlow,
00:18:29.620 | the deep learning framework.
00:18:31.300 | We have access to the experts that wrote TensorFlow
00:18:33.780 | and know it in depth.
00:18:35.660 | We have data centers to run large-scale parallel compute
00:18:38.660 | and also train models.
00:18:40.700 | We have specialized hardware for training models,
00:18:42.880 | which make it cheaper and more affordable and faster
00:18:47.040 | so you can iterate better.
00:18:48.340 | Ingredient two is high-quality labeled data.
00:18:51.960 | We have the scale to collect and store hundreds
00:18:55.540 | and thousands and more miles, to millions of miles.
00:18:58.420 | And just collecting and storing 10 millions of miles
00:19:02.180 | is not necessarily the best thing you can do
00:19:07.180 | because there is a decreasing utility to the data.
00:19:10.900 | So most of the data comes from common scenarios
00:19:13.420 | and maybe you're already good at them
00:19:14.700 | and that's where the long tail comes.
00:19:17.020 | So it's really important how you select the data.
00:19:20.260 | And so this is the important part of this pipeline.
00:19:22.180 | So while you're running a release on the vehicle,
00:19:24.540 | we have a bunch of models,
00:19:25.740 | we have a bunch of understanding about the world,
00:19:28.180 | and we annotate the data as we go
00:19:30.500 | and we can use this knowledge to decide
00:19:32.700 | what data is interesting, how to store it,
00:19:35.300 | which data we can potentially even ignore.
00:19:38.160 | So then once we do that, again,
00:19:41.740 | we need to be very careful how to select data.
00:19:43.660 | We want to select data, for example,
00:19:45.100 | that are interesting in some way and complement,
00:19:47.900 | capture these long tail cases
00:19:49.780 | that we potentially may not be doing so well on.
00:19:52.420 | And so for this, we have active learning
00:19:56.620 | and data mining pipelines.
00:19:59.580 | Given exemplars, find the rare examples,
00:20:02.020 | look for parts of your system which are uncertain
00:20:04.420 | or inconsistent over time and go and label those cases.
00:20:08.640 | Last but not least, we also produce auto labels.
00:20:12.540 | So how can you do that?
00:20:13.780 | Well, when you collect data, you also see the future
00:20:17.140 | for many of the objects, what they did.
00:20:19.500 | And so because of that, now knowing the past and the future,
00:20:22.960 | you can annotate your data better
00:20:24.740 | and then go back to your model
00:20:26.700 | that does not know the future
00:20:27.980 | and try to replicate that with that model.
00:20:30.140 | And so you need to do all of this as part of the system.
00:20:34.580 | Ingredient number three, high quality models.
00:20:37.340 | We're part of larger Alphabet and Google and DeepMind
00:20:40.660 | and generally Alphabet is the leader in AI.
00:20:44.300 | When I was at Google, we were very early
00:20:47.180 | on the deep learning revolution.
00:20:49.020 | I happened to have the chance to be there at the time.
00:20:51.660 | It was 2013 when I got on to do deep learning
00:20:55.860 | and a lot of things were not understood
00:20:57.340 | and we were there working on it earlier than most people.
00:21:00.140 | And so through that, we had the opportunity
00:21:02.300 | and the chance to develop some of the,
00:21:04.620 | in my time, the team I managed,
00:21:06.500 | we invented neural net architecture like Inception,
00:21:10.300 | which became popular later.
00:21:12.260 | We invented at the time the state of the art
00:21:14.500 | object detection, fast object detector called SSD.
00:21:18.060 | And we won the ImageNet 2014.
00:21:20.420 | And now if you go to the conferences,
00:21:21.860 | Google and DeepMind are leaders in perception
00:21:23.900 | and reinforcement learning and smart agents.
00:21:26.520 | And there is like state of the art,
00:21:29.180 | say semantic segmentation networks,
00:21:30.860 | pose estimation and so on.
00:21:32.260 | The object detection of course goes without saying.
00:21:34.420 | And so we collaborate with Google and DeepMind
00:21:36.980 | on projects improving our models.
00:21:39.120 | And so this is my factory for self-driving models
00:21:43.100 | and I want to tell you something
00:21:45.720 | that kind of captures all of these ideas,
00:21:48.900 | infrastructure, data and models in one.
00:21:51.260 | This is a project we did recently
00:21:54.900 | and today we put online in our blog
00:21:57.340 | about automatic machine learning for tuning
00:22:01.500 | and adjusting architectures of neural networks.
00:22:05.500 | So what did we do?
00:22:08.260 | So there is a team at Google working on AutoML,
00:22:12.380 | automatic machine learning.
00:22:14.200 | And usually networks themselves have complex architecture.
00:22:17.300 | They're crafted by practitioners to artisans
00:22:19.260 | of networks in some way.
00:22:21.020 | And sometimes we have very high latency constraints
00:22:23.600 | in the models, we have some compute constraints.
00:22:26.480 | The networks are specialized.
00:22:27.860 | It takes often people months to find the right architecture
00:22:30.980 | that's most performant, low latency and so on.
00:22:33.440 | And so there's a way to offload this work to the machines.
00:22:36.900 | You can have machines themselves,
00:22:40.280 | once you've posed the problem,
00:22:41.500 | go and find your good network architecture
00:22:43.360 | that's both low latency and high performance.
00:22:45.760 | Right, and so that's what we do.
00:22:48.540 | And we drive in a lot of scenarios
00:22:50.060 | and as we keep collecting data and finding new cities
00:22:52.860 | or new examples, the architectures may change
00:22:55.740 | and we want to easily find that and keep evolving that
00:22:58.220 | without too much effort, right?
00:22:59.580 | So we worked with the Google researchers
00:23:01.940 | and they had a strong work where they invented,
00:23:05.340 | well, they developed a system that searched the space
00:23:09.060 | of architectures and found a set of components
00:23:12.820 | of neural networks.
00:23:14.360 | It's a small sub-network called NASCEL
00:23:17.260 | and this is a diagram of a NASCEL.
00:23:18.940 | It's a set of layers put together
00:23:21.540 | that you can then replicate in the network
00:23:23.220 | to build a larger network.
00:23:24.580 | And they discovered in a small vision data set,
00:23:26.740 | it was called CIFAR-10.
00:23:28.060 | It's from the early days of deep learning,
00:23:30.880 | it was a very popular data set
00:23:32.540 | and you can quickly train models
00:23:33.860 | and explore the large search space.
00:23:36.040 | So the first thing we did is we took some problems
00:23:40.620 | in that we have for our stack,
00:23:42.940 | one of them being lighter segmentation.
00:23:44.540 | So you have a map representation and some lighter points
00:23:48.940 | and you essentially segment the lighter points.
00:23:51.580 | You say this point is part of a vehicle,
00:23:53.840 | that point is part of vegetation and so on.
00:23:56.540 | This is a standard problem.
00:23:58.440 | So what we first did at Waymo is
00:24:01.620 | we explored several hundred NASCEL combinations
00:24:07.580 | to see what performs better on this task.
00:24:10.740 | And we saw that one of two things happened
00:24:15.460 | for the various versions that we found.
00:24:17.420 | One of them is we can find models with similar quality
00:24:20.460 | but much lower latency and less compute.
00:24:24.220 | And then there is models of a bit higher quality
00:24:27.740 | at the same latency.
00:24:28.580 | So essentially we found better models
00:24:30.060 | than the human engineers did.
00:24:31.880 | And similar results were obtained for other problems,
00:24:36.500 | lane detection as well
00:24:38.940 | with this transfer learning approach.
00:24:41.100 | Of course, you can also do end-to-end architecture search.
00:24:44.300 | So there's no reason why what was found on CIFAR-10
00:24:47.740 | is best suited for our more specialized problems.
00:24:52.180 | And so we went about this more from the ground up.
00:24:55.660 | So let's find exactly deeper search, much larger space,
00:24:59.580 | not limited to the NASCELs themselves.
00:25:02.820 | And so the way to do this is because our networks
00:25:05.620 | are trained on quite a lot of data
00:25:08.100 | and take quite a while to converge
00:25:10.260 | and it takes some compute,
00:25:11.540 | we went and defined a proxy task.
00:25:13.220 | This is a smaller task, simplified,
00:25:16.140 | but correlates with the larger task.
00:25:18.540 | And we do this by some experimentation
00:25:20.420 | of what would be a proxy task.
00:25:21.780 | And once we establish a proxy task,
00:25:23.940 | now we execute the search algorithms
00:25:26.260 | developed by the Google researchers.
00:25:28.300 | And so we train up to 10,000 architectures
00:25:31.300 | with different topology and capacity.
00:25:34.620 | And once we find the top 100 models,
00:25:36.700 | now we train the large networks on those models
00:25:39.780 | all the way and pick the best ones.
00:25:42.220 | Right?
00:25:43.260 | And so this way we can explore a much larger space
00:25:46.140 | of network architectures.
00:25:47.820 | So what happened?
00:25:48.660 | So on the left, this is 4,000 different models
00:25:52.620 | spanning the scale and latency and quality.
00:25:55.860 | And in red was the transfer model.
00:25:58.140 | So after the first round of search,
00:25:59.940 | we actually did not produce a better model
00:26:01.700 | than the transfer, which already leveraged their insight.
00:26:04.860 | So then we took the learnings and the best models
00:26:07.420 | from this search and did the second round of search,
00:26:10.220 | which was in yellow, which allowed us to beat it.
00:26:12.260 | And third is we also executed
00:26:15.100 | reinforcement learning algorithm
00:26:17.940 | developed by the AI researchers
00:26:19.620 | on 6,000 different architectures.
00:26:21.980 | And that one was able to significantly improve
00:26:24.940 | on the red dot, which also significantly improves
00:26:28.380 | on the in-house algorithm.
00:26:31.260 | So that's one example where infrastructure,
00:26:36.380 | data, and models combine and shows
00:26:40.460 | how you can keep automating the factory.
00:26:42.980 | That is all good, but we keep finding
00:26:46.660 | new examples in the world.
00:26:48.220 | And for some situations, we have fairly few examples as well.
00:26:52.220 | And so there are cases where the models are uncertain
00:26:55.460 | or potentially can make mistakes.
00:26:57.180 | And you need to be robust to those.
00:27:00.540 | I mean, you cannot put the product and say,
00:27:02.060 | well, our networks just don't handle some case
00:27:05.220 | and it's, so we have designed our system to be robust,
00:27:09.620 | even when ML is not particularly confident.
00:27:11.740 | And how do you do this?
00:27:12.580 | So one part is, of course, you want redundant
00:27:15.140 | and complementary sensors.
00:27:17.180 | So we have given 360-degree field of view
00:27:19.060 | on our vehicles, both in camera, LiDAR, and radar.
00:27:21.900 | And they're complementary modalities.
00:27:24.140 | First of all, an object is seen in all of them.
00:27:26.740 | Second of all, they all have different strengths
00:27:29.020 | and different modes of failure.
00:27:31.020 | And so whenever one of them tends to fail,
00:27:32.620 | the others usually work fine.
00:27:34.580 | And so that helps a lot,
00:27:36.340 | make sure we do not miss anything.
00:27:38.040 | Also, we've designed our system to be a hybrid system.
00:27:43.620 | And this is a point I want to make.
00:27:45.300 | So, I mean, some of these mapping problems
00:27:49.100 | or problems in which we apply our models
00:27:52.420 | are very complicated.
00:27:53.260 | They're high dimensional.
00:27:54.140 | The image has a lot of pixels.
00:27:56.100 | LiDAR has a lot of LiDAR points.
00:27:58.620 | The networks can end up pretty big.
00:28:00.980 | And it may not be so easy to train
00:28:04.180 | with very few examples with the current state of the art.
00:28:06.500 | And so the state of the art keeps improving, of course.
00:28:08.620 | So this is their zero-shot and one-shot learning.
00:28:11.680 | But we can also, while the state of the art
00:28:14.460 | is improving in the models,
00:28:15.940 | we can also leverage expert domain knowledge.
00:28:18.180 | And so what does that do?
00:28:19.860 | So humans can help develop the right input representations.
00:28:24.860 | They can put in expert bias
00:28:26.500 | that constrains the representation
00:28:28.260 | to fewer parameters that already describe the task.
00:28:30.860 | And then with that bias, it is easier to learn models
00:28:34.540 | with fewer examples.
00:28:36.500 | And there is also, of course,
00:28:39.500 | experts can put in their knowledge
00:28:41.000 | in terms of designing the algorithm
00:28:42.560 | which incorporates it as well.
00:28:44.500 | And so our system is this hybrid.
00:28:46.140 | And so an example of what that looks for perception is,
00:28:50.900 | well, no matter if there's cases
00:28:54.800 | where the machine learning system may be not confident,
00:28:57.660 | we still have tracks and obstacles
00:28:59.180 | from LiDAR and radar scans,
00:29:00.520 | and we make sure that we drive relative to those safely.
00:29:05.520 | And in prediction and planning,
00:29:06.880 | if we are not confident in our predictions,
00:29:10.660 | we can drive more conservatively.
00:29:12.620 | And over time, as the factory is running
00:29:14.480 | and our models become more powerful, of course, improve,
00:29:16.980 | and we get more data of all the cases,
00:29:19.620 | the scope of ML grows, right?
00:29:23.360 | And the set of cases that you can handle with it increases.
00:29:29.740 | And so there's two ways to attack the tail.
00:29:32.240 | You both protect against it,
00:29:33.500 | but you also keep growing ML
00:29:34.840 | and making your system more performant.
00:29:36.800 | I'm going to tell you now
00:29:38.920 | how we deal with large-scale testing,
00:29:41.340 | which is another key problem.
00:29:43.420 | It's very important in the pipeline
00:29:46.480 | and also in getting the vehicles on the road.
00:29:48.720 | So how do you normally develop a self-driving algorithm?
00:29:54.120 | Well, the ideal thing you're gonna do
00:29:55.560 | is you make your algorithm change,
00:29:58.840 | and you would put it on the vehicle and drive a bunch
00:30:01.260 | and say, "Now it looks great."
00:30:03.720 | All right, let's make the next one.
00:30:05.580 | The problem is, I mean, we have a big fleet,
00:30:08.220 | we have a lot of data,
00:30:09.860 | but some of the conditions and situations
00:30:12.340 | occur very, very rarely.
00:30:13.980 | And so if you do this, you're gonna wait a long time.
00:30:17.420 | Furthermore, you don't just want to take your code
00:30:20.380 | and put it on the vehicle.
00:30:21.380 | You need to test it even before that.
00:30:23.180 | You don't want to,
00:30:24.580 | like you want very strongly tested code on public streets.
00:30:28.780 | So you can do structure testing.
00:30:30.280 | We have a 90-acre Air Force Base place
00:30:34.720 | where we can test very important situations
00:30:38.540 | and situations that occur rarely.
00:30:41.140 | It's an example of such a situation.
00:30:44.580 | And so you can do this as well.
00:30:47.300 | So you can select and deliberately stage safely
00:30:50.780 | conditions you care about.
00:30:53.620 | Now again, you cannot do this for all situations.
00:30:58.380 | So what do you do?
00:30:59.820 | A simulator.
00:31:00.900 | Right?
00:31:03.540 | And so how much do you need to simulate?
00:31:07.300 | Well, we simulate a lot.
00:31:09.140 | So we simulate the equivalent of 25,000 cars,
00:31:12.220 | virtual cars, driving.
00:31:13.980 | 10 million miles a day.
00:31:17.860 | And over seven billion miles simulated.
00:31:22.980 | It's a key part of our release process.
00:31:27.260 | So why do you need to simulate this much?
00:31:29.260 | Right, well, hopefully I convinced you
00:31:31.700 | there is a variety of cases to worry about
00:31:35.340 | and that you need to test, right, so far.
00:31:39.540 | And furthermore, it goes all the way bottom up.
00:31:43.380 | So as you change perception,
00:31:45.620 | for example, slightly different segmentation or detection,
00:31:48.340 | the changes can go through the system
00:31:51.540 | and the results can change significantly
00:31:54.220 | and you need to be robust to this.
00:31:55.340 | You need to test all the way.
00:31:57.780 | So what to simulate?
00:32:02.780 | One thing you can do is,
00:32:05.140 | we can create unique scenarios from scratch,
00:32:08.900 | working with safety experts,
00:32:10.300 | NITSA and analyzing water conditions
00:32:12.980 | in which typically lead to accidents.
00:32:15.660 | So you can do that.
00:32:16.500 | Of course, you can do it manually, you can create them.
00:32:19.260 | What else could you do?
00:32:20.420 | Well, you want to leverage your driving data.
00:32:25.420 | You have all your logs,
00:32:26.700 | you have a bunch of situations there, right?
00:32:29.540 | So you can pick interesting situations from your logs.
00:32:33.980 | And furthermore, what you can do is,
00:32:35.380 | you take all these situations
00:32:36.540 | and you create variations of these situations
00:32:39.140 | so you get even more scenarios.
00:32:41.220 | So here's an example of a log simulation.
00:32:45.260 | I'll play it twice.
00:32:46.180 | First time, look at the image.
00:32:48.260 | This is what happened in the real world the first time.
00:32:51.460 | So in the real world,
00:32:53.180 | we mostly stayed in the middle lane and stopped.
00:32:56.540 | If you see what happened in simulation,
00:33:00.420 | simulation, our algorithm decided this time
00:33:02.740 | to merge to the left lane and stopped.
00:33:06.140 | And everything was fine, things were safe,
00:33:08.780 | things were happy.
00:33:09.740 | What can go wrong in simulation from logs?
00:33:14.680 | Well, let's say this is another scenario,
00:33:19.100 | slightly different visualization.
00:33:20.860 | Our vehicle, when it drove the real world,
00:33:23.140 | was where the green vehicle is.
00:33:24.700 | Now in simulation, we drove differently
00:33:27.980 | and we have the blue vehicle, right?
00:33:30.540 | And so we're driving.
00:33:36.620 | What happened?
00:33:39.020 | Well, there is a purple agent over there,
00:33:41.700 | pesky purple agent, who in the real world
00:33:43.780 | saw that we passed them safely.
00:33:45.900 | And so it was safe for them to go,
00:33:48.980 | but it's no longer safe because we changed what we did.
00:33:51.900 | So the insight is, in simulation,
00:33:53.860 | our actions affect the environment
00:33:55.700 | and it need to be accounted for.
00:33:57.340 | So what does that mean?
00:33:59.860 | If you want to have effective simulations
00:34:04.140 | on a large scale, you need to simulate
00:34:06.300 | the realistic driver and pedestrian behavior.
00:34:08.540 | So, you know, you could think of a simple model.
00:34:12.740 | Well, what is a good proxy or what's a good approximation
00:34:16.460 | of a realistic behavior?
00:34:17.540 | Well, you can do a break and swerve model.
00:34:19.700 | So you just say, well, there is some normal way
00:34:23.180 | reactions happen.
00:34:24.220 | You know, I have a reaction time and breaking profile
00:34:27.340 | and maybe a swerving profile,
00:34:28.500 | so if an agent sees someone in front of them,
00:34:30.340 | maybe they just apply it as an algorithm.
00:34:32.380 | Right, so hopefully I convinced you that behavior
00:34:35.580 | can be fairly complicated and this will not always produce
00:34:38.780 | a believable reaction, especially in complex
00:34:42.020 | interactive cases such as merges, lane changes,
00:34:45.060 | intersections, and so on.
00:34:47.540 | Right, so what could you do?
00:34:49.980 | You could learn an agent from real demonstrations.
00:34:55.500 | Well, you went and collected all this data in the world,
00:34:57.620 | you have a bunch of information of how vehicles,
00:35:00.940 | pedestrians behave, you can learn a model and use that.
00:35:05.940 | Okay, so what is an agent?
00:35:08.460 | Let's look a little bit.
00:35:09.660 | An agent receives, sends the information
00:35:14.340 | and maybe context about the environment.
00:35:17.340 | And it develops a policy, it develops a reaction,
00:35:21.980 | it's a driver agent and applies acceleration and steering,
00:35:25.460 | then gets new sensor information, new map information,
00:35:30.300 | place in the map and it continues.
00:35:32.020 | And if it's our own vehicle, then you also have a router
00:35:34.420 | that's explicit intent generator which says,
00:35:37.340 | well, the passenger wants you to go over there,
00:35:39.820 | why don't we try to make a right turn now?
00:35:41.580 | So you also get an intent.
00:35:43.860 | And this is an agent, it could be in simulation,
00:35:46.660 | it could be in the real world, roughly this is the picture.
00:35:48.820 | And this is an end-to-end agent,
00:35:50.140 | end-to-end learning is popular, right?
00:35:52.340 | To its best approximation, if you learn a good policy
00:35:55.780 | this way, you can apply it and have
00:35:57.780 | very believable agent reactions.
00:35:59.500 | And so I'm gonna tell you a little bit about work
00:36:02.980 | we did in this direction.
00:36:04.060 | So we put a paper in archive about a month ago, I believe.
00:36:07.140 | We took 60 hours of footage of driving
00:36:12.500 | and we tried to see how well we can imitate it
00:36:16.220 | using a deep neural network.
00:36:18.140 | And so one option is to do exactly the same
00:36:21.260 | to end-to-end agent policy,
00:36:22.340 | but we wanted to make our task easier.
00:36:26.660 | Well, we have a good perception system at Waymo,
00:36:28.780 | so why don't we use its products for that agent?
00:36:33.020 | Also can simplify the input representation a bit,
00:36:35.580 | that is good, the task becomes easier.
00:36:37.740 | Controllers are well understood,
00:36:39.980 | we can use an existing controller,
00:36:41.420 | so no need to worry about acceleration and torques,
00:36:43.900 | we can generate trajectories.
00:36:45.980 | Now, if you want to see in a little more detail
00:36:50.980 | to understand the representation,
00:36:52.660 | is so we have, this is our agent vehicle,
00:36:55.100 | which is self-driving vehicle in this case,
00:36:56.860 | but could be a simulation agent.
00:36:59.100 | And we render an image with it at the center
00:37:01.300 | and potentially we augment it with some,
00:37:03.540 | we can generate a little bit of rotation to the image
00:37:07.580 | just so we don't over-bias the orientation a specific way.
00:37:12.220 | All right, and it's an 80 by 80 box,
00:37:13.860 | so we roughly see about 60 meters in front of us
00:37:16.100 | and 40 meters to the side in this setup.
00:37:18.460 | And now we render a road map in this box,
00:37:23.540 | which is the map, like which lanes you're allowed
00:37:25.780 | to drive on, there's traffic lights,
00:37:28.140 | and generally at intersections we render
00:37:30.260 | what lanes are allowed to go in what lanes
00:37:32.180 | and how the traffic lights permit it or do not permit it.
00:37:35.020 | Then you can render speed limits,
00:37:39.140 | the objects, result of your perception system,
00:37:42.380 | you render your current vehicle where it believes it is,
00:37:46.580 | and you render the post history.
00:37:50.460 | So you give an image of where the agent's been
00:37:53.260 | in the last few steps.
00:37:54.660 | And so you want, and last but not least,
00:37:58.420 | you render the intent, so the intent is where you want to go.
00:38:01.900 | So it's conditioned on this intent and this input,
00:38:04.780 | you want to predict the future waypoints for this vehicle.
00:38:07.220 | Right, so that's the task.
00:38:08.440 | And you can phrase it as a supervised learning problem.
00:38:11.240 | Right, just learn to, learn a policy with this network
00:38:15.900 | that approximates what you've seen in the world,
00:38:18.060 | with 60 hours of data.
00:38:19.260 | Of course, learning agents, there is a well-known problem,
00:38:24.940 | it's identified, it's called Paper Dagger,
00:38:27.780 | by Stefan Ross, who is actually at Waymo now,
00:38:31.620 | and Andrew Bagnell.
00:38:33.140 | So it's easy to make small errors over time,
00:38:35.500 | so even though in each step, if you do a relatively
00:38:38.140 | good estimate, if you string 10 steps together,
00:38:40.140 | you can end up very different from where agents
00:38:43.220 | have been before.
00:38:44.160 | Right, and there is techniques to handle this.
00:38:47.900 | One thing we did was synthesize perturbations.
00:38:51.080 | So you have your trajectory, and we synthesize,
00:38:54.540 | deform the trajectory and force the vehicle to learn
00:38:56.580 | to come back to the middle of the lane.
00:38:58.880 | So that's something you can do.
00:39:00.340 | That's reasonable.
00:39:02.540 | Now, if you just have direct imitation based on supervision,
00:39:06.900 | we are trying to pass a vehicle in the street,
00:39:09.420 | and it's stopping and never continuing.
00:39:11.700 | So now we did perturbations, and well,
00:39:16.700 | it kind of ran through the vehicle.
00:39:19.160 | Right, so that's not enough.
00:39:21.640 | So we need more, right?
00:39:22.820 | It's not actually an easy problem.
00:39:24.860 | So in addition to having this agent RNN,
00:39:27.340 | which essentially takes the past and creates memory
00:39:32.020 | of its past decisions and keeps iterating,
00:39:35.300 | predicting multiple points in the future.
00:39:37.540 | So it predicts the trajectory piecemeal in the future.
00:39:41.020 | How about we also learn about collisions
00:39:45.220 | and staying on the road and so on.
00:39:46.620 | So we augment the network, and now the network
00:39:48.700 | starts also predicting a mask for the road.
00:39:51.860 | And now we have a loss here.
00:39:54.060 | I don't know if I can point.
00:39:55.420 | So here you have a road mask loss.
00:39:58.660 | You say, hey, if you drive or generate motions
00:40:01.100 | that take you outside the road, that's probably not good.
00:40:04.100 | Hey, if you ever cause collisions
00:40:06.020 | where your perception network, which takes the other object
00:40:10.300 | and predicts their motion, so predict here our motion,
00:40:13.640 | where the road is, and the other agent's motion
00:40:17.200 | in the future, and we're trying to make sure
00:40:19.060 | there's no collisions and that we stay on the road.
00:40:20.780 | So you add this structural knowledge.
00:40:22.580 | That adds a lot more constraints
00:40:24.660 | to the system as it trains.
00:40:26.860 | So it's not just limited, but what it's explicitly seeing,
00:40:29.740 | it allows it to reason about things
00:40:31.220 | it has not explicitly seen as well.
00:40:34.100 | And so now, here's an example of us driving
00:40:36.780 | with this network.
00:40:38.660 | And you can see that we're predicting the future
00:40:42.180 | with yellow boxes, and we're driving safely
00:40:44.300 | through intersections and complex scenarios.
00:40:46.340 | Actually handles a lot of scenarios very well.
00:40:49.340 | If you're interested, I welcome you to go read the paper.
00:40:52.140 | It handles most of the simple situations fine.
00:40:55.380 | So now we have our past two approaches,
00:40:58.140 | the passing a parked car.
00:41:00.220 | One of them stops and never restarts.
00:41:01.860 | The other one hits the car.
00:41:04.220 | Now it actually handles it fine.
00:41:05.820 | And beyond that, afterwards, we can stop
00:41:11.020 | at the stop sign happily, which is the red line over there,
00:41:13.900 | and it does all of these operations.
00:41:16.060 | And what we did beyond this is, we took the system,
00:41:19.280 | as learned on imitation data, and we actually drove
00:41:21.860 | our real Waymo car with it.
00:41:23.900 | So we took it to Castle, the Air Force Base staging grounds,
00:41:27.260 | and this is it driving a road it's never seen before
00:41:30.460 | and stopping at stop signs and so on.
00:41:32.260 | So that's all great.
00:41:33.500 | We could use it also in agent simulation world,
00:41:36.120 | and we could drive a car with it, but it has some issues.
00:41:38.960 | So let's look on the left.
00:41:40.900 | So here it is driving, and then it was driving too fast,
00:41:45.900 | so because our range is limited, it didn't know
00:41:49.300 | it had to make a turn, and it overran the turn.
00:41:51.500 | So it just drove off the road.
00:41:52.900 | That's one thing that can happen.
00:41:54.500 | So, you know, one area of improvement, more range.
00:41:59.420 | Here's just another time.
00:42:00.780 | So yellow is, by the way, what we did in the real world,
00:42:05.380 | and green is what we do in the simulation, in that example.
00:42:08.660 | And here, we're trying to execute a complex maneuver,
00:42:13.180 | a U-turn, we're sitting there, and we're gonna try to do it,
00:42:17.060 | and we almost do it, but not quite,
00:42:19.660 | and at least we end up in the driveway.
00:42:21.660 | And there's other interactive situations.
00:42:25.620 | When they get really complex, this network also
00:42:28.180 | does not do too well, right?
00:42:29.740 | And so what does that tell us?
00:42:32.700 | Well, long tail came again in testing, right?
00:42:37.500 | There's, again, you can learn a policy
00:42:39.900 | for a lot of the common situations,
00:42:42.980 | but actually in testing, some of the things
00:42:44.620 | you really care about is the long tail.
00:42:46.340 | You want to test to the corner cases.
00:42:48.000 | You want to test in the scenarios where someone
00:42:49.900 | is obnoxious and adversarial and does something
00:42:52.200 | not too kosher, right?
00:42:55.660 | So one way to think of it is this, right?
00:43:00.540 | This is the distribution of human behavior,
00:43:02.380 | and of course, it goes in multiple axes.
00:43:04.620 | It could be aggressive and conservative, right?
00:43:09.620 | And then somewhere in between, you could be
00:43:12.220 | super expert driver and super inexperienced
00:43:14.460 | and somewhere in between, and so on.
00:43:16.260 | So our end-to-end model, it's fairly,
00:43:21.740 | it's an unbiased representation, meaning
00:43:24.860 | it could, in theory, learn any policy, right?
00:43:26.700 | I mean, you see everything you want to know
00:43:28.500 | about the environment, by and large.
00:43:30.920 | But it's complex, and this is similar a bit
00:43:33.260 | to the models as well, some of the models
00:43:35.440 | we talked about before.
00:43:36.420 | You can end up with complex model
00:43:37.700 | if you have complex input.
00:43:38.780 | This is images that are 80 by 80 with multiple channels.
00:43:41.860 | It's a large input space.
00:43:42.980 | The model can have tens of millions of parameters.
00:43:45.500 | Now, if you have an example, if you have a case
00:43:47.360 | where you have two or three examples
00:43:48.700 | in your whole 60 hours of driving,
00:43:50.940 | there's no guarantee that your 10 million
00:43:52.980 | parameter model will learn it well, right?
00:43:56.020 | And so it's really good when you have a lot of examples.
00:43:58.620 | It's really trying to do well in those.
00:44:02.420 | And then you have the long tail.
00:44:04.020 | So what do you do?
00:44:07.300 | Well, we can improve the representation.
00:44:10.300 | We can improve our model.
00:44:11.460 | This is, there is a lot of room to keep evolving this.
00:44:16.100 | And then this area will keep expanding, right?
00:44:18.940 | And that's one good direction.
00:44:20.220 | There is a lot of interesting questions how to do that,
00:44:22.260 | and we're working on a lot of them.
00:44:23.460 | There's actually some exciting work.
00:44:24.940 | Hopefully I get to share with you another time.
00:44:27.020 | Something else you can do, if you remember from my slide
00:44:29.340 | about the hybrid system, when you go to the long tail,
00:44:32.300 | you can do essentially a similar thing,
00:44:35.580 | which is simpler, biased, expert design input distribution
00:44:39.380 | that is much easier to learn with few examples.
00:44:41.660 | You can also, of course, use expert design models.
00:44:45.500 | And so in this case, you still will produce
00:44:48.460 | something reasonable by inputting this human knowledge.
00:44:52.400 | And you could have many models.
00:44:53.880 | I mean, there's not one.
00:44:55.280 | You could just tune to various aspects of this distribution.
00:44:58.640 | You can have little models for all the aspects
00:45:00.480 | you care about.
00:45:01.320 | You can mix and match, right?
00:45:02.840 | So that's another way to do it.
00:45:05.220 | So let me tell you about one such a model.
00:45:07.320 | So trajectory optimization agent.
00:45:10.840 | So we take inspiration from motion control theory,
00:45:14.120 | and we want to plan a good trajectory for the vehicle,
00:45:18.360 | the agent vehicle, and that satisfies a bunch
00:45:21.400 | of constraints and preferences, right?
00:45:23.880 | And so one insight to this is that we already know
00:45:29.760 | what the agent did in the environment last time.
00:45:32.840 | So you have fairly strong idea about the intent.
00:45:35.680 | And that helps you when you specify the preferences.
00:45:38.440 | 'Cause you can say, okay, well, give me a trajectory
00:45:41.860 | that minimizes some set of costs,
00:45:43.840 | which are preferences on the trajectory,
00:45:45.680 | typically called potentials.
00:45:48.120 | What is a potential?
00:45:49.040 | Well, at different parts of the trajectory,
00:45:51.360 | you can add these attractor potentials saying,
00:45:53.720 | well, try to go where you used to be before, for example.
00:45:56.640 | And that's the benefit of, in simulation,
00:45:58.840 | you have observed what was done.
00:46:00.320 | So this is a bit simpler.
00:46:01.600 | And of course, you can have repeller potential.
00:46:05.000 | Don't hit things, don't run into vehicles, right?
00:46:07.720 | So to a first approximation,
00:46:09.080 | that's what it roughly looks like.
00:46:10.800 | And so now, where is the learning, right?
00:46:14.960 | Well, it's still a machine learning model.
00:46:16.840 | There is a presentation.
00:46:17.840 | These potentials have parameters.
00:46:19.320 | It's the steepness of this curve.
00:46:22.480 | Sometimes they're multidimensional, right?
00:46:25.560 | There's a few parameters.
00:46:26.700 | Typically, we're talking a few dozen parameters or less.
00:46:30.760 | All right, and you can learn them too.
00:46:33.320 | So there is a technique called
00:46:35.640 | inverse reinforcement learning.
00:46:37.240 | You want to learn these parameters
00:46:39.880 | that produce trajectories that come close
00:46:41.960 | to the trajectories you've observed in the real world.
00:46:44.360 | So if you pick a bunch of trajectories
00:46:46.280 | that represent certain type of behavior,
00:46:47.880 | you want to model the tuning of parameters
00:46:49.400 | to behave like it.
00:46:50.840 | And then you want to generate reasonable trajectories,
00:46:53.280 | continuous, feasible, that satisfy this, right?
00:46:57.000 | And this is part of this optimization.
00:46:58.400 | You can solve this, actually.
00:46:59.240 | And so then you can tune these agents.
00:47:01.100 | And so here's some agents I want to show you.
00:47:04.720 | So this is a complex interactive scenario.
00:47:07.520 | Two vehicles, but you can see on the left is,
00:47:10.880 | on the right is the aggressive guy.
00:47:14.020 | Blue is the agent.
00:47:15.080 | Red is our vehicle.
00:47:16.040 | We're testing in simulation.
00:47:18.120 | And so let me play one more time.
00:47:20.840 | Once this ends, essentially on the left
00:47:24.280 | is the conservative driver.
00:47:25.480 | On the right is the aggressive driver.
00:47:27.760 | And they pass us, right?
00:47:29.380 | And they induce very different reactions in our vehicle.
00:47:33.200 | So the aggressive guy went and passed us
00:47:36.800 | and pushed us further into that lane
00:47:39.000 | and we merge much later.
00:47:40.700 | In the other case, when you have a conservative driver,
00:47:43.040 | we are in front of them and they're not bugging us
00:47:45.380 | and we execute much earlier.
00:47:47.200 | We can switch into the right lane where we want to go.
00:47:50.400 | All right, so this is agents that can test your system well.
00:47:52.680 | Now you have different scenarios in this case,
00:47:57.680 | depending what agent you put in.
00:47:59.800 | And I'll show you a little more scenarios.
00:48:02.880 | So it's not just a two-agent game.
00:48:04.600 | I mean, we can do things like merging
00:48:07.800 | from one side of the highway to the next.
00:48:10.000 | And this type of agent can generate
00:48:13.400 | fairly reasonable behaviors.
00:48:15.000 | It's slow down for a knowing slow vehicle in front,
00:48:17.700 | let the vehicles on the side pass you
00:48:19.100 | and still completes the mission.
00:48:21.460 | And you can generate multiple futures with this agent.
00:48:26.460 | So here's an example again.
00:48:28.740 | On the right will be an aggressive guy.
00:48:30.700 | Right, and on the left was the more conservative person.
00:48:36.060 | The aggressive guy found a gap between the two vehicles
00:48:38.660 | and just went for it, right?
00:48:40.820 | And you can test your stack this way.
00:48:43.120 | And one more I wanted to show you
00:48:44.580 | is an aggressive motorcycle driving.
00:48:47.620 | So you can have an agent that tests,
00:48:50.340 | you can test your reaction to motorcycle
00:48:51.980 | that are weaving in the lane, right?
00:48:53.780 | So I guess what's my takeaway from this story
00:48:57.140 | about testing and the long tail?
00:48:59.420 | You need a menagerie of agents at the moment, right?
00:49:02.140 | So if you think of it, right,
00:49:05.880 | and learning from demonstration is key.
00:49:10.740 | You can encode some simple models by hand,
00:49:12.580 | but ultimately it's much better.
00:49:14.240 | The task of modeling agent behavior is complex
00:49:16.940 | and it's much better learned.
00:49:19.080 | And so here's the space of models.
00:49:20.680 | So you can have not learned,
00:49:21.800 | you can just replay the log like I showed,
00:49:23.640 | and you can hand design trajectories for agents
00:49:26.600 | to for this reaction do this, for that reaction do that.
00:49:29.760 | Then you can have the break and swerve model
00:49:31.440 | that mostly if there's someone in front of an agent
00:49:33.680 | just does a deterministic break.
00:49:36.100 | Trajectory optimization, which I just showed.
00:49:39.640 | Now our mid to mid model and potentially
00:49:41.720 | end to end top down model, top down meaning
00:49:43.740 | you have like a top view of the environment.
00:49:45.380 | There's many other representations possible.
00:49:47.540 | This is a very interesting space.
00:49:49.000 | Ultimately I wanted to show you
00:49:51.300 | there's many possible agents
00:49:52.900 | and they have different utility
00:49:54.980 | and they have different number of examples
00:49:56.740 | you need to train them with.
00:49:58.140 | And so one other takeaway I wanted to tell you
00:50:00.980 | is smart agents are critical for autonomy at scale.
00:50:04.620 | This is something I truly believe working in the space.
00:50:07.980 | And this line of direction is exciting
00:50:09.840 | and ultimately one of the exciting problems
00:50:12.700 | that there's still a lot of interesting progress to be made.
00:50:17.140 | And why?
00:50:18.480 | Well you have accurate models of human behavior
00:50:20.480 | of drivers and pedestrians
00:50:21.820 | and they help you achieve several things.
00:50:23.900 | First, you will do better decisions when you drive yourself.
00:50:27.140 | You'll be able to anticipate what others will do better
00:50:29.300 | and that will be helpful.
00:50:30.980 | Second, you can develop a robust simulation environment
00:50:34.980 | with those insights, also very important.
00:50:38.220 | Third, well our vehicle is also
00:50:40.320 | one more agent in the environment.
00:50:41.660 | It's an agent we have more control than the others
00:50:44.120 | but a lot of these insights apply.
00:50:45.820 | And so this is very exciting and interesting.
00:50:49.620 | So I wanted to finish the talk
00:50:52.120 | just maybe as a mental exercise.
00:50:55.400 | When you think of a system
00:50:57.560 | that is tackling a complex AI challenge like self-driving,
00:51:01.160 | what is the good properties of the system to have
00:51:03.120 | and how do you think of a scalable system?
00:51:05.920 | And to me there's this mental test.
00:51:07.920 | We want to grow and handle and bring our service
00:51:10.680 | to more and more environments, more and more cities.
00:51:13.120 | How do you scale to dozens or hundreds of cities?
00:51:15.580 | So as we talked about the long tail,
00:51:18.520 | each new environment can bring new challenges.
00:51:20.860 | And they can be complex intersections in cities like Paris.
00:51:25.100 | There's our Lombard Street in San Francisco, I'm from there.
00:51:28.380 | There's narrow streets in European towns.
00:51:30.320 | There is all kinds of, the long tail keeps coming.
00:51:34.160 | As you keep driving new environments,
00:51:36.000 | in Pittsburgh people drive the famous Pittsburgh left.
00:51:38.640 | They take different precedence than usual.
00:51:42.520 | The local customs of driving, of behaving,
00:51:44.600 | all of this needs to be accounted for as you expand.
00:51:47.640 | And this makes your system potentially more complex
00:51:49.800 | or easier, harder to tune to all environments.
00:51:52.920 | But it's important because ultimately
00:51:54.600 | that's the only way you can scale.
00:51:55.960 | So how do you, what should a scalable process do?
00:52:00.680 | So in my mind, let's say you have
00:52:02.480 | a very good self-driving system.
00:52:04.920 | I mean, this very much parallels the factory analogy.
00:52:07.520 | I'm just going to repeat it one more time.
00:52:09.700 | You take your vehicles, we put a bunch of way more cars
00:52:12.680 | and we drive a long time in that environment with drivers.
00:52:15.680 | Maybe 30 days, maybe more, at least that long.
00:52:18.040 | And you collect all the data, right?
00:52:22.040 | And then your system should be able to improve a lot
00:52:25.160 | on the data you have collected, right?
00:52:29.160 | So drive a bunch, obviously you don't want
00:52:34.560 | to train the system too much in the real world
00:52:36.360 | while it's driving, but you want to train it
00:52:38.640 | after you've collected data about the environment.
00:52:42.280 | So it needs to be trainable on collected data.
00:52:44.580 | It's very important for a system to be able to quantify
00:52:49.220 | or have a notion to elicit from it
00:52:51.640 | whether it's incorrect or not confident, right?
00:52:56.120 | Because then you can take action.
00:52:58.280 | And this is important property that I think
00:53:00.300 | people should think of when they design systems.
00:53:02.360 | How do you elicit this?
00:53:04.620 | Then you can take an action.
00:53:06.860 | You can ask questions to raters, that's fairly legit.
00:53:10.220 | Typically active learning is a bit like this, right?
00:53:12.220 | So, and it's usually based on some amount
00:53:14.900 | of low confidence or surprise.
00:53:18.500 | That's the examples you want to send.
00:53:21.040 | And even better, the system could potentially
00:53:24.620 | directly update itself, and this is an interesting question.
00:53:27.980 | How do systems update themselves in light of new knowledge?
00:53:30.600 | And we have a system that clearly does this, right?
00:53:33.380 | And typically you do it with reasoning.
00:53:37.900 | And what is reasoning, right?
00:53:41.860 | So, I have an answer.
00:53:45.580 | It is one answer, there is possibly others, right?
00:53:49.140 | But one way is you can check and enforce consistency
00:53:51.700 | of your beliefs, and you can look for explanations
00:53:53.820 | of the world that are consistent.
00:53:55.420 | And so if you have a mechanism in the system
00:53:57.880 | that can do this, this allows the system
00:53:59.880 | to improve itself without necessarily
00:54:02.220 | being fed purely labeled data.
00:54:05.060 | It can improve itself on just collected data.
00:54:08.320 | And I think it's interesting to think of systems
00:54:10.920 | where you can do reasoning and the representations
00:54:13.780 | that these models need to have.
00:54:15.340 | And last and not least, we need scalable training
00:54:20.140 | and testing infrastructure, right?
00:54:22.880 | This is part of the fact that I was talking about.
00:54:24.940 | I'm very lucky at Waymo to have wonderful infrastructure.
00:54:29.100 | And it allows this virtuous cycle to happen.
00:54:34.100 | Thank you.
00:54:35.180 | (audience applauding)
00:54:38.340 | - Up here, Kieran Strobel.
00:54:42.060 | Thank you so much for the talk, I really appreciate it.
00:54:43.620 | So if you were to train off of image and LiDAR data,
00:54:47.420 | synthetic image and LiDAR data,
00:54:49.020 | would you weight the synthetic data differently
00:54:52.900 | than real-world data when training your models?
00:54:56.140 | - So there's actually a lot of interesting research
00:54:58.020 | in the field.
00:54:59.260 | There are people train on simulator,
00:55:01.680 | but also train adaptation models
00:55:04.740 | that make simulator data look like real data.
00:55:07.920 | Right?
00:55:10.320 | So you're essentially, you're trying to build consistency,
00:55:14.520 | or at least you're training on simulator scenarios,
00:55:16.980 | but if you learn a mapping from simulator scenes
00:55:19.780 | to real scenes, right, you could potentially train
00:55:22.860 | on the transformed simulator data already
00:55:25.340 | that's transformed with other models.
00:55:27.220 | There's many ways to do this, ultimately, right?
00:55:29.540 | So achieving realism in simulator
00:55:31.660 | is an open research problem, right?
00:55:34.420 | - I assume there is a lot of rules
00:55:37.060 | that you have to put into a system to make,
00:55:39.340 | to be able to trust it, you know?
00:55:44.020 | And so how you find the balance
00:55:45.900 | between this automatic models,
00:55:48.660 | like neural network when you're not quite sure
00:55:51.460 | what they would do, and rules where you're sure
00:55:54.180 | that it's not scalable?
00:55:56.620 | - I mean, through lots and lots of testing and analysis,
00:55:59.540 | right, so you keep keeping track
00:56:02.940 | of the performance of your models
00:56:04.460 | and you see where they come short, right?
00:56:07.220 | And then those are the areas you most need
00:56:10.420 | expert to complement, right?
00:56:13.660 | But the balance can change over time, right?
00:56:15.340 | And it's a natural process of evolution, right?
00:56:18.820 | So evolving your system as you go.
00:56:20.300 | I mean, generally, you know, the MLPI grows
00:56:23.100 | as the capabilities in the data sets grow, right?
00:56:26.340 | - So you stressed at the end of both the first half
00:56:29.860 | and the second half of your talk,
00:56:31.660 | the importance of quantifying uncertainty
00:56:34.300 | and the predictions that your models are making.
00:56:37.220 | So have you developed techniques
00:56:40.500 | for doing that with neural nets,
00:56:42.460 | or are you using some probabilistic graphical models
00:56:45.300 | or something?
00:56:46.540 | - I mean, so a lot of the models are neural nets.
00:56:49.660 | There's many ways to capture this, actually.
00:56:53.580 | I'm just going to give a general answer.
00:56:55.340 | I'm not commenting on specifically
00:56:56.900 | what way I'm going to be doing.
00:56:59.020 | I think, first of all, there's techniques in neural nets
00:57:00.900 | that can predict, where they can predict
00:57:02.620 | their own uncertainty fairly well, right?
00:57:05.380 | Either directly regress its uncertainty
00:57:07.300 | for certain products, or use ensembles of networks
00:57:10.260 | or dropout or techniques like this
00:57:12.100 | that also provide a measure of uncertainty.
00:57:16.740 | Another way of doing uncertainty
00:57:18.260 | is to leverage constraints in the environment.
00:57:20.260 | So if you have temporal sequences, right?
00:57:24.700 | You don't want, for example, objects to appear or disappear,
00:57:27.980 | or generally unreasonable changes in the environment,
00:57:31.860 | or inconsistent prediction in your models
00:57:34.260 | are good areas to look.
00:57:35.980 | - I'm just wondering, do you guys train
00:57:38.700 | and deploy different models depending on
00:57:41.020 | where the car is driving, like what city,
00:57:44.060 | or do you train and deploy a single model
00:57:47.060 | that adapts to most scenarios?
00:57:49.900 | - Well, ideally, you would have one model
00:57:53.580 | that adapts to most scenarios,
00:57:55.060 | then a complement is needed.
00:57:58.020 | - Yeah, so first off, thanks for your talk.
00:58:00.220 | I find the simulator work really, really exciting.
00:58:02.900 | And I was wondering if you could either talk more about,
00:58:06.820 | or maybe provide some insights into simulating pedestrians.
00:58:11.540 | 'Cause as a pedestrian myself,
00:58:12.580 | I feel like my behavior is a lot less constrained
00:58:14.580 | than a vehicle.
00:58:15.660 | - Right.
00:58:16.500 | - And I imagine, I mean, there's an advantage
00:58:18.620 | in that you're sensing from a vehicle,
00:58:20.060 | and you kind of know, your sensors are for like
00:58:21.860 | first person from a vehicle, but not from a pedestrian.
00:58:24.780 | - And that's correct.
00:58:25.620 | I mean, so if you want to simulate pedestrians
00:58:28.620 | far away in an environment, right,
00:58:31.500 | and you want to simulate them at very high resolution,
00:58:34.540 | right, and you've collected log data,
00:58:36.100 | you may not have the detailed data on that pedestrian.
00:58:38.980 | Right?
00:58:39.820 | At the same time, the subtle cues for that pedestrian
00:58:43.020 | matter less at that distance as well,
00:58:45.220 | because it's not like you observed them
00:58:47.100 | or reacted to them in the first place.
00:58:49.380 | So there is an interesting question
00:58:50.940 | of what fidelity do you need to simulate things?
00:58:53.700 | Right?
00:58:55.540 | And there is levels of realism in simulation
00:58:59.580 | that at some level need to parallel
00:59:02.460 | what your models are paying attention.
00:59:04.460 | - Thank you for the talk.
00:59:06.980 | It was very interesting.
00:59:08.620 | Since you, you know, titled and talked about it,
00:59:11.980 | long tail, it makes me wonder,
00:59:14.820 | is the bulk of the problem solved?
00:59:18.860 | Do you think, well, we're gonna have this figured out
00:59:22.420 | and within the next couple of years,
00:59:25.180 | there can be self-driving cars everywhere,
00:59:27.340 | or do you think it's closer to, you know,
00:59:30.460 | actually, it could be decades before
00:59:33.580 | we've really worked out everything necessary?
00:59:37.060 | What are your thoughts about the future?
00:59:39.060 | - It's a bit hard to, that's a good question.
00:59:41.620 | It's a bit hard to give this prognosis.
00:59:43.380 | I think, I mean, I'm not completely sure.
00:59:48.100 | I think one thing I would say is it will take a while
00:59:50.620 | for self-driving cars to roll out at scale, right?
00:59:54.540 | So this is not a technology that just,
00:59:56.860 | it turn the crank and appears everywhere, right?
00:59:59.580 | There's logistics and algorithms
01:00:01.340 | and all this tuning and testing needed
01:00:03.020 | to make sure it's really safe in the various environments.
01:00:06.300 | So it will take some time.
01:00:07.620 | - When you were talking about prediction,
01:00:10.540 | you mentioned looking at a context
01:00:12.420 | and saying if a person or if someone is looking at us,
01:00:15.340 | we can assume that they will behave differently
01:00:17.300 | than if they're not paying attention to what we're doing.
01:00:19.740 | - Potentially.
01:00:20.580 | - Is that something you're actively doing?
01:00:22.340 | Do you take into consideration if pedestrians
01:00:24.860 | or other participants in traffic
01:00:26.980 | are paying attention to your vehicles?
01:00:29.660 | - So I can't comment on our model designs too much,
01:00:33.420 | but I think there's a generally cues
01:00:35.260 | one needs to pay attention to, they're very significant.
01:00:37.700 | I mean, you know, even when people drive, for example,
01:00:40.980 | there's someone sitting in the vehicle next to you waving,
01:00:43.780 | keep going, right?
01:00:44.620 | And these are natural interactions in the environment.
01:00:47.260 | That, you know, is something you need to think about.
01:00:52.260 | - In one of your, first of all, thank you,
01:00:55.060 | it's a really cool talk.
01:00:56.660 | In one of your last slides, you talked about resolving
01:01:00.140 | certain uncertainties by the means of establishing
01:01:03.540 | a set of beliefs and checking to see
01:01:05.300 | if they were consistent in the--
01:01:07.020 | - That's my own theory, by the way, right?
01:01:08.940 | But I feel that the concept of reasoning
01:01:11.580 | is underexplored in deep learning and what it means, right?
01:01:16.300 | So if you read Tversky-Kahneman, Type I, Type II reasoning,
01:01:20.700 | we're really good at the instinctive mapping type of tasks,
01:01:25.700 | right, so like some low to mid to maybe high-level perception
01:01:32.940 | up to a point, but the reasoning part with neural networks,
01:01:38.020 | right, and generally with models,
01:01:40.300 | that's a bit less explored.
01:01:43.540 | I think it's, long-term, it's fruitful.
01:01:47.020 | That's my personal opinion, right?
01:01:49.220 | - I guess the question I was gonna ask
01:01:52.460 | is if you could elaborate on that concept
01:01:54.940 | in connection with the models you guys are working with,
01:01:57.220 | but I guess that's--
01:01:58.420 | - So I'll give an example from current work, right?
01:02:01.060 | And there's a lot of work on weekly supervised learning.
01:02:04.900 | - Sure.
01:02:05.740 | - And that's kind of been a big topic in 2018,
01:02:07.980 | and there were a lot of really strong papers,
01:02:09.740 | including by Google Brain and Yulia Angulo
01:02:11.740 | and her team and so on,
01:02:13.700 | and essentially, if you used to read the books
01:02:16.940 | about 3D reconstruction and geometry and so on, right,
01:02:21.340 | there's a bunch of rules that can encode
01:02:24.140 | geometric expectations about the world.
01:02:26.180 | So when you have video, and when you have 3D outputs
01:02:29.300 | in your models, there is certain amount of consistency.
01:02:31.820 | One example is ego motion versus depth estimation.
01:02:35.860 | There is a very strong constraint
01:02:37.500 | that if you predict the depth,
01:02:39.340 | and you predict the ego motion correctly,
01:02:41.020 | then you can reproject certain things,
01:02:42.740 | and they will look good, right?
01:02:44.180 | And that's a very strong constraint.
01:02:45.860 | That's a consistency.
01:02:46.700 | You know this about the environment, you expect it.
01:02:48.980 | This can help train your model, right?
01:02:51.260 | And so more of this type of reasoning may be interesting.
01:02:54.220 | - You mentioned expert design algorithms,
01:02:56.700 | and I was wondering, from your perspective,
01:02:59.460 | from also from Waymo's perspective,
01:03:02.180 | how important are those, say,
01:03:04.100 | non-machine learning type algorithms,
01:03:06.780 | or non-machine learning type approaches
01:03:08.780 | to tackling the challenges of autonomous driving?
01:03:11.900 | - Could you say one more time how important is,
01:03:13.900 | which aspect of them, Darrell?
01:03:15.180 | - Of expert designed algorithms.
01:03:17.500 | Every now and then, you just, you sprinkle in,
01:03:19.980 | like, here we can try expert designed algorithms,
01:03:22.540 | because we actually understand some parts of the problem,
01:03:24.900 | and I was wondering, like, what is really important
01:03:27.540 | for the challenges in autonomous driving
01:03:30.140 | outside of the field of machine learning?
01:03:33.100 | - I mean, generally, you want, the problem is,
01:03:35.260 | you want to be safe in the environment.
01:03:37.180 | That makes it such that you don't want to make errors
01:03:40.580 | in perception, prediction, and planning, right?
01:03:44.460 | And the state of machine learning is not at the point
01:03:47.580 | where it never makes errors, provided the scope
01:03:51.300 | that we're currently addressing.
01:03:53.940 | And so throughout your stack,
01:03:55.740 | with the current state of machine learning,
01:03:57.460 | it needs to be complemented, right?
01:04:00.180 | And so we've carefully done it,
01:04:02.020 | and I think machine learning, as it improves,
01:04:05.140 | I think there'll be less and less need to do it.
01:04:08.180 | It's somewhat effort intensive, bringing,
01:04:11.300 | especially in an evolving system, to do that,
01:04:13.500 | to have a hybrid system.
01:04:15.140 | But right now, I think this is the main thing
01:04:18.020 | that keeps you able to do complex behaviors in some cases,
01:04:23.020 | for which it's very hard to collect data,
01:04:25.620 | and you still need to handle.
01:04:27.100 | Then it's the right thing to do, right?
01:04:29.940 | So the way I view it, I'm a machine learning person,
01:04:32.340 | I like to do better and better.
01:04:34.080 | That said, we're not religious, it should not be.
01:04:37.100 | We just need to solve the problem,
01:04:38.500 | and right now, the right mix is a hybrid system,
01:04:40.540 | is my belief.
01:04:41.380 | - Well, we're really excited to see
01:04:44.340 | what Waymo has in store for us in '19.
01:04:46.820 | So please give Drago a big hand.
01:04:48.580 | (audience applauding)
01:04:51.740 | (audience cheering)
01:04:54.740 | (audience applauding)
01:04:57.900 | (audience cheering)
01:05:00.900 | (audience cheering)
01:05:03.900 | (audience cheering)
01:05:06.900 | (audience cheering)
01:05:09.900 | (audience cheering)