Drago Anguelov (Waymo) - MIT Self-Driving Cars

00:00:00.000 | All right, welcome back to 6.094,

00:00:03.000 | Deep Learning for Self-Driving Cars.

00:00:04.920 | Today we have Drago Anghielov,

00:00:07.880 | Principal Scientist at Waymo.

00:00:10.080 | Aside from having the coolest name in autonomous driving,

00:00:13.840 | Drago has done a lot of excellent work

00:00:16.400 | in developing and applying machine learning methods

00:00:18.360 | to autonomous vehicle perception

00:00:20.320 | and more generally in computer vision robotics.

00:00:22.440 | He's now helping Waymo lead the world in autonomous driving.

00:00:26.320 | 10 plus million miles achieved autonomously to date,

00:00:31.320 | which is an incredible accomplishment.

00:00:34.040 | So it's exciting to have Drago here with us to speak.

00:00:37.640 | Please give him a big hand.

00:00:39.440 | (audience applauding)

00:00:42.600 | - Hi, thanks for having me.

00:00:46.600 | I will tell you a bit about our work

00:00:48.320 | and the exciting nature of self-driving

00:00:50.480 | and the problem and our solutions.

00:00:52.200 | So my talk is called

00:00:54.400 | "Taming the Long Tail of Autonomous Driving Challenges."

00:00:57.720 | My background is in perception and robotics.

00:01:00.040 | I did a PhD at Stanford with Daphne Koller

00:01:04.120 | and worked closely with one of the pioneers in the space,

00:01:07.160 | Professor Sebastian Thrun.

00:01:09.080 | I spent eight years at Google doing research on perception,

00:01:11.720 | also working on Street View

00:01:13.080 | and developing deep models for detection,

00:01:15.920 | neural net architectures.

00:01:17.840 | I was briefly at Zooks.

00:01:19.320 | I was heading the 3D perception team at Zooks.

00:01:22.520 | We built another perception system for autonomous driving,

00:01:25.840 | and I've been leading the research team at Waymo

00:01:28.520 | most recently.

00:01:30.400 | So I want to tell you a little bit about Waymo when we start.

00:01:34.880 | Waymo actually this month has its 10-year anniversary.

00:01:39.160 | It started when Sebastian Thrun

00:01:42.320 | convinced the Google leadership

00:01:43.480 | to try an exciting new moonshot.

00:01:45.440 | And the goal that they set for themselves

00:01:49.400 | was to drive 10 different segments

00:01:51.760 | that were 100 miles long.

00:01:54.280 | And later that year, they succeeded

00:01:56.000 | and drove an order of magnitude

00:01:57.280 | more than anyone has ever driven.

00:01:58.960 | In 2015, we brought this car to the road.

00:02:06.480 | It was built ground up as a study

00:02:10.120 | in what fully driverless mobility would be like.

00:02:13.680 | In 2015, we put this vehicle in Austin

00:02:19.640 | and it completed the world's first

00:02:21.960 | fully autonomous ride on public roads.

00:02:25.000 | And the person inside this car

00:02:26.520 | is a fan of the project and he is blind.

00:02:28.800 | So we did not want this to be

00:02:33.560 | just a demo fully driverless experience.

00:02:36.520 | We worked hard and in 2017,

00:02:38.560 | we launched a fleet of fully self-driving vehicles

00:02:41.800 | on the streets in Phoenix metro area.

00:02:47.040 | And we have been doing driverless,

00:02:49.960 | fully driverless operations ever since.

00:02:53.000 | So I wanted to give you a feel

00:02:57.320 | for what fully driverless experience is like.

00:02:59.960 | (people chattering)

00:03:02.960 | (upbeat music)

00:03:05.560 | (people chattering)

00:03:08.640 | (people chattering)

00:03:38.000 | And so we continued.

00:03:39.920 | Last year, we launched our first commercial service

00:03:42.760 | in the metro area of Phoenix.

00:03:44.480 | There, people can call away on their phone.

00:03:48.920 | It can come pick them up and help them with errands

00:03:51.160 | or go to school.

00:03:52.760 | And we've been already learning a lot from these customers

00:03:55.680 | and we are looking to grow and expand the service

00:03:58.480 | and bring it to more people.

00:04:00.640 | So in the process of growing the service,

00:04:03.440 | we have driven 10 million miles on public road,

00:04:05.760 | as Lex said.

00:04:07.440 | And driverlessly and more also with human drivers

00:04:12.440 | to collect data.

00:04:16.360 | And we've driven all kinds of scenarios, cities,

00:04:20.040 | capturing a diverse set of conditions

00:04:22.760 | and a diverse set of situations

00:04:24.640 | in which we develop our systems.

00:04:28.280 | And so in this talk, I want to tell you,

00:04:31.480 | I mean, about the long tail of events.

00:04:34.200 | This is all the things we need to handle

00:04:36.520 | to enable truly self-driverless future.

00:04:39.800 | And I guess all the problems that come with this

00:04:42.240 | and offer some solutions and show you

00:04:44.680 | how Waymo has been thinking about these issues.

00:04:47.840 | So as we drove 10 million miles,

00:04:50.080 | of course, we still find scenarios,

00:04:54.240 | new ones that we have not seen before

00:04:55.960 | and we still keep collecting them.

00:04:57.880 | And so when you think about self-driving vehicles,

00:05:01.360 | they need to have the following properties.

00:05:02.840 | First, the vehicle needs to be capable.

00:05:05.520 | It needs to be able to handle the entire task of driving.

00:05:09.280 | So it cannot just do a subset

00:05:10.680 | and remove the human operator from the vehicle.

00:05:14.440 | And also all of these tasks obviously

00:05:16.120 | need to do well and safely.

00:05:18.640 | And that is the requirement

00:05:20.800 | to achieving self-driving at scale.

00:05:24.400 | And when you think about this now,

00:05:26.160 | the question is, well, how many of these capabilities

00:05:28.360 | and how many scenarios do you really need to handle?

00:05:30.920 | Well, it turns out, well, the world is quite diverse

00:05:35.440 | and complicated and there is a lot of rare situations

00:05:40.760 | and all of them need to be handled well.

00:05:42.960 | And they call this the long tail.

00:05:45.840 | The long tail of situations.

00:05:48.440 | It's one type of effort to get yourself driving

00:05:52.000 | for the common cases and then it's another effort

00:05:55.000 | to tame the rest and they really, really matter.

00:06:00.040 | And so I'll show you some.

00:06:03.080 | For example, this is us driving in the street

00:06:06.280 | and let's see if you can tell what is unusual in this video.

00:06:10.640 | (audience laughing)

00:06:12.600 | You see, so I can play it one more time.

00:06:14.880 | So there's a bicyclist and he's carrying a stop sign.

00:06:21.600 | And I don't know where he picked it up

00:06:25.120 | but it's certainly not a stop sign we need to stop for,

00:06:28.400 | unlike others, right?

00:06:29.800 | And so you need to understand that.

00:06:31.560 | Let me show you another scenario.

00:06:34.280 | This is another case where we are happily staying there

00:06:38.520 | and then the vehicle stops and a big pile of poles

00:06:41.800 | comes our way, right?

00:06:42.720 | And you need to potentially understand that

00:06:44.360 | and learn to avoid it.

00:06:45.440 | Generally, well, different types of objects

00:06:49.560 | can fall on the road, it's not just poles.

00:06:52.040 | Here's another interesting scenario.

00:06:53.640 | This happens a lot, it's called construction

00:06:56.260 | and there's various aspects of it.

00:06:57.880 | One of them is someone closed the lane,

00:07:01.960 | put a bunch of cones and we learn,

00:07:05.040 | and this is our vehicle correctly identifying

00:07:07.500 | where it's supposed to be driving

00:07:08.760 | between all of these cones and successfully executing it.

00:07:11.760 | So yeah, we drive for a while.

00:07:14.920 | And this is something that happens fairly often

00:07:19.000 | if you drive a lot.

00:07:20.000 | Another case is this one.

00:07:28.420 | I think you can understand what happened here.

00:07:33.060 | And you can notice actually, so we hear the siren.

00:07:36.320 | So we have the ability to understand

00:07:40.460 | sirens of special vehicles and you can see

00:07:42.260 | we hear it and stop and some guys are much later than us

00:07:45.980 | braking at the last moment,

00:07:47.500 | letting the emergency vehicle pass.

00:07:49.540 | And here's another scenario potentially I want to show you.

00:07:56.740 | Let's see if you can understand what happened.

00:07:59.040 | So let me play one more time.

00:08:05.740 | Did you guys see?

00:08:07.820 | So we stopped at, there's a green light,

00:08:10.820 | we're about to go and someone goes at high speed

00:08:14.180 | running a red light without any remorse.

00:08:16.820 | Right, and we successfully stop and prevent issues.

00:08:21.820 | Right, and so sometimes you have the rules of the way

00:08:25.820 | and you have your road and people don't always abide by them

00:08:29.340 | and that's something that you don't want to just

00:08:31.340 | directly go in front of that person

00:08:32.780 | even if they're breaking the law.

00:08:34.420 | So hopefully with this I convinced you that the situations

00:08:39.340 | that can occur are diverse and challenging

00:08:42.600 | and there's quite a few of them.

00:08:43.780 | And I want to take you a little bit on the tour

00:08:45.740 | of what makes this challenging and then tell you

00:08:47.780 | some ways in which we think about it

00:08:49.800 | and how we're handling it.

00:08:51.100 | And so to do this, we're gonna delve a little bit more

00:08:55.020 | into the main tasks for self-driving

00:08:56.900 | which is perception, prediction, and planning.

00:08:59.860 | And so I'll tell you a little bit about those.

00:09:02.420 | Right, and perception, these are the core AI aspects

00:09:07.420 | of the car usually.

00:09:08.380 | These tasks, there's others, we can talk about others

00:09:10.780 | as well in a little bit, but let's focus on these first.

00:09:13.220 | So perception is mapping from sensory inputs

00:09:15.380 | and potentially prior knowledge of the environment

00:09:18.060 | to a scene representation.

00:09:19.380 | And that scene representation can contain objects,

00:09:22.100 | it can contain scene semantics,

00:09:23.980 | potentially you can reconstruct a map,

00:09:27.060 | you can learn about object relationships and so on.

00:09:29.620 | And perception, the space of things you need to handle

00:09:34.700 | in perception is fairly hard, it's a complex mapping.

00:09:37.680 | Right, so you have sensors, the pixels come,

00:09:39.820 | lighter points come, or radar scans come,

00:09:42.620 | and you have multiple axis of variability

00:09:45.620 | in the environment.

00:09:46.460 | So obviously there's a lot of objects.

00:09:48.820 | They have different types, appearance, pose.

00:09:53.340 | I don't know if you see this well,

00:09:54.500 | there are a bunch of people dressed as dinosaurs

00:09:56.280 | in this case, people generally are fairly creative

00:09:59.260 | in how they dress.

00:10:00.460 | Vehicles can also be different types,

00:10:04.220 | people come in different poses, and we have seen it all.

00:10:07.700 | All right, so that's one aspect.

00:10:09.860 | There's different environments that these objects appear in.

00:10:14.560 | So there are times of day, seasons, day, night,

00:10:20.320 | different, for example, highway environment,

00:10:22.960 | suburban street and so on.

00:10:25.520 | And then there's a different variability axis,

00:10:28.520 | and this is a little more, slightly more abstract.

00:10:30.840 | There are different objects can come in this environment

00:10:32.960 | in different configurations

00:10:34.040 | and can have different relationships.

00:10:36.280 | And so things like occlusion,

00:10:38.240 | there's a guy carrying a big board,

00:10:41.400 | there is reflections, there is, you know,

00:10:44.720 | people riding on horses and so on.

00:10:47.300 | And so why am I showing this?

00:10:50.200 | Because I just want to show you the space, right?

00:10:52.280 | So in most cases you care about most objects

00:10:55.920 | in most environments in most reasonable configurations,

00:10:58.880 | and that's a space that you need to map

00:11:00.520 | from the sensor inputs to a representation that makes sense,

00:11:04.000 | and you need to learn this mapping function

00:11:06.120 | or represent it somehow.

00:11:07.320 | All right, and so let's go to the next step,

00:11:10.200 | which is prediction.

00:11:11.040 | So apart from just understanding

00:11:12.360 | what's happening in the world,

00:11:13.640 | you need to be able to anticipate and predict

00:11:16.200 | what some of the actors in the world are going to do,

00:11:18.600 | the actors being mostly people,

00:11:20.200 | and people is honestly what makes driving quite challenging.

00:11:24.680 | This is one of the aspects that do so,

00:11:26.520 | it's, you know, a vehicle needs to be out there

00:11:28.840 | and be a full-fledged traffic scene participant.

00:11:31.600 | And this anticipation of agent behavior

00:11:34.400 | sometimes needs to be fairly long term,

00:11:35.920 | so sometimes when you want to make a decision,

00:11:37.560 | you want to validate or convince yourself

00:11:40.120 | it does not interfere with what anyone else is going to do,

00:11:43.080 | and it can go from one second to maybe 10 seconds or more,

00:11:45.800 | you need to anticipate the future.

00:11:48.160 | So what goes into anticipating the future?

00:11:51.160 | Well, you can watch a past behavior,

00:11:52.760 | someone's, I'm going this way,

00:11:53.880 | maybe I will continue going there,

00:11:55.480 | or maybe I'm very aggressively walking,

00:11:57.680 | and maybe I'm more likely to do

00:11:59.360 | aggressive motions in the future.

00:12:01.320 | High-level scene semantics,

00:12:04.120 | well, I'm in a presentation room,

00:12:05.800 | I'm sitting here at front giving a talk,

00:12:07.480 | I'll probably stay here and continue,

00:12:11.040 | even though stranger things have happened.

00:12:13.360 | And of course there's subtle appearance cues,

00:12:18.000 | so for example, if a person's watching our vehicle

00:12:20.640 | and moving towards them,

00:12:21.520 | we can be fairly confident they're paying attention

00:12:23.840 | and not going to do anything particularly dangerous.

00:12:28.840 | If someone's not paying attention or being distracted,

00:12:31.040 | or there is a person in the car waving at us,

00:12:35.760 | various gestures, cues, the blinkers on the vehicles,

00:12:38.440 | these are all signals and subtle signals

00:12:41.840 | that we need to understand

00:12:43.160 | in order to be able to behave well.

00:12:45.360 | And last but not least,

00:12:47.160 | even when you predict how other agents behave,

00:12:49.880 | agents also are affected by the other agents

00:12:53.120 | in the environment as well.

00:12:54.160 | So everyone can affect everyone else,

00:12:56.080 | and you need to be mindful of this.

00:12:58.280 | So I'll show you an example of this.

00:13:00.200 | I think this is one of the issues

00:13:03.040 | that really needs to be thought about.

00:13:04.840 | We are all interacting with each other.

00:13:08.080 | So here's a case, our Waymo vehicle is driving,

00:13:10.720 | and there is two bicyclists in red

00:13:15.520 | going around a parked car.

00:13:17.720 | And what happens is we correctly anticipate

00:13:21.280 | that as they bike, they will go around the car,

00:13:23.440 | and we slow down and let them pass.

00:13:25.320 | So we're reasoning that they will interact

00:13:28.160 | with the parked car.

00:13:29.560 | This is the prediction, our most likely prediction

00:13:32.160 | for the rear bicyclists.

00:13:34.600 | We anticipate that they will do this,

00:13:36.480 | and we correctly handle this case.

00:13:38.640 | So this illustrates prediction.

00:13:40.480 | And we have planning.

00:13:42.200 | This is our decision-making machine.

00:13:44.480 | It produces vehicle behavior,

00:13:46.960 | typically ends up in control commands to the vehicle,

00:13:49.720 | accelerate, slow down, steer the wheel.

00:13:52.960 | And you need to generate behavior

00:13:54.240 | that ultimately has several properties to it,

00:13:56.520 | and it's important to think of them,

00:13:57.760 | which is safe, safety comes first,

00:14:00.520 | comfortable for the passengers,

00:14:04.720 | and also sends the right signals

00:14:07.960 | to the other traffic participants,

00:14:09.660 | because they can interact with you,

00:14:12.320 | and they will react to your actions,

00:14:14.000 | so you need to be mindful.

00:14:15.280 | And you need to, of course, make progress.

00:14:18.200 | You need to deliver your passengers.

00:14:19.480 | So you need to trade all of these in a reasonable way.

00:14:22.080 | And it can be fairly sophisticated reasoning

00:14:26.600 | in complex environments.

00:14:27.640 | I'll show you just one scene.

00:14:29.080 | This is a complex, I think, school gathering.

00:14:33.400 | There's bicyclists trailing us,

00:14:34.920 | vehicles really closely hemmed within us,

00:14:37.000 | a bunch of pedestrians, and we need to make progress,

00:14:39.740 | and here is us.

00:14:40.700 | We're driving reasonably well.

00:14:44.440 | In crowded scenes.

00:14:45.520 | And that is part of the prerequisite

00:14:47.640 | of bringing this technology

00:14:50.240 | to the dense urban environments, being able to do this.

00:14:53.480 | So how are we gonna do it?

00:14:54.320 | Well, I gave it up.

00:14:55.600 | I'm a machine learning person.

00:14:57.800 | I think when you have this complicated models and systems,

00:15:01.100 | machine learning is a really great tool to model

00:15:05.360 | complex actions, complex mapping functions, features.

00:15:11.520 | Right, and so we're going to learn our system.

00:15:14.840 | And we've been doing this.

00:15:15.880 | I mean, we're not the only ones.

00:15:17.080 | So obviously, this is now a machine learning revolution.

00:15:20.360 | And machine learning is permeating

00:15:23.160 | all parts of the way in Mostak.

00:15:24.840 | All of these systems that I'm talking about,

00:15:26.860 | it helps us perceive the world.

00:15:28.120 | It helps us make decisions about

00:15:30.600 | what others are going to do.

00:15:31.880 | It helps us make our own decisions.

00:15:33.640 | Right, and machine learning is a tool

00:15:36.400 | to handle the long tail.

00:15:37.640 | Right, and I'll tell you a little more on this how.

00:15:42.080 | So I have this allegory about machine learning

00:15:44.480 | that I like to think about.

00:15:45.660 | So there is a classical system

00:15:47.020 | and there is a machine learning system.

00:15:49.120 | And to me, a classical system,

00:15:50.480 | and I've been there, I've done well.

00:15:52.960 | Early machine learning systems also can be a bit classical.

00:15:56.120 | You're the artisan, you're the expert.

00:15:57.640 | You have your tools and you need to build this product.

00:16:00.280 | And you have your craft and you go

00:16:01.640 | and take your tools and build it, right?

00:16:03.520 | And it can fairly quickly get something reasonable.

00:16:06.420 | But then it's harder to change, it's harder to evolve.

00:16:08.840 | If you learn new things, now you need to go back

00:16:11.160 | and maybe the tools don't quite fit

00:16:12.680 | and you need to essentially keep tweaking it

00:16:14.920 | and it starts becoming, the more complicated the product

00:16:17.360 | becomes, the harder it is to do.

00:16:19.700 | And machine learning, modern machine learning,

00:16:22.000 | is like a factory.

00:16:24.200 | Right, so machine learning, you build the factory,

00:16:28.600 | which is the machine learning infrastructure.

00:16:30.800 | And then you feed data in this factory

00:16:34.660 | and get nice models that solve your problems, right?

00:16:37.480 | And so, kind of infrastructure is at the heart

00:16:40.280 | of this new paradigm.

00:16:41.480 | You need to build a factory.

00:16:43.960 | Right, once you do it, now you can iterate,

00:16:47.480 | it's scalable, right?

00:16:49.180 | Just keep the right data, keep feeding the machine,

00:16:51.640 | keeps giving you good models.

00:16:53.080 | So what is a ML factory for self-driving models?

00:16:58.740 | Well, roughly it goes like this.

00:17:03.720 | We have a software release, we put it on the vehicle,

00:17:06.360 | we're able to drive.

00:17:07.280 | We drive, we collect data, we collect it and we store it.

00:17:11.560 | And then we select some parts of this data

00:17:15.440 | and we send it to labelers.

00:17:17.640 | And the labelers label parts of the data

00:17:19.480 | that we find interesting and that's a knowledge

00:17:22.200 | that we want to extract from the data.

00:17:23.720 | These are the labels, the annotations,

00:17:25.120 | the results we want for our models.

00:17:26.880 | Right, there it is.

00:17:31.200 | And then what we're going to do is we're gonna train

00:17:33.640 | machine learning models on this data.

00:17:36.500 | After we have the models, we will do testing and validation,

00:17:39.300 | validate that they're good to put on our vehicles.

00:17:42.140 | And once they're good to put on our vehicles,

00:17:43.860 | we go and collect more data.

00:17:45.260 | And then the process starts going again and again.

00:17:48.780 | So you collect more data, now you select new data

00:17:51.260 | that you have not selected before.

00:17:52.960 | You add it to your data set, you keep training the model.

00:17:56.920 | And iterate, iterate, iterate, it's a nice scalable setup.

00:18:00.780 | Of course, this needs to be automated.

00:18:03.520 | It needs to be scalable itself.

00:18:05.280 | It's a game of infrastructure.

00:18:07.360 | And at Waymo, we have the beautiful advantage

00:18:10.760 | to be really well set up with regards

00:18:13.380 | to the machine learning infrastructure.

00:18:15.180 | And I'll tell you a bit about its ingredients

00:18:17.280 | and how we go about it.

00:18:20.340 | So ingredient one is computing software infrastructure

00:18:23.580 | and we're part of Alphabet, Google,

00:18:25.020 | and we are able to, first of all, leverage TensorFlow,

00:18:29.620 | the deep learning framework.

00:18:31.300 | We have access to the experts that wrote TensorFlow

00:18:33.780 | and know it in depth.

00:18:35.660 | We have data centers to run large-scale parallel compute

00:18:38.660 | and also train models.

00:18:40.700 | We have specialized hardware for training models,

00:18:42.880 | which make it cheaper and more affordable and faster

00:18:47.040 | so you can iterate better.

00:18:48.340 | Ingredient two is high-quality labeled data.

00:18:51.960 | We have the scale to collect and store hundreds

00:18:55.540 | and thousands and more miles, to millions of miles.

00:18:58.420 | And just collecting and storing 10 millions of miles

00:19:02.180 | is not necessarily the best thing you can do

00:19:07.180 | because there is a decreasing utility to the data.

00:19:10.900 | So most of the data comes from common scenarios

00:19:13.420 | and maybe you're already good at them

00:19:14.700 | and that's where the long tail comes.

00:19:17.020 | So it's really important how you select the data.

00:19:20.260 | And so this is the important part of this pipeline.

00:19:22.180 | So while you're running a release on the vehicle,

00:19:24.540 | we have a bunch of models,

00:19:25.740 | we have a bunch of understanding about the world,

00:19:28.180 | and we annotate the data as we go

00:19:30.500 | and we can use this knowledge to decide

00:19:32.700 | what data is interesting, how to store it,

00:19:35.300 | which data we can potentially even ignore.

00:19:38.160 | So then once we do that, again,

00:19:41.740 | we need to be very careful how to select data.

00:19:43.660 | We want to select data, for example,

00:19:45.100 | that are interesting in some way and complement,

00:19:47.900 | capture these long tail cases

00:19:49.780 | that we potentially may not be doing so well on.

00:19:52.420 | And so for this, we have active learning

00:19:56.620 | and data mining pipelines.

00:19:59.580 | Given exemplars, find the rare examples,

00:20:02.020 | look for parts of your system which are uncertain

00:20:04.420 | or inconsistent over time and go and label those cases.

00:20:08.640 | Last but not least, we also produce auto labels.

00:20:12.540 | So how can you do that?

00:20:13.780 | Well, when you collect data, you also see the future

00:20:17.140 | for many of the objects, what they did.

00:20:19.500 | And so because of that, now knowing the past and the future,

00:20:22.960 | you can annotate your data better

00:20:24.740 | and then go back to your model

00:20:26.700 | that does not know the future

00:20:27.980 | and try to replicate that with that model.

00:20:30.140 | And so you need to do all of this as part of the system.

00:20:34.580 | Ingredient number three, high quality models.

00:20:37.340 | We're part of larger Alphabet and Google and DeepMind

00:20:40.660 | and generally Alphabet is the leader in AI.

00:20:44.300 | When I was at Google, we were very early

00:20:47.180 | on the deep learning revolution.

00:20:49.020 | I happened to have the chance to be there at the time.

00:20:51.660 | It was 2013 when I got on to do deep learning

00:20:55.860 | and a lot of things were not understood

00:20:57.340 | and we were there working on it earlier than most people.

00:21:00.140 | And so through that, we had the opportunity

00:21:02.300 | and the chance to develop some of the,

00:21:04.620 | in my time, the team I managed,

00:21:06.500 | we invented neural net architecture like Inception,

00:21:10.300 | which became popular later.

00:21:12.260 | We invented at the time the state of the art

00:21:14.500 | object detection, fast object detector called SSD.

00:21:18.060 | And we won the ImageNet 2014.

00:21:20.420 | And now if you go to the conferences,

00:21:21.860 | Google and DeepMind are leaders in perception

00:21:23.900 | and reinforcement learning and smart agents.

00:21:26.520 | And there is like state of the art,

00:21:29.180 | say semantic segmentation networks,

00:21:30.860 | pose estimation and so on.

00:21:32.260 | The object detection of course goes without saying.

00:21:34.420 | And so we collaborate with Google and DeepMind

00:21:36.980 | on projects improving our models.

00:21:39.120 | And so this is my factory for self-driving models

00:21:43.100 | and I want to tell you something

00:21:45.720 | that kind of captures all of these ideas,

00:21:48.900 | infrastructure, data and models in one.

00:21:51.260 | This is a project we did recently

00:21:54.900 | and today we put online in our blog

00:21:57.340 | about automatic machine learning for tuning

00:22:01.500 | and adjusting architectures of neural networks.

00:22:05.500 | So what did we do?

00:22:08.260 | So there is a team at Google working on AutoML,

00:22:12.380 | automatic machine learning.

00:22:14.200 | And usually networks themselves have complex architecture.

00:22:17.300 | They're crafted by practitioners to artisans

00:22:19.260 | of networks in some way.

00:22:21.020 | And sometimes we have very high latency constraints

00:22:23.600 | in the models, we have some compute constraints.

00:22:26.480 | The networks are specialized.

00:22:27.860 | It takes often people months to find the right architecture

00:22:30.980 | that's most performant, low latency and so on.

00:22:33.440 | And so there's a way to offload this work to the machines.

00:22:36.900 | You can have machines themselves,

00:22:40.280 | once you've posed the problem,

00:22:41.500 | go and find your good network architecture

00:22:43.360 | that's both low latency and high performance.

00:22:45.760 | Right, and so that's what we do.

00:22:48.540 | And we drive in a lot of scenarios

00:22:50.060 | and as we keep collecting data and finding new cities

00:22:52.860 | or new examples, the architectures may change

00:22:55.740 | and we want to easily find that and keep evolving that

00:22:58.220 | without too much effort, right?

00:22:59.580 | So we worked with the Google researchers

00:23:01.940 | and they had a strong work where they invented,

00:23:05.340 | well, they developed a system that searched the space

00:23:09.060 | of architectures and found a set of components

00:23:12.820 | of neural networks.

00:23:14.360 | It's a small sub-network called NASCEL

00:23:17.260 | and this is a diagram of a NASCEL.

00:23:18.940 | It's a set of layers put together

00:23:21.540 | that you can then replicate in the network

00:23:23.220 | to build a larger network.

00:23:24.580 | And they discovered in a small vision data set,

00:23:26.740 | it was called CIFAR-10.

00:23:28.060 | It's from the early days of deep learning,

00:23:30.880 | it was a very popular data set

00:23:32.540 | and you can quickly train models

00:23:33.860 | and explore the large search space.

00:23:36.040 | So the first thing we did is we took some problems

00:23:40.620 | in that we have for our stack,

00:23:42.940 | one of them being lighter segmentation.

00:23:44.540 | So you have a map representation and some lighter points

00:23:48.940 | and you essentially segment the lighter points.

00:23:51.580 | You say this point is part of a vehicle,

00:23:53.840 | that point is part of vegetation and so on.

00:23:56.540 | This is a standard problem.

00:23:58.440 | So what we first did at Waymo is

00:24:01.620 | we explored several hundred NASCEL combinations

00:24:07.580 | to see what performs better on this task.

00:24:10.740 | And we saw that one of two things happened

00:24:15.460 | for the various versions that we found.

00:24:17.420 | One of them is we can find models with similar quality

00:24:20.460 | but much lower latency and less compute.

00:24:24.220 | And then there is models of a bit higher quality

00:24:27.740 | at the same latency.

00:24:28.580 | So essentially we found better models

00:24:30.060 | than the human engineers did.

00:24:31.880 | And similar results were obtained for other problems,

00:24:36.500 | lane detection as well

00:24:38.940 | with this transfer learning approach.

00:24:41.100 | Of course, you can also do end-to-end architecture search.

00:24:44.300 | So there's no reason why what was found on CIFAR-10

00:24:47.740 | is best suited for our more specialized problems.

00:24:52.180 | And so we went about this more from the ground up.

00:24:55.660 | So let's find exactly deeper search, much larger space,

00:24:59.580 | not limited to the NASCELs themselves.

00:25:02.820 | And so the way to do this is because our networks

00:25:05.620 | are trained on quite a lot of data

00:25:08.100 | and take quite a while to converge

00:25:10.260 | and it takes some compute,

00:25:11.540 | we went and defined a proxy task.

00:25:13.220 | This is a smaller task, simplified,

00:25:16.140 | but correlates with the larger task.

00:25:18.540 | And we do this by some experimentation

00:25:20.420 | of what would be a proxy task.

00:25:21.780 | And once we establish a proxy task,

00:25:23.940 | now we execute the search algorithms

00:25:26.260 | developed by the Google researchers.

00:25:28.300 | And so we train up to 10,000 architectures

00:25:31.300 | with different topology and capacity.

00:25:34.620 | And once we find the top 100 models,

00:25:36.700 | now we train the large networks on those models

00:25:39.780 | all the way and pick the best ones.

00:25:42.220 | Right?

00:25:43.260 | And so this way we can explore a much larger space

00:25:46.140 | of network architectures.

00:25:47.820 | So what happened?

00:25:48.660 | So on the left, this is 4,000 different models

00:25:52.620 | spanning the scale and latency and quality.

00:25:55.860 | And in red was the transfer model.

00:25:58.140 | So after the first round of search,

00:25:59.940 | we actually did not produce a better model

00:26:01.700 | than the transfer, which already leveraged their insight.

00:26:04.860 | So then we took the learnings and the best models

00:26:07.420 | from this search and did the second round of search,

00:26:10.220 | which was in yellow, which allowed us to beat it.

00:26:12.260 | And third is we also executed

00:26:15.100 | reinforcement learning algorithm

00:26:17.940 | developed by the AI researchers

00:26:19.620 | on 6,000 different architectures.

00:26:21.980 | And that one was able to significantly improve

00:26:24.940 | on the red dot, which also significantly improves

00:26:28.380 | on the in-house algorithm.

00:26:31.260 | So that's one example where infrastructure,

00:26:36.380 | data, and models combine and shows

00:26:40.460 | how you can keep automating the factory.

00:26:42.980 | That is all good, but we keep finding

00:26:46.660 | new examples in the world.

00:26:48.220 | And for some situations, we have fairly few examples as well.

00:26:52.220 | And so there are cases where the models are uncertain

00:26:55.460 | or potentially can make mistakes.

00:26:57.180 | And you need to be robust to those.

00:27:00.540 | I mean, you cannot put the product and say,

00:27:02.060 | well, our networks just don't handle some case

00:27:05.220 | and it's, so we have designed our system to be robust,

00:27:09.620 | even when ML is not particularly confident.

00:27:11.740 | And how do you do this?

00:27:12.580 | So one part is, of course, you want redundant

00:27:15.140 | and complementary sensors.

00:27:17.180 | So we have given 360-degree field of view

00:27:19.060 | on our vehicles, both in camera, LiDAR, and radar.

00:27:21.900 | And they're complementary modalities.

00:27:24.140 | First of all, an object is seen in all of them.

00:27:26.740 | Second of all, they all have different strengths

00:27:29.020 | and different modes of failure.

00:27:31.020 | And so whenever one of them tends to fail,

00:27:32.620 | the others usually work fine.

00:27:34.580 | And so that helps a lot,

00:27:36.340 | make sure we do not miss anything.

00:27:38.040 | Also, we've designed our system to be a hybrid system.

00:27:43.620 | And this is a point I want to make.

00:27:45.300 | So, I mean, some of these mapping problems

00:27:49.100 | or problems in which we apply our models

00:27:52.420 | are very complicated.

00:27:53.260 | They're high dimensional.

00:27:54.140 | The image has a lot of pixels.

00:27:56.100 | LiDAR has a lot of LiDAR points.

00:27:58.620 | The networks can end up pretty big.

00:28:00.980 | And it may not be so easy to train

00:28:04.180 | with very few examples with the current state of the art.

00:28:06.500 | And so the state of the art keeps improving, of course.

00:28:08.620 | So this is their zero-shot and one-shot learning.

00:28:11.680 | But we can also, while the state of the art

00:28:14.460 | is improving in the models,

00:28:15.940 | we can also leverage expert domain knowledge.

00:28:18.180 | And so what does that do?

00:28:19.860 | So humans can help develop the right input representations.

00:28:24.860 | They can put in expert bias

00:28:26.500 | that constrains the representation

00:28:28.260 | to fewer parameters that already describe the task.

00:28:30.860 | And then with that bias, it is easier to learn models

00:28:34.540 | with fewer examples.

00:28:36.500 | And there is also, of course,

00:28:39.500 | experts can put in their knowledge

00:28:41.000 | in terms of designing the algorithm

00:28:42.560 | which incorporates it as well.

00:28:44.500 | And so our system is this hybrid.

00:28:46.140 | And so an example of what that looks for perception is,

00:28:50.900 | well, no matter if there's cases

00:28:54.800 | where the machine learning system may be not confident,

00:28:57.660 | we still have tracks and obstacles

00:28:59.180 | from LiDAR and radar scans,

00:29:00.520 | and we make sure that we drive relative to those safely.

00:29:05.520 | And in prediction and planning,

00:29:06.880 | if we are not confident in our predictions,

00:29:10.660 | we can drive more conservatively.

00:29:12.620 | And over time, as the factory is running

00:29:14.480 | and our models become more powerful, of course, improve,

00:29:16.980 | and we get more data of all the cases,

00:29:19.620 | the scope of ML grows, right?

00:29:23.360 | And the set of cases that you can handle with it increases.

00:29:29.740 | And so there's two ways to attack the tail.

00:29:32.240 | You both protect against it,

00:29:33.500 | but you also keep growing ML

00:29:34.840 | and making your system more performant.

00:29:36.800 | I'm going to tell you now

00:29:38.920 | how we deal with large-scale testing,

00:29:41.340 | which is another key problem.

00:29:43.420 | It's very important in the pipeline

00:29:46.480 | and also in getting the vehicles on the road.

00:29:48.720 | So how do you normally develop a self-driving algorithm?

00:29:54.120 | Well, the ideal thing you're gonna do

00:29:55.560 | is you make your algorithm change,

00:29:58.840 | and you would put it on the vehicle and drive a bunch

00:30:01.260 | and say, "Now it looks great."

00:30:03.720 | All right, let's make the next one.

00:30:05.580 | The problem is, I mean, we have a big fleet,

00:30:08.220 | we have a lot of data,

00:30:09.860 | but some of the conditions and situations

00:30:12.340 | occur very, very rarely.

00:30:13.980 | And so if you do this, you're gonna wait a long time.

00:30:17.420 | Furthermore, you don't just want to take your code

00:30:20.380 | and put it on the vehicle.

00:30:21.380 | You need to test it even before that.

00:30:23.180 | You don't want to,

00:30:24.580 | like you want very strongly tested code on public streets.

00:30:28.780 | So you can do structure testing.

00:30:30.280 | We have a 90-acre Air Force Base place

00:30:34.720 | where we can test very important situations

00:30:38.540 | and situations that occur rarely.

00:30:41.140 | It's an example of such a situation.

00:30:44.580 | And so you can do this as well.

00:30:47.300 | So you can select and deliberately stage safely

00:30:50.780 | conditions you care about.

00:30:53.620 | Now again, you cannot do this for all situations.

00:30:58.380 | So what do you do?

00:30:59.820 | A simulator.

00:31:00.900 | Right?

00:31:03.540 | And so how much do you need to simulate?

00:31:07.300 | Well, we simulate a lot.

00:31:09.140 | So we simulate the equivalent of 25,000 cars,

00:31:12.220 | virtual cars, driving.

00:31:13.980 | 10 million miles a day.

00:31:17.860 | And over seven billion miles simulated.

00:31:22.980 | It's a key part of our release process.

00:31:27.260 | So why do you need to simulate this much?

00:31:29.260 | Right, well, hopefully I convinced you

00:31:31.700 | there is a variety of cases to worry about

00:31:35.340 | and that you need to test, right, so far.

00:31:39.540 | And furthermore, it goes all the way bottom up.

00:31:43.380 | So as you change perception,

00:31:45.620 | for example, slightly different segmentation or detection,

00:31:48.340 | the changes can go through the system

00:31:51.540 | and the results can change significantly

00:31:54.220 | and you need to be robust to this.

00:31:55.340 | You need to test all the way.

00:31:57.780 | So what to simulate?

00:32:02.780 | One thing you can do is,

00:32:05.140 | we can create unique scenarios from scratch,

00:32:08.900 | working with safety experts,

00:32:10.300 | NITSA and analyzing water conditions

00:32:12.980 | in which typically lead to accidents.

00:32:15.660 | So you can do that.

00:32:16.500 | Of course, you can do it manually, you can create them.

00:32:19.260 | What else could you do?

00:32:20.420 | Well, you want to leverage your driving data.

00:32:25.420 | You have all your logs,

00:32:26.700 | you have a bunch of situations there, right?

00:32:29.540 | So you can pick interesting situations from your logs.

00:32:33.980 | And furthermore, what you can do is,

00:32:35.380 | you take all these situations

00:32:36.540 | and you create variations of these situations

00:32:39.140 | so you get even more scenarios.

00:32:41.220 | So here's an example of a log simulation.

00:32:45.260 | I'll play it twice.

00:32:46.180 | First time, look at the image.

00:32:48.260 | This is what happened in the real world the first time.

00:32:51.460 | So in the real world,

00:32:53.180 | we mostly stayed in the middle lane and stopped.

00:32:56.540 | If you see what happened in simulation,

00:33:00.420 | simulation, our algorithm decided this time

00:33:02.740 | to merge to the left lane and stopped.

00:33:06.140 | And everything was fine, things were safe,

00:33:08.780 | things were happy.

00:33:09.740 | What can go wrong in simulation from logs?

00:33:14.680 | Well, let's say this is another scenario,

00:33:19.100 | slightly different visualization.

00:33:20.860 | Our vehicle, when it drove the real world,

00:33:23.140 | was where the green vehicle is.

00:33:24.700 | Now in simulation, we drove differently

00:33:27.980 | and we have the blue vehicle, right?

00:33:30.540 | And so we're driving.

00:33:32.860 | Bam.

00:33:36.620 | What happened?

00:33:39.020 | Well, there is a purple agent over there,

00:33:41.700 | pesky purple agent, who in the real world

00:33:43.780 | saw that we passed them safely.

00:33:45.900 | And so it was safe for them to go,

00:33:48.980 | but it's no longer safe because we changed what we did.

00:33:51.900 | So the insight is, in simulation,

00:33:53.860 | our actions affect the environment

00:33:55.700 | and it need to be accounted for.

00:33:57.340 | So what does that mean?

00:33:59.860 | If you want to have effective simulations

00:34:04.140 | on a large scale, you need to simulate

00:34:06.300 | the realistic driver and pedestrian behavior.

00:34:08.540 | So, you know, you could think of a simple model.

00:34:12.740 | Well, what is a good proxy or what's a good approximation

00:34:16.460 | of a realistic behavior?

00:34:17.540 | Well, you can do a break and swerve model.

00:34:19.700 | So you just say, well, there is some normal way

00:34:23.180 | reactions happen.

00:34:24.220 | You know, I have a reaction time and breaking profile

00:34:27.340 | and maybe a swerving profile,

00:34:28.500 | so if an agent sees someone in front of them,

00:34:30.340 | maybe they just apply it as an algorithm.

00:34:32.380 | Right, so hopefully I convinced you that behavior

00:34:35.580 | can be fairly complicated and this will not always produce

00:34:38.780 | a believable reaction, especially in complex

00:34:42.020 | interactive cases such as merges, lane changes,

00:34:45.060 | intersections, and so on.

00:34:47.540 | Right, so what could you do?

00:34:49.980 | You could learn an agent from real demonstrations.

00:34:55.500 | Well, you went and collected all this data in the world,

00:34:57.620 | you have a bunch of information of how vehicles,

00:35:00.940 | pedestrians behave, you can learn a model and use that.

00:35:05.940 | Okay, so what is an agent?

00:35:08.460 | Let's look a little bit.

00:35:09.660 | An agent receives, sends the information

00:35:14.340 | and maybe context about the environment.

00:35:17.340 | And it develops a policy, it develops a reaction,

00:35:21.980 | it's a driver agent and applies acceleration and steering,

00:35:25.460 | then gets new sensor information, new map information,

00:35:30.300 | place in the map and it continues.

00:35:32.020 | And if it's our own vehicle, then you also have a router

00:35:34.420 | that's explicit intent generator which says,

00:35:37.340 | well, the passenger wants you to go over there,

00:35:39.820 | why don't we try to make a right turn now?

00:35:41.580 | So you also get an intent.

00:35:43.860 | And this is an agent, it could be in simulation,

00:35:46.660 | it could be in the real world, roughly this is the picture.

00:35:48.820 | And this is an end-to-end agent,

00:35:50.140 | end-to-end learning is popular, right?

00:35:52.340 | To its best approximation, if you learn a good policy

00:35:55.780 | this way, you can apply it and have

00:35:57.780 | very believable agent reactions.

00:35:59.500 | And so I'm gonna tell you a little bit about work

00:36:02.980 | we did in this direction.

00:36:04.060 | So we put a paper in archive about a month ago, I believe.

00:36:07.140 | We took 60 hours of footage of driving

00:36:12.500 | and we tried to see how well we can imitate it

00:36:16.220 | using a deep neural network.

00:36:18.140 | And so one option is to do exactly the same

00:36:21.260 | to end-to-end agent policy,

00:36:22.340 | but we wanted to make our task easier.

00:36:24.540 | How?

00:36:26.660 | Well, we have a good perception system at Waymo,

00:36:28.780 | so why don't we use its products for that agent?

00:36:33.020 | Also can simplify the input representation a bit,

00:36:35.580 | that is good, the task becomes easier.

00:36:37.740 | Controllers are well understood,

00:36:39.980 | we can use an existing controller,

00:36:41.420 | so no need to worry about acceleration and torques,

00:36:43.900 | we can generate trajectories.

00:36:45.980 | Now, if you want to see in a little more detail

00:36:50.980 | to understand the representation,

00:36:52.660 | is so we have, this is our agent vehicle,

00:36:55.100 | which is self-driving vehicle in this case,

00:36:56.860 | but could be a simulation agent.

00:36:59.100 | And we render an image with it at the center

00:37:01.300 | and potentially we augment it with some,

00:37:03.540 | we can generate a little bit of rotation to the image

00:37:07.580 | just so we don't over-bias the orientation a specific way.

00:37:12.220 | All right, and it's an 80 by 80 box,

00:37:13.860 | so we roughly see about 60 meters in front of us

00:37:16.100 | and 40 meters to the side in this setup.

00:37:18.460 | And now we render a road map in this box,

00:37:23.540 | which is the map, like which lanes you're allowed

00:37:25.780 | to drive on, there's traffic lights,

00:37:28.140 | and generally at intersections we render

00:37:30.260 | what lanes are allowed to go in what lanes

00:37:32.180 | and how the traffic lights permit it or do not permit it.

00:37:35.020 | Then you can render speed limits,

00:37:39.140 | the objects, result of your perception system,

00:37:42.380 | you render your current vehicle where it believes it is,

00:37:46.580 | and you render the post history.

00:37:50.460 | So you give an image of where the agent's been

00:37:53.260 | in the last few steps.

00:37:54.660 | And so you want, and last but not least,

00:37:58.420 | you render the intent, so the intent is where you want to go.

00:38:01.900 | So it's conditioned on this intent and this input,

00:38:04.780 | you want to predict the future waypoints for this vehicle.

00:38:07.220 | Right, so that's the task.

00:38:08.440 | And you can phrase it as a supervised learning problem.

00:38:11.240 | Right, just learn to, learn a policy with this network

00:38:15.900 | that approximates what you've seen in the world,

00:38:18.060 | with 60 hours of data.

00:38:19.260 | Of course, learning agents, there is a well-known problem,

00:38:24.940 | it's identified, it's called Paper Dagger,

00:38:27.780 | by Stefan Ross, who is actually at Waymo now,

00:38:31.620 | and Andrew Bagnell.

00:38:33.140 | So it's easy to make small errors over time,

00:38:35.500 | so even though in each step, if you do a relatively

00:38:38.140 | good estimate, if you string 10 steps together,

00:38:40.140 | you can end up very different from where agents

00:38:43.220 | have been before.

00:38:44.160 | Right, and there is techniques to handle this.

00:38:47.900 | One thing we did was synthesize perturbations.

00:38:51.080 | So you have your trajectory, and we synthesize,

00:38:54.540 | deform the trajectory and force the vehicle to learn

00:38:56.580 | to come back to the middle of the lane.

00:38:58.880 | So that's something you can do.

00:39:00.340 | That's reasonable.

00:39:02.540 | Now, if you just have direct imitation based on supervision,

00:39:06.900 | we are trying to pass a vehicle in the street,

00:39:09.420 | and it's stopping and never continuing.

00:39:11.700 | So now we did perturbations, and well,

00:39:16.700 | it kind of ran through the vehicle.

00:39:19.160 | Right, so that's not enough.

00:39:21.640 | So we need more, right?

00:39:22.820 | It's not actually an easy problem.

00:39:24.860 | So in addition to having this agent RNN,

00:39:27.340 | which essentially takes the past and creates memory

00:39:32.020 | of its past decisions and keeps iterating,

00:39:35.300 | predicting multiple points in the future.

00:39:37.540 | So it predicts the trajectory piecemeal in the future.

00:39:41.020 | How about we also learn about collisions

00:39:45.220 | and staying on the road and so on.

00:39:46.620 | So we augment the network, and now the network

00:39:48.700 | starts also predicting a mask for the road.

00:39:51.860 | And now we have a loss here.

00:39:54.060 | I don't know if I can point.

00:39:55.420 | So here you have a road mask loss.

00:39:58.660 | You say, hey, if you drive or generate motions

00:40:01.100 | that take you outside the road, that's probably not good.

00:40:04.100 | Hey, if you ever cause collisions

00:40:06.020 | where your perception network, which takes the other object

00:40:10.300 | and predicts their motion, so predict here our motion,

00:40:13.640 | where the road is, and the other agent's motion

00:40:17.200 | in the future, and we're trying to make sure

00:40:19.060 | there's no collisions and that we stay on the road.

00:40:20.780 | So you add this structural knowledge.

00:40:22.580 | That adds a lot more constraints

00:40:24.660 | to the system as it trains.

00:40:26.860 | So it's not just limited, but what it's explicitly seeing,

00:40:29.740 | it allows it to reason about things

00:40:31.220 | it has not explicitly seen as well.

00:40:34.100 | And so now, here's an example of us driving

00:40:36.780 | with this network.

00:40:38.660 | And you can see that we're predicting the future

00:40:42.180 | with yellow boxes, and we're driving safely

00:40:44.300 | through intersections and complex scenarios.

00:40:46.340 | Actually handles a lot of scenarios very well.

00:40:49.340 | If you're interested, I welcome you to go read the paper.

00:40:52.140 | It handles most of the simple situations fine.

00:40:55.380 | So now we have our past two approaches,

00:40:58.140 | the passing a parked car.

00:41:00.220 | One of them stops and never restarts.

00:41:01.860 | The other one hits the car.

00:41:04.220 | Now it actually handles it fine.

00:41:05.820 | And beyond that, afterwards, we can stop

00:41:11.020 | at the stop sign happily, which is the red line over there,

00:41:13.900 | and it does all of these operations.

00:41:16.060 | And what we did beyond this is, we took the system,

00:41:19.280 | as learned on imitation data, and we actually drove

00:41:21.860 | our real Waymo car with it.

00:41:23.900 | So we took it to Castle, the Air Force Base staging grounds,

00:41:27.260 | and this is it driving a road it's never seen before

00:41:30.460 | and stopping at stop signs and so on.

00:41:32.260 | So that's all great.

00:41:33.500 | We could use it also in agent simulation world,

00:41:36.120 | and we could drive a car with it, but it has some issues.

00:41:38.960 | So let's look on the left.

00:41:40.900 | So here it is driving, and then it was driving too fast,

00:41:45.900 | so because our range is limited, it didn't know

00:41:49.300 | it had to make a turn, and it overran the turn.

00:41:51.500 | So it just drove off the road.

00:41:52.900 | That's one thing that can happen.

00:41:54.500 | So, you know, one area of improvement, more range.

00:41:59.420 | Here's just another time.

00:42:00.780 | So yellow is, by the way, what we did in the real world,

00:42:05.380 | and green is what we do in the simulation, in that example.

00:42:08.660 | And here, we're trying to execute a complex maneuver,

00:42:13.180 | a U-turn, we're sitting there, and we're gonna try to do it,

00:42:17.060 | and we almost do it, but not quite,

00:42:19.660 | and at least we end up in the driveway.

00:42:21.660 | And there's other interactive situations.

00:42:25.620 | When they get really complex, this network also

00:42:28.180 | does not do too well, right?

00:42:29.740 | And so what does that tell us?

00:42:32.700 | Well, long tail came again in testing, right?

00:42:37.500 | There's, again, you can learn a policy

00:42:39.900 | for a lot of the common situations,

00:42:42.980 | but actually in testing, some of the things

00:42:44.620 | you really care about is the long tail.

00:42:46.340 | You want to test to the corner cases.

00:42:48.000 | You want to test in the scenarios where someone

00:42:49.900 | is obnoxious and adversarial and does something

00:42:52.200 | not too kosher, right?

00:42:55.660 | So one way to think of it is this, right?

00:43:00.540 | This is the distribution of human behavior,

00:43:02.380 | and of course, it goes in multiple axes.

00:43:04.620 | It could be aggressive and conservative, right?

00:43:09.620 | And then somewhere in between, you could be

00:43:12.220 | super expert driver and super inexperienced

00:43:14.460 | and somewhere in between, and so on.

00:43:16.260 | So our end-to-end model, it's fairly,

00:43:21.740 | it's an unbiased representation, meaning

00:43:24.860 | it could, in theory, learn any policy, right?

00:43:26.700 | I mean, you see everything you want to know

00:43:28.500 | about the environment, by and large.

00:43:30.920 | But it's complex, and this is similar a bit

00:43:33.260 | to the models as well, some of the models

00:43:35.440 | we talked about before.

00:43:36.420 | You can end up with complex model

00:43:37.700 | if you have complex input.

00:43:38.780 | This is images that are 80 by 80 with multiple channels.

00:43:41.860 | It's a large input space.

00:43:42.980 | The model can have tens of millions of parameters.

00:43:45.500 | Now, if you have an example, if you have a case

00:43:47.360 | where you have two or three examples

00:43:48.700 | in your whole 60 hours of driving,

00:43:50.940 | there's no guarantee that your 10 million

00:43:52.980 | parameter model will learn it well, right?

00:43:56.020 | And so it's really good when you have a lot of examples.

00:43:58.620 | It's really trying to do well in those.

00:44:02.420 | And then you have the long tail.

00:44:04.020 | So what do you do?

00:44:07.300 | Well, we can improve the representation.

00:44:10.300 | We can improve our model.

00:44:11.460 | This is, there is a lot of room to keep evolving this.

00:44:16.100 | And then this area will keep expanding, right?

00:44:18.940 | And that's one good direction.

00:44:20.220 | There is a lot of interesting questions how to do that,

00:44:22.260 | and we're working on a lot of them.

00:44:23.460 | There's actually some exciting work.

00:44:24.940 | Hopefully I get to share with you another time.

00:44:27.020 | Something else you can do, if you remember from my slide

00:44:29.340 | about the hybrid system, when you go to the long tail,

00:44:32.300 | you can do essentially a similar thing,

00:44:35.580 | which is simpler, biased, expert design input distribution

00:44:39.380 | that is much easier to learn with few examples.

00:44:41.660 | You can also, of course, use expert design models.

00:44:45.500 | And so in this case, you still will produce

00:44:48.460 | something reasonable by inputting this human knowledge.

00:44:52.400 | And you could have many models.

00:44:53.880 | I mean, there's not one.

00:44:55.280 | You could just tune to various aspects of this distribution.

00:44:58.640 | You can have little models for all the aspects

00:45:00.480 | you care about.

00:45:01.320 | You can mix and match, right?

00:45:02.840 | So that's another way to do it.

00:45:05.220 | So let me tell you about one such a model.

00:45:07.320 | So trajectory optimization agent.

00:45:10.840 | So we take inspiration from motion control theory,

00:45:14.120 | and we want to plan a good trajectory for the vehicle,

00:45:18.360 | the agent vehicle, and that satisfies a bunch

00:45:21.400 | of constraints and preferences, right?

00:45:23.880 | And so one insight to this is that we already know

00:45:29.760 | what the agent did in the environment last time.

00:45:32.840 | So you have fairly strong idea about the intent.

00:45:35.680 | And that helps you when you specify the preferences.

00:45:38.440 | 'Cause you can say, okay, well, give me a trajectory

00:45:41.860 | that minimizes some set of costs,

00:45:43.840 | which are preferences on the trajectory,

00:45:45.680 | typically called potentials.

00:45:48.120 | What is a potential?

00:45:49.040 | Well, at different parts of the trajectory,

00:45:51.360 | you can add these attractor potentials saying,

00:45:53.720 | well, try to go where you used to be before, for example.

00:45:56.640 | And that's the benefit of, in simulation,

00:45:58.840 | you have observed what was done.

00:46:00.320 | So this is a bit simpler.

00:46:01.600 | And of course, you can have repeller potential.

00:46:05.000 | Don't hit things, don't run into vehicles, right?

00:46:07.720 | So to a first approximation,

00:46:09.080 | that's what it roughly looks like.

00:46:10.800 | And so now, where is the learning, right?

00:46:14.960 | Well, it's still a machine learning model.

00:46:16.840 | There is a presentation.

00:46:17.840 | These potentials have parameters.

00:46:19.320 | It's the steepness of this curve.

00:46:22.480 | Sometimes they're multidimensional, right?

00:46:25.560 | There's a few parameters.

00:46:26.700 | Typically, we're talking a few dozen parameters or less.

00:46:30.760 | All right, and you can learn them too.

00:46:33.320 | So there is a technique called

00:46:35.640 | inverse reinforcement learning.

00:46:37.240 | You want to learn these parameters

00:46:39.880 | that produce trajectories that come close

00:46:41.960 | to the trajectories you've observed in the real world.

00:46:44.360 | So if you pick a bunch of trajectories

00:46:46.280 | that represent certain type of behavior,

00:46:47.880 | you want to model the tuning of parameters

00:46:49.400 | to behave like it.

00:46:50.840 | And then you want to generate reasonable trajectories,

00:46:53.280 | continuous, feasible, that satisfy this, right?

00:46:57.000 | And this is part of this optimization.

00:46:58.400 | You can solve this, actually.

00:46:59.240 | And so then you can tune these agents.

00:47:01.100 | And so here's some agents I want to show you.

00:47:04.720 | So this is a complex interactive scenario.

00:47:07.520 | Two vehicles, but you can see on the left is,

00:47:10.880 | on the right is the aggressive guy.

00:47:14.020 | Blue is the agent.

00:47:15.080 | Red is our vehicle.

00:47:16.040 | We're testing in simulation.

00:47:18.120 | And so let me play one more time.

00:47:20.840 | Once this ends, essentially on the left

00:47:24.280 | is the conservative driver.

00:47:25.480 | On the right is the aggressive driver.

00:47:27.760 | And they pass us, right?

00:47:29.380 | And they induce very different reactions in our vehicle.

00:47:33.200 | So the aggressive guy went and passed us

00:47:36.800 | and pushed us further into that lane

00:47:39.000 | and we merge much later.

00:47:40.700 | In the other case, when you have a conservative driver,

00:47:43.040 | we are in front of them and they're not bugging us

00:47:45.380 | and we execute much earlier.

00:47:47.200 | We can switch into the right lane where we want to go.

00:47:50.400 | All right, so this is agents that can test your system well.

00:47:52.680 | Now you have different scenarios in this case,

00:47:57.680 | depending what agent you put in.

00:47:59.800 | And I'll show you a little more scenarios.

00:48:02.880 | So it's not just a two-agent game.

00:48:04.600 | I mean, we can do things like merging

00:48:07.800 | from one side of the highway to the next.

00:48:10.000 | And this type of agent can generate

00:48:13.400 | fairly reasonable behaviors.

00:48:15.000 | It's slow down for a knowing slow vehicle in front,

00:48:17.700 | let the vehicles on the side pass you

00:48:19.100 | and still completes the mission.

00:48:21.460 | And you can generate multiple futures with this agent.

00:48:26.460 | So here's an example again.

00:48:28.740 | On the right will be an aggressive guy.

00:48:30.700 | Right, and on the left was the more conservative person.

00:48:36.060 | The aggressive guy found a gap between the two vehicles

00:48:38.660 | and just went for it, right?

00:48:40.820 | And you can test your stack this way.

00:48:43.120 | And one more I wanted to show you

00:48:44.580 | is an aggressive motorcycle driving.

00:48:47.620 | So you can have an agent that tests,

00:48:50.340 | you can test your reaction to motorcycle

00:48:51.980 | that are weaving in the lane, right?

00:48:53.780 | So I guess what's my takeaway from this story

00:48:57.140 | about testing and the long tail?

00:48:59.420 | You need a menagerie of agents at the moment, right?

00:49:02.140 | So if you think of it, right,

00:49:05.880 | and learning from demonstration is key.

00:49:10.740 | You can encode some simple models by hand,

00:49:12.580 | but ultimately it's much better.

00:49:14.240 | The task of modeling agent behavior is complex

00:49:16.940 | and it's much better learned.

00:49:19.080 | And so here's the space of models.

00:49:20.680 | So you can have not learned,

00:49:21.800 | you can just replay the log like I showed,

00:49:23.640 | and you can hand design trajectories for agents

00:49:26.600 | to for this reaction do this, for that reaction do that.

00:49:29.760 | Then you can have the break and swerve model

00:49:31.440 | that mostly if there's someone in front of an agent

00:49:33.680 | just does a deterministic break.

00:49:36.100 | Trajectory optimization, which I just showed.

00:49:39.640 | Now our mid to mid model and potentially

00:49:41.720 | end to end top down model, top down meaning

00:49:43.740 | you have like a top view of the environment.

00:49:45.380 | There's many other representations possible.

00:49:47.540 | This is a very interesting space.

00:49:49.000 | Ultimately I wanted to show you

00:49:51.300 | there's many possible agents

00:49:52.900 | and they have different utility

00:49:54.980 | and they have different number of examples

00:49:56.740 | you need to train them with.

00:49:58.140 | And so one other takeaway I wanted to tell you

00:50:00.980 | is smart agents are critical for autonomy at scale.

00:50:04.620 | This is something I truly believe working in the space.

00:50:07.980 | And this line of direction is exciting

00:50:09.840 | and ultimately one of the exciting problems

00:50:12.700 | that there's still a lot of interesting progress to be made.

00:50:17.140 | And why?

00:50:18.480 | Well you have accurate models of human behavior

00:50:20.480 | of drivers and pedestrians

00:50:21.820 | and they help you achieve several things.

00:50:23.900 | First, you will do better decisions when you drive yourself.

00:50:27.140 | You'll be able to anticipate what others will do better

00:50:29.300 | and that will be helpful.

00:50:30.980 | Second, you can develop a robust simulation environment

00:50:34.980 | with those insights, also very important.

00:50:38.220 | Third, well our vehicle is also

00:50:40.320 | one more agent in the environment.

00:50:41.660 | It's an agent we have more control than the others

00:50:44.120 | but a lot of these insights apply.

00:50:45.820 | And so this is very exciting and interesting.

00:50:49.620 | So I wanted to finish the talk

00:50:52.120 | just maybe as a mental exercise.

00:50:55.400 | When you think of a system

00:50:57.560 | that is tackling a complex AI challenge like self-driving,

00:51:01.160 | what is the good properties of the system to have

00:51:03.120 | and how do you think of a scalable system?

00:51:05.920 | And to me there's this mental test.

00:51:07.920 | We want to grow and handle and bring our service

00:51:10.680 | to more and more environments, more and more cities.

00:51:13.120 | How do you scale to dozens or hundreds of cities?

00:51:15.580 | So as we talked about the long tail,

00:51:18.520 | each new environment can bring new challenges.

00:51:20.860 | And they can be complex intersections in cities like Paris.

00:51:25.100 | There's our Lombard Street in San Francisco, I'm from there.

00:51:28.380 | There's narrow streets in European towns.

00:51:30.320 | There is all kinds of, the long tail keeps coming.

00:51:34.160 | As you keep driving new environments,

00:51:36.000 | in Pittsburgh people drive the famous Pittsburgh left.

00:51:38.640 | They take different precedence than usual.

00:51:42.520 | The local customs of driving, of behaving,

00:51:44.600 | all of this needs to be accounted for as you expand.

00:51:47.640 | And this makes your system potentially more complex

00:51:49.800 | or easier, harder to tune to all environments.

00:51:52.920 | But it's important because ultimately

00:51:54.600 | that's the only way you can scale.

00:51:55.960 | So how do you, what should a scalable process do?

00:52:00.680 | So in my mind, let's say you have

00:52:02.480 | a very good self-driving system.

00:52:04.920 | I mean, this very much parallels the factory analogy.

00:52:07.520 | I'm just going to repeat it one more time.

00:52:09.700 | You take your vehicles, we put a bunch of way more cars

00:52:12.680 | and we drive a long time in that environment with drivers.

00:52:15.680 | Maybe 30 days, maybe more, at least that long.

00:52:18.040 | And you collect all the data, right?

00:52:22.040 | And then your system should be able to improve a lot

00:52:25.160 | on the data you have collected, right?

00:52:29.160 | So drive a bunch, obviously you don't want

00:52:34.560 | to train the system too much in the real world

00:52:36.360 | while it's driving, but you want to train it

00:52:38.640 | after you've collected data about the environment.

00:52:42.280 | So it needs to be trainable on collected data.

00:52:44.580 | It's very important for a system to be able to quantify

00:52:49.220 | or have a notion to elicit from it

00:52:51.640 | whether it's incorrect or not confident, right?

00:52:56.120 | Because then you can take action.

00:52:58.280 | And this is important property that I think

00:53:00.300 | people should think of when they design systems.

00:53:02.360 | How do you elicit this?

00:53:04.620 | Then you can take an action.

00:53:06.860 | You can ask questions to raters, that's fairly legit.

00:53:10.220 | Typically active learning is a bit like this, right?

00:53:12.220 | So, and it's usually based on some amount

00:53:14.900 | of low confidence or surprise.

00:53:18.500 | That's the examples you want to send.

00:53:21.040 | And even better, the system could potentially

00:53:24.620 | directly update itself, and this is an interesting question.

00:53:27.980 | How do systems update themselves in light of new knowledge?

00:53:30.600 | And we have a system that clearly does this, right?

00:53:33.380 | And typically you do it with reasoning.

00:53:37.900 | And what is reasoning, right?

00:53:41.860 | So, I have an answer.

00:53:45.580 | It is one answer, there is possibly others, right?

00:53:49.140 | But one way is you can check and enforce consistency

00:53:51.700 | of your beliefs, and you can look for explanations

00:53:53.820 | of the world that are consistent.

00:53:55.420 | And so if you have a mechanism in the system

00:53:57.880 | that can do this, this allows the system

00:53:59.880 | to improve itself without necessarily

00:54:02.220 | being fed purely labeled data.

00:54:05.060 | It can improve itself on just collected data.

00:54:08.320 | And I think it's interesting to think of systems

00:54:10.920 | where you can do reasoning and the representations

00:54:13.780 | that these models need to have.

00:54:15.340 | And last and not least, we need scalable training

00:54:20.140 | and testing infrastructure, right?

00:54:22.880 | This is part of the fact that I was talking about.

00:54:24.940 | I'm very lucky at Waymo to have wonderful infrastructure.

00:54:29.100 | And it allows this virtuous cycle to happen.

00:54:34.100 | Thank you.

00:54:35.180 | (audience applauding)

00:54:38.340 | - Up here, Kieran Strobel.

00:54:42.060 | Thank you so much for the talk, I really appreciate it.

00:54:43.620 | So if you were to train off of image and LiDAR data,

00:54:47.420 | synthetic image and LiDAR data,

00:54:49.020 | would you weight the synthetic data differently

00:54:52.900 | than real-world data when training your models?

00:54:56.140 | - So there's actually a lot of interesting research

00:54:58.020 | in the field.

00:54:59.260 | There are people train on simulator,

00:55:01.680 | but also train adaptation models

00:55:04.740 | that make simulator data look like real data.

00:55:07.920 | Right?

00:55:10.320 | So you're essentially, you're trying to build consistency,

00:55:14.520 | or at least you're training on simulator scenarios,

00:55:16.980 | but if you learn a mapping from simulator scenes

00:55:19.780 | to real scenes, right, you could potentially train

00:55:22.860 | on the transformed simulator data already

00:55:25.340 | that's transformed with other models.

00:55:27.220 | There's many ways to do this, ultimately, right?

00:55:29.540 | So achieving realism in simulator

00:55:31.660 | is an open research problem, right?

00:55:34.420 | - I assume there is a lot of rules

00:55:37.060 | that you have to put into a system to make,

00:55:39.340 | to be able to trust it, you know?

00:55:44.020 | And so how you find the balance

00:55:45.900 | between this automatic models,

00:55:48.660 | like neural network when you're not quite sure

00:55:51.460 | what they would do, and rules where you're sure

00:55:54.180 | that it's not scalable?

00:55:56.620 | - I mean, through lots and lots of testing and analysis,

00:55:59.540 | right, so you keep keeping track

00:56:02.940 | of the performance of your models

00:56:04.460 | and you see where they come short, right?

00:56:07.220 | And then those are the areas you most need

00:56:10.420 | expert to complement, right?

00:56:13.660 | But the balance can change over time, right?

00:56:15.340 | And it's a natural process of evolution, right?

00:56:18.820 | So evolving your system as you go.

00:56:20.300 | I mean, generally, you know, the MLPI grows

00:56:23.100 | as the capabilities in the data sets grow, right?

00:56:26.340 | - So you stressed at the end of both the first half

00:56:29.860 | and the second half of your talk,

00:56:31.660 | the importance of quantifying uncertainty

00:56:34.300 | and the predictions that your models are making.

00:56:37.220 | So have you developed techniques

00:56:40.500 | for doing that with neural nets,

00:56:42.460 | or are you using some probabilistic graphical models

00:56:45.300 | or something?

00:56:46.540 | - I mean, so a lot of the models are neural nets.

00:56:49.660 | There's many ways to capture this, actually.

00:56:53.580 | I'm just going to give a general answer.

00:56:55.340 | I'm not commenting on specifically

00:56:56.900 | what way I'm going to be doing.

00:56:59.020 | I think, first of all, there's techniques in neural nets

00:57:00.900 | that can predict, where they can predict

00:57:02.620 | their own uncertainty fairly well, right?

00:57:05.380 | Either directly regress its uncertainty

00:57:07.300 | for certain products, or use ensembles of networks

00:57:10.260 | or dropout or techniques like this

00:57:12.100 | that also provide a measure of uncertainty.

00:57:16.740 | Another way of doing uncertainty

00:57:18.260 | is to leverage constraints in the environment.

00:57:20.260 | So if you have temporal sequences, right?

00:57:24.700 | You don't want, for example, objects to appear or disappear,

00:57:27.980 | or generally unreasonable changes in the environment,

00:57:31.860 | or inconsistent prediction in your models

00:57:34.260 | are good areas to look.

00:57:35.980 | - I'm just wondering, do you guys train

00:57:38.700 | and deploy different models depending on

00:57:41.020 | where the car is driving, like what city,

00:57:44.060 | or do you train and deploy a single model

00:57:47.060 | that adapts to most scenarios?

00:57:49.900 | - Well, ideally, you would have one model

00:57:53.580 | that adapts to most scenarios,

00:57:55.060 | then a complement is needed.

00:57:58.020 | - Yeah, so first off, thanks for your talk.

00:58:00.220 | I find the simulator work really, really exciting.

00:58:02.900 | And I was wondering if you could either talk more about,

00:58:06.820 | or maybe provide some insights into simulating pedestrians.

00:58:11.540 | 'Cause as a pedestrian myself,

00:58:12.580 | I feel like my behavior is a lot less constrained

00:58:14.580 | than a vehicle.

00:58:15.660 | - Right.

00:58:16.500 | - And I imagine, I mean, there's an advantage

00:58:18.620 | in that you're sensing from a vehicle,

00:58:20.060 | and you kind of know, your sensors are for like

00:58:21.860 | first person from a vehicle, but not from a pedestrian.

00:58:24.780 | - And that's correct.

00:58:25.620 | I mean, so if you want to simulate pedestrians

00:58:28.620 | far away in an environment, right,

00:58:31.500 | and you want to simulate them at very high resolution,

00:58:34.540 | right, and you've collected log data,

00:58:36.100 | you may not have the detailed data on that pedestrian.

00:58:38.980 | Right?

00:58:39.820 | At the same time, the subtle cues for that pedestrian

00:58:43.020 | matter less at that distance as well,

00:58:45.220 | because it's not like you observed them

00:58:47.100 | or reacted to them in the first place.

00:58:49.380 | So there is an interesting question

00:58:50.940 | of what fidelity do you need to simulate things?

00:58:53.700 | Right?

00:58:55.540 | And there is levels of realism in simulation

00:58:59.580 | that at some level need to parallel

00:59:02.460 | what your models are paying attention.

00:59:04.460 | - Thank you for the talk.

00:59:06.980 | It was very interesting.

00:59:08.620 | Since you, you know, titled and talked about it,

00:59:11.980 | long tail, it makes me wonder,

00:59:14.820 | is the bulk of the problem solved?

00:59:18.860 | Do you think, well, we're gonna have this figured out

00:59:22.420 | and within the next couple of years,

00:59:25.180 | there can be self-driving cars everywhere,

00:59:27.340 | or do you think it's closer to, you know,

00:59:30.460 | actually, it could be decades before

00:59:33.580 | we've really worked out everything necessary?

00:59:37.060 | What are your thoughts about the future?

00:59:39.060 | - It's a bit hard to, that's a good question.

00:59:41.620 | It's a bit hard to give this prognosis.

00:59:43.380 | I think, I mean, I'm not completely sure.

00:59:48.100 | I think one thing I would say is it will take a while

00:59:50.620 | for self-driving cars to roll out at scale, right?

00:59:54.540 | So this is not a technology that just,

00:59:56.860 | it turn the crank and appears everywhere, right?

00:59:59.580 | There's logistics and algorithms

01:00:01.340 | and all this tuning and testing needed

01:00:03.020 | to make sure it's really safe in the various environments.

01:00:06.300 | So it will take some time.

01:00:07.620 | - When you were talking about prediction,

01:00:10.540 | you mentioned looking at a context

01:00:12.420 | and saying if a person or if someone is looking at us,

01:00:15.340 | we can assume that they will behave differently

01:00:17.300 | than if they're not paying attention to what we're doing.

01:00:19.740 | - Potentially.

01:00:20.580 | - Is that something you're actively doing?

01:00:22.340 | Do you take into consideration if pedestrians

01:00:24.860 | or other participants in traffic

01:00:26.980 | are paying attention to your vehicles?

01:00:29.660 | - So I can't comment on our model designs too much,

01:00:33.420 | but I think there's a generally cues

01:00:35.260 | one needs to pay attention to, they're very significant.

01:00:37.700 | I mean, you know, even when people drive, for example,

01:00:40.980 | there's someone sitting in the vehicle next to you waving,

01:00:43.780 | keep going, right?

01:00:44.620 | And these are natural interactions in the environment.

01:00:47.260 | That, you know, is something you need to think about.

01:00:52.260 | - In one of your, first of all, thank you,

01:00:55.060 | it's a really cool talk.

01:00:56.660 | In one of your last slides, you talked about resolving

01:01:00.140 | certain uncertainties by the means of establishing

01:01:03.540 | a set of beliefs and checking to see

01:01:05.300 | if they were consistent in the--

01:01:07.020 | - That's my own theory, by the way, right?

01:01:08.940 | But I feel that the concept of reasoning

01:01:11.580 | is underexplored in deep learning and what it means, right?

01:01:16.300 | So if you read Tversky-Kahneman, Type I, Type II reasoning,

01:01:20.700 | we're really good at the instinctive mapping type of tasks,

01:01:25.700 | right, so like some low to mid to maybe high-level perception

01:01:32.940 | up to a point, but the reasoning part with neural networks,

01:01:38.020 | right, and generally with models,

01:01:40.300 | that's a bit less explored.

01:01:43.540 | I think it's, long-term, it's fruitful.

01:01:47.020 | That's my personal opinion, right?

01:01:49.220 | - I guess the question I was gonna ask

01:01:52.460 | is if you could elaborate on that concept

01:01:54.940 | in connection with the models you guys are working with,

01:01:57.220 | but I guess that's--

01:01:58.420 | - So I'll give an example from current work, right?

01:02:01.060 | And there's a lot of work on weekly supervised learning.

01:02:04.900 | - Sure.

01:02:05.740 | - And that's kind of been a big topic in 2018,

01:02:07.980 | and there were a lot of really strong papers,

01:02:09.740 | including by Google Brain and Yulia Angulo

01:02:11.740 | and her team and so on,

01:02:13.700 | and essentially, if you used to read the books

01:02:16.940 | about 3D reconstruction and geometry and so on, right,

01:02:21.340 | there's a bunch of rules that can encode

01:02:24.140 | geometric expectations about the world.

01:02:26.180 | So when you have video, and when you have 3D outputs

01:02:29.300 | in your models, there is certain amount of consistency.

01:02:31.820 | One example is ego motion versus depth estimation.

01:02:35.860 | There is a very strong constraint

01:02:37.500 | that if you predict the depth,

01:02:39.340 | and you predict the ego motion correctly,

01:02:41.020 | then you can reproject certain things,

01:02:42.740 | and they will look good, right?

01:02:44.180 | And that's a very strong constraint.

01:02:45.860 | That's a consistency.

01:02:46.700 | You know this about the environment, you expect it.

01:02:48.980 | This can help train your model, right?

01:02:51.260 | And so more of this type of reasoning may be interesting.

01:02:54.220 | - You mentioned expert design algorithms,

01:02:56.700 | and I was wondering, from your perspective,

01:02:59.460 | from also from Waymo's perspective,

01:03:02.180 | how important are those, say,

01:03:04.100 | non-machine learning type algorithms,

01:03:06.780 | or non-machine learning type approaches

01:03:08.780 | to tackling the challenges of autonomous driving?

01:03:11.900 | - Could you say one more time how important is,

01:03:13.900 | which aspect of them, Darrell?

01:03:15.180 | - Of expert designed algorithms.

01:03:17.500 | Every now and then, you just, you sprinkle in,

01:03:19.980 | like, here we can try expert designed algorithms,

01:03:22.540 | because we actually understand some parts of the problem,

01:03:24.900 | and I was wondering, like, what is really important

01:03:27.540 | for the challenges in autonomous driving

01:03:30.140 | outside of the field of machine learning?

01:03:33.100 | - I mean, generally, you want, the problem is,

01:03:35.260 | you want to be safe in the environment.

01:03:37.180 | That makes it such that you don't want to make errors

01:03:40.580 | in perception, prediction, and planning, right?

01:03:44.460 | And the state of machine learning is not at the point

01:03:47.580 | where it never makes errors, provided the scope

01:03:51.300 | that we're currently addressing.

01:03:53.940 | And so throughout your stack,

01:03:55.740 | with the current state of machine learning,

01:03:57.460 | it needs to be complemented, right?

01:04:00.180 | And so we've carefully done it,

01:04:02.020 | and I think machine learning, as it improves,

01:04:05.140 | I think there'll be less and less need to do it.

01:04:08.180 | It's somewhat effort intensive, bringing,

01:04:11.300 | especially in an evolving system, to do that,

01:04:13.500 | to have a hybrid system.

01:04:15.140 | But right now, I think this is the main thing

01:04:18.020 | that keeps you able to do complex behaviors in some cases,

01:04:23.020 | for which it's very hard to collect data,

01:04:25.620 | and you still need to handle.

01:04:27.100 | Then it's the right thing to do, right?

01:04:29.940 | So the way I view it, I'm a machine learning person,

01:04:32.340 | I like to do better and better.

01:04:34.080 | That said, we're not religious, it should not be.

01:04:37.100 | We just need to solve the problem,

01:04:38.500 | and right now, the right mix is a hybrid system,

01:04:40.540 | is my belief.

01:04:41.380 | - Well, we're really excited to see

01:04:44.340 | what Waymo has in store for us in '19.

01:04:46.820 | So please give Drago a big hand.

01:04:48.580 | (audience applauding)

01:04:51.740 | (audience cheering)

01:04:54.740 | (audience applauding)

01:04:57.900 | (audience cheering)

01:05:00.900 | (audience cheering)

01:05:03.900 | (audience cheering)

01:05:06.900 | (audience cheering)

01:05:09.900 | (audience cheering)

01:05:12.900 | you

Drago Anguelov (Waymo) - MIT Self-Driving Cars

Chapters