back to index

Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars


Chapters

0:0
2:50 self-driving cars?
13:27 The rise of Deep Learning
30:57 Deep learning techniques
32:31 Reflections
37:5 Semantic segmentation example: snow
38:58 End-to-end architecture predicting from raw data
40:15 Computer vision based single shot with cones
40:49 Embeddings for finer classification
43:47 Pedestrians
49:3 Dataset size and quality circa 2013
52:52 TensorFlow. a new programming paradigm
58:12 Large scale simulation

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we have the Director of Engineering, Head of Perception at Waymo,
00:00:05.000 | a company that's recently driven over 4 million miles autonomously,
00:00:11.000 | and in so doing inspired the world in what artificial intelligence and good engineering can do.
00:00:18.000 | So please give a warm welcome to Sasha Arnou.
00:00:28.000 | Thanks a lot, Lex, for the introduction.
00:00:31.000 | Wow, it's a pretty packed house.
00:00:34.000 | Thanks a lot. I'm really excited.
00:00:36.000 | Thanks a lot for giving me the opportunity to be able to come and share my passion with self-driving cars
00:00:44.000 | and be able to share with you all the great work we've been doing at Waymo over the last 10 years
00:00:50.000 | and give you more details on the recent milestones we've reached.
00:00:55.000 | So as you'll see, we'll cover a lot of different topics,
00:01:00.000 | some more technical, some more about context.
00:01:05.000 | But over the content, I have three main objectives that I'd like to convey today.
00:01:13.000 | So keep that in mind as we go through the presentation.
00:01:17.000 | My first one is to give you some background around the self-driving space
00:01:25.000 | and what's happening there and what it takes to build self-driving cars,
00:01:29.000 | but also give you some behind-the-scenes views and tidbits on the history of machine learning, deep learning,
00:01:38.000 | and how it all came together within the big alphabet family from Google to Waymo.
00:01:45.000 | Another piece, obviously, another objective I have is to give you some technical meat
00:01:52.000 | around the techniques that are working today on our self-driving cars.
00:01:56.000 | So I think during the class you'll hear a lot, you've heard a lot about different deep learning techniques,
00:02:03.000 | models, architectures, algorithms, and I'll try to put that in a current hole
00:02:09.000 | so that you can see how those pieces fit together to build the system we have today.
00:02:14.000 | And last but not least, I think as Lex mentioned, it takes a lot more, actually, than algorithms
00:02:21.000 | to build a sophisticated system such as our self-driving cars.
00:02:26.000 | And fundamentally, it takes a full industrial project to make that happen.
00:02:32.000 | And I'll try to give you some color, which hopefully is a little different from what you've heard during the week.
00:02:38.000 | I'll try to give you some color on what it takes to actually pan out such an industrial project in real life
00:02:45.000 | and essentially productionize machine learning.
00:02:51.000 | So we hear a lot of talk. We hear a lot about self-driving cars.
00:02:56.000 | It's a very hot topic, and for very good reasons.
00:03:00.000 | I can tell you for sure that 2017 has been a great year for Waymo.
00:03:05.000 | Actually, only a year ago, in January 2017, Waymo became its own company.
00:03:11.000 | So that was a major milestone and a testimony to the robustness of the solution
00:03:16.000 | so that we could move to a productization phase.
00:03:20.000 | So what you see on the picture here is our latest generation self-driving vehicle.
00:03:28.000 | So it is based on the Chrysler Pacifica.
00:03:31.000 | You can already see a bunch of sensors. I'll come back to that and give you more insights on what they do and how they operate.
00:03:39.000 | But that's the latest and greatest.
00:03:42.000 | So self-driving indeed draws a lot of attention, and for very good reasons.
00:03:49.000 | I personally believe, and I think you will agree with me, that self-driving really has the potential to deeply change
00:03:57.000 | the way we look about mobility and the way we move people and things around.
00:04:03.000 | So only to cover a few aspects here, obviously, and I don't want to go into too many details,
00:04:09.000 | but safety is one of the main motivations.
00:04:15.000 | 94% of US crashes today involve human errors.
00:04:19.000 | A lot of those errors are around distraction and things that could be avoided.
00:04:23.000 | So safety is a big piece of it.
00:04:27.000 | Accessibility and access to mobility is also a big motivation of ours.
00:04:35.000 | So obviously, the self-driving technology has the potential to make it very available and cheaper for more people to be able to move around.
00:04:44.000 | And last but not least is efficiency, collective efficiency.
00:04:50.000 | Not only do we spend a lot of time in our cars, in long commute hours.
00:04:56.000 | I personally spend a lot of time in long commute hours.
00:05:00.000 | And that time we spend in traffic probably could be better spent doing something else
00:05:04.000 | than having to drive the car in complicated situations.
00:05:08.000 | Beyond traffic, obviously, self-driving technology has the potential to deeply change the way we think about traffic,
00:05:17.000 | parking spots, urban environments, city design.
00:05:23.000 | So that's why it's a very exciting topic.
00:05:27.000 | So that's why we made it our mission at Waymo,
00:05:30.000 | is fundamentally to make it safe and easy to move people and things around.
00:05:36.000 | So that's a nice mission, and we've been on it for a very long time.
00:05:43.000 | So actually, the whole adventure started close to 10 years ago in 2009.
00:05:50.000 | And at the time, that started under the umbrella of a Google project that you may have heard of called Chauffeur.
00:06:00.000 | And back in those days, so remember, we were before the deep learning days, at least in the industry.
00:06:06.000 | And so really back in those days, the first objective of the project was to try and assemble a first prototype vehicle,
00:06:14.000 | take off-the-shelf sensors, assemble them together, and try to go and decide if self-driving is even a possibility.
00:06:23.000 | It's one thing to have some prototype somewhere, but is that even a thing that is worth pursuing?
00:06:29.000 | Which is a very common way for Google to tackle problems.
00:06:33.000 | So the genesis for that work was to come up with a pretty aggressive objective.
00:06:40.000 | So the team, the first milestone for the team, was to essentially assemble 10 100-mile loops
00:06:48.000 | in Northern California, around Mountain View, and try and figure out, so for a total of 1,000 miles,
00:06:55.000 | and try and see if they could build a first system that would be able to go and drive those loops autonomously.
00:07:04.000 | So they were not afraid. So the team was not afraid. So those loops went through some very aggressive patterns.
00:07:13.000 | So you see that some of those loops go through the Santa Cruz Mountains, which is an area in California that,
00:07:20.000 | as you see, I'll show you a video, that has very small roads and two-way traffic and cliffs,
00:07:26.000 | with negative obstacles and complicated patterns like that.
00:07:30.000 | Some of those paths were going on highways, and one of the busiest highways.
00:07:39.000 | Some of those routes were going around Lake Tahoe, which is in the Sierras in California,
00:07:46.000 | where you can encounter different kinds of weather, and again, different kinds of roads conditions.
00:07:51.000 | Those routes were going around bridges, and the Bay Area has quite a few bridges to go through.
00:07:58.000 | Some of them were even going through a dense urban area.
00:08:02.000 | So you can see San Francisco being driven. You can see Monterey, some of the Monterey centers being driven.
00:08:10.000 | And as you'll see on the video, those truly bring dense urban area challenges.
00:08:19.000 | So since I promised it, so here you're going to see some pictures of the driving. It's kind of working.
00:08:27.000 | So here, with better quality, so here you see the roads I was talking about on the Santa Cruz Mountains,
00:08:34.000 | driving in the night, animals crossing the street, freeway driving, going through pay tolls.
00:08:40.000 | That's the Monterey area that is fairly dense. There's an aquarium there, a pretty popular one.
00:08:46.000 | That's the famous Lombard Street in San Francisco that you may have heard of,
00:08:51.000 | which in San Francisco always brings its unique set of challenges between fog and slopes,
00:08:56.000 | and in that case, even sharp turns.
00:09:00.000 | So that was all the way back in 2010. So those 10 loops were successfully completed 100% autonomously back in 2010.
00:09:12.000 | So that's more than eight years ago.
00:09:17.000 | So on the heels of that success, the team decided, and Google decided, that self-driving was worth pursuing,
00:09:26.000 | and moved forward with the development of the technology and testing.
00:09:33.000 | So we've been at it for all those years, and have been working very hard on it.
00:09:38.000 | Historically, Waymo and I think all the other companies out there have been relying on what we call safety drivers
00:09:46.000 | to still sit behind the wheels, even if the car is driving autonomously.
00:09:51.000 | We still have a safety driver who is able to take over at any time and make sure that we have very safe operations.
00:09:58.000 | And we've been accumulating miles and knowledge and developing the system, many iterations of the system,
00:10:03.000 | along all those years.
00:10:07.000 | We reached a major milestone, as Lex mentioned, back in November,
00:10:13.000 | where for the first time we reached a level of confidence and maturity in a system
00:10:18.000 | that we felt confident and proved to ourselves that it was safe to remove the safety driver.
00:10:25.000 | As you can imagine, that's a major milestone, because it takes a very high level of confidence
00:10:32.000 | to not have that backup solution of a safety driver to take over where something to arise.
00:10:38.000 | So here I'm going to show you a small video, a quick capture of that event.
00:10:44.000 | So the video is from one of the first times we did that.
00:10:48.000 | Since then we've been continuously operating driverless cars, self-driving cars, in the Phoenix area in Arizona
00:10:57.000 | to expand our testing.
00:11:00.000 | So here you can see our Chrysler Pacifica.
00:11:07.000 | So here we have members of the team who are acting as passengers, getting on a back seat.
00:11:12.000 | You can notice that there is no driver on the driver's seat.
00:11:17.000 | So here we are running a car-hailing kind of service.
00:11:21.000 | So the passengers simply press a button, the application knows where they want to go, and the car goes.
00:11:28.000 | No one on the driver's seat.
00:11:30.000 | So we started with a fairly constrained geographical area in Chandler, close to Phoenix, Arizona.
00:11:40.000 | And we are hard-working to expand the testing and the scope of our operating area since then.
00:11:52.000 | So that goes well beyond a single car, a single day.
00:11:56.000 | Not only do we do that continuously, but we also have a growing fleet of self-driving cars
00:12:01.000 | that we are deploying there all the way and looking for a product launch pretty quickly.
00:12:12.000 | So I talked about 2010, and we are in 2018, and we are getting there.
00:12:18.000 | But it took quite a bit of time.
00:12:21.000 | So I think one of the key ideas that I'd like to convey here today,
00:12:26.000 | and that I will go back to during the presentation, is how much work it takes to really take a demo
00:12:35.000 | or something that's working in the lab into something that you feel safe to put on the roads,
00:12:40.000 | and get all the way to that depth of understanding, that depth of perfection in your technology,
00:12:46.000 | that you operate safely.
00:12:49.000 | So one way to say that is that when you're 90% done, you still have 90% to go.
00:12:53.000 | So 90% of the technology takes only 10% of the time.
00:12:58.000 | In other words, you need to 10x.
00:13:01.000 | You need to 10x the capabilities of your technology.
00:13:06.000 | You need to 10x your team size and find ways for more engineers and more researchers to collaborate together.
00:13:12.000 | You need to 10x the capabilities of your sensors.
00:13:15.000 | You need to 10x fundamentally the overall quality of the system,
00:13:19.000 | and your testing practices, as we'll see, and a lot of the aspects of the program.
00:13:24.000 | And that's what we've been working on.
00:13:28.000 | So, beyond the context of self-driving cars, I want to spend a little bit of time
00:13:35.000 | to give you kind of an insider view of the rise of deep learning.
00:13:41.000 | So remember I mentioned that back in 2009, 2010, deep learning was not really available yet
00:13:48.000 | in full capacity in the industry.
00:13:50.000 | And so over those years, actually, it took a lot of breakthroughs to be able to reach that stage.
00:13:58.000 | And one of them was the algorithm breakthrough that deep learning gave us.
00:14:02.000 | And I'll give you a little bit of a backstage view on what happened at Google during those years.
00:14:10.000 | So as you know, Google has committed itself to machine learning and deep learning very early on.
00:14:16.000 | You may have heard of the Google Brain, what we call internally the Google Brain Team,
00:14:21.000 | which is a team fundamentally hard at work to lead the bleeding edge of research, which is known,
00:14:29.000 | but also leading the development of the tools and infrastructure of the whole machine learning ecosystem
00:14:37.000 | at Google and Waymo, to essentially allow many teams to develop machine learning at scale
00:14:44.000 | all the way to successful products.
00:14:47.000 | So they've been working and pushing, the deep learning technology has been pushing the field
00:14:52.000 | in many directions, from computer vision to speech understanding to NLP,
00:15:00.000 | and all those directions are things that you can see in Google products today.
00:15:03.000 | So whether you're talking Google Assistant or Google Photos, speech recognition, or even Google Maps,
00:15:10.000 | you can see the impact of deep learning in all those areas.
00:15:15.000 | And actually, many years ago, I myself was part of the Street View team,
00:15:21.000 | and I was leading an internal program, an internal project that we called Street Smart.
00:15:29.000 | And the goal we had at Street Smart was to use deep learning and machine learning techniques
00:15:37.000 | to go and analyze street view imagery, and as you know, that's a fairly big and varied corpus,
00:15:43.000 | so that we could extract elements that are core to our mapping strategy,
00:15:47.000 | and that way build better Google Maps.
00:15:51.000 | So for instance, in that picture, that's a piece of a panorama from street view imagery,
00:15:58.000 | and you can see that there are a lot of pieces in there that if you could find and properly localize,
00:16:06.000 | would drastically help you build better maps.
00:16:08.000 | So street numbers, obviously, that are really useful to map addresses,
00:16:12.000 | street names that when combined even on similar techniques from our views,
00:16:18.000 | will help you properly draw all the routes and give a name to them.
00:16:22.000 | And those two combines actually allow you to do very high-quality address lookups,
00:16:28.000 | which is a common query on Google Maps.
00:16:31.000 | Internal text, and more specifically text on business facades,
00:16:35.000 | that allow you to not only maybe localize business listings that you may have gotten by other means
00:16:41.000 | to actual physical locations, but also build some of those local listings directly from scratch.
00:16:48.000 | And more traffic-oriented patterns, whether it's traffic lights, traffic signs,
00:16:53.000 | that can be used then for ETA, navigation ETA predictions, and stuff like that.
00:16:59.000 | So that was our mission.
00:17:02.000 | One of the, as I mentioned, one of the hard pieces to do is to map addresses at scale.
00:17:09.000 | And so you can imagine that we had the breakthrough when we first were able to properly find
00:17:17.000 | those street numbers out of the street view imagery and out of the facade.
00:17:22.000 | Solving that problem actually requires a lot of pieces.
00:17:25.000 | Not only you need to find where the street number is on the facade,
00:17:31.000 | which is, if you think about it, a fairly hard semantic problem.
00:17:35.000 | What's the difference between a street number versus another kind of number versus other text?
00:17:41.000 | But then obviously read it, because there's no point having pixels if you cannot understand
00:17:46.000 | the number that's on the facade, all the way to properly geolocalizing it,
00:17:51.000 | so that you can then put it on Google Maps.
00:17:55.000 | And so the first deep learning application that succeeded in production,
00:18:00.000 | and that's all the way back to 2012, that we had the first system in production,
00:18:06.000 | was really the first breakthrough that we had across Alphabet
00:18:11.000 | on our ability to properly understand real scene situations.
00:18:18.000 | So here I'm going to show you a video that kind of sums it up.
00:18:22.000 | So look, every one of those segments is actually a view, starting from the car,
00:18:28.000 | going to the physical number of all those house numbers that we've been able to detect and transcribe.
00:18:35.000 | So here that's in Sao Paulo, and where you can see that when all that data is put together,
00:18:40.000 | it gives you a very consistent view of the addressing scheme.
00:18:46.000 | So that's another example. Similar things, obviously we have more, that's in Paris,
00:18:52.000 | where we have even more imagery, so more views of those physical numbers,
00:18:57.000 | that if you're able to triangulate, you're able to localize them very accurately,
00:19:02.000 | and have very accurate maps.
00:19:04.000 | So the last example I'm going to show is in Cape Town in South Africa,
00:19:10.000 | where again, the impact of that deep learning work has been huge in terms of quality.
00:19:16.000 | So many countries today actually have up to more than 95% of addresses mapped that way.
00:19:26.000 | So doing similar things. So obviously you can see a lot of parallelism
00:19:29.000 | between that work on street view imagery and doing the same on the real scene on the car.
00:19:36.000 | But obviously doing that on the car is even harder.
00:19:40.000 | It's even harder because you need to do that real-time and very quickly, with low latency.
00:19:48.000 | And you also need to do that in an embedded system.
00:19:51.000 | So the cars have to be entirely autonomous.
00:19:57.000 | You cannot rely on a connection to a Google data center,
00:20:00.000 | and first you don't have the time in terms of latency to bring data back and forth.
00:20:05.000 | But also you cannot rely on a connection for the safe operation of your system.
00:20:09.000 | So you need to do the processing within the car.
00:20:13.000 | So that's a paper that you can read that dates all the way back to 2014,
00:20:20.000 | where for the first time, by using slightly different techniques,
00:20:24.000 | we were able to put deep learning at work inside that constrained real-time environment,
00:20:31.000 | and start to have impact, and in that case, around pedestrian detection.
00:20:38.000 | So as I said, there are a lot of analogies.
00:20:42.000 | You can see that to properly drive that scene, like street view,
00:20:46.000 | you need to see the traffic light.
00:20:48.000 | You need to understand if the light is red or green.
00:20:51.000 | And that's what essentially will allow you to do that processing.
00:20:55.000 | Obviously driving is even more challenging beyond the real-time.
00:20:58.000 | I don't know if you saw the cyclist going through.
00:21:01.000 | So we have real stuff happening on the scene that you need to detect
00:21:04.000 | and properly understand, interpret, and predict.
00:21:07.000 | And at the same time, here I explicitly took a night driving example
00:21:13.000 | to show you that while you can choose when you take pictures of street view
00:21:17.000 | and do it in daytime and in perfect conditions,
00:21:21.000 | driving requires you to take the conditions as they are,
00:21:24.000 | and you have to deal with it.
00:21:27.000 | So there has been, from the very early beginning,
00:21:31.000 | there has been a lot of cross-punnelization between the real scene work.
00:21:36.000 | So here I took a few papers that we did in street view,
00:21:40.000 | that obviously if you read them, you'll see directly apply
00:21:43.000 | to some of the stuff we do on the cars.
00:21:45.000 | But obviously that collaboration between Google Research and Waymo
00:21:50.000 | historically went well beyond street view only and across all the research groups.
00:21:55.000 | And that still is a very strong collaboration going on
00:21:58.000 | that enables us to stay on the bleeding edge of what we can do.
00:22:04.000 | So now that we looked a little bit at how things happened,
00:22:08.000 | I want to spend more time and go into more of the details
00:22:12.000 | of what's going on in the cars today,
00:22:14.000 | and how deep learning is actually impacting our current system.
00:22:20.000 | So I think during the—if I looked at the cursors properly,
00:22:25.000 | I think during the week you went through the major pieces
00:22:28.000 | that you need to master to make a self-driving car.
00:22:31.000 | So I'm sure you heard about mapping, localization,
00:22:35.000 | so putting the car within those maps and understanding where you are
00:22:38.000 | with pretty good accuracy, perception, scene understanding,
00:22:42.000 | which is a higher level semantic understanding of what's going on in the scene,
00:22:46.000 | starting to predict what the agents are going to do around you
00:22:50.000 | so that you can do better motion planning.
00:22:53.000 | Obviously there's a whole robotics aspect at the end of the day.
00:22:57.000 | The car in many ways acts like a robot,
00:23:00.000 | whether it's around the sensor data or even the control interfaces to the car.
00:23:05.000 | And for everyone who has dealt with hardware and robotics,
00:23:09.000 | you will agree with me that it's not a perfect world,
00:23:13.000 | and you need to deal with those errors.
00:23:17.000 | Other pieces that you may have talked about is around simulation
00:23:21.000 | and essentially validation of whatever system you put together.
00:23:26.000 | So obviously machine learning and deep learning have been having a deep impact
00:23:31.000 | in a growing set of those areas,
00:23:35.000 | but for the next minutes here I'm going to focus more on the perception piece,
00:23:40.000 | which is a core element of what a self-driving car needs to do.
00:23:45.000 | So what is perception?
00:23:48.000 | So fundamentally, perception is a system in the car
00:23:52.000 | that needs to build an understanding of the world around it.
00:23:57.000 | And it does that using two major inputs.
00:24:01.000 | The first one is prior on the scene.
00:24:04.000 | So for instance, to give you an example,
00:24:07.000 | it would be a little silly to have to recompute the actual location of the road,
00:24:12.000 | the actual interconnectivity of the intersections,
00:24:16.000 | of every intersection once you get on the scene,
00:24:19.000 | because those things you can pre-compute.
00:24:21.000 | You can pre-compute in advance and save your on-board computing
00:24:25.000 | for other tasks that are more critical.
00:24:28.000 | So really, that's often referred to as the mapping exercise,
00:24:32.000 | but really it's about reducing the computation
00:24:35.000 | you're going to have to do on the car once it drives.
00:24:40.000 | The other big input, obviously,
00:24:43.000 | is what sensors are going to give you once you get on the spot.
00:24:47.000 | So sensor data is the signal that's going to tell you
00:24:52.000 | what is not like what you mapped,
00:24:54.000 | and the things, is the traffic light red or green?
00:24:57.000 | Where are the pedestrians? Where are the cars? What are they doing?
00:25:02.000 | So as we saw on the initial picture,
00:25:05.000 | we have quite a set of sensors on our self-driving cars.
00:25:11.000 | So they go from vision systems, radar, and LIDAR,
00:25:16.000 | are the three big families of sensors we have.
00:25:20.000 | One point to note here is that they are designed to be complementary.
00:25:26.000 | So they are designed to be complementary first in their localization on the car,
00:25:31.000 | so we don't put them in the same spot,
00:25:33.000 | because obviously blind spots are major issues,
00:25:36.000 | and you want to have good coverage of the field of view.
00:25:42.000 | The other piece is that they are complementary in their capabilities.
00:25:46.000 | So for instance, to give you an example,
00:25:48.000 | cameras are going to be very good to give you a dense representation.
00:25:53.000 | It's a very dense set of information.
00:25:57.000 | It contains a lot of semantic information.
00:26:00.000 | You can really see a large number of details,
00:26:07.000 | but for instance, they are not really good to give you depth.
00:26:10.000 | It's much harder, and computationally expensive,
00:26:14.000 | to get depth information out of camera systems.
00:26:18.000 | So systems like LIDAR, for instance,
00:26:20.000 | when you hit objects, will give you a very good depth estimation,
00:26:25.000 | but obviously they're going to lack a lot of the semantic information
00:26:28.000 | that you will find on camera systems.
00:26:30.000 | So all those sensors are designed to be complementary
00:26:34.000 | in terms of their capabilities.
00:26:37.000 | It goes without saying that the better your sensors are,
00:26:41.000 | the better your perception system is going to be.
00:26:45.000 | So that's why at Waymo we took the path of designing our own sensors in-house
00:26:51.000 | and enhancing what's available off the shelf today,
00:26:58.000 | because it's important for us to go all the way to be able to build
00:27:03.000 | a self-driving system that we could believe in.
00:27:09.000 | And so that's what perception does.
00:27:12.000 | It takes those two inputs and builds a representation of the scene.
00:27:17.000 | So at the end of the day, you have to realize that in nature,
00:27:23.000 | that work of perception is really what differentiates, deeply differentiates,
00:27:28.000 | what you need to do in a self-driving system,
00:27:30.000 | as opposed to a lower-level driving assistance system.
00:27:37.000 | In many cases, for instance, if you do speed cruise,
00:27:40.000 | or if you do a lot of lower-level driver assistance,
00:27:45.000 | a lot of the strategies can be around not bumping into things.
00:27:50.000 | If you see things moving around, you group them, you segment them appropriately
00:27:54.000 | in blocks of moving things, and you don't hit them,
00:27:57.000 | you're good enough in most cases.
00:28:00.000 | When you don't have a driver on the driver's seat,
00:28:02.000 | obviously the challenge totally changes scale.
00:28:05.000 | So to give you an example, for instance, if you're on a lane
00:28:09.000 | and you see a bicyclist going more slowly on the lane right of you,
00:28:14.000 | and there's a car next to you, you need to understand that there's a chance
00:28:19.000 | that that car is going to want to avoid that bicyclist,
00:28:22.000 | it's going to swerve, and you need to anticipate that behavior
00:28:25.000 | so that you can properly decide whether you want to slow down,
00:28:29.000 | give space for the car, or speed up and have the car go behind you.
00:28:33.000 | Those are the kinds of behaviors that go well beyond not bumping into things,
00:28:38.000 | and that require a much deeper understanding of the world that's going on around you.
00:28:45.000 | So let me put it in picture, and we'll come back to that example in a couple of cases.
00:28:49.000 | So here is a typical scene that we encountered, at least.
00:28:54.000 | So here, obviously, you have a police car pulled over,
00:28:59.000 | probably pulled over someone there.
00:29:01.000 | You have a cyclist on the road moving forward,
00:29:05.000 | and we need to drive through that situation.
00:29:10.000 | So the first thing you can do, you have to do, obviously, is the basics.
00:29:14.000 | So out of your sensor data, understand that a set of point clouds and pixels belong to the cyclist.
00:29:22.000 | Find that you have two cars on the scene, the police car and the car parked in front of it.
00:29:27.000 | Understand the policeman as a pedestrian.
00:29:31.000 | So basic level of understanding.
00:29:34.000 | Obviously, you need a little more than that.
00:29:36.000 | You need to go deeper in your semantics.
00:29:40.000 | Obviously, you need, if you understand that the flashing lights are on,
00:29:45.000 | you understand that the police car is becoming an EV,
00:29:49.000 | and is performing something on the scene.
00:29:53.000 | If you understand that this car is parked,
00:29:55.000 | obviously that's a valuable piece of information that's going to tell you whether you can pass it or not.
00:30:00.000 | Something you may have not noticed is that there are also cones.
00:30:03.000 | So there are cones here on the scene that would prevent you, for instance,
00:30:07.000 | to go and drive that pathway if you wanted to.
00:30:11.000 | Next level of getting closer to behavior prediction.
00:30:15.000 | Obviously, if you also understand that actually the police car has an open door,
00:30:21.000 | then all of a sudden you can start to expect a behavior where someone is going to get out of that car.
00:30:25.000 | And the way you swerve, even if you were to decide to swerve,
00:30:28.000 | or the way someone getting out of that car would impact the trajectory of the cyclist,
00:30:34.000 | is something you need to understand in order to properly and safely drive.
00:30:40.000 | And only then, only when you have that depth of understanding,
00:30:43.000 | you can start to come up with realistic behavior predictions
00:30:48.000 | and trajectory predictions for all those agents on the scene,
00:30:52.000 | and you can come up with a proper strategy for your planning control.
00:30:58.000 | So how is deep learning playing into that whole space?
00:31:02.000 | And how is deep learning impacting used to solve many of those problems?
00:31:10.000 | So remember when I said when you're 90% done, you still have 90% to go?
00:31:17.000 | So I think that starts to beat us.
00:31:20.000 | I also talked about how robotics and having sensors in real life is not a perfect world.
00:31:27.000 | So actually it is a big piece of the puzzle.
00:31:30.000 | So I wish sensors would give us perfect data all the time,
00:31:34.000 | and would give us a perfect picture that we can do, reliably use to do deep learning.
00:31:39.000 | But unfortunately, that's not how it works.
00:31:42.000 | So here, for instance, you see an example where you have a pickup truck.
00:31:48.000 | So the imagery doesn't show it, but you have smoke coming out of the exhaust,
00:31:54.000 | and you have an exhaust that's triggering laser points.
00:32:00.000 | Not very relevant for any behavior prediction or for your driving behavior.
00:32:05.000 | So those points, obviously, and it's safe to go and drive through them.
00:32:10.000 | So those are very safe to ignore in terms of scene understanding.
00:32:16.000 | So filtering the whole bunch of data coming off your sensors is a very important task,
00:32:24.000 | because that reduces the computation you're going to have to do,
00:32:27.000 | but also key to operate safely.
00:32:30.000 | A more subtle one, but an important one, are around reflections.
00:32:36.000 | So here we are driving a scene.
00:32:39.000 | There's a car here. On the camera picture, the car is reflected in a bus.
00:32:44.000 | And if you just do a naïve detection, especially if the bus moves along with you,
00:32:50.000 | which is very typical, and everything moves,
00:32:53.000 | then all of a sudden you're going to have two cars on the scene.
00:32:56.000 | And if you take that car too seriously, all the way to impacting your behavior,
00:33:01.000 | obviously you're going to make mistakes.
00:33:03.000 | So here I showed you an example of reflections on the visual range,
00:33:10.000 | but obviously that affects all sensors in slightly different manners.
00:33:13.000 | But you could have the same effect, for instance, with LiDAR data,
00:33:17.000 | where, for instance, you drive a freeway, and you have a road sign on top of the freeway
00:33:22.000 | that will reflect in the back window of the car in front of you,
00:33:26.000 | and then showing a reflected sign on the road.
00:33:30.000 | You better understand that the thing you see on the road is actually your reflection,
00:33:35.000 | and not try to swerve around and trying to avoid that thing on the 65 miles per hour trajectory.
00:33:42.000 | So that's a big, that's a big complicated challenge.
00:33:48.000 | But assume we are able to get to proper sensor data
00:33:54.000 | that we can start to process with our machine learning.
00:33:58.000 | So by the way, a lot of the single processing pieces
00:34:03.000 | we already use machine learning and deep learning to,
00:34:06.000 | because as you can see, for instance, in the reflection space,
00:34:08.000 | you need to, at the end of the day, you can do some tricks
00:34:12.000 | to understand the difference in the signal, but at the end of the day,
00:34:14.000 | at some point, for some of them, you're going to have to understand,
00:34:16.000 | to have a higher level of understanding of the scene,
00:34:18.000 | and realize it's not possible that the car is hiding behind the bus,
00:34:22.000 | given my field of view, for instance.
00:34:24.000 | But assuming you have good sensor data, filtered out sensor data,
00:34:28.000 | the very next thing you're going to want to do is, typically,
00:34:32.000 | is apply some kind of convolution layers on top of that imagery.
00:34:41.000 | So, if you're not familiar with convolution layers,
00:34:45.000 | so that's a very popular way to do computer vision,
00:34:50.000 | because it relies on connecting neurons with kernels
00:34:55.000 | that are going to learn, layer after layer, features of the imagery.
00:35:02.000 | So those kernels typically work locally on the sub-region of the image,
00:35:06.000 | and they're going to understand lines, they're going to understand contours,
00:35:12.000 | and as you build up layers, they're going to understand
00:35:15.000 | higher and higher levels of feature representations
00:35:18.000 | that ultimately will tell you what's happening on the image.
00:35:21.000 | So that's a very common technique, and much more efficient, obviously,
00:35:25.000 | than fully connected layers, for instance, that wouldn't work.
00:35:28.000 | But unfortunately, a lot of the state-of-the-art is actually in 2D convolutions.
00:35:34.000 | So again, they've been developed on imagery,
00:35:38.000 | and typically they require a fairly dense input.
00:35:42.000 | So, for an imagery upgrade, it's great because pixels are very dense.
00:35:46.000 | You always have a pixel next to the next one.
00:35:48.000 | There's not a lot of void.
00:35:50.000 | If you were, for instance, to think if you were to do plane convolutions
00:35:55.000 | on a very sparse laser point, for instance,
00:35:57.000 | then you would have a lot of holes, and those don't work nearly as well.
00:36:01.000 | So typically, what we do is to first project sensor data into 2D planes,
00:36:07.000 | and do processing on those.
00:36:09.000 | So two very typical views that we use, the first one is top-down,
00:36:14.000 | so bird views, which is going to give you a Google Maps kind of view of the scene.
00:36:18.000 | So it's great, for instance, to map cars and objects moving along the scene.
00:36:25.000 | But it's harder to put imagery, pixels, imagery you saw from the car,
00:36:31.000 | into those top-down views.
00:36:33.000 | So there's another famous one, a common one, that is the driver view,
00:36:38.000 | so projection onto the plane from the driver's perspective,
00:36:42.000 | that are much better at utilizing imagery,
00:36:47.000 | because essentially that's how imagery got captured.
00:36:50.000 | We didn't use drones.
00:36:52.000 | So here, for instance, you're going to see how you can,
00:36:54.000 | if your sensors are properly registered,
00:36:57.000 | how you can use both LiDAR and imagery signals together
00:37:02.000 | to better understand the scene.
00:37:06.000 | So the first kind of processing you can do is what is called segmentation.
00:37:15.000 | So once you have pixels or laser points,
00:37:18.000 | you need to group them together into objects
00:37:22.000 | that you can then use for better understanding and processing.
00:37:27.000 | So unfortunately, a lot of the objects you encounter while driving
00:37:31.000 | don't have a predefined shape.
00:37:33.000 | So here I do the example of snow, but if you think about vegetation,
00:37:37.000 | or if you think about trash bags, for instance,
00:37:39.000 | you can't come up with a prior understanding
00:37:45.000 | on how they're going to look like.
00:37:47.000 | And so you have to be ready to have any shape of those objects.
00:37:51.000 | So one of the techniques that works pretty well
00:37:55.000 | is to build a smaller convolution network
00:37:59.000 | that you're going to slide across the projection of your sensor data.
00:38:04.000 | So that's the sliding window approach.
00:38:07.000 | So here, for instance, if you have a pixel-accurate snow detector
00:38:12.000 | that you slide across the image,
00:38:14.000 | then you'll be able to build a representation of those patches of snow
00:38:18.000 | and drive appropriately around them.
00:38:21.000 | So that works pretty well, but as you can imagine,
00:38:24.000 | it's a little expensive computationally,
00:38:27.000 | because it's like, if you remember,
00:38:30.000 | I don't know if you've seen them, actually,
00:38:32.000 | it's like the old matrix printing.
00:38:35.000 | It had a printer, and it had to go "choo-choo" and print a page, point by point.
00:38:40.000 | So it works pretty well, but it's pretty slow.
00:38:43.000 | So obviously, it's very analogous to that.
00:38:47.000 | But it works pretty well.
00:38:49.000 | So that works pretty well, but you need to be very conscious
00:38:52.000 | on which area of the scene you want to apply it to, to stay efficient.
00:38:59.000 | Fortunately, many of the objects you care about have predefined priors.
00:39:05.000 | So for instance, if you take a car from the top-down view,
00:39:09.000 | from the bird's view, it's going to be a rectangle.
00:39:12.000 | You can take that shape prior into consideration.
00:39:16.000 | In most cases, even, on the driving lanes,
00:39:20.000 | they're going to go in similar directions,
00:39:22.000 | whether they go forward or they come the other way.
00:39:25.000 | They're going to go in the direction of the lanes.
00:39:28.000 | Same for adjacent streets.
00:39:30.000 | So you can use those priors to actually do some more efficient deep learning
00:39:36.000 | that in the literature is conveyed under the ideas of single-shot multi-box, for instance.
00:39:43.000 | So here, again, you would start with convolution towers,
00:39:46.000 | but you do only one pass of convolution.
00:39:49.000 | It's the same difference between a dot matrix printer and a press, right?
00:39:54.000 | That would print a page at once.
00:39:57.000 | It's only an analogy, but I think that conveys the idea pretty well.
00:40:01.000 | So here you would train a deep net that would directly take the whole projection of your sensor data
00:40:07.000 | and output boxes that encode the priors you have.
00:40:13.000 | So here, for instance, I can show you how such a thing would work for cone detection.
00:40:18.000 | So you can see that we don't have all the fidelity of the per-pixel cone detection,
00:40:23.000 | but we don't really care about that.
00:40:24.000 | We just need to know there is a cone somewhere, and we take a box prior.
00:40:28.000 | And obviously what that image is also meant to show is that,
00:40:33.000 | since it's a lot cheaper computationally,
00:40:37.000 | you can obviously run that on a pretty wide range of space.
00:40:40.000 | And even if you have a lot of them, that still is going to be a very efficient way to get that data.
00:40:50.000 | So we talked about, remember, the flashing lights on top of the police car.
00:40:56.000 | So even if you properly detect and segment cars, let's say, on the road,
00:41:03.000 | many cars have very special semantics.
00:41:06.000 | So here on that slide I'm showing you many examples of EV, emergency vehicles,
00:41:11.000 | that you need obviously to understand.
00:41:13.000 | You need to understand, first, that it is an EV, and two, whether the EV is active or not.
00:41:18.000 | School buses are not really emergency vehicles, but obviously whether the bus has lights on,
00:41:22.000 | or the bus has a stop sign open on the side, carry heavy semantics that you need to understand.
00:41:29.000 | So how do you deal with that?
00:41:31.000 | Back to the deep learning techniques.
00:41:33.000 | One thing you could do is take that patch, build a new convolution tower,
00:41:40.000 | and build a classifier on top of that,
00:41:43.000 | and essentially build a school bus classifier, a school bus with lights on classifier,
00:41:47.000 | a school bus with stop sign open classifier.
00:41:50.000 | I'm pretty sure that would work pretty well, but obviously that would be a lot of work,
00:41:54.000 | and pretty expensive to run on the car, because you would need to ...
00:41:58.000 | And convolution layers typically are the most expensive pieces of a neural net.
00:42:03.000 | So one better thing to do is to use embeddings.
00:42:08.000 | So if you're not familiar with it, embeddings essentially are vector representations
00:42:13.000 | of objects that you can learn with deep nets that will carry some semantic meaning of those objects.
00:42:21.000 | So for instance, given a vehicle, you can build a vector that's going to carry the information
00:42:29.000 | that that vehicle is a school bus, whether the lights are on, whether the stop sign is open,
00:42:35.000 | and then you're back into a vector space, which is much smaller, much more efficient,
00:42:39.000 | that you can operate in to do further processing.
00:42:44.000 | So those embeddings have been actually historically, they've been more closely associated with word embeddings.
00:42:49.000 | So in a typical text, if you were able to build those vectors with words, out of words,
00:42:55.000 | so out of every word in a piece of text, you build a vector that represents the meaning of that word.
00:43:00.000 | And then if you look at the sequence of those words and operate in the vector space,
00:43:04.000 | you start to understand the semantics of those sentences.
00:43:08.000 | So one of the early projects that you can look at is called Work2Vec,
00:43:13.000 | which was done in the NLP group at Google, where they were able to build such things.
00:43:19.000 | And they discovered that that embedding space actually carried some interesting vector space properties,
00:43:25.000 | such as if you took the vector for king minus the vector for man plus the vector for woman,
00:43:31.000 | actually you ended up with a vector where the closest word to that vector would be queen, essentially.
00:43:36.000 | So that's to show you how those vector representations can be very powerful
00:43:41.000 | in the amount of information they can contain.
00:43:45.000 | Let's talk about pedestrians.
00:43:50.000 | So we talked about semantic segmentation.
00:43:55.000 | Remember, so the ability to go pixel by pixel for things that don't really have a shape.
00:44:02.000 | We talked about using shape priors.
00:44:05.000 | But pedestrians actually combine the complexity of those two approaches for many reasons.
00:44:13.000 | One is that they obviously are deformable, and pedestrians come with many shapes and poses.
00:44:21.000 | As you can see here, I think here you have a guy or someone on a skateboard,
00:44:28.000 | crouching, more unusual poses that you need to understand.
00:44:33.000 | And the recall you need to have on pedestrians is very high.
00:44:36.000 | And pedestrians show up in many different situations.
00:44:39.000 | So for instance here, you have occluded pedestrians that you need to see,
00:44:43.000 | because there's a good chance when you do your behavior prediction
00:44:46.000 | that that person here is going to jump out of the car, and you need to be ready for that.
00:44:51.000 | So last but not least, predicting the behavior of pedestrians is really hard,
00:44:58.000 | because they move in any direction.
00:45:00.000 | A car moving in that direction, you can safely bet it's not going to drastically change angle in a moment's notice.
00:45:07.000 | But if you take children, for instance, it's a little more complicated.
00:45:11.000 | So they may not pay attention, they may jump in any direction, and you need to be ready for that.
00:45:16.000 | So it's harder in terms of shape prior, it's harder in terms of recall,
00:45:20.000 | and it's also harder in terms of prediction.
00:45:23.000 | And you need to have a fine understanding of the semantics to understand that.
00:45:26.000 | Another example here that we encountered is you get to an intersection,
00:45:32.000 | and you have a visually impaired person that's jaywalking on the intersection.
00:45:38.000 | And you obviously need to understand all of that to know that you need to yield to that person, pretty clearly.
00:45:44.000 | So, person on the road, maybe you should yield to it, to him.
00:45:50.000 | Not easy. So for instance here, so there is actually, I don't know if it's a real person or a mannequin or something.
00:45:59.000 | So, but here we go. Something that frankly really looks like a pedestrian, that you should probably classify as a pedestrian,
00:46:05.000 | but lying on the bed of a pickup truck.
00:46:09.000 | So, and obviously you shouldn't yield to that person, right, because if you were to,
00:46:15.000 | and yielding to a pedestrian at 35 miles per hour, for instance, is hitting the brakes pretty hard, right,
00:46:21.000 | with the risk of a rear collision.
00:46:24.000 | So obviously you need to understand that that person is traveling with a truck,
00:46:30.000 | and he's not actually on the road, and it's okay to not yield to him.
00:46:36.000 | So those are examples of the richness of the semantics you need to understand.
00:46:41.000 | Obviously one way to do that is to start and understand the behavior of things over time.
00:46:47.000 | Everything we talked about up until now in how we use deep learning to solve some of those problems
00:46:52.000 | was on a pure frame basis.
00:46:54.000 | But understanding that that person is moving with a truck versus the jaywalker in the middle of the intersection,
00:46:59.000 | obviously that kind of information you can get to if you observe the behavior over time.
00:47:06.000 | Back to the embeddings.
00:47:08.000 | So if you have vector representations of those objects, you can start and track them over time.
00:47:14.000 | So a common technique that you can use to get there is to use recurrent neural networks,
00:47:19.000 | that essentially are networks that will build a state that gets better and better
00:47:24.000 | as it gets more observations, sequential observations of a real pattern.
00:47:28.000 | So for instance, coming back to the words example I gave earlier,
00:47:33.000 | you have one word, you see its vector representation, another word in a sentence,
00:47:38.000 | so you understand more about what the author is trying to say.
00:47:41.000 | Third word, fourth word, at the end of the sentence you have a good understanding,
00:47:45.000 | and you can start to translate, for instance.
00:47:48.000 | So here's a similar idea.
00:47:50.000 | If you have a semantic representation encoded in an embedding for the pedestrian and the car under it,
00:47:57.000 | and track that over time and build a state that gets more and more meaning as time goes by,
00:48:04.000 | you're going to get closer and closer to a good understanding of what's going on in the scene.
00:48:09.000 | So my point here is, those vector representations combined with recurrent neural networks
00:48:15.000 | is a common technique that can help you figure that out.
00:48:23.000 | Back to the point.
00:48:26.000 | When you're 90% done, you still have 90% to go.
00:48:30.000 | And so to get to the last leg of my talk here today,
00:48:35.000 | I want to give you some appreciation for what it takes to truly build a machine learning system at scale
00:48:43.000 | and industrialize it.
00:48:46.000 | So up till now we talked a lot about algorithms.
00:48:48.000 | As I said earlier, algorithms have been a breakthrough,
00:48:51.000 | and the efficiency of those algorithms has been a breakthrough for us to succeed at the self-driving task.
00:48:57.000 | But it takes a lot more than algorithms to actually get there.
00:49:04.000 | The first piece that you need to 10x is around the labeling efforts.
00:49:12.000 | So a lot of the algorithms we talked about are supervised,
00:49:16.000 | meaning that even if you have a strong network architecture and you come up with the right one,
00:49:21.000 | they are supervised in the sense that you need to give, in order to train that network,
00:49:26.000 | you need to come up with a representative set, a high-quality set of labeled data
00:49:30.000 | that's going to map some input to predict the outputs you wanted to predict.
00:49:35.000 | So that's a pedestrian, that's a car. That's a pedestrian, that's a car.
00:49:38.000 | And the network will learn in a supervised way how to build the right representations.
00:49:45.000 | So there's a lot. Obviously the unsupervised space is a very active domain of research.
00:49:52.000 | Our own team of research at Waymo, in collaboration with Google, is around that domain.
00:49:58.000 | But today a lot of it still is supervised.
00:50:01.000 | So to give you orders of magnitude, so here I represented in a logarithmic scale
00:50:07.000 | the size of a couple of data sets.
00:50:09.000 | So you may be familiar with ImageNet, which I think is the 15 million of such labels range.
00:50:16.000 | That guy jumping represents a number of seconds from birth to college graduation, hopefully coming soon.
00:50:25.000 | And so that's more of a historical tidbit.
00:50:29.000 | But the first, remember the find the house number, the street number on the facade problem?
00:50:36.000 | So back in those days, it took us a multi-billion label data set to actually teach the network.
00:50:43.000 | So those were very early days. Today we do a lot better, obviously.
00:50:47.000 | But that's to give you an idea of scale.
00:50:50.000 | So being able to have labeling operations that produce large and high-quality label data sets
00:50:57.000 | is key for your success. And that's a big piece of the puzzle you need to solve.
00:51:02.000 | So obviously today we do a lot more better. Not only we require less data,
00:51:07.000 | but we also can generate those data sets much more efficiently.
00:51:12.000 | You can use machine learning itself to come up with labels, and use operators,
00:51:17.000 | and more importantly use hybrid models where you use labelers to more and more fix the discrepancies
00:51:23.000 | or the mistakes, and not have to label the whole thing from scratch.
00:51:26.000 | So combining, so that's a whole space of active learning and stuff like that.
00:51:30.000 | So combining those techniques together, obviously you can get to completion faster.
00:51:35.000 | It's very common to still need millions, millions range kind of samples to train a robust solution.
00:51:43.000 | Another piece is around computation, computing power.
00:51:48.000 | So again, that's kind of a historical tidbit.
00:51:53.000 | Around the street number models, so here it's the detection model, and here is the transcriber model.
00:51:59.000 | So obviously comparison is not, is only worth what it's worth here.
00:52:04.000 | But if you look at the number of neurons, or number of connections per neuron,
00:52:08.000 | which are two important parameters of any neural net, that gives you an idea of scale.
00:52:15.000 | So obviously it's many orders of magnitude away from what the human brain can do,
00:52:20.000 | but you start to be competitive, and even in some cases in the mammal space.
00:52:26.000 | So again, historical data, but the main point here is that you need a lot of computation,
00:52:32.000 | and you need to have access to a lot of computing to either train and infer those trained models on real-time on the scene.
00:52:43.000 | And that requires a lot of very robust engineering and infrastructure development to get to those scales.
00:52:53.000 | But Google is pretty good at that, and obviously we at Waymo have access to the Google infrastructure and tools to essentially get there.
00:53:02.000 | So I don't know if you heard, so the way it's happening at Google is around TensorFlow.
00:53:08.000 | So maybe you've heard about it as more of a programming language to program machine learning,
00:53:14.000 | and encode network architectures.
00:53:20.000 | But actually, TensorFlow is also becoming, or is actually, the whole ecosystem that can combine all those pieces together
00:53:29.000 | and do machine learning at scale at Google and Waymo.
00:53:33.000 | So as I said, it's a language that allows teams to collaborate and work together.
00:53:40.000 | That's a data representation in which you can represent your label data sets, for instance, or your training batches.
00:53:48.000 | That's a runtime that you can deploy onto Google data centers, and it's good that we have access to that computing power.
00:53:58.000 | Another piece is accelerators.
00:54:01.000 | So back in the early days when we had CPUs to run deep learning models at scale, which is less efficient,
00:54:08.000 | over time GPUs came into the mix, and Google is pretty active into developing a very advanced set of hardware accelerators.
00:54:18.000 | So you may have heard about TPUs, TensorFlow processing units,
00:54:22.000 | which are proprietary chipsets that Google deploys in its data centers
00:54:28.000 | that allow you to train and infer more efficiently those deep learning models.
00:54:32.000 | And TensorFlow is the glue that allows you to deploy at scale across those pieces.
00:54:38.000 | Very important piece to get there.
00:54:43.000 | So it's nice. You're smart. We build a smart algorithm.
00:54:48.000 | We were able to collect enough data to train it. Great! Ship it!
00:54:54.000 | Well, the self-driving system is pretty sophisticated, and that's a complex system to understand,
00:55:01.000 | and that's a complex system that requires extensive testing.
00:55:07.000 | And I think the last leg that you need to cover to do machine learning at scale
00:55:12.000 | and with a high safety bar is around your testing program.
00:55:17.000 | So we have three legs that we use to make sure that our machine learning is ready for production.
00:55:26.000 | One is around real-world driving, another one is around simulation, and the last one is around structure testing.
00:55:31.000 | So I'll come back to that.
00:55:33.000 | In terms of real-world driving, obviously there is no way around it.
00:55:38.000 | If you want to encounter situations and see and understand how you behave, you need to drive.
00:55:44.000 | So as you can see, the driving at Waymo has been accelerating over time.
00:55:48.000 | It's still accelerating. So we crossed 3 million miles driven back in May 2017,
00:55:55.000 | and only six months later, back in November, we reached 4 million.
00:56:00.000 | So that's an accelerating pace.
00:56:03.000 | Obviously, not every mile is equal, and what you care about are the miles that carry new situations and important situations.
00:56:10.000 | So what we do, obviously, is drive in many different situations.
00:56:14.000 | So those miles got acquired across 20 cities, many weather conditions, and many environments.
00:56:23.000 | It's 4 million a lot. So to give you another rough magnitude, so that's 160 times around the globe.
00:56:29.000 | Even more importantly, it's hard to estimate, but it's probably around 300 years of human driving, the equivalent.
00:56:40.000 | So in that data set, potentially, you have 300 years of experience that your machine learning can tap into to learn what to do.
00:56:52.000 | Even more importantly is your ability to simulate.
00:56:58.000 | Obviously, the software changes regularly. So if for each new revision of the software, you need to go and re-drive 4 million miles,
00:57:06.000 | it's not really practical, and it's going to take a lot of time.
00:57:09.000 | So the ability to have a good enough simulation that you can replay all those miles that you've driven
00:57:15.000 | in any new iteration of the software is key for you to decide if the new version is ready or not.
00:57:20.000 | Even more important is your ability to make those miles even more efficient and tweak them.
00:57:29.000 | So here is a screenshot of an internal tool that we call CarCraft,
00:57:34.000 | that essentially gives us the ability to fuzz or change the parameters of the actual scene we've driven.
00:57:41.000 | So what if the cars were doing in a slightly different speed?
00:57:44.000 | What if there was an extra car that was on the scene?
00:57:48.000 | What if a pedestrian crossed in front of the car?
00:57:51.000 | So you can use the actual driven miles as a base, and then augment them into new situations
00:57:57.000 | that you can test your self-driving system against.
00:58:02.000 | So that's a very powerful way to actually drastically multiply the impact of any mile you drive.
00:58:09.000 | And simulation is another of those massive-scale projects that you need to cover.
00:58:15.000 | So a couple of orders of magnitude here.
00:58:18.000 | So using Google's infrastructure, we have the ability to run a virtual fleet of 25,000 cars 24/7 in data centers.
00:58:27.000 | So those are software stacks that emulate the driving across either raw miles that we've driven
00:58:33.000 | or modified miles that help us understand the behavior of a software.
00:58:38.000 | So to give you another magnitude, last year alone we drove 2.5 billion of those miles in data centers.
00:58:47.000 | So remember, 4 million driven miles total, all the way to 2.5.
00:58:51.000 | So that's three orders of magnitude of expansion in your ability to truly understand how the system behaves.
00:59:00.000 | But there's still a long tail.
00:59:02.000 | There's a whole tail, or a long tail, of situations that will happen very rarely.
00:59:08.000 | So the way we decided to tackle those is to set up our own testing facility
00:59:15.000 | that is a mock of a city and driving situation.
00:59:18.000 | So we do that in a 90-acre testing facility on the former Air Force Base in central California
00:59:26.000 | that we set up with traffic lights, railroad crossings,
00:59:31.000 | I mean, truly trying to reproduce a real-life situation,
00:59:35.000 | and where we set up very specific scenarios that we haven't necessarily encountered during regular driving
00:59:41.000 | but that we want to test.
00:59:43.000 | And then feed back into the simulation, re-augment using the same augmentation strategies,
00:59:48.000 | and inject into our 2.5 billion miles driven.
00:59:51.000 | So here I'm going to show you two quick examples of such tests.
00:59:55.000 | So here, just have a car back up as the self-driving car gets close and see what happens,
01:00:02.000 | and use all those sensor data to re-inject them into simulation.
01:00:06.000 | Another example is going to be around people dropping boxes.
01:00:11.000 | So remember, try to imagine the kind of understanding, segmentation you need to do
01:00:18.000 | to understand what's happening there and cementing understanding you have.
01:00:22.000 | And to make it even more interesting,
01:00:25.000 | note that the car that has been put on the other side,
01:00:27.000 | so that swerving is not an option, right, without hitting the car.
01:00:32.000 | So driving complex situations that go from perception to motion planning, the whole stack,
01:00:36.000 | and make sure that we are reliable, even in those long-term examples.
01:00:44.000 | Are we done?
01:00:46.000 | It looks like a lot of work.
01:00:48.000 | I wish, but no.
01:00:50.000 | Actually, we still have a lot of very interesting work coming,
01:00:54.000 | so I don't have much time to go into too many of those details,
01:00:56.000 | but I'm just going to give you two big directions.
01:00:59.000 | The first one is around growing our, what we call, ODD,
01:01:03.000 | so operating domain, operating design domain.
01:01:09.000 | So extending our fleet of self-driving cars, not only geographically,
01:01:15.000 | so geographically meaning going into, deploying into urban cores,
01:01:21.000 | deploying into different weather conditions.
01:01:24.000 | So just as of this morning or yesterday,
01:01:27.000 | we announced that we're going to grow our testing in San Francisco, for instance,
01:01:33.000 | with way more cars that bring urban environments, slopes, fog, as I said.
01:01:38.000 | And so that's obviously a very, very important direction that we need to go into,
01:01:44.000 | and where machine learning is going to keep playing a very important role.
01:01:48.000 | Another area is around cementing understanding.
01:01:52.000 | So in case you haven't noticed yet, I am from France.
01:01:57.000 | That's a famous roundabout in Paris, Place de l'Etoile,
01:02:03.000 | which seems pretty chaotic, but I've driven it many times without any issues, touching wood.
01:02:11.000 | But I know that it took a lot of semantics and understanding for me to do it safely.
01:02:17.000 | I had a lot of expectations on what people do,
01:02:21.000 | a lot of communication, visual, gestures, to essentially get through that thing safely.
01:02:28.000 | And those require a lot of deeper semantic understanding of the scene around
01:02:35.000 | for self-driving cars to get through.
01:02:37.000 | So that's an example of a direction.
01:02:41.000 | So back to my objectives.
01:02:44.000 | I hope I covered many of those, or at least you have directions for further reading and investigations.
01:02:50.000 | On those three objectives I had today,
01:02:54.000 | the first one was around context, context of the space, context of the history at Google and Waymo,
01:03:00.000 | and how deep the roots are on the way back in time.
01:03:07.000 | My second objective was to give you, to tie in some of the technical algorithmic solutions
01:03:13.000 | that you may have talked about during that class into the practical cases we need to solve
01:03:18.000 | in a production system.
01:03:20.000 | And last but not least, really emphasize the scale and the engineering infrastructure work
01:03:27.000 | that needs to happen to really take such a project into fruition in a production system.
01:03:35.000 | Last tweet.
01:03:37.000 | That's a scene with kids jumping on bags as a frogger across the sea.
01:03:44.000 | And I think we have time for a few questions.
01:03:47.000 | So I'll hand it over to Hannah.
01:03:49.000 | Thank you.
01:03:55.000 | I was wondering, you showed your car craft simulation a little bit.
01:03:58.000 | So from a robotics background, usually the systems tend to fail at this intersection between perception and planning.
01:04:03.000 | So your planner might assume something about a perfect world that perception cannot deliver.
01:04:07.000 | So I was wondering if you use the simulation environment also to induce these perception failures,
01:04:12.000 | or whether that's really specific for scenario testing,
01:04:15.000 | and whether you have other validation arguments for the perception side.
01:04:19.000 | Very good question.
01:04:22.000 | So one thing I didn't mention is that the simulator obviously enables you to simulate many different layers in a stack.
01:04:29.000 | And one of the hard-core engineering problems is to actually properly design your stack
01:04:34.000 | so that you can isolate and test independently.
01:04:36.000 | Like any robust piece of software, you need to have good APIs and layers.
01:04:41.000 | So we have such a layer in our system between perception and planning.
01:04:47.000 | And the way you would test perception is more by measuring the performance of your perception system
01:04:53.000 | across more of the real miles,
01:04:56.000 | and use and tweak the output of the perception system with its mistakes.
01:05:02.000 | So having a good understanding of the mistakes it makes,
01:05:04.000 | and reproduce those mistakes realistically in the new scenarios you would come up with,
01:05:08.000 | part of your simulator, to realistically test the planning side of the house.
01:05:13.000 | Thanks very much.
01:05:16.000 | You talked about the car as being a complex system,
01:05:20.000 | and it has to be an industrial product that is being conceived at scale and produced at scale.
01:05:25.000 | Do you have a systematic way of creating the architectures of the embedded system?
01:05:31.000 | You have so many choices for sensors, algorithms,
01:05:34.000 | each problem you showed has many different solutions.
01:05:38.000 | That's going to create different interfaces between each element.
01:05:41.000 | So how do you choose which architecture you put in a car?
01:05:45.000 | That's true for any complex software stack.
01:05:49.000 | So there's a combination of different things.
01:05:52.000 | So the first thing, obviously, that I didn't talk too much about here,
01:05:55.000 | but is around the vast amount of research that we do at Waymo,
01:06:00.000 | but also we do in collaboration with Google Teams,
01:06:04.000 | to actually understand even what building blocks we have at our disposal
01:06:10.000 | to even play with and come up with those production systems.
01:06:15.000 | The other piece is obviously the one you decide to take all the way to production.
01:06:21.000 | So you're right.
01:06:23.000 | So the two big elements here, I would say, the first one,
01:06:26.000 | I mean the main element, frankly, is in your ability to--
01:06:31.000 | so that search actually takes a lot of people to get to.
01:06:38.000 | So something I try to say is that to really--
01:06:43.000 | part of the second 90% is your ability to grow your team
01:06:48.000 | and essentially grow the number of people who will be able
01:06:51.000 | to productively participate in your engineering project.
01:06:55.000 | And that's where the robustness we need to bring
01:06:58.000 | into our development environment, our testing,
01:07:03.000 | is really key to be able to grow that team at a bigger scale
01:07:08.000 | and essentially explore all those paths and come up with the best one.
01:07:11.000 | And at the end of the day, the robustness of testing is the judge.
01:07:16.000 | That's what tells you whether an approach works or not.
01:07:19.000 | It's not a philosophical debate.
01:07:23.000 | Thank you for your talk.
01:07:25.000 | So the car is making a decision at every single step time,
01:07:28.000 | on direction and speed.
01:07:31.000 | And part of the reason why you have this simulation
01:07:33.000 | is so that you can test those decisions in every possible scenario.
01:07:38.000 | So once self-driving cars become production-ready and out on the streets,
01:07:43.000 | do you expect that the decision will be made based on prior understanding
01:07:48.000 | of every single situation which is possible?
01:07:51.000 | Or can the car make a new decision in real time
01:07:55.000 | based on its seen understanding and everything around it?
01:07:59.000 | So at the end of the day,
01:08:03.000 | the goal of the system is not to build a library of events
01:08:09.000 | that you can reproduce one by one and make sure that you encode.
01:08:14.000 | The analogy in machine learning would be overfitting.
01:08:17.000 | It's like if you encountered five situations,
01:08:20.000 | I'm pretty sure you can hard-code the perfect thing you need to do
01:08:24.000 | in those five situations.
01:08:25.000 | But the sixth one that happens, if you don't generalize,
01:08:27.000 | actually is going to fall through.
01:08:30.000 | So really the complexity of what you need to do
01:08:34.000 | is extract the core principles that make you safely drive.
01:08:40.000 | And have the algorithms learn those principles
01:08:44.000 | rather than the specifics of any situation.
01:08:47.000 | Because as you said, the parameter space of a real scene is infinite.
01:08:53.000 | So we try to fuzz that a little bit with a simulator.
01:08:58.000 | What if the cars went a little faster, a little slower?
01:09:00.000 | But the goal is not to enumerate all possibilities
01:09:03.000 | and make sure we do well on those.
01:09:05.000 | But the goal is to bring more diversity
01:09:07.000 | to the learning of those general principles
01:09:09.000 | that will be learned by the system
01:09:11.000 | or will be coded in the system
01:09:13.000 | for the car to behave properly and generalize
01:09:16.000 | when a new situation occurs.
01:09:20.000 | Maybe a couple more questions is okay?
01:09:22.000 | Okay.
01:09:24.000 | Fantastic talk.
01:09:26.000 | One of the questions I had was,
01:09:27.000 | you mentioned the difficulty of identifying snow
01:09:30.000 | because it could come in many different shapes.
01:09:33.000 | One of the things that I immediately thought of was,
01:09:36.000 | I know it was just an urban legend,
01:09:37.000 | but it was that urban legend about the Inuit
01:09:40.000 | having 150 different words for snow.
01:09:42.000 | And you mentioned embeddings of objects.
01:09:46.000 | Do you think one possible approach might be
01:09:49.000 | to create a much wider array of object embeddings
01:09:53.000 | for things like snow?
01:09:55.000 | I mean, if you're...
01:09:58.000 | Many different types of snow could actually have
01:09:59.000 | pretty different impacts on driving,
01:10:02.000 | whether it be just like a flurry,
01:10:03.000 | or if it were to be the kind of like a really heavy blizzard
01:10:07.000 | like we just had.
01:10:10.000 | Yeah, I think if you look at it
01:10:12.000 | from an algorithmic point of view,
01:10:16.000 | that may make sense.
01:10:19.000 | But maybe something I'd like to emphasize a little more
01:10:23.000 | is the very hard line to walk
01:10:26.000 | is to walk the line of what's algorithmically possible,
01:10:30.000 | but also what's computationally feasible in the car.
01:10:36.000 | I think...
01:10:38.000 | So, two points on your remarks.
01:10:41.000 | So, if we had the processing power to process every point,
01:10:47.000 | or every... to a large level of understanding,
01:10:50.000 | and had the computing power to do that,
01:10:52.000 | maybe that would be an approach.
01:10:53.000 | But that would be very expensive,
01:10:55.000 | and that's a hard thing to do.
01:10:57.000 | Even more importantly,
01:11:00.000 | having... for instance, it wouldn't make sense
01:11:02.000 | to have a behavior prediction on every snowflake
01:11:05.000 | of the things you see on the side of the road, right?
01:11:07.000 | And you need to group...
01:11:09.000 | That's the whole point of segmentation.
01:11:10.000 | You need to group what you see into semantic objects
01:11:15.000 | that are likely to exhibit a behavior as a whole,
01:11:19.000 | and reason at that level of abstraction
01:11:21.000 | to have a meaningful semantic understanding
01:11:23.000 | that you need to drive, essentially.
01:11:25.000 | So, yeah, it's an in-between.
01:11:30.000 | Last question. Make it a good one.
01:11:32.000 | Thanks for the talk.
01:11:34.000 | So, if you're using perception for your scene understanding,
01:11:37.000 | are you worried about, like, adversarial examples
01:11:39.000 | or things that have been demonstrated?
01:11:42.000 | Or do you not believe that this is, like, a real-world attack
01:11:45.000 | that could be used for perception-based systems?
01:11:48.000 | So, generally speaking,
01:11:50.000 | I think even before adversarial attacks,
01:11:56.000 | errors... I mean, errors can happen, right?
01:11:58.000 | And errors happen in every module.
01:12:01.000 | So I think a prime example of that which is not adversarial
01:12:04.000 | is the reflection case.
01:12:06.000 | It's like, yeah, you could as well have put a sticker
01:12:08.000 | on the car, on the bus, and say, "Ah, you're confused.
01:12:10.000 | "You think it's a car. It's not a car."
01:12:12.000 | But you don't need to put a sticker on the bus.
01:12:14.000 | It's like, real life already brings a lot of those examples.
01:12:18.000 | So really, the way out is two ways.
01:12:21.000 | The first one is to have sensors that complement each other.
01:12:27.000 | So I try to emphasize that.
01:12:29.000 | Really, different sensors or different systems
01:12:33.000 | are not going to make the same mistakes,
01:12:35.000 | and so they're going to complement each other.
01:12:37.000 | And that's a very important piece of relevancy
01:12:39.000 | that we build into the system.
01:12:41.000 | The other one is also, even in the reflection case,
01:12:45.000 | is in the understanding.
01:12:49.000 | So the way you as a human wouldn't be fooled
01:12:52.000 | is because you understand and you know
01:12:54.000 | it's not a thing that can happen.
01:12:57.000 | The same way you know that a car reflecting in a bus,
01:13:01.000 | there's no way you can see through the bus
01:13:03.000 | of a real car behind it.
01:13:05.000 | So that level of semantic understanding
01:13:07.000 | is what is going to tell you what is true and what is not,
01:13:11.000 | or what is a mistake, an error in your stack.
01:13:14.000 | And so similar patterns apply.
01:13:17.000 | We'd like to thank you very much, Sacha Arnouf,
01:13:19.000 | for coming to MIT.