back to indexSacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars
Chapters
0:0
2:50 self-driving cars?
13:27 The rise of Deep Learning
30:57 Deep learning techniques
32:31 Reflections
37:5 Semantic segmentation example: snow
38:58 End-to-end architecture predicting from raw data
40:15 Computer vision based single shot with cones
40:49 Embeddings for finer classification
43:47 Pedestrians
49:3 Dataset size and quality circa 2013
52:52 TensorFlow. a new programming paradigm
58:12 Large scale simulation
00:00:00.000 |
Today we have the Director of Engineering, Head of Perception at Waymo, 00:00:05.000 |
a company that's recently driven over 4 million miles autonomously, 00:00:11.000 |
and in so doing inspired the world in what artificial intelligence and good engineering can do. 00:00:18.000 |
So please give a warm welcome to Sasha Arnou. 00:00:36.000 |
Thanks a lot for giving me the opportunity to be able to come and share my passion with self-driving cars 00:00:44.000 |
and be able to share with you all the great work we've been doing at Waymo over the last 10 years 00:00:50.000 |
and give you more details on the recent milestones we've reached. 00:00:55.000 |
So as you'll see, we'll cover a lot of different topics, 00:01:00.000 |
some more technical, some more about context. 00:01:05.000 |
But over the content, I have three main objectives that I'd like to convey today. 00:01:13.000 |
So keep that in mind as we go through the presentation. 00:01:17.000 |
My first one is to give you some background around the self-driving space 00:01:25.000 |
and what's happening there and what it takes to build self-driving cars, 00:01:29.000 |
but also give you some behind-the-scenes views and tidbits on the history of machine learning, deep learning, 00:01:38.000 |
and how it all came together within the big alphabet family from Google to Waymo. 00:01:45.000 |
Another piece, obviously, another objective I have is to give you some technical meat 00:01:52.000 |
around the techniques that are working today on our self-driving cars. 00:01:56.000 |
So I think during the class you'll hear a lot, you've heard a lot about different deep learning techniques, 00:02:03.000 |
models, architectures, algorithms, and I'll try to put that in a current hole 00:02:09.000 |
so that you can see how those pieces fit together to build the system we have today. 00:02:14.000 |
And last but not least, I think as Lex mentioned, it takes a lot more, actually, than algorithms 00:02:21.000 |
to build a sophisticated system such as our self-driving cars. 00:02:26.000 |
And fundamentally, it takes a full industrial project to make that happen. 00:02:32.000 |
And I'll try to give you some color, which hopefully is a little different from what you've heard during the week. 00:02:38.000 |
I'll try to give you some color on what it takes to actually pan out such an industrial project in real life 00:02:45.000 |
and essentially productionize machine learning. 00:02:51.000 |
So we hear a lot of talk. We hear a lot about self-driving cars. 00:02:56.000 |
It's a very hot topic, and for very good reasons. 00:03:00.000 |
I can tell you for sure that 2017 has been a great year for Waymo. 00:03:05.000 |
Actually, only a year ago, in January 2017, Waymo became its own company. 00:03:11.000 |
So that was a major milestone and a testimony to the robustness of the solution 00:03:16.000 |
so that we could move to a productization phase. 00:03:20.000 |
So what you see on the picture here is our latest generation self-driving vehicle. 00:03:31.000 |
You can already see a bunch of sensors. I'll come back to that and give you more insights on what they do and how they operate. 00:03:42.000 |
So self-driving indeed draws a lot of attention, and for very good reasons. 00:03:49.000 |
I personally believe, and I think you will agree with me, that self-driving really has the potential to deeply change 00:03:57.000 |
the way we look about mobility and the way we move people and things around. 00:04:03.000 |
So only to cover a few aspects here, obviously, and I don't want to go into too many details, 00:04:15.000 |
94% of US crashes today involve human errors. 00:04:19.000 |
A lot of those errors are around distraction and things that could be avoided. 00:04:27.000 |
Accessibility and access to mobility is also a big motivation of ours. 00:04:35.000 |
So obviously, the self-driving technology has the potential to make it very available and cheaper for more people to be able to move around. 00:04:44.000 |
And last but not least is efficiency, collective efficiency. 00:04:50.000 |
Not only do we spend a lot of time in our cars, in long commute hours. 00:04:56.000 |
I personally spend a lot of time in long commute hours. 00:05:00.000 |
And that time we spend in traffic probably could be better spent doing something else 00:05:04.000 |
than having to drive the car in complicated situations. 00:05:08.000 |
Beyond traffic, obviously, self-driving technology has the potential to deeply change the way we think about traffic, 00:05:17.000 |
parking spots, urban environments, city design. 00:05:27.000 |
So that's why we made it our mission at Waymo, 00:05:30.000 |
is fundamentally to make it safe and easy to move people and things around. 00:05:36.000 |
So that's a nice mission, and we've been on it for a very long time. 00:05:43.000 |
So actually, the whole adventure started close to 10 years ago in 2009. 00:05:50.000 |
And at the time, that started under the umbrella of a Google project that you may have heard of called Chauffeur. 00:06:00.000 |
And back in those days, so remember, we were before the deep learning days, at least in the industry. 00:06:06.000 |
And so really back in those days, the first objective of the project was to try and assemble a first prototype vehicle, 00:06:14.000 |
take off-the-shelf sensors, assemble them together, and try to go and decide if self-driving is even a possibility. 00:06:23.000 |
It's one thing to have some prototype somewhere, but is that even a thing that is worth pursuing? 00:06:29.000 |
Which is a very common way for Google to tackle problems. 00:06:33.000 |
So the genesis for that work was to come up with a pretty aggressive objective. 00:06:40.000 |
So the team, the first milestone for the team, was to essentially assemble 10 100-mile loops 00:06:48.000 |
in Northern California, around Mountain View, and try and figure out, so for a total of 1,000 miles, 00:06:55.000 |
and try and see if they could build a first system that would be able to go and drive those loops autonomously. 00:07:04.000 |
So they were not afraid. So the team was not afraid. So those loops went through some very aggressive patterns. 00:07:13.000 |
So you see that some of those loops go through the Santa Cruz Mountains, which is an area in California that, 00:07:20.000 |
as you see, I'll show you a video, that has very small roads and two-way traffic and cliffs, 00:07:26.000 |
with negative obstacles and complicated patterns like that. 00:07:30.000 |
Some of those paths were going on highways, and one of the busiest highways. 00:07:39.000 |
Some of those routes were going around Lake Tahoe, which is in the Sierras in California, 00:07:46.000 |
where you can encounter different kinds of weather, and again, different kinds of roads conditions. 00:07:51.000 |
Those routes were going around bridges, and the Bay Area has quite a few bridges to go through. 00:07:58.000 |
Some of them were even going through a dense urban area. 00:08:02.000 |
So you can see San Francisco being driven. You can see Monterey, some of the Monterey centers being driven. 00:08:10.000 |
And as you'll see on the video, those truly bring dense urban area challenges. 00:08:19.000 |
So since I promised it, so here you're going to see some pictures of the driving. It's kind of working. 00:08:27.000 |
So here, with better quality, so here you see the roads I was talking about on the Santa Cruz Mountains, 00:08:34.000 |
driving in the night, animals crossing the street, freeway driving, going through pay tolls. 00:08:40.000 |
That's the Monterey area that is fairly dense. There's an aquarium there, a pretty popular one. 00:08:46.000 |
That's the famous Lombard Street in San Francisco that you may have heard of, 00:08:51.000 |
which in San Francisco always brings its unique set of challenges between fog and slopes, 00:09:00.000 |
So that was all the way back in 2010. So those 10 loops were successfully completed 100% autonomously back in 2010. 00:09:17.000 |
So on the heels of that success, the team decided, and Google decided, that self-driving was worth pursuing, 00:09:26.000 |
and moved forward with the development of the technology and testing. 00:09:33.000 |
So we've been at it for all those years, and have been working very hard on it. 00:09:38.000 |
Historically, Waymo and I think all the other companies out there have been relying on what we call safety drivers 00:09:46.000 |
to still sit behind the wheels, even if the car is driving autonomously. 00:09:51.000 |
We still have a safety driver who is able to take over at any time and make sure that we have very safe operations. 00:09:58.000 |
And we've been accumulating miles and knowledge and developing the system, many iterations of the system, 00:10:07.000 |
We reached a major milestone, as Lex mentioned, back in November, 00:10:13.000 |
where for the first time we reached a level of confidence and maturity in a system 00:10:18.000 |
that we felt confident and proved to ourselves that it was safe to remove the safety driver. 00:10:25.000 |
As you can imagine, that's a major milestone, because it takes a very high level of confidence 00:10:32.000 |
to not have that backup solution of a safety driver to take over where something to arise. 00:10:38.000 |
So here I'm going to show you a small video, a quick capture of that event. 00:10:44.000 |
So the video is from one of the first times we did that. 00:10:48.000 |
Since then we've been continuously operating driverless cars, self-driving cars, in the Phoenix area in Arizona 00:11:07.000 |
So here we have members of the team who are acting as passengers, getting on a back seat. 00:11:12.000 |
You can notice that there is no driver on the driver's seat. 00:11:17.000 |
So here we are running a car-hailing kind of service. 00:11:21.000 |
So the passengers simply press a button, the application knows where they want to go, and the car goes. 00:11:30.000 |
So we started with a fairly constrained geographical area in Chandler, close to Phoenix, Arizona. 00:11:40.000 |
And we are hard-working to expand the testing and the scope of our operating area since then. 00:11:52.000 |
So that goes well beyond a single car, a single day. 00:11:56.000 |
Not only do we do that continuously, but we also have a growing fleet of self-driving cars 00:12:01.000 |
that we are deploying there all the way and looking for a product launch pretty quickly. 00:12:12.000 |
So I talked about 2010, and we are in 2018, and we are getting there. 00:12:21.000 |
So I think one of the key ideas that I'd like to convey here today, 00:12:26.000 |
and that I will go back to during the presentation, is how much work it takes to really take a demo 00:12:35.000 |
or something that's working in the lab into something that you feel safe to put on the roads, 00:12:40.000 |
and get all the way to that depth of understanding, that depth of perfection in your technology, 00:12:49.000 |
So one way to say that is that when you're 90% done, you still have 90% to go. 00:12:53.000 |
So 90% of the technology takes only 10% of the time. 00:13:01.000 |
You need to 10x the capabilities of your technology. 00:13:06.000 |
You need to 10x your team size and find ways for more engineers and more researchers to collaborate together. 00:13:12.000 |
You need to 10x the capabilities of your sensors. 00:13:15.000 |
You need to 10x fundamentally the overall quality of the system, 00:13:19.000 |
and your testing practices, as we'll see, and a lot of the aspects of the program. 00:13:28.000 |
So, beyond the context of self-driving cars, I want to spend a little bit of time 00:13:35.000 |
to give you kind of an insider view of the rise of deep learning. 00:13:41.000 |
So remember I mentioned that back in 2009, 2010, deep learning was not really available yet 00:13:50.000 |
And so over those years, actually, it took a lot of breakthroughs to be able to reach that stage. 00:13:58.000 |
And one of them was the algorithm breakthrough that deep learning gave us. 00:14:02.000 |
And I'll give you a little bit of a backstage view on what happened at Google during those years. 00:14:10.000 |
So as you know, Google has committed itself to machine learning and deep learning very early on. 00:14:16.000 |
You may have heard of the Google Brain, what we call internally the Google Brain Team, 00:14:21.000 |
which is a team fundamentally hard at work to lead the bleeding edge of research, which is known, 00:14:29.000 |
but also leading the development of the tools and infrastructure of the whole machine learning ecosystem 00:14:37.000 |
at Google and Waymo, to essentially allow many teams to develop machine learning at scale 00:14:47.000 |
So they've been working and pushing, the deep learning technology has been pushing the field 00:14:52.000 |
in many directions, from computer vision to speech understanding to NLP, 00:15:00.000 |
and all those directions are things that you can see in Google products today. 00:15:03.000 |
So whether you're talking Google Assistant or Google Photos, speech recognition, or even Google Maps, 00:15:10.000 |
you can see the impact of deep learning in all those areas. 00:15:15.000 |
And actually, many years ago, I myself was part of the Street View team, 00:15:21.000 |
and I was leading an internal program, an internal project that we called Street Smart. 00:15:29.000 |
And the goal we had at Street Smart was to use deep learning and machine learning techniques 00:15:37.000 |
to go and analyze street view imagery, and as you know, that's a fairly big and varied corpus, 00:15:43.000 |
so that we could extract elements that are core to our mapping strategy, 00:15:51.000 |
So for instance, in that picture, that's a piece of a panorama from street view imagery, 00:15:58.000 |
and you can see that there are a lot of pieces in there that if you could find and properly localize, 00:16:06.000 |
would drastically help you build better maps. 00:16:08.000 |
So street numbers, obviously, that are really useful to map addresses, 00:16:12.000 |
street names that when combined even on similar techniques from our views, 00:16:18.000 |
will help you properly draw all the routes and give a name to them. 00:16:22.000 |
And those two combines actually allow you to do very high-quality address lookups, 00:16:31.000 |
Internal text, and more specifically text on business facades, 00:16:35.000 |
that allow you to not only maybe localize business listings that you may have gotten by other means 00:16:41.000 |
to actual physical locations, but also build some of those local listings directly from scratch. 00:16:48.000 |
And more traffic-oriented patterns, whether it's traffic lights, traffic signs, 00:16:53.000 |
that can be used then for ETA, navigation ETA predictions, and stuff like that. 00:17:02.000 |
One of the, as I mentioned, one of the hard pieces to do is to map addresses at scale. 00:17:09.000 |
And so you can imagine that we had the breakthrough when we first were able to properly find 00:17:17.000 |
those street numbers out of the street view imagery and out of the facade. 00:17:22.000 |
Solving that problem actually requires a lot of pieces. 00:17:25.000 |
Not only you need to find where the street number is on the facade, 00:17:31.000 |
which is, if you think about it, a fairly hard semantic problem. 00:17:35.000 |
What's the difference between a street number versus another kind of number versus other text? 00:17:41.000 |
But then obviously read it, because there's no point having pixels if you cannot understand 00:17:46.000 |
the number that's on the facade, all the way to properly geolocalizing it, 00:17:55.000 |
And so the first deep learning application that succeeded in production, 00:18:00.000 |
and that's all the way back to 2012, that we had the first system in production, 00:18:06.000 |
was really the first breakthrough that we had across Alphabet 00:18:11.000 |
on our ability to properly understand real scene situations. 00:18:18.000 |
So here I'm going to show you a video that kind of sums it up. 00:18:22.000 |
So look, every one of those segments is actually a view, starting from the car, 00:18:28.000 |
going to the physical number of all those house numbers that we've been able to detect and transcribe. 00:18:35.000 |
So here that's in Sao Paulo, and where you can see that when all that data is put together, 00:18:40.000 |
it gives you a very consistent view of the addressing scheme. 00:18:46.000 |
So that's another example. Similar things, obviously we have more, that's in Paris, 00:18:52.000 |
where we have even more imagery, so more views of those physical numbers, 00:18:57.000 |
that if you're able to triangulate, you're able to localize them very accurately, 00:19:04.000 |
So the last example I'm going to show is in Cape Town in South Africa, 00:19:10.000 |
where again, the impact of that deep learning work has been huge in terms of quality. 00:19:16.000 |
So many countries today actually have up to more than 95% of addresses mapped that way. 00:19:26.000 |
So doing similar things. So obviously you can see a lot of parallelism 00:19:29.000 |
between that work on street view imagery and doing the same on the real scene on the car. 00:19:36.000 |
But obviously doing that on the car is even harder. 00:19:40.000 |
It's even harder because you need to do that real-time and very quickly, with low latency. 00:19:48.000 |
And you also need to do that in an embedded system. 00:19:57.000 |
You cannot rely on a connection to a Google data center, 00:20:00.000 |
and first you don't have the time in terms of latency to bring data back and forth. 00:20:05.000 |
But also you cannot rely on a connection for the safe operation of your system. 00:20:09.000 |
So you need to do the processing within the car. 00:20:13.000 |
So that's a paper that you can read that dates all the way back to 2014, 00:20:20.000 |
where for the first time, by using slightly different techniques, 00:20:24.000 |
we were able to put deep learning at work inside that constrained real-time environment, 00:20:31.000 |
and start to have impact, and in that case, around pedestrian detection. 00:20:42.000 |
You can see that to properly drive that scene, like street view, 00:20:48.000 |
You need to understand if the light is red or green. 00:20:51.000 |
And that's what essentially will allow you to do that processing. 00:20:55.000 |
Obviously driving is even more challenging beyond the real-time. 00:20:58.000 |
I don't know if you saw the cyclist going through. 00:21:01.000 |
So we have real stuff happening on the scene that you need to detect 00:21:04.000 |
and properly understand, interpret, and predict. 00:21:07.000 |
And at the same time, here I explicitly took a night driving example 00:21:13.000 |
to show you that while you can choose when you take pictures of street view 00:21:17.000 |
and do it in daytime and in perfect conditions, 00:21:21.000 |
driving requires you to take the conditions as they are, 00:21:27.000 |
So there has been, from the very early beginning, 00:21:31.000 |
there has been a lot of cross-punnelization between the real scene work. 00:21:36.000 |
So here I took a few papers that we did in street view, 00:21:40.000 |
that obviously if you read them, you'll see directly apply 00:21:45.000 |
But obviously that collaboration between Google Research and Waymo 00:21:50.000 |
historically went well beyond street view only and across all the research groups. 00:21:55.000 |
And that still is a very strong collaboration going on 00:21:58.000 |
that enables us to stay on the bleeding edge of what we can do. 00:22:04.000 |
So now that we looked a little bit at how things happened, 00:22:08.000 |
I want to spend more time and go into more of the details 00:22:14.000 |
and how deep learning is actually impacting our current system. 00:22:20.000 |
So I think during the—if I looked at the cursors properly, 00:22:25.000 |
I think during the week you went through the major pieces 00:22:28.000 |
that you need to master to make a self-driving car. 00:22:31.000 |
So I'm sure you heard about mapping, localization, 00:22:35.000 |
so putting the car within those maps and understanding where you are 00:22:38.000 |
with pretty good accuracy, perception, scene understanding, 00:22:42.000 |
which is a higher level semantic understanding of what's going on in the scene, 00:22:46.000 |
starting to predict what the agents are going to do around you 00:22:53.000 |
Obviously there's a whole robotics aspect at the end of the day. 00:23:00.000 |
whether it's around the sensor data or even the control interfaces to the car. 00:23:05.000 |
And for everyone who has dealt with hardware and robotics, 00:23:09.000 |
you will agree with me that it's not a perfect world, 00:23:17.000 |
Other pieces that you may have talked about is around simulation 00:23:21.000 |
and essentially validation of whatever system you put together. 00:23:26.000 |
So obviously machine learning and deep learning have been having a deep impact 00:23:35.000 |
but for the next minutes here I'm going to focus more on the perception piece, 00:23:40.000 |
which is a core element of what a self-driving car needs to do. 00:23:48.000 |
So fundamentally, perception is a system in the car 00:23:52.000 |
that needs to build an understanding of the world around it. 00:24:07.000 |
it would be a little silly to have to recompute the actual location of the road, 00:24:12.000 |
the actual interconnectivity of the intersections, 00:24:16.000 |
of every intersection once you get on the scene, 00:24:21.000 |
You can pre-compute in advance and save your on-board computing 00:24:28.000 |
So really, that's often referred to as the mapping exercise, 00:24:32.000 |
but really it's about reducing the computation 00:24:35.000 |
you're going to have to do on the car once it drives. 00:24:43.000 |
is what sensors are going to give you once you get on the spot. 00:24:47.000 |
So sensor data is the signal that's going to tell you 00:24:54.000 |
and the things, is the traffic light red or green? 00:24:57.000 |
Where are the pedestrians? Where are the cars? What are they doing? 00:25:05.000 |
we have quite a set of sensors on our self-driving cars. 00:25:11.000 |
So they go from vision systems, radar, and LIDAR, 00:25:16.000 |
are the three big families of sensors we have. 00:25:20.000 |
One point to note here is that they are designed to be complementary. 00:25:26.000 |
So they are designed to be complementary first in their localization on the car, 00:25:33.000 |
because obviously blind spots are major issues, 00:25:36.000 |
and you want to have good coverage of the field of view. 00:25:42.000 |
The other piece is that they are complementary in their capabilities. 00:25:48.000 |
cameras are going to be very good to give you a dense representation. 00:26:00.000 |
You can really see a large number of details, 00:26:07.000 |
but for instance, they are not really good to give you depth. 00:26:10.000 |
It's much harder, and computationally expensive, 00:26:14.000 |
to get depth information out of camera systems. 00:26:20.000 |
when you hit objects, will give you a very good depth estimation, 00:26:25.000 |
but obviously they're going to lack a lot of the semantic information 00:26:30.000 |
So all those sensors are designed to be complementary 00:26:37.000 |
It goes without saying that the better your sensors are, 00:26:41.000 |
the better your perception system is going to be. 00:26:45.000 |
So that's why at Waymo we took the path of designing our own sensors in-house 00:26:51.000 |
and enhancing what's available off the shelf today, 00:26:58.000 |
because it's important for us to go all the way to be able to build 00:27:03.000 |
a self-driving system that we could believe in. 00:27:12.000 |
It takes those two inputs and builds a representation of the scene. 00:27:17.000 |
So at the end of the day, you have to realize that in nature, 00:27:23.000 |
that work of perception is really what differentiates, deeply differentiates, 00:27:28.000 |
what you need to do in a self-driving system, 00:27:30.000 |
as opposed to a lower-level driving assistance system. 00:27:37.000 |
In many cases, for instance, if you do speed cruise, 00:27:40.000 |
or if you do a lot of lower-level driver assistance, 00:27:45.000 |
a lot of the strategies can be around not bumping into things. 00:27:50.000 |
If you see things moving around, you group them, you segment them appropriately 00:27:54.000 |
in blocks of moving things, and you don't hit them, 00:28:00.000 |
When you don't have a driver on the driver's seat, 00:28:02.000 |
obviously the challenge totally changes scale. 00:28:05.000 |
So to give you an example, for instance, if you're on a lane 00:28:09.000 |
and you see a bicyclist going more slowly on the lane right of you, 00:28:14.000 |
and there's a car next to you, you need to understand that there's a chance 00:28:19.000 |
that that car is going to want to avoid that bicyclist, 00:28:22.000 |
it's going to swerve, and you need to anticipate that behavior 00:28:25.000 |
so that you can properly decide whether you want to slow down, 00:28:29.000 |
give space for the car, or speed up and have the car go behind you. 00:28:33.000 |
Those are the kinds of behaviors that go well beyond not bumping into things, 00:28:38.000 |
and that require a much deeper understanding of the world that's going on around you. 00:28:45.000 |
So let me put it in picture, and we'll come back to that example in a couple of cases. 00:28:49.000 |
So here is a typical scene that we encountered, at least. 00:28:54.000 |
So here, obviously, you have a police car pulled over, 00:29:01.000 |
You have a cyclist on the road moving forward, 00:29:10.000 |
So the first thing you can do, you have to do, obviously, is the basics. 00:29:14.000 |
So out of your sensor data, understand that a set of point clouds and pixels belong to the cyclist. 00:29:22.000 |
Find that you have two cars on the scene, the police car and the car parked in front of it. 00:29:40.000 |
Obviously, you need, if you understand that the flashing lights are on, 00:29:45.000 |
you understand that the police car is becoming an EV, 00:29:55.000 |
obviously that's a valuable piece of information that's going to tell you whether you can pass it or not. 00:30:00.000 |
Something you may have not noticed is that there are also cones. 00:30:03.000 |
So there are cones here on the scene that would prevent you, for instance, 00:30:07.000 |
to go and drive that pathway if you wanted to. 00:30:11.000 |
Next level of getting closer to behavior prediction. 00:30:15.000 |
Obviously, if you also understand that actually the police car has an open door, 00:30:21.000 |
then all of a sudden you can start to expect a behavior where someone is going to get out of that car. 00:30:25.000 |
And the way you swerve, even if you were to decide to swerve, 00:30:28.000 |
or the way someone getting out of that car would impact the trajectory of the cyclist, 00:30:34.000 |
is something you need to understand in order to properly and safely drive. 00:30:40.000 |
And only then, only when you have that depth of understanding, 00:30:43.000 |
you can start to come up with realistic behavior predictions 00:30:48.000 |
and trajectory predictions for all those agents on the scene, 00:30:52.000 |
and you can come up with a proper strategy for your planning control. 00:30:58.000 |
So how is deep learning playing into that whole space? 00:31:02.000 |
And how is deep learning impacting used to solve many of those problems? 00:31:10.000 |
So remember when I said when you're 90% done, you still have 90% to go? 00:31:20.000 |
I also talked about how robotics and having sensors in real life is not a perfect world. 00:31:30.000 |
So I wish sensors would give us perfect data all the time, 00:31:34.000 |
and would give us a perfect picture that we can do, reliably use to do deep learning. 00:31:42.000 |
So here, for instance, you see an example where you have a pickup truck. 00:31:48.000 |
So the imagery doesn't show it, but you have smoke coming out of the exhaust, 00:31:54.000 |
and you have an exhaust that's triggering laser points. 00:32:00.000 |
Not very relevant for any behavior prediction or for your driving behavior. 00:32:05.000 |
So those points, obviously, and it's safe to go and drive through them. 00:32:10.000 |
So those are very safe to ignore in terms of scene understanding. 00:32:16.000 |
So filtering the whole bunch of data coming off your sensors is a very important task, 00:32:24.000 |
because that reduces the computation you're going to have to do, 00:32:30.000 |
A more subtle one, but an important one, are around reflections. 00:32:39.000 |
There's a car here. On the camera picture, the car is reflected in a bus. 00:32:44.000 |
And if you just do a naïve detection, especially if the bus moves along with you, 00:32:53.000 |
then all of a sudden you're going to have two cars on the scene. 00:32:56.000 |
And if you take that car too seriously, all the way to impacting your behavior, 00:33:03.000 |
So here I showed you an example of reflections on the visual range, 00:33:10.000 |
but obviously that affects all sensors in slightly different manners. 00:33:13.000 |
But you could have the same effect, for instance, with LiDAR data, 00:33:17.000 |
where, for instance, you drive a freeway, and you have a road sign on top of the freeway 00:33:22.000 |
that will reflect in the back window of the car in front of you, 00:33:26.000 |
and then showing a reflected sign on the road. 00:33:30.000 |
You better understand that the thing you see on the road is actually your reflection, 00:33:35.000 |
and not try to swerve around and trying to avoid that thing on the 65 miles per hour trajectory. 00:33:42.000 |
So that's a big, that's a big complicated challenge. 00:33:48.000 |
But assume we are able to get to proper sensor data 00:33:54.000 |
that we can start to process with our machine learning. 00:33:58.000 |
So by the way, a lot of the single processing pieces 00:34:03.000 |
we already use machine learning and deep learning to, 00:34:06.000 |
because as you can see, for instance, in the reflection space, 00:34:08.000 |
you need to, at the end of the day, you can do some tricks 00:34:12.000 |
to understand the difference in the signal, but at the end of the day, 00:34:14.000 |
at some point, for some of them, you're going to have to understand, 00:34:16.000 |
to have a higher level of understanding of the scene, 00:34:18.000 |
and realize it's not possible that the car is hiding behind the bus, 00:34:24.000 |
But assuming you have good sensor data, filtered out sensor data, 00:34:28.000 |
the very next thing you're going to want to do is, typically, 00:34:32.000 |
is apply some kind of convolution layers on top of that imagery. 00:34:41.000 |
So, if you're not familiar with convolution layers, 00:34:45.000 |
so that's a very popular way to do computer vision, 00:34:50.000 |
because it relies on connecting neurons with kernels 00:34:55.000 |
that are going to learn, layer after layer, features of the imagery. 00:35:02.000 |
So those kernels typically work locally on the sub-region of the image, 00:35:06.000 |
and they're going to understand lines, they're going to understand contours, 00:35:12.000 |
and as you build up layers, they're going to understand 00:35:15.000 |
higher and higher levels of feature representations 00:35:18.000 |
that ultimately will tell you what's happening on the image. 00:35:21.000 |
So that's a very common technique, and much more efficient, obviously, 00:35:25.000 |
than fully connected layers, for instance, that wouldn't work. 00:35:28.000 |
But unfortunately, a lot of the state-of-the-art is actually in 2D convolutions. 00:35:38.000 |
and typically they require a fairly dense input. 00:35:42.000 |
So, for an imagery upgrade, it's great because pixels are very dense. 00:35:46.000 |
You always have a pixel next to the next one. 00:35:50.000 |
If you were, for instance, to think if you were to do plane convolutions 00:35:57.000 |
then you would have a lot of holes, and those don't work nearly as well. 00:36:01.000 |
So typically, what we do is to first project sensor data into 2D planes, 00:36:09.000 |
So two very typical views that we use, the first one is top-down, 00:36:14.000 |
so bird views, which is going to give you a Google Maps kind of view of the scene. 00:36:18.000 |
So it's great, for instance, to map cars and objects moving along the scene. 00:36:25.000 |
But it's harder to put imagery, pixels, imagery you saw from the car, 00:36:33.000 |
So there's another famous one, a common one, that is the driver view, 00:36:38.000 |
so projection onto the plane from the driver's perspective, 00:36:47.000 |
because essentially that's how imagery got captured. 00:36:52.000 |
So here, for instance, you're going to see how you can, 00:36:57.000 |
how you can use both LiDAR and imagery signals together 00:37:06.000 |
So the first kind of processing you can do is what is called segmentation. 00:37:22.000 |
that you can then use for better understanding and processing. 00:37:27.000 |
So unfortunately, a lot of the objects you encounter while driving 00:37:33.000 |
So here I do the example of snow, but if you think about vegetation, 00:37:37.000 |
or if you think about trash bags, for instance, 00:37:47.000 |
And so you have to be ready to have any shape of those objects. 00:37:51.000 |
So one of the techniques that works pretty well 00:37:59.000 |
that you're going to slide across the projection of your sensor data. 00:38:07.000 |
So here, for instance, if you have a pixel-accurate snow detector 00:38:14.000 |
then you'll be able to build a representation of those patches of snow 00:38:21.000 |
So that works pretty well, but as you can imagine, 00:38:35.000 |
It had a printer, and it had to go "choo-choo" and print a page, point by point. 00:38:40.000 |
So it works pretty well, but it's pretty slow. 00:38:49.000 |
So that works pretty well, but you need to be very conscious 00:38:52.000 |
on which area of the scene you want to apply it to, to stay efficient. 00:38:59.000 |
Fortunately, many of the objects you care about have predefined priors. 00:39:05.000 |
So for instance, if you take a car from the top-down view, 00:39:09.000 |
from the bird's view, it's going to be a rectangle. 00:39:12.000 |
You can take that shape prior into consideration. 00:39:22.000 |
whether they go forward or they come the other way. 00:39:25.000 |
They're going to go in the direction of the lanes. 00:39:30.000 |
So you can use those priors to actually do some more efficient deep learning 00:39:36.000 |
that in the literature is conveyed under the ideas of single-shot multi-box, for instance. 00:39:43.000 |
So here, again, you would start with convolution towers, 00:39:49.000 |
It's the same difference between a dot matrix printer and a press, right? 00:39:57.000 |
It's only an analogy, but I think that conveys the idea pretty well. 00:40:01.000 |
So here you would train a deep net that would directly take the whole projection of your sensor data 00:40:07.000 |
and output boxes that encode the priors you have. 00:40:13.000 |
So here, for instance, I can show you how such a thing would work for cone detection. 00:40:18.000 |
So you can see that we don't have all the fidelity of the per-pixel cone detection, 00:40:24.000 |
We just need to know there is a cone somewhere, and we take a box prior. 00:40:28.000 |
And obviously what that image is also meant to show is that, 00:40:37.000 |
you can obviously run that on a pretty wide range of space. 00:40:40.000 |
And even if you have a lot of them, that still is going to be a very efficient way to get that data. 00:40:50.000 |
So we talked about, remember, the flashing lights on top of the police car. 00:40:56.000 |
So even if you properly detect and segment cars, let's say, on the road, 00:41:06.000 |
So here on that slide I'm showing you many examples of EV, emergency vehicles, 00:41:13.000 |
You need to understand, first, that it is an EV, and two, whether the EV is active or not. 00:41:18.000 |
School buses are not really emergency vehicles, but obviously whether the bus has lights on, 00:41:22.000 |
or the bus has a stop sign open on the side, carry heavy semantics that you need to understand. 00:41:33.000 |
One thing you could do is take that patch, build a new convolution tower, 00:41:43.000 |
and essentially build a school bus classifier, a school bus with lights on classifier, 00:41:50.000 |
I'm pretty sure that would work pretty well, but obviously that would be a lot of work, 00:41:54.000 |
and pretty expensive to run on the car, because you would need to ... 00:41:58.000 |
And convolution layers typically are the most expensive pieces of a neural net. 00:42:03.000 |
So one better thing to do is to use embeddings. 00:42:08.000 |
So if you're not familiar with it, embeddings essentially are vector representations 00:42:13.000 |
of objects that you can learn with deep nets that will carry some semantic meaning of those objects. 00:42:21.000 |
So for instance, given a vehicle, you can build a vector that's going to carry the information 00:42:29.000 |
that that vehicle is a school bus, whether the lights are on, whether the stop sign is open, 00:42:35.000 |
and then you're back into a vector space, which is much smaller, much more efficient, 00:42:39.000 |
that you can operate in to do further processing. 00:42:44.000 |
So those embeddings have been actually historically, they've been more closely associated with word embeddings. 00:42:49.000 |
So in a typical text, if you were able to build those vectors with words, out of words, 00:42:55.000 |
so out of every word in a piece of text, you build a vector that represents the meaning of that word. 00:43:00.000 |
And then if you look at the sequence of those words and operate in the vector space, 00:43:04.000 |
you start to understand the semantics of those sentences. 00:43:08.000 |
So one of the early projects that you can look at is called Work2Vec, 00:43:13.000 |
which was done in the NLP group at Google, where they were able to build such things. 00:43:19.000 |
And they discovered that that embedding space actually carried some interesting vector space properties, 00:43:25.000 |
such as if you took the vector for king minus the vector for man plus the vector for woman, 00:43:31.000 |
actually you ended up with a vector where the closest word to that vector would be queen, essentially. 00:43:36.000 |
So that's to show you how those vector representations can be very powerful 00:43:41.000 |
in the amount of information they can contain. 00:43:55.000 |
Remember, so the ability to go pixel by pixel for things that don't really have a shape. 00:44:05.000 |
But pedestrians actually combine the complexity of those two approaches for many reasons. 00:44:13.000 |
One is that they obviously are deformable, and pedestrians come with many shapes and poses. 00:44:21.000 |
As you can see here, I think here you have a guy or someone on a skateboard, 00:44:28.000 |
crouching, more unusual poses that you need to understand. 00:44:33.000 |
And the recall you need to have on pedestrians is very high. 00:44:36.000 |
And pedestrians show up in many different situations. 00:44:39.000 |
So for instance here, you have occluded pedestrians that you need to see, 00:44:43.000 |
because there's a good chance when you do your behavior prediction 00:44:46.000 |
that that person here is going to jump out of the car, and you need to be ready for that. 00:44:51.000 |
So last but not least, predicting the behavior of pedestrians is really hard, 00:45:00.000 |
A car moving in that direction, you can safely bet it's not going to drastically change angle in a moment's notice. 00:45:07.000 |
But if you take children, for instance, it's a little more complicated. 00:45:11.000 |
So they may not pay attention, they may jump in any direction, and you need to be ready for that. 00:45:16.000 |
So it's harder in terms of shape prior, it's harder in terms of recall, 00:45:23.000 |
And you need to have a fine understanding of the semantics to understand that. 00:45:26.000 |
Another example here that we encountered is you get to an intersection, 00:45:32.000 |
and you have a visually impaired person that's jaywalking on the intersection. 00:45:38.000 |
And you obviously need to understand all of that to know that you need to yield to that person, pretty clearly. 00:45:44.000 |
So, person on the road, maybe you should yield to it, to him. 00:45:50.000 |
Not easy. So for instance here, so there is actually, I don't know if it's a real person or a mannequin or something. 00:45:59.000 |
So, but here we go. Something that frankly really looks like a pedestrian, that you should probably classify as a pedestrian, 00:46:09.000 |
So, and obviously you shouldn't yield to that person, right, because if you were to, 00:46:15.000 |
and yielding to a pedestrian at 35 miles per hour, for instance, is hitting the brakes pretty hard, right, 00:46:24.000 |
So obviously you need to understand that that person is traveling with a truck, 00:46:30.000 |
and he's not actually on the road, and it's okay to not yield to him. 00:46:36.000 |
So those are examples of the richness of the semantics you need to understand. 00:46:41.000 |
Obviously one way to do that is to start and understand the behavior of things over time. 00:46:47.000 |
Everything we talked about up until now in how we use deep learning to solve some of those problems 00:46:54.000 |
But understanding that that person is moving with a truck versus the jaywalker in the middle of the intersection, 00:46:59.000 |
obviously that kind of information you can get to if you observe the behavior over time. 00:47:08.000 |
So if you have vector representations of those objects, you can start and track them over time. 00:47:14.000 |
So a common technique that you can use to get there is to use recurrent neural networks, 00:47:19.000 |
that essentially are networks that will build a state that gets better and better 00:47:24.000 |
as it gets more observations, sequential observations of a real pattern. 00:47:28.000 |
So for instance, coming back to the words example I gave earlier, 00:47:33.000 |
you have one word, you see its vector representation, another word in a sentence, 00:47:38.000 |
so you understand more about what the author is trying to say. 00:47:41.000 |
Third word, fourth word, at the end of the sentence you have a good understanding, 00:47:45.000 |
and you can start to translate, for instance. 00:47:50.000 |
If you have a semantic representation encoded in an embedding for the pedestrian and the car under it, 00:47:57.000 |
and track that over time and build a state that gets more and more meaning as time goes by, 00:48:04.000 |
you're going to get closer and closer to a good understanding of what's going on in the scene. 00:48:09.000 |
So my point here is, those vector representations combined with recurrent neural networks 00:48:15.000 |
is a common technique that can help you figure that out. 00:48:26.000 |
When you're 90% done, you still have 90% to go. 00:48:30.000 |
And so to get to the last leg of my talk here today, 00:48:35.000 |
I want to give you some appreciation for what it takes to truly build a machine learning system at scale 00:48:46.000 |
So up till now we talked a lot about algorithms. 00:48:48.000 |
As I said earlier, algorithms have been a breakthrough, 00:48:51.000 |
and the efficiency of those algorithms has been a breakthrough for us to succeed at the self-driving task. 00:48:57.000 |
But it takes a lot more than algorithms to actually get there. 00:49:04.000 |
The first piece that you need to 10x is around the labeling efforts. 00:49:12.000 |
So a lot of the algorithms we talked about are supervised, 00:49:16.000 |
meaning that even if you have a strong network architecture and you come up with the right one, 00:49:21.000 |
they are supervised in the sense that you need to give, in order to train that network, 00:49:26.000 |
you need to come up with a representative set, a high-quality set of labeled data 00:49:30.000 |
that's going to map some input to predict the outputs you wanted to predict. 00:49:35.000 |
So that's a pedestrian, that's a car. That's a pedestrian, that's a car. 00:49:38.000 |
And the network will learn in a supervised way how to build the right representations. 00:49:45.000 |
So there's a lot. Obviously the unsupervised space is a very active domain of research. 00:49:52.000 |
Our own team of research at Waymo, in collaboration with Google, is around that domain. 00:50:01.000 |
So to give you orders of magnitude, so here I represented in a logarithmic scale 00:50:09.000 |
So you may be familiar with ImageNet, which I think is the 15 million of such labels range. 00:50:16.000 |
That guy jumping represents a number of seconds from birth to college graduation, hopefully coming soon. 00:50:29.000 |
But the first, remember the find the house number, the street number on the facade problem? 00:50:36.000 |
So back in those days, it took us a multi-billion label data set to actually teach the network. 00:50:43.000 |
So those were very early days. Today we do a lot better, obviously. 00:50:50.000 |
So being able to have labeling operations that produce large and high-quality label data sets 00:50:57.000 |
is key for your success. And that's a big piece of the puzzle you need to solve. 00:51:02.000 |
So obviously today we do a lot more better. Not only we require less data, 00:51:07.000 |
but we also can generate those data sets much more efficiently. 00:51:12.000 |
You can use machine learning itself to come up with labels, and use operators, 00:51:17.000 |
and more importantly use hybrid models where you use labelers to more and more fix the discrepancies 00:51:23.000 |
or the mistakes, and not have to label the whole thing from scratch. 00:51:26.000 |
So combining, so that's a whole space of active learning and stuff like that. 00:51:30.000 |
So combining those techniques together, obviously you can get to completion faster. 00:51:35.000 |
It's very common to still need millions, millions range kind of samples to train a robust solution. 00:51:43.000 |
Another piece is around computation, computing power. 00:51:48.000 |
So again, that's kind of a historical tidbit. 00:51:53.000 |
Around the street number models, so here it's the detection model, and here is the transcriber model. 00:51:59.000 |
So obviously comparison is not, is only worth what it's worth here. 00:52:04.000 |
But if you look at the number of neurons, or number of connections per neuron, 00:52:08.000 |
which are two important parameters of any neural net, that gives you an idea of scale. 00:52:15.000 |
So obviously it's many orders of magnitude away from what the human brain can do, 00:52:20.000 |
but you start to be competitive, and even in some cases in the mammal space. 00:52:26.000 |
So again, historical data, but the main point here is that you need a lot of computation, 00:52:32.000 |
and you need to have access to a lot of computing to either train and infer those trained models on real-time on the scene. 00:52:43.000 |
And that requires a lot of very robust engineering and infrastructure development to get to those scales. 00:52:53.000 |
But Google is pretty good at that, and obviously we at Waymo have access to the Google infrastructure and tools to essentially get there. 00:53:02.000 |
So I don't know if you heard, so the way it's happening at Google is around TensorFlow. 00:53:08.000 |
So maybe you've heard about it as more of a programming language to program machine learning, 00:53:20.000 |
But actually, TensorFlow is also becoming, or is actually, the whole ecosystem that can combine all those pieces together 00:53:29.000 |
and do machine learning at scale at Google and Waymo. 00:53:33.000 |
So as I said, it's a language that allows teams to collaborate and work together. 00:53:40.000 |
That's a data representation in which you can represent your label data sets, for instance, or your training batches. 00:53:48.000 |
That's a runtime that you can deploy onto Google data centers, and it's good that we have access to that computing power. 00:54:01.000 |
So back in the early days when we had CPUs to run deep learning models at scale, which is less efficient, 00:54:08.000 |
over time GPUs came into the mix, and Google is pretty active into developing a very advanced set of hardware accelerators. 00:54:18.000 |
So you may have heard about TPUs, TensorFlow processing units, 00:54:22.000 |
which are proprietary chipsets that Google deploys in its data centers 00:54:28.000 |
that allow you to train and infer more efficiently those deep learning models. 00:54:32.000 |
And TensorFlow is the glue that allows you to deploy at scale across those pieces. 00:54:43.000 |
So it's nice. You're smart. We build a smart algorithm. 00:54:48.000 |
We were able to collect enough data to train it. Great! Ship it! 00:54:54.000 |
Well, the self-driving system is pretty sophisticated, and that's a complex system to understand, 00:55:01.000 |
and that's a complex system that requires extensive testing. 00:55:07.000 |
And I think the last leg that you need to cover to do machine learning at scale 00:55:12.000 |
and with a high safety bar is around your testing program. 00:55:17.000 |
So we have three legs that we use to make sure that our machine learning is ready for production. 00:55:26.000 |
One is around real-world driving, another one is around simulation, and the last one is around structure testing. 00:55:33.000 |
In terms of real-world driving, obviously there is no way around it. 00:55:38.000 |
If you want to encounter situations and see and understand how you behave, you need to drive. 00:55:44.000 |
So as you can see, the driving at Waymo has been accelerating over time. 00:55:48.000 |
It's still accelerating. So we crossed 3 million miles driven back in May 2017, 00:55:55.000 |
and only six months later, back in November, we reached 4 million. 00:56:03.000 |
Obviously, not every mile is equal, and what you care about are the miles that carry new situations and important situations. 00:56:10.000 |
So what we do, obviously, is drive in many different situations. 00:56:14.000 |
So those miles got acquired across 20 cities, many weather conditions, and many environments. 00:56:23.000 |
It's 4 million a lot. So to give you another rough magnitude, so that's 160 times around the globe. 00:56:29.000 |
Even more importantly, it's hard to estimate, but it's probably around 300 years of human driving, the equivalent. 00:56:40.000 |
So in that data set, potentially, you have 300 years of experience that your machine learning can tap into to learn what to do. 00:56:52.000 |
Even more importantly is your ability to simulate. 00:56:58.000 |
Obviously, the software changes regularly. So if for each new revision of the software, you need to go and re-drive 4 million miles, 00:57:06.000 |
it's not really practical, and it's going to take a lot of time. 00:57:09.000 |
So the ability to have a good enough simulation that you can replay all those miles that you've driven 00:57:15.000 |
in any new iteration of the software is key for you to decide if the new version is ready or not. 00:57:20.000 |
Even more important is your ability to make those miles even more efficient and tweak them. 00:57:29.000 |
So here is a screenshot of an internal tool that we call CarCraft, 00:57:34.000 |
that essentially gives us the ability to fuzz or change the parameters of the actual scene we've driven. 00:57:41.000 |
So what if the cars were doing in a slightly different speed? 00:57:44.000 |
What if there was an extra car that was on the scene? 00:57:48.000 |
What if a pedestrian crossed in front of the car? 00:57:51.000 |
So you can use the actual driven miles as a base, and then augment them into new situations 00:57:57.000 |
that you can test your self-driving system against. 00:58:02.000 |
So that's a very powerful way to actually drastically multiply the impact of any mile you drive. 00:58:09.000 |
And simulation is another of those massive-scale projects that you need to cover. 00:58:18.000 |
So using Google's infrastructure, we have the ability to run a virtual fleet of 25,000 cars 24/7 in data centers. 00:58:27.000 |
So those are software stacks that emulate the driving across either raw miles that we've driven 00:58:33.000 |
or modified miles that help us understand the behavior of a software. 00:58:38.000 |
So to give you another magnitude, last year alone we drove 2.5 billion of those miles in data centers. 00:58:47.000 |
So remember, 4 million driven miles total, all the way to 2.5. 00:58:51.000 |
So that's three orders of magnitude of expansion in your ability to truly understand how the system behaves. 00:59:02.000 |
There's a whole tail, or a long tail, of situations that will happen very rarely. 00:59:08.000 |
So the way we decided to tackle those is to set up our own testing facility 00:59:15.000 |
that is a mock of a city and driving situation. 00:59:18.000 |
So we do that in a 90-acre testing facility on the former Air Force Base in central California 00:59:26.000 |
that we set up with traffic lights, railroad crossings, 00:59:31.000 |
I mean, truly trying to reproduce a real-life situation, 00:59:35.000 |
and where we set up very specific scenarios that we haven't necessarily encountered during regular driving 00:59:43.000 |
And then feed back into the simulation, re-augment using the same augmentation strategies, 00:59:48.000 |
and inject into our 2.5 billion miles driven. 00:59:51.000 |
So here I'm going to show you two quick examples of such tests. 00:59:55.000 |
So here, just have a car back up as the self-driving car gets close and see what happens, 01:00:02.000 |
and use all those sensor data to re-inject them into simulation. 01:00:06.000 |
Another example is going to be around people dropping boxes. 01:00:11.000 |
So remember, try to imagine the kind of understanding, segmentation you need to do 01:00:18.000 |
to understand what's happening there and cementing understanding you have. 01:00:25.000 |
note that the car that has been put on the other side, 01:00:27.000 |
so that swerving is not an option, right, without hitting the car. 01:00:32.000 |
So driving complex situations that go from perception to motion planning, the whole stack, 01:00:36.000 |
and make sure that we are reliable, even in those long-term examples. 01:00:50.000 |
Actually, we still have a lot of very interesting work coming, 01:00:54.000 |
so I don't have much time to go into too many of those details, 01:00:56.000 |
but I'm just going to give you two big directions. 01:00:59.000 |
The first one is around growing our, what we call, ODD, 01:01:03.000 |
so operating domain, operating design domain. 01:01:09.000 |
So extending our fleet of self-driving cars, not only geographically, 01:01:15.000 |
so geographically meaning going into, deploying into urban cores, 01:01:27.000 |
we announced that we're going to grow our testing in San Francisco, for instance, 01:01:33.000 |
with way more cars that bring urban environments, slopes, fog, as I said. 01:01:38.000 |
And so that's obviously a very, very important direction that we need to go into, 01:01:44.000 |
and where machine learning is going to keep playing a very important role. 01:01:48.000 |
Another area is around cementing understanding. 01:01:52.000 |
So in case you haven't noticed yet, I am from France. 01:01:57.000 |
That's a famous roundabout in Paris, Place de l'Etoile, 01:02:03.000 |
which seems pretty chaotic, but I've driven it many times without any issues, touching wood. 01:02:11.000 |
But I know that it took a lot of semantics and understanding for me to do it safely. 01:02:17.000 |
I had a lot of expectations on what people do, 01:02:21.000 |
a lot of communication, visual, gestures, to essentially get through that thing safely. 01:02:28.000 |
And those require a lot of deeper semantic understanding of the scene around 01:02:44.000 |
I hope I covered many of those, or at least you have directions for further reading and investigations. 01:02:54.000 |
the first one was around context, context of the space, context of the history at Google and Waymo, 01:03:00.000 |
and how deep the roots are on the way back in time. 01:03:07.000 |
My second objective was to give you, to tie in some of the technical algorithmic solutions 01:03:13.000 |
that you may have talked about during that class into the practical cases we need to solve 01:03:20.000 |
And last but not least, really emphasize the scale and the engineering infrastructure work 01:03:27.000 |
that needs to happen to really take such a project into fruition in a production system. 01:03:37.000 |
That's a scene with kids jumping on bags as a frogger across the sea. 01:03:44.000 |
And I think we have time for a few questions. 01:03:55.000 |
I was wondering, you showed your car craft simulation a little bit. 01:03:58.000 |
So from a robotics background, usually the systems tend to fail at this intersection between perception and planning. 01:04:03.000 |
So your planner might assume something about a perfect world that perception cannot deliver. 01:04:07.000 |
So I was wondering if you use the simulation environment also to induce these perception failures, 01:04:12.000 |
or whether that's really specific for scenario testing, 01:04:15.000 |
and whether you have other validation arguments for the perception side. 01:04:22.000 |
So one thing I didn't mention is that the simulator obviously enables you to simulate many different layers in a stack. 01:04:29.000 |
And one of the hard-core engineering problems is to actually properly design your stack 01:04:34.000 |
so that you can isolate and test independently. 01:04:36.000 |
Like any robust piece of software, you need to have good APIs and layers. 01:04:41.000 |
So we have such a layer in our system between perception and planning. 01:04:47.000 |
And the way you would test perception is more by measuring the performance of your perception system 01:04:56.000 |
and use and tweak the output of the perception system with its mistakes. 01:05:02.000 |
So having a good understanding of the mistakes it makes, 01:05:04.000 |
and reproduce those mistakes realistically in the new scenarios you would come up with, 01:05:08.000 |
part of your simulator, to realistically test the planning side of the house. 01:05:16.000 |
You talked about the car as being a complex system, 01:05:20.000 |
and it has to be an industrial product that is being conceived at scale and produced at scale. 01:05:25.000 |
Do you have a systematic way of creating the architectures of the embedded system? 01:05:31.000 |
You have so many choices for sensors, algorithms, 01:05:34.000 |
each problem you showed has many different solutions. 01:05:38.000 |
That's going to create different interfaces between each element. 01:05:41.000 |
So how do you choose which architecture you put in a car? 01:05:49.000 |
So there's a combination of different things. 01:05:52.000 |
So the first thing, obviously, that I didn't talk too much about here, 01:05:55.000 |
but is around the vast amount of research that we do at Waymo, 01:06:00.000 |
but also we do in collaboration with Google Teams, 01:06:04.000 |
to actually understand even what building blocks we have at our disposal 01:06:10.000 |
to even play with and come up with those production systems. 01:06:15.000 |
The other piece is obviously the one you decide to take all the way to production. 01:06:23.000 |
So the two big elements here, I would say, the first one, 01:06:26.000 |
I mean the main element, frankly, is in your ability to-- 01:06:31.000 |
so that search actually takes a lot of people to get to. 01:06:38.000 |
So something I try to say is that to really-- 01:06:43.000 |
part of the second 90% is your ability to grow your team 01:06:48.000 |
and essentially grow the number of people who will be able 01:06:51.000 |
to productively participate in your engineering project. 01:06:55.000 |
And that's where the robustness we need to bring 01:06:58.000 |
into our development environment, our testing, 01:07:03.000 |
is really key to be able to grow that team at a bigger scale 01:07:08.000 |
and essentially explore all those paths and come up with the best one. 01:07:11.000 |
And at the end of the day, the robustness of testing is the judge. 01:07:16.000 |
That's what tells you whether an approach works or not. 01:07:25.000 |
So the car is making a decision at every single step time, 01:07:31.000 |
And part of the reason why you have this simulation 01:07:33.000 |
is so that you can test those decisions in every possible scenario. 01:07:38.000 |
So once self-driving cars become production-ready and out on the streets, 01:07:43.000 |
do you expect that the decision will be made based on prior understanding 01:07:51.000 |
Or can the car make a new decision in real time 01:07:55.000 |
based on its seen understanding and everything around it? 01:08:03.000 |
the goal of the system is not to build a library of events 01:08:09.000 |
that you can reproduce one by one and make sure that you encode. 01:08:14.000 |
The analogy in machine learning would be overfitting. 01:08:17.000 |
It's like if you encountered five situations, 01:08:20.000 |
I'm pretty sure you can hard-code the perfect thing you need to do 01:08:25.000 |
But the sixth one that happens, if you don't generalize, 01:08:30.000 |
So really the complexity of what you need to do 01:08:34.000 |
is extract the core principles that make you safely drive. 01:08:40.000 |
And have the algorithms learn those principles 01:08:47.000 |
Because as you said, the parameter space of a real scene is infinite. 01:08:53.000 |
So we try to fuzz that a little bit with a simulator. 01:08:58.000 |
What if the cars went a little faster, a little slower? 01:09:00.000 |
But the goal is not to enumerate all possibilities 01:09:13.000 |
for the car to behave properly and generalize 01:09:27.000 |
you mentioned the difficulty of identifying snow 01:09:30.000 |
because it could come in many different shapes. 01:09:33.000 |
One of the things that I immediately thought of was, 01:09:49.000 |
to create a much wider array of object embeddings 01:09:58.000 |
Many different types of snow could actually have 01:10:03.000 |
or if it were to be the kind of like a really heavy blizzard 01:10:19.000 |
But maybe something I'd like to emphasize a little more 01:10:26.000 |
is to walk the line of what's algorithmically possible, 01:10:30.000 |
but also what's computationally feasible in the car. 01:10:41.000 |
So, if we had the processing power to process every point, 01:10:47.000 |
or every... to a large level of understanding, 01:11:00.000 |
having... for instance, it wouldn't make sense 01:11:02.000 |
to have a behavior prediction on every snowflake 01:11:05.000 |
of the things you see on the side of the road, right? 01:11:10.000 |
You need to group what you see into semantic objects 01:11:15.000 |
that are likely to exhibit a behavior as a whole, 01:11:34.000 |
So, if you're using perception for your scene understanding, 01:11:37.000 |
are you worried about, like, adversarial examples 01:11:42.000 |
Or do you not believe that this is, like, a real-world attack 01:11:45.000 |
that could be used for perception-based systems? 01:12:01.000 |
So I think a prime example of that which is not adversarial 01:12:06.000 |
It's like, yeah, you could as well have put a sticker 01:12:08.000 |
on the car, on the bus, and say, "Ah, you're confused. 01:12:12.000 |
But you don't need to put a sticker on the bus. 01:12:14.000 |
It's like, real life already brings a lot of those examples. 01:12:21.000 |
The first one is to have sensors that complement each other. 01:12:29.000 |
Really, different sensors or different systems 01:12:35.000 |
and so they're going to complement each other. 01:12:37.000 |
And that's a very important piece of relevancy 01:12:41.000 |
The other one is also, even in the reflection case, 01:12:57.000 |
The same way you know that a car reflecting in a bus, 01:13:07.000 |
is what is going to tell you what is true and what is not, 01:13:11.000 |
or what is a mistake, an error in your stack. 01:13:17.000 |
We'd like to thank you very much, Sacha Arnouf,