Sacha Arnoud, Director of Engineering, Waymo

Today we have the Director of Engineering, Head of Perception at Waymo, a company that's recently driven over 4 million miles autonomously, and in so doing inspired the world in what artificial intelligence and good engineering can do. So please give a warm welcome to Sasha Arnou. Thanks a lot, Lex, for the introduction.

Wow, it's a pretty packed house. Thanks a lot. I'm really excited. Thanks a lot for giving me the opportunity to be able to come and share my passion with self-driving cars and be able to share with you all the great work we've been doing at Waymo over the last 10 years and give you more details on the recent milestones we've reached.

So as you'll see, we'll cover a lot of different topics, some more technical, some more about context. But over the content, I have three main objectives that I'd like to convey today. So keep that in mind as we go through the presentation. My first one is to give you some background around the self-driving space and what's happening there and what it takes to build self-driving cars, but also give you some behind-the-scenes views and tidbits on the history of machine learning, deep learning, and how it all came together within the big alphabet family from Google to Waymo.

Another piece, obviously, another objective I have is to give you some technical meat around the techniques that are working today on our self-driving cars. So I think during the class you'll hear a lot, you've heard a lot about different deep learning techniques, models, architectures, algorithms, and I'll try to put that in a current hole so that you can see how those pieces fit together to build the system we have today.

And last but not least, I think as Lex mentioned, it takes a lot more, actually, than algorithms to build a sophisticated system such as our self-driving cars. And fundamentally, it takes a full industrial project to make that happen. And I'll try to give you some color, which hopefully is a little different from what you've heard during the week.

I'll try to give you some color on what it takes to actually pan out such an industrial project in real life and essentially productionize machine learning. So we hear a lot of talk. We hear a lot about self-driving cars. It's a very hot topic, and for very good reasons.

I can tell you for sure that 2017 has been a great year for Waymo. Actually, only a year ago, in January 2017, Waymo became its own company. So that was a major milestone and a testimony to the robustness of the solution so that we could move to a productization phase.

So what you see on the picture here is our latest generation self-driving vehicle. So it is based on the Chrysler Pacifica. You can already see a bunch of sensors. I'll come back to that and give you more insights on what they do and how they operate. But that's the latest and greatest.

So self-driving indeed draws a lot of attention, and for very good reasons. I personally believe, and I think you will agree with me, that self-driving really has the potential to deeply change the way we look about mobility and the way we move people and things around. So only to cover a few aspects here, obviously, and I don't want to go into too many details, but safety is one of the main motivations.

94% of US crashes today involve human errors. A lot of those errors are around distraction and things that could be avoided. So safety is a big piece of it. Accessibility and access to mobility is also a big motivation of ours. So obviously, the self-driving technology has the potential to make it very available and cheaper for more people to be able to move around.

And last but not least is efficiency, collective efficiency. Not only do we spend a lot of time in our cars, in long commute hours. I personally spend a lot of time in long commute hours. And that time we spend in traffic probably could be better spent doing something else than having to drive the car in complicated situations.

Beyond traffic, obviously, self-driving technology has the potential to deeply change the way we think about traffic, parking spots, urban environments, city design. So that's why it's a very exciting topic. So that's why we made it our mission at Waymo, is fundamentally to make it safe and easy to move people and things around.

So that's a nice mission, and we've been on it for a very long time. So actually, the whole adventure started close to 10 years ago in 2009. And at the time, that started under the umbrella of a Google project that you may have heard of called Chauffeur. And back in those days, so remember, we were before the deep learning days, at least in the industry.

And so really back in those days, the first objective of the project was to try and assemble a first prototype vehicle, take off-the-shelf sensors, assemble them together, and try to go and decide if self-driving is even a possibility. It's one thing to have some prototype somewhere, but is that even a thing that is worth pursuing?

Which is a very common way for Google to tackle problems. So the genesis for that work was to come up with a pretty aggressive objective. So the team, the first milestone for the team, was to essentially assemble 10 100-mile loops in Northern California, around Mountain View, and try and figure out, so for a total of 1,000 miles, and try and see if they could build a first system that would be able to go and drive those loops autonomously.

So they were not afraid. So the team was not afraid. So those loops went through some very aggressive patterns. So you see that some of those loops go through the Santa Cruz Mountains, which is an area in California that, as you see, I'll show you a video, that has very small roads and two-way traffic and cliffs, with negative obstacles and complicated patterns like that.

Some of those paths were going on highways, and one of the busiest highways. Some of those routes were going around Lake Tahoe, which is in the Sierras in California, where you can encounter different kinds of weather, and again, different kinds of roads conditions. Those routes were going around bridges, and the Bay Area has quite a few bridges to go through.

Some of them were even going through a dense urban area. So you can see San Francisco being driven. You can see Monterey, some of the Monterey centers being driven. And as you'll see on the video, those truly bring dense urban area challenges. So since I promised it, so here you're going to see some pictures of the driving.

It's kind of working. So here, with better quality, so here you see the roads I was talking about on the Santa Cruz Mountains, driving in the night, animals crossing the street, freeway driving, going through pay tolls. That's the Monterey area that is fairly dense. There's an aquarium there, a pretty popular one.

That's the famous Lombard Street in San Francisco that you may have heard of, which in San Francisco always brings its unique set of challenges between fog and slopes, and in that case, even sharp turns. So that was all the way back in 2010. So those 10 loops were successfully completed 100% autonomously back in 2010.

So that's more than eight years ago. So on the heels of that success, the team decided, and Google decided, that self-driving was worth pursuing, and moved forward with the development of the technology and testing. So we've been at it for all those years, and have been working very hard on it.

Historically, Waymo and I think all the other companies out there have been relying on what we call safety drivers to still sit behind the wheels, even if the car is driving autonomously. We still have a safety driver who is able to take over at any time and make sure that we have very safe operations.

And we've been accumulating miles and knowledge and developing the system, many iterations of the system, along all those years. We reached a major milestone, as Lex mentioned, back in November, where for the first time we reached a level of confidence and maturity in a system that we felt confident and proved to ourselves that it was safe to remove the safety driver.

As you can imagine, that's a major milestone, because it takes a very high level of confidence to not have that backup solution of a safety driver to take over where something to arise. So here I'm going to show you a small video, a quick capture of that event. So the video is from one of the first times we did that.

Since then we've been continuously operating driverless cars, self-driving cars, in the Phoenix area in Arizona to expand our testing. So here you can see our Chrysler Pacifica. So here we have members of the team who are acting as passengers, getting on a back seat. You can notice that there is no driver on the driver's seat.

So here we are running a car-hailing kind of service. So the passengers simply press a button, the application knows where they want to go, and the car goes. No one on the driver's seat. So we started with a fairly constrained geographical area in Chandler, close to Phoenix, Arizona. And we are hard-working to expand the testing and the scope of our operating area since then.

So that goes well beyond a single car, a single day. Not only do we do that continuously, but we also have a growing fleet of self-driving cars that we are deploying there all the way and looking for a product launch pretty quickly. So I talked about 2010, and we are in 2018, and we are getting there.

But it took quite a bit of time. So I think one of the key ideas that I'd like to convey here today, and that I will go back to during the presentation, is how much work it takes to really take a demo or something that's working in the lab into something that you feel safe to put on the roads, and get all the way to that depth of understanding, that depth of perfection in your technology, that you operate safely.

So one way to say that is that when you're 90% done, you still have 90% to go. So 90% of the technology takes only 10% of the time. In other words, you need to 10x. You need to 10x the capabilities of your technology. You need to 10x your team size and find ways for more engineers and more researchers to collaborate together.

You need to 10x the capabilities of your sensors. You need to 10x fundamentally the overall quality of the system, and your testing practices, as we'll see, and a lot of the aspects of the program. And that's what we've been working on. So, beyond the context of self-driving cars, I want to spend a little bit of time to give you kind of an insider view of the rise of deep learning.

So remember I mentioned that back in 2009, 2010, deep learning was not really available yet in full capacity in the industry. And so over those years, actually, it took a lot of breakthroughs to be able to reach that stage. And one of them was the algorithm breakthrough that deep learning gave us.

And I'll give you a little bit of a backstage view on what happened at Google during those years. So as you know, Google has committed itself to machine learning and deep learning very early on. You may have heard of the Google Brain, what we call internally the Google Brain Team, which is a team fundamentally hard at work to lead the bleeding edge of research, which is known, but also leading the development of the tools and infrastructure of the whole machine learning ecosystem at Google and Waymo, to essentially allow many teams to develop machine learning at scale all the way to successful products.

So they've been working and pushing, the deep learning technology has been pushing the field in many directions, from computer vision to speech understanding to NLP, and all those directions are things that you can see in Google products today. So whether you're talking Google Assistant or Google Photos, speech recognition, or even Google Maps, you can see the impact of deep learning in all those areas.

And actually, many years ago, I myself was part of the Street View team, and I was leading an internal program, an internal project that we called Street Smart. And the goal we had at Street Smart was to use deep learning and machine learning techniques to go and analyze street view imagery, and as you know, that's a fairly big and varied corpus, so that we could extract elements that are core to our mapping strategy, and that way build better Google Maps.

So for instance, in that picture, that's a piece of a panorama from street view imagery, and you can see that there are a lot of pieces in there that if you could find and properly localize, would drastically help you build better maps. So street numbers, obviously, that are really useful to map addresses, street names that when combined even on similar techniques from our views, will help you properly draw all the routes and give a name to them.

And those two combines actually allow you to do very high-quality address lookups, which is a common query on Google Maps. Internal text, and more specifically text on business facades, that allow you to not only maybe localize business listings that you may have gotten by other means to actual physical locations, but also build some of those local listings directly from scratch.

And more traffic-oriented patterns, whether it's traffic lights, traffic signs, that can be used then for ETA, navigation ETA predictions, and stuff like that. So that was our mission. One of the, as I mentioned, one of the hard pieces to do is to map addresses at scale. And so you can imagine that we had the breakthrough when we first were able to properly find those street numbers out of the street view imagery and out of the facade.

Solving that problem actually requires a lot of pieces. Not only you need to find where the street number is on the facade, which is, if you think about it, a fairly hard semantic problem. What's the difference between a street number versus another kind of number versus other text? But then obviously read it, because there's no point having pixels if you cannot understand the number that's on the facade, all the way to properly geolocalizing it, so that you can then put it on Google Maps.

And so the first deep learning application that succeeded in production, and that's all the way back to 2012, that we had the first system in production, was really the first breakthrough that we had across Alphabet on our ability to properly understand real scene situations. So here I'm going to show you a video that kind of sums it up.

So look, every one of those segments is actually a view, starting from the car, going to the physical number of all those house numbers that we've been able to detect and transcribe. So here that's in Sao Paulo, and where you can see that when all that data is put together, it gives you a very consistent view of the addressing scheme.

So that's another example. Similar things, obviously we have more, that's in Paris, where we have even more imagery, so more views of those physical numbers, that if you're able to triangulate, you're able to localize them very accurately, and have very accurate maps. So the last example I'm going to show is in Cape Town in South Africa, where again, the impact of that deep learning work has been huge in terms of quality.

So many countries today actually have up to more than 95% of addresses mapped that way. So doing similar things. So obviously you can see a lot of parallelism between that work on street view imagery and doing the same on the real scene on the car. But obviously doing that on the car is even harder.

It's even harder because you need to do that real-time and very quickly, with low latency. And you also need to do that in an embedded system. So the cars have to be entirely autonomous. You cannot rely on a connection to a Google data center, and first you don't have the time in terms of latency to bring data back and forth.

But also you cannot rely on a connection for the safe operation of your system. So you need to do the processing within the car. So that's a paper that you can read that dates all the way back to 2014, where for the first time, by using slightly different techniques, we were able to put deep learning at work inside that constrained real-time environment, and start to have impact, and in that case, around pedestrian detection.

So as I said, there are a lot of analogies. You can see that to properly drive that scene, like street view, you need to see the traffic light. You need to understand if the light is red or green. And that's what essentially will allow you to do that processing.

Obviously driving is even more challenging beyond the real-time. I don't know if you saw the cyclist going through. So we have real stuff happening on the scene that you need to detect and properly understand, interpret, and predict. And at the same time, here I explicitly took a night driving example to show you that while you can choose when you take pictures of street view and do it in daytime and in perfect conditions, driving requires you to take the conditions as they are, and you have to deal with it.

So there has been, from the very early beginning, there has been a lot of cross-punnelization between the real scene work. So here I took a few papers that we did in street view, that obviously if you read them, you'll see directly apply to some of the stuff we do on the cars.

But obviously that collaboration between Google Research and Waymo historically went well beyond street view only and across all the research groups. And that still is a very strong collaboration going on that enables us to stay on the bleeding edge of what we can do. So now that we looked a little bit at how things happened, I want to spend more time and go into more of the details of what's going on in the cars today, and how deep learning is actually impacting our current system.

So I think during the—if I looked at the cursors properly, I think during the week you went through the major pieces that you need to master to make a self-driving car. So I'm sure you heard about mapping, localization, so putting the car within those maps and understanding where you are with pretty good accuracy, perception, scene understanding, which is a higher level semantic understanding of what's going on in the scene, starting to predict what the agents are going to do around you so that you can do better motion planning.

Obviously there's a whole robotics aspect at the end of the day. The car in many ways acts like a robot, whether it's around the sensor data or even the control interfaces to the car. And for everyone who has dealt with hardware and robotics, you will agree with me that it's not a perfect world, and you need to deal with those errors.

Other pieces that you may have talked about is around simulation and essentially validation of whatever system you put together. So obviously machine learning and deep learning have been having a deep impact in a growing set of those areas, but for the next minutes here I'm going to focus more on the perception piece, which is a core element of what a self-driving car needs to do.

So what is perception? So fundamentally, perception is a system in the car that needs to build an understanding of the world around it. And it does that using two major inputs. The first one is prior on the scene. So for instance, to give you an example, it would be a little silly to have to recompute the actual location of the road, the actual interconnectivity of the intersections, of every intersection once you get on the scene, because those things you can pre-compute.

You can pre-compute in advance and save your on-board computing for other tasks that are more critical. So really, that's often referred to as the mapping exercise, but really it's about reducing the computation you're going to have to do on the car once it drives. The other big input, obviously, is what sensors are going to give you once you get on the spot.

So sensor data is the signal that's going to tell you what is not like what you mapped, and the things, is the traffic light red or green? Where are the pedestrians? Where are the cars? What are they doing? So as we saw on the initial picture, we have quite a set of sensors on our self-driving cars.

So they go from vision systems, radar, and LIDAR, are the three big families of sensors we have. One point to note here is that they are designed to be complementary. So they are designed to be complementary first in their localization on the car, so we don't put them in the same spot, because obviously blind spots are major issues, and you want to have good coverage of the field of view.

The other piece is that they are complementary in their capabilities. So for instance, to give you an example, cameras are going to be very good to give you a dense representation. It's a very dense set of information. It contains a lot of semantic information. You can really see a large number of details, but for instance, they are not really good to give you depth.

It's much harder, and computationally expensive, to get depth information out of camera systems. So systems like LIDAR, for instance, when you hit objects, will give you a very good depth estimation, but obviously they're going to lack a lot of the semantic information that you will find on camera systems.

So all those sensors are designed to be complementary in terms of their capabilities. It goes without saying that the better your sensors are, the better your perception system is going to be. So that's why at Waymo we took the path of designing our own sensors in-house and enhancing what's available off the shelf today, because it's important for us to go all the way to be able to build a self-driving system that we could believe in.

And so that's what perception does. It takes those two inputs and builds a representation of the scene. So at the end of the day, you have to realize that in nature, that work of perception is really what differentiates, deeply differentiates, what you need to do in a self-driving system, as opposed to a lower-level driving assistance system.

In many cases, for instance, if you do speed cruise, or if you do a lot of lower-level driver assistance, a lot of the strategies can be around not bumping into things. If you see things moving around, you group them, you segment them appropriately in blocks of moving things, and you don't hit them, you're good enough in most cases.

When you don't have a driver on the driver's seat, obviously the challenge totally changes scale. So to give you an example, for instance, if you're on a lane and you see a bicyclist going more slowly on the lane right of you, and there's a car next to you, you need to understand that there's a chance that that car is going to want to avoid that bicyclist, it's going to swerve, and you need to anticipate that behavior so that you can properly decide whether you want to slow down, give space for the car, or speed up and have the car go behind you.

Those are the kinds of behaviors that go well beyond not bumping into things, and that require a much deeper understanding of the world that's going on around you. So let me put it in picture, and we'll come back to that example in a couple of cases. So here is a typical scene that we encountered, at least.

So here, obviously, you have a police car pulled over, probably pulled over someone there. You have a cyclist on the road moving forward, and we need to drive through that situation. So the first thing you can do, you have to do, obviously, is the basics. So out of your sensor data, understand that a set of point clouds and pixels belong to the cyclist.

Find that you have two cars on the scene, the police car and the car parked in front of it. Understand the policeman as a pedestrian. So basic level of understanding. Obviously, you need a little more than that. You need to go deeper in your semantics. Obviously, you need, if you understand that the flashing lights are on, you understand that the police car is becoming an EV, and is performing something on the scene.

If you understand that this car is parked, obviously that's a valuable piece of information that's going to tell you whether you can pass it or not. Something you may have not noticed is that there are also cones. So there are cones here on the scene that would prevent you, for instance, to go and drive that pathway if you wanted to.

Next level of getting closer to behavior prediction. Obviously, if you also understand that actually the police car has an open door, then all of a sudden you can start to expect a behavior where someone is going to get out of that car. And the way you swerve, even if you were to decide to swerve, or the way someone getting out of that car would impact the trajectory of the cyclist, is something you need to understand in order to properly and safely drive.

And only then, only when you have that depth of understanding, you can start to come up with realistic behavior predictions and trajectory predictions for all those agents on the scene, and you can come up with a proper strategy for your planning control. So how is deep learning playing into that whole space?

And how is deep learning impacting used to solve many of those problems? So remember when I said when you're 90% done, you still have 90% to go? So I think that starts to beat us. I also talked about how robotics and having sensors in real life is not a perfect world.

So actually it is a big piece of the puzzle. So I wish sensors would give us perfect data all the time, and would give us a perfect picture that we can do, reliably use to do deep learning. But unfortunately, that's not how it works. So here, for instance, you see an example where you have a pickup truck.

So the imagery doesn't show it, but you have smoke coming out of the exhaust, and you have an exhaust that's triggering laser points. Not very relevant for any behavior prediction or for your driving behavior. So those points, obviously, and it's safe to go and drive through them. So those are very safe to ignore in terms of scene understanding.

So filtering the whole bunch of data coming off your sensors is a very important task, because that reduces the computation you're going to have to do, but also key to operate safely. A more subtle one, but an important one, are around reflections. So here we are driving a scene.

There's a car here. On the camera picture, the car is reflected in a bus. And if you just do a naïve detection, especially if the bus moves along with you, which is very typical, and everything moves, then all of a sudden you're going to have two cars on the scene.

And if you take that car too seriously, all the way to impacting your behavior, obviously you're going to make mistakes. So here I showed you an example of reflections on the visual range, but obviously that affects all sensors in slightly different manners. But you could have the same effect, for instance, with LiDAR data, where, for instance, you drive a freeway, and you have a road sign on top of the freeway that will reflect in the back window of the car in front of you, and then showing a reflected sign on the road.

You better understand that the thing you see on the road is actually your reflection, and not try to swerve around and trying to avoid that thing on the 65 miles per hour trajectory. So that's a big, that's a big complicated challenge. But assume we are able to get to proper sensor data that we can start to process with our machine learning.

So by the way, a lot of the single processing pieces we already use machine learning and deep learning to, because as you can see, for instance, in the reflection space, you need to, at the end of the day, you can do some tricks to understand the difference in the signal, but at the end of the day, at some point, for some of them, you're going to have to understand, to have a higher level of understanding of the scene, and realize it's not possible that the car is hiding behind the bus, given my field of view, for instance.

But assuming you have good sensor data, filtered out sensor data, the very next thing you're going to want to do is, typically, is apply some kind of convolution layers on top of that imagery. So, if you're not familiar with convolution layers, so that's a very popular way to do computer vision, because it relies on connecting neurons with kernels that are going to learn, layer after layer, features of the imagery.

So those kernels typically work locally on the sub-region of the image, and they're going to understand lines, they're going to understand contours, and as you build up layers, they're going to understand higher and higher levels of feature representations that ultimately will tell you what's happening on the image. So that's a very common technique, and much more efficient, obviously, than fully connected layers, for instance, that wouldn't work.

But unfortunately, a lot of the state-of-the-art is actually in 2D convolutions. So again, they've been developed on imagery, and typically they require a fairly dense input. So, for an imagery upgrade, it's great because pixels are very dense. You always have a pixel next to the next one. There's not a lot of void.

If you were, for instance, to think if you were to do plane convolutions on a very sparse laser point, for instance, then you would have a lot of holes, and those don't work nearly as well. So typically, what we do is to first project sensor data into 2D planes, and do processing on those.

So two very typical views that we use, the first one is top-down, so bird views, which is going to give you a Google Maps kind of view of the scene. So it's great, for instance, to map cars and objects moving along the scene. But it's harder to put imagery, pixels, imagery you saw from the car, into those top-down views.

So there's another famous one, a common one, that is the driver view, so projection onto the plane from the driver's perspective, that are much better at utilizing imagery, because essentially that's how imagery got captured. We didn't use drones. So here, for instance, you're going to see how you can, if your sensors are properly registered, how you can use both LiDAR and imagery signals together to better understand the scene.

So the first kind of processing you can do is what is called segmentation. So once you have pixels or laser points, you need to group them together into objects that you can then use for better understanding and processing. So unfortunately, a lot of the objects you encounter while driving don't have a predefined shape.

So here I do the example of snow, but if you think about vegetation, or if you think about trash bags, for instance, you can't come up with a prior understanding on how they're going to look like. And so you have to be ready to have any shape of those objects.

So one of the techniques that works pretty well is to build a smaller convolution network that you're going to slide across the projection of your sensor data. So that's the sliding window approach. So here, for instance, if you have a pixel-accurate snow detector that you slide across the image, then you'll be able to build a representation of those patches of snow and drive appropriately around them.

So that works pretty well, but as you can imagine, it's a little expensive computationally, because it's like, if you remember, I don't know if you've seen them, actually, it's like the old matrix printing. It had a printer, and it had to go "choo-choo" and print a page, point by point.

So it works pretty well, but it's pretty slow. So obviously, it's very analogous to that. But it works pretty well. So that works pretty well, but you need to be very conscious on which area of the scene you want to apply it to, to stay efficient. Fortunately, many of the objects you care about have predefined priors.

So for instance, if you take a car from the top-down view, from the bird's view, it's going to be a rectangle. You can take that shape prior into consideration. In most cases, even, on the driving lanes, they're going to go in similar directions, whether they go forward or they come the other way.

They're going to go in the direction of the lanes. Same for adjacent streets. So you can use those priors to actually do some more efficient deep learning that in the literature is conveyed under the ideas of single-shot multi-box, for instance. So here, again, you would start with convolution towers, but you do only one pass of convolution.

It's the same difference between a dot matrix printer and a press, right? That would print a page at once. It's only an analogy, but I think that conveys the idea pretty well. So here you would train a deep net that would directly take the whole projection of your sensor data and output boxes that encode the priors you have.

So here, for instance, I can show you how such a thing would work for cone detection. So you can see that we don't have all the fidelity of the per-pixel cone detection, but we don't really care about that. We just need to know there is a cone somewhere, and we take a box prior.

And obviously what that image is also meant to show is that, since it's a lot cheaper computationally, you can obviously run that on a pretty wide range of space. And even if you have a lot of them, that still is going to be a very efficient way to get that data.

So we talked about, remember, the flashing lights on top of the police car. So even if you properly detect and segment cars, let's say, on the road, many cars have very special semantics. So here on that slide I'm showing you many examples of EV, emergency vehicles, that you need obviously to understand.

You need to understand, first, that it is an EV, and two, whether the EV is active or not. School buses are not really emergency vehicles, but obviously whether the bus has lights on, or the bus has a stop sign open on the side, carry heavy semantics that you need to understand.

So how do you deal with that? Back to the deep learning techniques. One thing you could do is take that patch, build a new convolution tower, and build a classifier on top of that, and essentially build a school bus classifier, a school bus with lights on classifier, a school bus with stop sign open classifier.

I'm pretty sure that would work pretty well, but obviously that would be a lot of work, and pretty expensive to run on the car, because you would need to ... And convolution layers typically are the most expensive pieces of a neural net. So one better thing to do is to use embeddings.

So if you're not familiar with it, embeddings essentially are vector representations of objects that you can learn with deep nets that will carry some semantic meaning of those objects. So for instance, given a vehicle, you can build a vector that's going to carry the information that that vehicle is a school bus, whether the lights are on, whether the stop sign is open, and then you're back into a vector space, which is much smaller, much more efficient, that you can operate in to do further processing.

So those embeddings have been actually historically, they've been more closely associated with word embeddings. So in a typical text, if you were able to build those vectors with words, out of words, so out of every word in a piece of text, you build a vector that represents the meaning of that word.

And then if you look at the sequence of those words and operate in the vector space, you start to understand the semantics of those sentences. So one of the early projects that you can look at is called Work2Vec, which was done in the NLP group at Google, where they were able to build such things.

And they discovered that that embedding space actually carried some interesting vector space properties, such as if you took the vector for king minus the vector for man plus the vector for woman, actually you ended up with a vector where the closest word to that vector would be queen, essentially.

So that's to show you how those vector representations can be very powerful in the amount of information they can contain. Let's talk about pedestrians. So we talked about semantic segmentation. Remember, so the ability to go pixel by pixel for things that don't really have a shape. We talked about using shape priors.

But pedestrians actually combine the complexity of those two approaches for many reasons. One is that they obviously are deformable, and pedestrians come with many shapes and poses. As you can see here, I think here you have a guy or someone on a skateboard, crouching, more unusual poses that you need to understand.

And the recall you need to have on pedestrians is very high. And pedestrians show up in many different situations. So for instance here, you have occluded pedestrians that you need to see, because there's a good chance when you do your behavior prediction that that person here is going to jump out of the car, and you need to be ready for that.

So last but not least, predicting the behavior of pedestrians is really hard, because they move in any direction. A car moving in that direction, you can safely bet it's not going to drastically change angle in a moment's notice. But if you take children, for instance, it's a little more complicated.

So they may not pay attention, they may jump in any direction, and you need to be ready for that. So it's harder in terms of shape prior, it's harder in terms of recall, and it's also harder in terms of prediction. And you need to have a fine understanding of the semantics to understand that.

Another example here that we encountered is you get to an intersection, and you have a visually impaired person that's jaywalking on the intersection. And you obviously need to understand all of that to know that you need to yield to that person, pretty clearly. So, person on the road, maybe you should yield to it, to him.

Not easy. So for instance here, so there is actually, I don't know if it's a real person or a mannequin or something. So, but here we go. Something that frankly really looks like a pedestrian, that you should probably classify as a pedestrian, but lying on the bed of a pickup truck.

So, and obviously you shouldn't yield to that person, right, because if you were to, and yielding to a pedestrian at 35 miles per hour, for instance, is hitting the brakes pretty hard, right, with the risk of a rear collision. So obviously you need to understand that that person is traveling with a truck, and he's not actually on the road, and it's okay to not yield to him.

So those are examples of the richness of the semantics you need to understand. Obviously one way to do that is to start and understand the behavior of things over time. Everything we talked about up until now in how we use deep learning to solve some of those problems was on a pure frame basis.

But understanding that that person is moving with a truck versus the jaywalker in the middle of the intersection, obviously that kind of information you can get to if you observe the behavior over time. Back to the embeddings. So if you have vector representations of those objects, you can start and track them over time.

So a common technique that you can use to get there is to use recurrent neural networks, that essentially are networks that will build a state that gets better and better as it gets more observations, sequential observations of a real pattern. So for instance, coming back to the words example I gave earlier, you have one word, you see its vector representation, another word in a sentence, so you understand more about what the author is trying to say.

Third word, fourth word, at the end of the sentence you have a good understanding, and you can start to translate, for instance. So here's a similar idea. If you have a semantic representation encoded in an embedding for the pedestrian and the car under it, and track that over time and build a state that gets more and more meaning as time goes by, you're going to get closer and closer to a good understanding of what's going on in the scene.

So my point here is, those vector representations combined with recurrent neural networks is a common technique that can help you figure that out. Back to the point. When you're 90% done, you still have 90% to go. And so to get to the last leg of my talk here today, I want to give you some appreciation for what it takes to truly build a machine learning system at scale and industrialize it.

So up till now we talked a lot about algorithms. As I said earlier, algorithms have been a breakthrough, and the efficiency of those algorithms has been a breakthrough for us to succeed at the self-driving task. But it takes a lot more than algorithms to actually get there. The first piece that you need to 10x is around the labeling efforts.

So a lot of the algorithms we talked about are supervised, meaning that even if you have a strong network architecture and you come up with the right one, they are supervised in the sense that you need to give, in order to train that network, you need to come up with a representative set, a high-quality set of labeled data that's going to map some input to predict the outputs you wanted to predict.

So that's a pedestrian, that's a car. That's a pedestrian, that's a car. And the network will learn in a supervised way how to build the right representations. So there's a lot. Obviously the unsupervised space is a very active domain of research. Our own team of research at Waymo, in collaboration with Google, is around that domain.

But today a lot of it still is supervised. So to give you orders of magnitude, so here I represented in a logarithmic scale the size of a couple of data sets. So you may be familiar with ImageNet, which I think is the 15 million of such labels range. That guy jumping represents a number of seconds from birth to college graduation, hopefully coming soon.

And so that's more of a historical tidbit. But the first, remember the find the house number, the street number on the facade problem? So back in those days, it took us a multi-billion label data set to actually teach the network. So those were very early days. Today we do a lot better, obviously.

But that's to give you an idea of scale. So being able to have labeling operations that produce large and high-quality label data sets is key for your success. And that's a big piece of the puzzle you need to solve. So obviously today we do a lot more better. Not only we require less data, but we also can generate those data sets much more efficiently.

You can use machine learning itself to come up with labels, and use operators, and more importantly use hybrid models where you use labelers to more and more fix the discrepancies or the mistakes, and not have to label the whole thing from scratch. So combining, so that's a whole space of active learning and stuff like that.

So combining those techniques together, obviously you can get to completion faster. It's very common to still need millions, millions range kind of samples to train a robust solution. Another piece is around computation, computing power. So again, that's kind of a historical tidbit. Around the street number models, so here it's the detection model, and here is the transcriber model.

So obviously comparison is not, is only worth what it's worth here. But if you look at the number of neurons, or number of connections per neuron, which are two important parameters of any neural net, that gives you an idea of scale. So obviously it's many orders of magnitude away from what the human brain can do, but you start to be competitive, and even in some cases in the mammal space.

So again, historical data, but the main point here is that you need a lot of computation, and you need to have access to a lot of computing to either train and infer those trained models on real-time on the scene. And that requires a lot of very robust engineering and infrastructure development to get to those scales.

But Google is pretty good at that, and obviously we at Waymo have access to the Google infrastructure and tools to essentially get there. So I don't know if you heard, so the way it's happening at Google is around TensorFlow. So maybe you've heard about it as more of a programming language to program machine learning, and encode network architectures.

But actually, TensorFlow is also becoming, or is actually, the whole ecosystem that can combine all those pieces together and do machine learning at scale at Google and Waymo. So as I said, it's a language that allows teams to collaborate and work together. That's a data representation in which you can represent your label data sets, for instance, or your training batches.

That's a runtime that you can deploy onto Google data centers, and it's good that we have access to that computing power. Another piece is accelerators. So back in the early days when we had CPUs to run deep learning models at scale, which is less efficient, over time GPUs came into the mix, and Google is pretty active into developing a very advanced set of hardware accelerators.

So you may have heard about TPUs, TensorFlow processing units, which are proprietary chipsets that Google deploys in its data centers that allow you to train and infer more efficiently those deep learning models. And TensorFlow is the glue that allows you to deploy at scale across those pieces. Very important piece to get there.

So it's nice. You're smart. We build a smart algorithm. We were able to collect enough data to train it. Great! Ship it! Well, the self-driving system is pretty sophisticated, and that's a complex system to understand, and that's a complex system that requires extensive testing. And I think the last leg that you need to cover to do machine learning at scale and with a high safety bar is around your testing program.

So we have three legs that we use to make sure that our machine learning is ready for production. One is around real-world driving, another one is around simulation, and the last one is around structure testing. So I'll come back to that. In terms of real-world driving, obviously there is no way around it.

If you want to encounter situations and see and understand how you behave, you need to drive. So as you can see, the driving at Waymo has been accelerating over time. It's still accelerating. So we crossed 3 million miles driven back in May 2017, and only six months later, back in November, we reached 4 million.

So that's an accelerating pace. Obviously, not every mile is equal, and what you care about are the miles that carry new situations and important situations. So what we do, obviously, is drive in many different situations. So those miles got acquired across 20 cities, many weather conditions, and many environments.

It's 4 million a lot. So to give you another rough magnitude, so that's 160 times around the globe. Even more importantly, it's hard to estimate, but it's probably around 300 years of human driving, the equivalent. So in that data set, potentially, you have 300 years of experience that your machine learning can tap into to learn what to do.

Even more importantly is your ability to simulate. Obviously, the software changes regularly. So if for each new revision of the software, you need to go and re-drive 4 million miles, it's not really practical, and it's going to take a lot of time. So the ability to have a good enough simulation that you can replay all those miles that you've driven in any new iteration of the software is key for you to decide if the new version is ready or not.

Even more important is your ability to make those miles even more efficient and tweak them. So here is a screenshot of an internal tool that we call CarCraft, that essentially gives us the ability to fuzz or change the parameters of the actual scene we've driven. So what if the cars were doing in a slightly different speed?

What if there was an extra car that was on the scene? What if a pedestrian crossed in front of the car? So you can use the actual driven miles as a base, and then augment them into new situations that you can test your self-driving system against. So that's a very powerful way to actually drastically multiply the impact of any mile you drive.

And simulation is another of those massive-scale projects that you need to cover. So a couple of orders of magnitude here. So using Google's infrastructure, we have the ability to run a virtual fleet of 25,000 cars 24/7 in data centers. So those are software stacks that emulate the driving across either raw miles that we've driven or modified miles that help us understand the behavior of a software.

So to give you another magnitude, last year alone we drove 2.5 billion of those miles in data centers. So remember, 4 million driven miles total, all the way to 2.5. So that's three orders of magnitude of expansion in your ability to truly understand how the system behaves. But there's still a long tail.

There's a whole tail, or a long tail, of situations that will happen very rarely. So the way we decided to tackle those is to set up our own testing facility that is a mock of a city and driving situation. So we do that in a 90-acre testing facility on the former Air Force Base in central California that we set up with traffic lights, railroad crossings, I mean, truly trying to reproduce a real-life situation, and where we set up very specific scenarios that we haven't necessarily encountered during regular driving but that we want to test.

And then feed back into the simulation, re-augment using the same augmentation strategies, and inject into our 2.5 billion miles driven. So here I'm going to show you two quick examples of such tests. So here, just have a car back up as the self-driving car gets close and see what happens, and use all those sensor data to re-inject them into simulation.

Another example is going to be around people dropping boxes. So remember, try to imagine the kind of understanding, segmentation you need to do to understand what's happening there and cementing understanding you have. And to make it even more interesting, note that the car that has been put on the other side, so that swerving is not an option, right, without hitting the car.

So driving complex situations that go from perception to motion planning, the whole stack, and make sure that we are reliable, even in those long-term examples. Are we done? It looks like a lot of work. I wish, but no. Actually, we still have a lot of very interesting work coming, so I don't have much time to go into too many of those details, but I'm just going to give you two big directions.

The first one is around growing our, what we call, ODD, so operating domain, operating design domain. So extending our fleet of self-driving cars, not only geographically, so geographically meaning going into, deploying into urban cores, deploying into different weather conditions. So just as of this morning or yesterday, we announced that we're going to grow our testing in San Francisco, for instance, with way more cars that bring urban environments, slopes, fog, as I said.

And so that's obviously a very, very important direction that we need to go into, and where machine learning is going to keep playing a very important role. Another area is around cementing understanding. So in case you haven't noticed yet, I am from France. That's a famous roundabout in Paris, Place de l'Etoile, which seems pretty chaotic, but I've driven it many times without any issues, touching wood.

But I know that it took a lot of semantics and understanding for me to do it safely. I had a lot of expectations on what people do, a lot of communication, visual, gestures, to essentially get through that thing safely. And those require a lot of deeper semantic understanding of the scene around for self-driving cars to get through.

So that's an example of a direction. So back to my objectives. I hope I covered many of those, or at least you have directions for further reading and investigations. On those three objectives I had today, the first one was around context, context of the space, context of the history at Google and Waymo, and how deep the roots are on the way back in time.

My second objective was to give you, to tie in some of the technical algorithmic solutions that you may have talked about during that class into the practical cases we need to solve in a production system. And last but not least, really emphasize the scale and the engineering infrastructure work that needs to happen to really take such a project into fruition in a production system.

Last tweet. That's a scene with kids jumping on bags as a frogger across the sea. And I think we have time for a few questions. So I'll hand it over to Hannah. Thank you. I was wondering, you showed your car craft simulation a little bit. So from a robotics background, usually the systems tend to fail at this intersection between perception and planning.

So your planner might assume something about a perfect world that perception cannot deliver. So I was wondering if you use the simulation environment also to induce these perception failures, or whether that's really specific for scenario testing, and whether you have other validation arguments for the perception side. Very good question.

So one thing I didn't mention is that the simulator obviously enables you to simulate many different layers in a stack. And one of the hard-core engineering problems is to actually properly design your stack so that you can isolate and test independently. Like any robust piece of software, you need to have good APIs and layers.

So we have such a layer in our system between perception and planning. And the way you would test perception is more by measuring the performance of your perception system across more of the real miles, and use and tweak the output of the perception system with its mistakes. So having a good understanding of the mistakes it makes, and reproduce those mistakes realistically in the new scenarios you would come up with, part of your simulator, to realistically test the planning side of the house.

Thanks very much. You talked about the car as being a complex system, and it has to be an industrial product that is being conceived at scale and produced at scale. Do you have a systematic way of creating the architectures of the embedded system? You have so many choices for sensors, algorithms, each problem you showed has many different solutions.

That's going to create different interfaces between each element. So how do you choose which architecture you put in a car? That's true for any complex software stack. So there's a combination of different things. So the first thing, obviously, that I didn't talk too much about here, but is around the vast amount of research that we do at Waymo, but also we do in collaboration with Google Teams, to actually understand even what building blocks we have at our disposal to even play with and come up with those production systems.

The other piece is obviously the one you decide to take all the way to production. So you're right. So the two big elements here, I would say, the first one, I mean the main element, frankly, is in your ability to-- so that search actually takes a lot of people to get to.

So something I try to say is that to really-- part of the second 90% is your ability to grow your team and essentially grow the number of people who will be able to productively participate in your engineering project. And that's where the robustness we need to bring into our development environment, our testing, is really key to be able to grow that team at a bigger scale and essentially explore all those paths and come up with the best one.

And at the end of the day, the robustness of testing is the judge. That's what tells you whether an approach works or not. It's not a philosophical debate. Thank you for your talk. So the car is making a decision at every single step time, on direction and speed. And part of the reason why you have this simulation is so that you can test those decisions in every possible scenario.

So once self-driving cars become production-ready and out on the streets, do you expect that the decision will be made based on prior understanding of every single situation which is possible? Or can the car make a new decision in real time based on its seen understanding and everything around it?

So at the end of the day, the goal of the system is not to build a library of events that you can reproduce one by one and make sure that you encode. The analogy in machine learning would be overfitting. It's like if you encountered five situations, I'm pretty sure you can hard-code the perfect thing you need to do in those five situations.

But the sixth one that happens, if you don't generalize, actually is going to fall through. So really the complexity of what you need to do is extract the core principles that make you safely drive. And have the algorithms learn those principles rather than the specifics of any situation. Because as you said, the parameter space of a real scene is infinite.

So we try to fuzz that a little bit with a simulator. What if the cars went a little faster, a little slower? But the goal is not to enumerate all possibilities and make sure we do well on those. But the goal is to bring more diversity to the learning of those general principles that will be learned by the system or will be coded in the system for the car to behave properly and generalize when a new situation occurs.

Maybe a couple more questions is okay? Okay. Fantastic talk. One of the questions I had was, you mentioned the difficulty of identifying snow because it could come in many different shapes. One of the things that I immediately thought of was, I know it was just an urban legend, but it was that urban legend about the Inuit having 150 different words for snow.

And you mentioned embeddings of objects. Do you think one possible approach might be to create a much wider array of object embeddings for things like snow? I mean, if you're... Many different types of snow could actually have pretty different impacts on driving, whether it be just like a flurry, or if it were to be the kind of like a really heavy blizzard like we just had.

Yeah, I think if you look at it from an algorithmic point of view, that may make sense. But maybe something I'd like to emphasize a little more is the very hard line to walk is to walk the line of what's algorithmically possible, but also what's computationally feasible in the car.

I think... So, two points on your remarks. So, if we had the processing power to process every point, or every... to a large level of understanding, and had the computing power to do that, maybe that would be an approach. But that would be very expensive, and that's a hard thing to do.

Even more importantly, having... for instance, it wouldn't make sense to have a behavior prediction on every snowflake of the things you see on the side of the road, right? And you need to group... That's the whole point of segmentation. You need to group what you see into semantic objects that are likely to exhibit a behavior as a whole, and reason at that level of abstraction to have a meaningful semantic understanding that you need to drive, essentially.

So, yeah, it's an in-between. Last question. Make it a good one. Thanks for the talk. So, if you're using perception for your scene understanding, are you worried about, like, adversarial examples or things that have been demonstrated? Or do you not believe that this is, like, a real-world attack that could be used for perception-based systems?

So, generally speaking, I think even before adversarial attacks, errors... I mean, errors can happen, right? And errors happen in every module. So I think a prime example of that which is not adversarial is the reflection case. It's like, yeah, you could as well have put a sticker on the car, on the bus, and say, "Ah, you're confused.

"You think it's a car. It's not a car." But you don't need to put a sticker on the bus. It's like, real life already brings a lot of those examples. So really, the way out is two ways. The first one is to have sensors that complement each other. So I try to emphasize that.

Really, different sensors or different systems are not going to make the same mistakes, and so they're going to complement each other. And that's a very important piece of relevancy that we build into the system. The other one is also, even in the reflection case, is in the understanding. So the way you as a human wouldn't be fooled is because you understand and you know it's not a thing that can happen.

The same way you know that a car reflecting in a bus, there's no way you can see through the bus of a real car behind it. So that level of semantic understanding is what is going to tell you what is true and what is not, or what is a mistake, an error in your stack.

And so similar patterns apply. We'd like to thank you very much, Sacha Arnouf, for coming to MIT.

Sacha Arnoud, Director of Engineering, Waymo - MIT Self-Driving Cars

Chapters

Transcript