Drago Anguelov (Waymo) - MIT Self-Driving Cars

All right, welcome back to 6.094, Deep Learning for Self-Driving Cars. Today we have Drago Anghielov, Principal Scientist at Waymo. Aside from having the coolest name in autonomous driving, Drago has done a lot of excellent work in developing and applying machine learning methods to autonomous vehicle perception and more generally in computer vision robotics.

He's now helping Waymo lead the world in autonomous driving. 10 plus million miles achieved autonomously to date, which is an incredible accomplishment. So it's exciting to have Drago here with us to speak. Please give him a big hand. (audience applauding) - Hi, thanks for having me. I will tell you a bit about our work and the exciting nature of self-driving and the problem and our solutions.

So my talk is called "Taming the Long Tail of Autonomous Driving Challenges." My background is in perception and robotics. I did a PhD at Stanford with Daphne Koller and worked closely with one of the pioneers in the space, Professor Sebastian Thrun. I spent eight years at Google doing research on perception, also working on Street View and developing deep models for detection, neural net architectures.

I was briefly at Zooks. I was heading the 3D perception team at Zooks. We built another perception system for autonomous driving, and I've been leading the research team at Waymo most recently. So I want to tell you a little bit about Waymo when we start. Waymo actually this month has its 10-year anniversary.

It started when Sebastian Thrun convinced the Google leadership to try an exciting new moonshot. And the goal that they set for themselves was to drive 10 different segments that were 100 miles long. And later that year, they succeeded and drove an order of magnitude more than anyone has ever driven.

In 2015, we brought this car to the road. It was built ground up as a study in what fully driverless mobility would be like. In 2015, we put this vehicle in Austin and it completed the world's first fully autonomous ride on public roads. And the person inside this car is a fan of the project and he is blind.

So we did not want this to be just a demo fully driverless experience. We worked hard and in 2017, we launched a fleet of fully self-driving vehicles on the streets in Phoenix metro area. And we have been doing driverless, fully driverless operations ever since. So I wanted to give you a feel for what fully driverless experience is like.

(people chattering) (upbeat music) (people chattering) (people chattering) And so we continued. Last year, we launched our first commercial service in the metro area of Phoenix. There, people can call away on their phone. It can come pick them up and help them with errands or go to school. And we've been already learning a lot from these customers and we are looking to grow and expand the service and bring it to more people.

So in the process of growing the service, we have driven 10 million miles on public road, as Lex said. And driverlessly and more also with human drivers to collect data. And we've driven all kinds of scenarios, cities, capturing a diverse set of conditions and a diverse set of situations in which we develop our systems.

And so in this talk, I want to tell you, I mean, about the long tail of events. This is all the things we need to handle to enable truly self-driverless future. And I guess all the problems that come with this and offer some solutions and show you how Waymo has been thinking about these issues.

So as we drove 10 million miles, of course, we still find scenarios, new ones that we have not seen before and we still keep collecting them. And so when you think about self-driving vehicles, they need to have the following properties. First, the vehicle needs to be capable. It needs to be able to handle the entire task of driving.

So it cannot just do a subset and remove the human operator from the vehicle. And also all of these tasks obviously need to do well and safely. And that is the requirement to achieving self-driving at scale. And when you think about this now, the question is, well, how many of these capabilities and how many scenarios do you really need to handle?

Well, it turns out, well, the world is quite diverse and complicated and there is a lot of rare situations and all of them need to be handled well. And they call this the long tail. The long tail of situations. It's one type of effort to get yourself driving for the common cases and then it's another effort to tame the rest and they really, really matter.

And so I'll show you some. For example, this is us driving in the street and let's see if you can tell what is unusual in this video. (audience laughing) You see, so I can play it one more time. So there's a bicyclist and he's carrying a stop sign. And I don't know where he picked it up but it's certainly not a stop sign we need to stop for, unlike others, right?

And so you need to understand that. Let me show you another scenario. This is another case where we are happily staying there and then the vehicle stops and a big pile of poles comes our way, right? And you need to potentially understand that and learn to avoid it. Generally, well, different types of objects can fall on the road, it's not just poles.

Here's another interesting scenario. This happens a lot, it's called construction and there's various aspects of it. One of them is someone closed the lane, put a bunch of cones and we learn, and this is our vehicle correctly identifying where it's supposed to be driving between all of these cones and successfully executing it.

So yeah, we drive for a while. And this is something that happens fairly often if you drive a lot. Another case is this one. I think you can understand what happened here. And you can notice actually, so we hear the siren. So we have the ability to understand sirens of special vehicles and you can see we hear it and stop and some guys are much later than us braking at the last moment, letting the emergency vehicle pass.

And here's another scenario potentially I want to show you. Let's see if you can understand what happened. So let me play one more time. Did you guys see? So we stopped at, there's a green light, we're about to go and someone goes at high speed running a red light without any remorse.

Right, and we successfully stop and prevent issues. Right, and so sometimes you have the rules of the way and you have your road and people don't always abide by them and that's something that you don't want to just directly go in front of that person even if they're breaking the law.

So hopefully with this I convinced you that the situations that can occur are diverse and challenging and there's quite a few of them. And I want to take you a little bit on the tour of what makes this challenging and then tell you some ways in which we think about it and how we're handling it.

And so to do this, we're gonna delve a little bit more into the main tasks for self-driving which is perception, prediction, and planning. And so I'll tell you a little bit about those. Right, and perception, these are the core AI aspects of the car usually. These tasks, there's others, we can talk about others as well in a little bit, but let's focus on these first.

So perception is mapping from sensory inputs and potentially prior knowledge of the environment to a scene representation. And that scene representation can contain objects, it can contain scene semantics, potentially you can reconstruct a map, you can learn about object relationships and so on. And perception, the space of things you need to handle in perception is fairly hard, it's a complex mapping.

Right, so you have sensors, the pixels come, lighter points come, or radar scans come, and you have multiple axis of variability in the environment. So obviously there's a lot of objects. They have different types, appearance, pose. I don't know if you see this well, there are a bunch of people dressed as dinosaurs in this case, people generally are fairly creative in how they dress.

Vehicles can also be different types, people come in different poses, and we have seen it all. All right, so that's one aspect. There's different environments that these objects appear in. So there are times of day, seasons, day, night, different, for example, highway environment, suburban street and so on. And then there's a different variability axis, and this is a little more, slightly more abstract.

There are different objects can come in this environment in different configurations and can have different relationships. And so things like occlusion, there's a guy carrying a big board, there is reflections, there is, you know, people riding on horses and so on. And so why am I showing this? Because I just want to show you the space, right?

So in most cases you care about most objects in most environments in most reasonable configurations, and that's a space that you need to map from the sensor inputs to a representation that makes sense, and you need to learn this mapping function or represent it somehow. All right, and so let's go to the next step, which is prediction.

So apart from just understanding what's happening in the world, you need to be able to anticipate and predict what some of the actors in the world are going to do, the actors being mostly people, and people is honestly what makes driving quite challenging. This is one of the aspects that do so, it's, you know, a vehicle needs to be out there and be a full-fledged traffic scene participant.

And this anticipation of agent behavior sometimes needs to be fairly long term, so sometimes when you want to make a decision, you want to validate or convince yourself it does not interfere with what anyone else is going to do, and it can go from one second to maybe 10 seconds or more, you need to anticipate the future.

So what goes into anticipating the future? Well, you can watch a past behavior, someone's, I'm going this way, maybe I will continue going there, or maybe I'm very aggressively walking, and maybe I'm more likely to do aggressive motions in the future. High-level scene semantics, well, I'm in a presentation room, I'm sitting here at front giving a talk, I'll probably stay here and continue, even though stranger things have happened.

And of course there's subtle appearance cues, so for example, if a person's watching our vehicle and moving towards them, we can be fairly confident they're paying attention and not going to do anything particularly dangerous. If someone's not paying attention or being distracted, or there is a person in the car waving at us, various gestures, cues, the blinkers on the vehicles, these are all signals and subtle signals that we need to understand in order to be able to behave well.

And last but not least, even when you predict how other agents behave, agents also are affected by the other agents in the environment as well. So everyone can affect everyone else, and you need to be mindful of this. So I'll show you an example of this. I think this is one of the issues that really needs to be thought about.

We are all interacting with each other. So here's a case, our Waymo vehicle is driving, and there is two bicyclists in red going around a parked car. And what happens is we correctly anticipate that as they bike, they will go around the car, and we slow down and let them pass.

So we're reasoning that they will interact with the parked car. This is the prediction, our most likely prediction for the rear bicyclists. We anticipate that they will do this, and we correctly handle this case. So this illustrates prediction. And we have planning. This is our decision-making machine. It produces vehicle behavior, typically ends up in control commands to the vehicle, accelerate, slow down, steer the wheel.

And you need to generate behavior that ultimately has several properties to it, and it's important to think of them, which is safe, safety comes first, comfortable for the passengers, and also sends the right signals to the other traffic participants, because they can interact with you, and they will react to your actions, so you need to be mindful.

And you need to, of course, make progress. You need to deliver your passengers. So you need to trade all of these in a reasonable way. And it can be fairly sophisticated reasoning in complex environments. I'll show you just one scene. This is a complex, I think, school gathering. There's bicyclists trailing us, vehicles really closely hemmed within us, a bunch of pedestrians, and we need to make progress, and here is us.

We're driving reasonably well. In crowded scenes. And that is part of the prerequisite of bringing this technology to the dense urban environments, being able to do this. So how are we gonna do it? Well, I gave it up. I'm a machine learning person. I think when you have this complicated models and systems, machine learning is a really great tool to model complex actions, complex mapping functions, features.

Right, and so we're going to learn our system. And we've been doing this. I mean, we're not the only ones. So obviously, this is now a machine learning revolution. And machine learning is permeating all parts of the way in Mostak. All of these systems that I'm talking about, it helps us perceive the world.

It helps us make decisions about what others are going to do. It helps us make our own decisions. Right, and machine learning is a tool to handle the long tail. Right, and I'll tell you a little more on this how. So I have this allegory about machine learning that I like to think about.

So there is a classical system and there is a machine learning system. And to me, a classical system, and I've been there, I've done well. Early machine learning systems also can be a bit classical. You're the artisan, you're the expert. You have your tools and you need to build this product.

And you have your craft and you go and take your tools and build it, right? And it can fairly quickly get something reasonable. But then it's harder to change, it's harder to evolve. If you learn new things, now you need to go back and maybe the tools don't quite fit and you need to essentially keep tweaking it and it starts becoming, the more complicated the product becomes, the harder it is to do.

And machine learning, modern machine learning, is like a factory. Right, so machine learning, you build the factory, which is the machine learning infrastructure. And then you feed data in this factory and get nice models that solve your problems, right? And so, kind of infrastructure is at the heart of this new paradigm.

You need to build a factory. Right, once you do it, now you can iterate, it's scalable, right? Just keep the right data, keep feeding the machine, keeps giving you good models. So what is a ML factory for self-driving models? Well, roughly it goes like this. We have a software release, we put it on the vehicle, we're able to drive.

We drive, we collect data, we collect it and we store it. And then we select some parts of this data and we send it to labelers. And the labelers label parts of the data that we find interesting and that's a knowledge that we want to extract from the data.

These are the labels, the annotations, the results we want for our models. Right, there it is. And then what we're going to do is we're gonna train machine learning models on this data. After we have the models, we will do testing and validation, validate that they're good to put on our vehicles.

And once they're good to put on our vehicles, we go and collect more data. And then the process starts going again and again. So you collect more data, now you select new data that you have not selected before. You add it to your data set, you keep training the model.

And iterate, iterate, iterate, it's a nice scalable setup. Of course, this needs to be automated. It needs to be scalable itself. It's a game of infrastructure. And at Waymo, we have the beautiful advantage to be really well set up with regards to the machine learning infrastructure. And I'll tell you a bit about its ingredients and how we go about it.

So ingredient one is computing software infrastructure and we're part of Alphabet, Google, and we are able to, first of all, leverage TensorFlow, the deep learning framework. We have access to the experts that wrote TensorFlow and know it in depth. We have data centers to run large-scale parallel compute and also train models.

We have specialized hardware for training models, which make it cheaper and more affordable and faster so you can iterate better. Ingredient two is high-quality labeled data. We have the scale to collect and store hundreds and thousands and more miles, to millions of miles. And just collecting and storing 10 millions of miles is not necessarily the best thing you can do because there is a decreasing utility to the data.

So most of the data comes from common scenarios and maybe you're already good at them and that's where the long tail comes. So it's really important how you select the data. And so this is the important part of this pipeline. So while you're running a release on the vehicle, we have a bunch of models, we have a bunch of understanding about the world, and we annotate the data as we go and we can use this knowledge to decide what data is interesting, how to store it, which data we can potentially even ignore.

So then once we do that, again, we need to be very careful how to select data. We want to select data, for example, that are interesting in some way and complement, capture these long tail cases that we potentially may not be doing so well on. And so for this, we have active learning and data mining pipelines.

Given exemplars, find the rare examples, look for parts of your system which are uncertain or inconsistent over time and go and label those cases. Last but not least, we also produce auto labels. So how can you do that? Well, when you collect data, you also see the future for many of the objects, what they did.

And so because of that, now knowing the past and the future, you can annotate your data better and then go back to your model that does not know the future and try to replicate that with that model. And so you need to do all of this as part of the system.

Ingredient number three, high quality models. We're part of larger Alphabet and Google and DeepMind and generally Alphabet is the leader in AI. When I was at Google, we were very early on the deep learning revolution. I happened to have the chance to be there at the time. It was 2013 when I got on to do deep learning and a lot of things were not understood and we were there working on it earlier than most people.

And so through that, we had the opportunity and the chance to develop some of the, in my time, the team I managed, we invented neural net architecture like Inception, which became popular later. We invented at the time the state of the art object detection, fast object detector called SSD.

And we won the ImageNet 2014. And now if you go to the conferences, Google and DeepMind are leaders in perception and reinforcement learning and smart agents. And there is like state of the art, say semantic segmentation networks, pose estimation and so on. The object detection of course goes without saying.

And so we collaborate with Google and DeepMind on projects improving our models. And so this is my factory for self-driving models and I want to tell you something that kind of captures all of these ideas, infrastructure, data and models in one. This is a project we did recently and today we put online in our blog about automatic machine learning for tuning and adjusting architectures of neural networks.

So what did we do? So there is a team at Google working on AutoML, automatic machine learning. And usually networks themselves have complex architecture. They're crafted by practitioners to artisans of networks in some way. And sometimes we have very high latency constraints in the models, we have some compute constraints.

The networks are specialized. It takes often people months to find the right architecture that's most performant, low latency and so on. And so there's a way to offload this work to the machines. You can have machines themselves, once you've posed the problem, go and find your good network architecture that's both low latency and high performance.

Right, and so that's what we do. And we drive in a lot of scenarios and as we keep collecting data and finding new cities or new examples, the architectures may change and we want to easily find that and keep evolving that without too much effort, right? So we worked with the Google researchers and they had a strong work where they invented, well, they developed a system that searched the space of architectures and found a set of components of neural networks.

It's a small sub-network called NASCEL and this is a diagram of a NASCEL. It's a set of layers put together that you can then replicate in the network to build a larger network. And they discovered in a small vision data set, it was called CIFAR-10. It's from the early days of deep learning, it was a very popular data set and you can quickly train models and explore the large search space.

So the first thing we did is we took some problems in that we have for our stack, one of them being lighter segmentation. So you have a map representation and some lighter points and you essentially segment the lighter points. You say this point is part of a vehicle, that point is part of vegetation and so on.

This is a standard problem. So what we first did at Waymo is we explored several hundred NASCEL combinations to see what performs better on this task. And we saw that one of two things happened for the various versions that we found. One of them is we can find models with similar quality but much lower latency and less compute.

And then there is models of a bit higher quality at the same latency. So essentially we found better models than the human engineers did. And similar results were obtained for other problems, lane detection as well with this transfer learning approach. Of course, you can also do end-to-end architecture search.

So there's no reason why what was found on CIFAR-10 is best suited for our more specialized problems. And so we went about this more from the ground up. So let's find exactly deeper search, much larger space, not limited to the NASCELs themselves. And so the way to do this is because our networks are trained on quite a lot of data and take quite a while to converge and it takes some compute, we went and defined a proxy task.

This is a smaller task, simplified, but correlates with the larger task. And we do this by some experimentation of what would be a proxy task. And once we establish a proxy task, now we execute the search algorithms developed by the Google researchers. And so we train up to 10,000 architectures with different topology and capacity.

And once we find the top 100 models, now we train the large networks on those models all the way and pick the best ones. Right? And so this way we can explore a much larger space of network architectures. So what happened? So on the left, this is 4,000 different models spanning the scale and latency and quality.

And in red was the transfer model. So after the first round of search, we actually did not produce a better model than the transfer, which already leveraged their insight. So then we took the learnings and the best models from this search and did the second round of search, which was in yellow, which allowed us to beat it.

And third is we also executed reinforcement learning algorithm developed by the AI researchers on 6,000 different architectures. And that one was able to significantly improve on the red dot, which also significantly improves on the in-house algorithm. So that's one example where infrastructure, data, and models combine and shows how you can keep automating the factory.

That is all good, but we keep finding new examples in the world. And for some situations, we have fairly few examples as well. And so there are cases where the models are uncertain or potentially can make mistakes. And you need to be robust to those. I mean, you cannot put the product and say, well, our networks just don't handle some case and it's, so we have designed our system to be robust, even when ML is not particularly confident.

And how do you do this? So one part is, of course, you want redundant and complementary sensors. So we have given 360-degree field of view on our vehicles, both in camera, LiDAR, and radar. And they're complementary modalities. First of all, an object is seen in all of them. Second of all, they all have different strengths and different modes of failure.

And so whenever one of them tends to fail, the others usually work fine. And so that helps a lot, make sure we do not miss anything. Also, we've designed our system to be a hybrid system. And this is a point I want to make. So, I mean, some of these mapping problems or problems in which we apply our models are very complicated.

They're high dimensional. The image has a lot of pixels. LiDAR has a lot of LiDAR points. The networks can end up pretty big. And it may not be so easy to train with very few examples with the current state of the art. And so the state of the art keeps improving, of course.

So this is their zero-shot and one-shot learning. But we can also, while the state of the art is improving in the models, we can also leverage expert domain knowledge. And so what does that do? So humans can help develop the right input representations. They can put in expert bias that constrains the representation to fewer parameters that already describe the task.

And then with that bias, it is easier to learn models with fewer examples. And there is also, of course, experts can put in their knowledge in terms of designing the algorithm which incorporates it as well. And so our system is this hybrid. And so an example of what that looks for perception is, well, no matter if there's cases where the machine learning system may be not confident, we still have tracks and obstacles from LiDAR and radar scans, and we make sure that we drive relative to those safely.

And in prediction and planning, if we are not confident in our predictions, we can drive more conservatively. And over time, as the factory is running and our models become more powerful, of course, improve, and we get more data of all the cases, the scope of ML grows, right? And the set of cases that you can handle with it increases.

And so there's two ways to attack the tail. You both protect against it, but you also keep growing ML and making your system more performant. I'm going to tell you now how we deal with large-scale testing, which is another key problem. It's very important in the pipeline and also in getting the vehicles on the road.

So how do you normally develop a self-driving algorithm? Well, the ideal thing you're gonna do is you make your algorithm change, and you would put it on the vehicle and drive a bunch and say, "Now it looks great." All right, let's make the next one. The problem is, I mean, we have a big fleet, we have a lot of data, but some of the conditions and situations occur very, very rarely.

And so if you do this, you're gonna wait a long time. Furthermore, you don't just want to take your code and put it on the vehicle. You need to test it even before that. You don't want to, like you want very strongly tested code on public streets. So you can do structure testing.

We have a 90-acre Air Force Base place where we can test very important situations and situations that occur rarely. It's an example of such a situation. And so you can do this as well. So you can select and deliberately stage safely conditions you care about. Now again, you cannot do this for all situations.

So what do you do? A simulator. Right? And so how much do you need to simulate? Well, we simulate a lot. So we simulate the equivalent of 25,000 cars, virtual cars, driving. 10 million miles a day. And over seven billion miles simulated. It's a key part of our release process.

So why do you need to simulate this much? Right, well, hopefully I convinced you there is a variety of cases to worry about and that you need to test, right, so far. And furthermore, it goes all the way bottom up. So as you change perception, for example, slightly different segmentation or detection, the changes can go through the system and the results can change significantly and you need to be robust to this.

You need to test all the way. So what to simulate? One thing you can do is, we can create unique scenarios from scratch, working with safety experts, NITSA and analyzing water conditions in which typically lead to accidents. So you can do that. Of course, you can do it manually, you can create them.

What else could you do? Well, you want to leverage your driving data. You have all your logs, you have a bunch of situations there, right? So you can pick interesting situations from your logs. And furthermore, what you can do is, you take all these situations and you create variations of these situations so you get even more scenarios.

So here's an example of a log simulation. I'll play it twice. First time, look at the image. This is what happened in the real world the first time. So in the real world, we mostly stayed in the middle lane and stopped. If you see what happened in simulation, simulation, our algorithm decided this time to merge to the left lane and stopped.

And everything was fine, things were safe, things were happy. What can go wrong in simulation from logs? Well, let's say this is another scenario, slightly different visualization. Our vehicle, when it drove the real world, was where the green vehicle is. Now in simulation, we drove differently and we have the blue vehicle, right?

And so we're driving. Bam. What happened? Well, there is a purple agent over there, pesky purple agent, who in the real world saw that we passed them safely. And so it was safe for them to go, but it's no longer safe because we changed what we did. So the insight is, in simulation, our actions affect the environment and it need to be accounted for.

So what does that mean? If you want to have effective simulations on a large scale, you need to simulate the realistic driver and pedestrian behavior. So, you know, you could think of a simple model. Well, what is a good proxy or what's a good approximation of a realistic behavior?

Well, you can do a break and swerve model. So you just say, well, there is some normal way reactions happen. You know, I have a reaction time and breaking profile and maybe a swerving profile, so if an agent sees someone in front of them, maybe they just apply it as an algorithm.

Right, so hopefully I convinced you that behavior can be fairly complicated and this will not always produce a believable reaction, especially in complex interactive cases such as merges, lane changes, intersections, and so on. Right, so what could you do? You could learn an agent from real demonstrations. Well, you went and collected all this data in the world, you have a bunch of information of how vehicles, pedestrians behave, you can learn a model and use that.

Okay, so what is an agent? Let's look a little bit. An agent receives, sends the information and maybe context about the environment. And it develops a policy, it develops a reaction, it's a driver agent and applies acceleration and steering, then gets new sensor information, new map information, place in the map and it continues.

And if it's our own vehicle, then you also have a router that's explicit intent generator which says, well, the passenger wants you to go over there, why don't we try to make a right turn now? So you also get an intent. And this is an agent, it could be in simulation, it could be in the real world, roughly this is the picture.

And this is an end-to-end agent, end-to-end learning is popular, right? To its best approximation, if you learn a good policy this way, you can apply it and have very believable agent reactions. And so I'm gonna tell you a little bit about work we did in this direction. So we put a paper in archive about a month ago, I believe.

We took 60 hours of footage of driving and we tried to see how well we can imitate it using a deep neural network. And so one option is to do exactly the same to end-to-end agent policy, but we wanted to make our task easier. How? Well, we have a good perception system at Waymo, so why don't we use its products for that agent?

Also can simplify the input representation a bit, that is good, the task becomes easier. Controllers are well understood, we can use an existing controller, so no need to worry about acceleration and torques, we can generate trajectories. Now, if you want to see in a little more detail to understand the representation, is so we have, this is our agent vehicle, which is self-driving vehicle in this case, but could be a simulation agent.

And we render an image with it at the center and potentially we augment it with some, we can generate a little bit of rotation to the image just so we don't over-bias the orientation a specific way. All right, and it's an 80 by 80 box, so we roughly see about 60 meters in front of us and 40 meters to the side in this setup.

And now we render a road map in this box, which is the map, like which lanes you're allowed to drive on, there's traffic lights, and generally at intersections we render what lanes are allowed to go in what lanes and how the traffic lights permit it or do not permit it.

Then you can render speed limits, the objects, result of your perception system, you render your current vehicle where it believes it is, and you render the post history. So you give an image of where the agent's been in the last few steps. And so you want, and last but not least, you render the intent, so the intent is where you want to go.

So it's conditioned on this intent and this input, you want to predict the future waypoints for this vehicle. Right, so that's the task. And you can phrase it as a supervised learning problem. Right, just learn to, learn a policy with this network that approximates what you've seen in the world, with 60 hours of data.

Of course, learning agents, there is a well-known problem, it's identified, it's called Paper Dagger, by Stefan Ross, who is actually at Waymo now, and Andrew Bagnell. So it's easy to make small errors over time, so even though in each step, if you do a relatively good estimate, if you string 10 steps together, you can end up very different from where agents have been before.

Right, and there is techniques to handle this. One thing we did was synthesize perturbations. So you have your trajectory, and we synthesize, deform the trajectory and force the vehicle to learn to come back to the middle of the lane. So that's something you can do. That's reasonable. Now, if you just have direct imitation based on supervision, we are trying to pass a vehicle in the street, and it's stopping and never continuing.

So now we did perturbations, and well, it kind of ran through the vehicle. Right, so that's not enough. So we need more, right? It's not actually an easy problem. So in addition to having this agent RNN, which essentially takes the past and creates memory of its past decisions and keeps iterating, predicting multiple points in the future.

So it predicts the trajectory piecemeal in the future. How about we also learn about collisions and staying on the road and so on. So we augment the network, and now the network starts also predicting a mask for the road. And now we have a loss here. I don't know if I can point.

So here you have a road mask loss. You say, hey, if you drive or generate motions that take you outside the road, that's probably not good. Hey, if you ever cause collisions where your perception network, which takes the other object and predicts their motion, so predict here our motion, where the road is, and the other agent's motion in the future, and we're trying to make sure there's no collisions and that we stay on the road.

So you add this structural knowledge. That adds a lot more constraints to the system as it trains. So it's not just limited, but what it's explicitly seeing, it allows it to reason about things it has not explicitly seen as well. And so now, here's an example of us driving with this network.

And you can see that we're predicting the future with yellow boxes, and we're driving safely through intersections and complex scenarios. Actually handles a lot of scenarios very well. If you're interested, I welcome you to go read the paper. It handles most of the simple situations fine. So now we have our past two approaches, the passing a parked car.

One of them stops and never restarts. The other one hits the car. Now it actually handles it fine. And beyond that, afterwards, we can stop at the stop sign happily, which is the red line over there, and it does all of these operations. And what we did beyond this is, we took the system, as learned on imitation data, and we actually drove our real Waymo car with it.

So we took it to Castle, the Air Force Base staging grounds, and this is it driving a road it's never seen before and stopping at stop signs and so on. So that's all great. We could use it also in agent simulation world, and we could drive a car with it, but it has some issues.

So let's look on the left. So here it is driving, and then it was driving too fast, so because our range is limited, it didn't know it had to make a turn, and it overran the turn. So it just drove off the road. That's one thing that can happen.

So, you know, one area of improvement, more range. Here's just another time. So yellow is, by the way, what we did in the real world, and green is what we do in the simulation, in that example. And here, we're trying to execute a complex maneuver, a U-turn, we're sitting there, and we're gonna try to do it, and we almost do it, but not quite, and at least we end up in the driveway.

And there's other interactive situations. When they get really complex, this network also does not do too well, right? And so what does that tell us? Well, long tail came again in testing, right? There's, again, you can learn a policy for a lot of the common situations, but actually in testing, some of the things you really care about is the long tail.

You want to test to the corner cases. You want to test in the scenarios where someone is obnoxious and adversarial and does something not too kosher, right? So one way to think of it is this, right? This is the distribution of human behavior, and of course, it goes in multiple axes.

It could be aggressive and conservative, right? And then somewhere in between, you could be super expert driver and super inexperienced and somewhere in between, and so on. So our end-to-end model, it's fairly, it's an unbiased representation, meaning it could, in theory, learn any policy, right? I mean, you see everything you want to know about the environment, by and large.

But it's complex, and this is similar a bit to the models as well, some of the models we talked about before. You can end up with complex model if you have complex input. This is images that are 80 by 80 with multiple channels. It's a large input space. The model can have tens of millions of parameters.

Now, if you have an example, if you have a case where you have two or three examples in your whole 60 hours of driving, there's no guarantee that your 10 million parameter model will learn it well, right? And so it's really good when you have a lot of examples.

It's really trying to do well in those. And then you have the long tail. So what do you do? Well, we can improve the representation. We can improve our model. This is, there is a lot of room to keep evolving this. And then this area will keep expanding, right?

And that's one good direction. There is a lot of interesting questions how to do that, and we're working on a lot of them. There's actually some exciting work. Hopefully I get to share with you another time. Something else you can do, if you remember from my slide about the hybrid system, when you go to the long tail, you can do essentially a similar thing, which is simpler, biased, expert design input distribution that is much easier to learn with few examples.

You can also, of course, use expert design models. And so in this case, you still will produce something reasonable by inputting this human knowledge. And you could have many models. I mean, there's not one. You could just tune to various aspects of this distribution. You can have little models for all the aspects you care about.

You can mix and match, right? So that's another way to do it. So let me tell you about one such a model. So trajectory optimization agent. So we take inspiration from motion control theory, and we want to plan a good trajectory for the vehicle, the agent vehicle, and that satisfies a bunch of constraints and preferences, right?

And so one insight to this is that we already know what the agent did in the environment last time. So you have fairly strong idea about the intent. And that helps you when you specify the preferences. 'Cause you can say, okay, well, give me a trajectory that minimizes some set of costs, which are preferences on the trajectory, typically called potentials.

What is a potential? Well, at different parts of the trajectory, you can add these attractor potentials saying, well, try to go where you used to be before, for example. And that's the benefit of, in simulation, you have observed what was done. So this is a bit simpler. And of course, you can have repeller potential.

Don't hit things, don't run into vehicles, right? So to a first approximation, that's what it roughly looks like. And so now, where is the learning, right? Well, it's still a machine learning model. There is a presentation. These potentials have parameters. It's the steepness of this curve. Sometimes they're multidimensional, right?

There's a few parameters. Typically, we're talking a few dozen parameters or less. All right, and you can learn them too. So there is a technique called inverse reinforcement learning. You want to learn these parameters that produce trajectories that come close to the trajectories you've observed in the real world.

So if you pick a bunch of trajectories that represent certain type of behavior, you want to model the tuning of parameters to behave like it. And then you want to generate reasonable trajectories, continuous, feasible, that satisfy this, right? And this is part of this optimization. You can solve this, actually.

And so then you can tune these agents. And so here's some agents I want to show you. So this is a complex interactive scenario. Two vehicles, but you can see on the left is, on the right is the aggressive guy. Blue is the agent. Red is our vehicle. We're testing in simulation.

And so let me play one more time. Once this ends, essentially on the left is the conservative driver. On the right is the aggressive driver. And they pass us, right? And they induce very different reactions in our vehicle. So the aggressive guy went and passed us and pushed us further into that lane and we merge much later.

In the other case, when you have a conservative driver, we are in front of them and they're not bugging us and we execute much earlier. We can switch into the right lane where we want to go. All right, so this is agents that can test your system well. Now you have different scenarios in this case, depending what agent you put in.

And I'll show you a little more scenarios. So it's not just a two-agent game. I mean, we can do things like merging from one side of the highway to the next. And this type of agent can generate fairly reasonable behaviors. It's slow down for a knowing slow vehicle in front, let the vehicles on the side pass you and still completes the mission.

And you can generate multiple futures with this agent. So here's an example again. On the right will be an aggressive guy. Right, and on the left was the more conservative person. The aggressive guy found a gap between the two vehicles and just went for it, right? And you can test your stack this way.

And one more I wanted to show you is an aggressive motorcycle driving. So you can have an agent that tests, you can test your reaction to motorcycle that are weaving in the lane, right? So I guess what's my takeaway from this story about testing and the long tail? You need a menagerie of agents at the moment, right?

So if you think of it, right, and learning from demonstration is key. You can encode some simple models by hand, but ultimately it's much better. The task of modeling agent behavior is complex and it's much better learned. And so here's the space of models. So you can have not learned, you can just replay the log like I showed, and you can hand design trajectories for agents to for this reaction do this, for that reaction do that.

Then you can have the break and swerve model that mostly if there's someone in front of an agent just does a deterministic break. Trajectory optimization, which I just showed. Now our mid to mid model and potentially end to end top down model, top down meaning you have like a top view of the environment.

There's many other representations possible. This is a very interesting space. Ultimately I wanted to show you there's many possible agents and they have different utility and they have different number of examples you need to train them with. And so one other takeaway I wanted to tell you is smart agents are critical for autonomy at scale.

This is something I truly believe working in the space. And this line of direction is exciting and ultimately one of the exciting problems that there's still a lot of interesting progress to be made. And why? Well you have accurate models of human behavior of drivers and pedestrians and they help you achieve several things.

First, you will do better decisions when you drive yourself. You'll be able to anticipate what others will do better and that will be helpful. Second, you can develop a robust simulation environment with those insights, also very important. Third, well our vehicle is also one more agent in the environment.

It's an agent we have more control than the others but a lot of these insights apply. And so this is very exciting and interesting. So I wanted to finish the talk just maybe as a mental exercise. When you think of a system that is tackling a complex AI challenge like self-driving, what is the good properties of the system to have and how do you think of a scalable system?

And to me there's this mental test. We want to grow and handle and bring our service to more and more environments, more and more cities. How do you scale to dozens or hundreds of cities? So as we talked about the long tail, each new environment can bring new challenges.

And they can be complex intersections in cities like Paris. There's our Lombard Street in San Francisco, I'm from there. There's narrow streets in European towns. There is all kinds of, the long tail keeps coming. As you keep driving new environments, in Pittsburgh people drive the famous Pittsburgh left. They take different precedence than usual.

The local customs of driving, of behaving, all of this needs to be accounted for as you expand. And this makes your system potentially more complex or easier, harder to tune to all environments. But it's important because ultimately that's the only way you can scale. So how do you, what should a scalable process do?

So in my mind, let's say you have a very good self-driving system. I mean, this very much parallels the factory analogy. I'm just going to repeat it one more time. You take your vehicles, we put a bunch of way more cars and we drive a long time in that environment with drivers.

Maybe 30 days, maybe more, at least that long. And you collect all the data, right? And then your system should be able to improve a lot on the data you have collected, right? So drive a bunch, obviously you don't want to train the system too much in the real world while it's driving, but you want to train it after you've collected data about the environment.

So it needs to be trainable on collected data. It's very important for a system to be able to quantify or have a notion to elicit from it whether it's incorrect or not confident, right? Because then you can take action. And this is important property that I think people should think of when they design systems.

How do you elicit this? Then you can take an action. You can ask questions to raters, that's fairly legit. Typically active learning is a bit like this, right? So, and it's usually based on some amount of low confidence or surprise. That's the examples you want to send. And even better, the system could potentially directly update itself, and this is an interesting question.

How do systems update themselves in light of new knowledge? And we have a system that clearly does this, right? And typically you do it with reasoning. And what is reasoning, right? So, I have an answer. It is one answer, there is possibly others, right? But one way is you can check and enforce consistency of your beliefs, and you can look for explanations of the world that are consistent.

And so if you have a mechanism in the system that can do this, this allows the system to improve itself without necessarily being fed purely labeled data. It can improve itself on just collected data. And I think it's interesting to think of systems where you can do reasoning and the representations that these models need to have.

And last and not least, we need scalable training and testing infrastructure, right? This is part of the fact that I was talking about. I'm very lucky at Waymo to have wonderful infrastructure. And it allows this virtuous cycle to happen. Thank you. (audience applauding) - Up here, Kieran Strobel. Thank you so much for the talk, I really appreciate it.

So if you were to train off of image and LiDAR data, synthetic image and LiDAR data, would you weight the synthetic data differently than real-world data when training your models? - So there's actually a lot of interesting research in the field. There are people train on simulator, but also train adaptation models that make simulator data look like real data.

Right? So you're essentially, you're trying to build consistency, or at least you're training on simulator scenarios, but if you learn a mapping from simulator scenes to real scenes, right, you could potentially train on the transformed simulator data already that's transformed with other models. There's many ways to do this, ultimately, right?

So achieving realism in simulator is an open research problem, right? - I assume there is a lot of rules that you have to put into a system to make, to be able to trust it, you know? And so how you find the balance between this automatic models, like neural network when you're not quite sure what they would do, and rules where you're sure that it's not scalable?

- I mean, through lots and lots of testing and analysis, right, so you keep keeping track of the performance of your models and you see where they come short, right? And then those are the areas you most need expert to complement, right? But the balance can change over time, right?

And it's a natural process of evolution, right? So evolving your system as you go. I mean, generally, you know, the MLPI grows as the capabilities in the data sets grow, right? - So you stressed at the end of both the first half and the second half of your talk, the importance of quantifying uncertainty and the predictions that your models are making.

So have you developed techniques for doing that with neural nets, or are you using some probabilistic graphical models or something? - I mean, so a lot of the models are neural nets. There's many ways to capture this, actually. I'm just going to give a general answer. I'm not commenting on specifically what way I'm going to be doing.

I think, first of all, there's techniques in neural nets that can predict, where they can predict their own uncertainty fairly well, right? Either directly regress its uncertainty for certain products, or use ensembles of networks or dropout or techniques like this that also provide a measure of uncertainty. Another way of doing uncertainty is to leverage constraints in the environment.

So if you have temporal sequences, right? You don't want, for example, objects to appear or disappear, or generally unreasonable changes in the environment, or inconsistent prediction in your models are good areas to look. - I'm just wondering, do you guys train and deploy different models depending on where the car is driving, like what city, or do you train and deploy a single model that adapts to most scenarios?

- Well, ideally, you would have one model that adapts to most scenarios, then a complement is needed. - Yeah, so first off, thanks for your talk. I find the simulator work really, really exciting. And I was wondering if you could either talk more about, or maybe provide some insights into simulating pedestrians.

'Cause as a pedestrian myself, I feel like my behavior is a lot less constrained than a vehicle. - Right. - And I imagine, I mean, there's an advantage in that you're sensing from a vehicle, and you kind of know, your sensors are for like first person from a vehicle, but not from a pedestrian.

- And that's correct. I mean, so if you want to simulate pedestrians far away in an environment, right, and you want to simulate them at very high resolution, right, and you've collected log data, you may not have the detailed data on that pedestrian. Right? At the same time, the subtle cues for that pedestrian matter less at that distance as well, because it's not like you observed them or reacted to them in the first place.

So there is an interesting question of what fidelity do you need to simulate things? Right? And there is levels of realism in simulation that at some level need to parallel what your models are paying attention. - Thank you for the talk. It was very interesting. Since you, you know, titled and talked about it, long tail, it makes me wonder, is the bulk of the problem solved?

Do you think, well, we're gonna have this figured out and within the next couple of years, there can be self-driving cars everywhere, or do you think it's closer to, you know, actually, it could be decades before we've really worked out everything necessary? What are your thoughts about the future?

- It's a bit hard to, that's a good question. It's a bit hard to give this prognosis. I think, I mean, I'm not completely sure. I think one thing I would say is it will take a while for self-driving cars to roll out at scale, right? So this is not a technology that just, it turn the crank and appears everywhere, right?

There's logistics and algorithms and all this tuning and testing needed to make sure it's really safe in the various environments. So it will take some time. - When you were talking about prediction, you mentioned looking at a context and saying if a person or if someone is looking at us, we can assume that they will behave differently than if they're not paying attention to what we're doing.

- Potentially. - Is that something you're actively doing? Do you take into consideration if pedestrians or other participants in traffic are paying attention to your vehicles? - So I can't comment on our model designs too much, but I think there's a generally cues one needs to pay attention to, they're very significant.

I mean, you know, even when people drive, for example, there's someone sitting in the vehicle next to you waving, keep going, right? And these are natural interactions in the environment. That, you know, is something you need to think about. - In one of your, first of all, thank you, it's a really cool talk.

In one of your last slides, you talked about resolving certain uncertainties by the means of establishing a set of beliefs and checking to see if they were consistent in the-- - That's my own theory, by the way, right? But I feel that the concept of reasoning is underexplored in deep learning and what it means, right?

So if you read Tversky-Kahneman, Type I, Type II reasoning, we're really good at the instinctive mapping type of tasks, right, so like some low to mid to maybe high-level perception up to a point, but the reasoning part with neural networks, right, and generally with models, that's a bit less explored.

I think it's, long-term, it's fruitful. That's my personal opinion, right? - I guess the question I was gonna ask is if you could elaborate on that concept in connection with the models you guys are working with, but I guess that's-- - So I'll give an example from current work, right?

And there's a lot of work on weekly supervised learning. - Sure. - And that's kind of been a big topic in 2018, and there were a lot of really strong papers, including by Google Brain and Yulia Angulo and her team and so on, and essentially, if you used to read the books about 3D reconstruction and geometry and so on, right, there's a bunch of rules that can encode geometric expectations about the world.

So when you have video, and when you have 3D outputs in your models, there is certain amount of consistency. One example is ego motion versus depth estimation. There is a very strong constraint that if you predict the depth, and you predict the ego motion correctly, then you can reproject certain things, and they will look good, right?

And that's a very strong constraint. That's a consistency. You know this about the environment, you expect it. This can help train your model, right? And so more of this type of reasoning may be interesting. - You mentioned expert design algorithms, and I was wondering, from your perspective, from also from Waymo's perspective, how important are those, say, non-machine learning type algorithms, or non-machine learning type approaches to tackling the challenges of autonomous driving?

- Could you say one more time how important is, which aspect of them, Darrell? - Of expert designed algorithms. Every now and then, you just, you sprinkle in, like, here we can try expert designed algorithms, because we actually understand some parts of the problem, and I was wondering, like, what is really important for the challenges in autonomous driving outside of the field of machine learning?

- I mean, generally, you want, the problem is, you want to be safe in the environment. That makes it such that you don't want to make errors in perception, prediction, and planning, right? And the state of machine learning is not at the point where it never makes errors, provided the scope that we're currently addressing.

And so throughout your stack, with the current state of machine learning, it needs to be complemented, right? And so we've carefully done it, and I think machine learning, as it improves, I think there'll be less and less need to do it. It's somewhat effort intensive, bringing, especially in an evolving system, to do that, to have a hybrid system.

But right now, I think this is the main thing that keeps you able to do complex behaviors in some cases, for which it's very hard to collect data, and you still need to handle. Then it's the right thing to do, right? So the way I view it, I'm a machine learning person, I like to do better and better.

That said, we're not religious, it should not be. We just need to solve the problem, and right now, the right mix is a hybrid system, is my belief. - Well, we're really excited to see what Waymo has in store for us in '19. So please give Drago a big hand.

(audience applauding) (audience cheering) (audience applauding) (audience cheering) (audience cheering) (audience cheering) (audience cheering) (audience cheering) you

Drago Anguelov (Waymo) - MIT Self-Driving Cars

Chapters

Transcript