Karl Iagnemma & Oscar Beijbom (Aptiv Autonomous Mobility)

All right, welcome back to 6S094, Deep Learning for Self-Driving Cars. Today we have Carl and Yema, and Oscar Bayboom from Aptiv. Carl is the president of Aptiv Autonomous Mobility, where Oscar is the machine learning lead. Carl founded Neutonomy, as many of you know, in 2013. It's a Boston-based autonomous vehicle company, and Neutonomy was acquired by Aptiv in 2017, and now is part of Aptiv.

Carl and team are one of the leaders in autonomous vehicle development and deployment, with cars on roads all over the United States, several sites. But most importantly, Carl is MIT through and through, as also some of you may know, getting his PhD here. He led a robotics group here as a research scientist for many years.

So it's really a pleasure to have both Carl and Oscar with us today. Please give them a warm welcome. (audience applauding) - All right, thanks, Lex. Yeah, very glad to be back at MIT. Very impressed that you guys are here during IAP. My course load during IAP was usually ice skating, and sometimes there was a wine tasting course.

This was now almost 20 years ago, and that was pretty much it. That's where the academic work stopped. So you guys are here to learn something, so I'm gonna do my best and try something radical, actually. Since I'm president now of Aptiv Autonomous Driving, I'm not allowed to talk about anything technical or interesting.

I'm gonna flout that a little bit and raise some topics that we think about that I think are interesting questions to keep in the back of your mind as you're thinking about deep learning and autonomous driving. So I'll raise some of those questions. And then Oscar will actually present some real-life technology and some of the work that he has been doing.

Oscar's our machine learning lead. Some of the work that he and his outstanding team have been doing around machine learning-based detectors for the perception problem. So let me first introduce Aptiv a little bit, 'cause people usually ask me, like, what's an Aptiv when I say I work for Aptiv?

Aptiv's actually been around for a long time, but in a different form. Aptiv was previously Delphi Technologies, which was previously part of General Motors. So everybody's heard of General Motors. Some of you may have heard of Delphi, Aptiv spun from Delphi about 14 months ago. And so Aptiv's a tier one supplier.

They're an automotive company that industrializes technology. Essentially, they take software and hardware, they industrialize it and put it on cars so it can run for many, many hundreds of thousands of miles without failing, which is a useful thing when we think about autonomous driving. So the themes for Aptiv, they develop what they say is safer, greener, and more connected solutions.

Safer means safety systems, active safety, autonomous driving systems of the type that we're building. Greener, systems to enable electrification and kind of green vehicles. And then more connected connectivity solutions, both within the vehicle, transmitting data around the vehicle, and then externally, wireless communication. All of these things, as you can imagine, feed very, very nicely into the future transportation systems that the software will actually only be a part of.

So Aptiv is in a really interesting spot when you think about the future of autonomous driving. And to give you a sense of scale, still kind of amazes me. The biggest my research group ever was at MIT was like 18, 18 people. Aptiv is 156,000 employees, so significant sized organization, about a $13 billion company by revenue in about 50 countries around the world.

My group's about 700 people, so of which Oscar is one very important person. We're about 700 working on autonomous driving. We've got about 120 cars on the road in different countries, and I'll show you some examples of that. But first, let me take a trip down memory lane and show you a couple of snapshots about where we were not too long ago kind of as a community, but also me personally.

And this will either inspire or horrify you, I'm not sure which. The fact is 2007, there were groups driving around with cars like running blade servers in the trunk that were generating so much heat, you had to install another air conditioner, which then was drawing so much power, you had to add another alternator, and then kind of rinse and repeat.

So it wasn't a great situation. But people did enough algorithmically, computationally, to enable these cars, and this is the DARPA Urban Challenge for those of you that may be familiar, to enable these cars to do something useful and interesting on a closed course. And it kind of convinced enough people that given enough devotion of thought and resources that this might actually become a real thing someday.

So I was one of those people that got convinced. 2010, this is now, I'm gonna crib from my co-founder Emilio who was a former MIT faculty member in AeroAstro. Emilio started up an operation in Singapore through Smart, who some of you have probably worked with. So this is some folks from Smart.

That's James, who looks really young in that picture. He was one of Emilio's students who was basically taking a golf cart and turning it into an autonomous shuttle. It turned out to work pretty well, and it got people in Singapore excited, which in turn got us further excited. 2014, they did a demo where they let people of Singapore come and ride around in these carts in a garden, and that worked great over the course of a weekend.

Around this time, we'd started Newtonomy. We'd actually started a commercial enterprise. It kind of stepped at least partly away from MIT at that point. 2015, we had cars on the road. This is a Mitsubishi IMEV electric vehicle. When we had all of our equipment in it, the front seat was pushed forward so far that me, I'm about six foot three, actually couldn't sit in the front seat, so I couldn't actually accompany people on rides.

It wasn't very practical. We ended up switching cars to a Renault Zoe platform, which is the one you see here, which had a little more leg room. We were giving, at that point, open to the public rides in our cars in Singapore in the part of the city that we were allowed to operate in.

It was a quick transition. As you can see, just even visually, the evolution of these systems has come a long way in a short time, and we're just a point example of this phenomenon, which is kind of, broadly speaking, similar across the industry. But 2017, we joined Aptiv, and we were excited by that because we, as primarily scientists and technologists, didn't have a great idea how we were gonna industrialize this technology and actually bring it to market and make it reliable and robust and make it safe, which is what I'm gonna talk about a little bit here today.

So we joined Aptiv with its global footprint. Today, we're primarily in Pittsburgh, Boston, Singapore, and Vegas, and we've got connectivity to Aptiv's other sites in Shanghai and Wolfsburg. Let me tell you a little bit about what's happening in Vegas. I think people were here, when was Luke talking? Couple days ago, yesterday.

So Luke from Lyft, Luke Vincent, probably talked a little bit about Vegas. Vegas is really an interesting place for us. We've got a big operation there, 130,000 square foot garage. We've got about 75 cars. We've got 30 of those cars on the Lyft network. So Aptiv technology, but connecting to the customer through Lyft.

So if you go to Vegas and you open your Lyft app, it'll ask you, do you wanna take a ride in an autonomous car? You can opt in, you can opt out, it's up to you. If you opt in, there's a reasonable chance one of our cars will pick you up if you call for a ride.

So anybody can do this, competitors, innocent bystanders, totally up to you, we have nothing to hide. Our cars are on the road 20 hours a day, seven days a week. If you take a ride, when you get out of the car, just like any Lyft ride, you gotta give us a star rating, one through five.

And that, to us, is actually really interesting because it's a scaler, it's not too rich, but that star rating, to me, says something about the ride quality, meaning the comfort of the trip, the safety that you felt, and the efficiency of getting to where you wanted to go. Our star rating today is 4.95, which is pretty good.

Key numbers, we've given, at this point, over 30,000 rides to more than 50,000 passengers. We've driven over a million miles in Vegas and a little bit additional, but primarily there. And as I mentioned, the 4.95. So what does it look like on the road? I'll show just one video today.

I think Oscar has a few more. This one's actually in Singapore, but it's all kind of morally equivalent. You'll see a sped up, slightly sped up view of a run from, this is now probably six, seven months old, on the road in Singapore, but it's got some interesting stuff in a fairly typical run.

Some of you may recognize these roads. We're on the wrong side of the road, remember, 'cause we're in Singapore. But to give you an example of some of the types of problems we have to solve on a daily basis. So let me run this thing. And you'll see as this car is cruising down the road, you have obstacles that we have to avoid, sometimes in the face of oncoming traffic.

We've got to deal with sometimes situations where other road users are maybe not perfectly behaving by the rules. We've got to manage that in a natural way. Construction in Singapore, like everywhere else, is pretty ubiquitous. And so you have to navigate through these less structured environments. People who are sometimes doing things or indicating some future action, which you have to make inferences about, that can be tricky to navigate.

So typical day, a route that any one of us as humans would drive through without batting an eye, no problem, is actually presents some really, really complex problems for autonomous vehicles. But it's the table stakes these days. These are the things you have to do if you want to be on the road, and certainly if you want to drive millions of miles with very few accidents, which is what we're doing.

So that's an introduction to Aptiv and a little bit of background. So let me talk about, we're gonna talk about learning and how we think about learning in the context of autonomous driving. So there was a period a few years ago where I think as a community, people thought that we would be able to go from pixels to actuator commands with a single learned architecture, a single black box.

I'll say, generally speaking, we no longer believe that's true. And I shouldn't include we in that. I didn't believe that was ever true. But some of us maybe thought that was true. And I'll tell you part of the reason why, in part of this talk, a big part of it comes down to safety.

A big part of it comes down to safety. And the question of safety, convincing ourselves that that system, that black box, even if we could train it to accurately approximate this massively complex underlying function that we're trying to approximate, can we convince ourselves that it's safe? And it's very, very hard to answer that question affirmatively.

And I'll raise some of the issues around why that is. This is not to say that learning methods are not incredibly useful for autonomous driving because they absolutely are. And Oscar will show you examples of why that is and how Aptiv is using some learning methods today. But this safety dimension is tricky because there's actually two axes here.

One is the actual technical safety of the system, which is to say, can we build a system that's safe, that's provably in some sense safe, that we can validate, which we can convince ourselves, achieves the intended functionality in our operational design domain, that adheres to whatever regulatory requirements might be imposed on our jurisdictions that we're operating.

And there's a whole longer list related to technical safety. But these are technical problems primarily. But there's another dimension, which up here is called perceived safety, which is to say, when you ride in a car, even if it's safe, do you believe that it's safe? And therefore, will you wanna take another trip?

Which sounds kinda squishy, and as engineers, we're typically uncomfortable with that kind of stuff, but it turns out to be really important and probably harder to solve because it's a little bit squishy. And quite obviously, we gotta sit up here, right? We gotta be in this upper right-hand corner where we have not only a very safe car from a technical perspective, but one that feels safe, that inspires confidence in riders, in regulators, and in everybody else.

So how do we get there in the context of elements of this system that may be black boxes, for lack of a better word? What's required is trust. You know, how do we get to this point where we can trust neural networks in the context of safety-critical systems, which is what an autonomous vehicle is?

It really comes down to this question of, how do we convince ourselves that we can validate these systems? Again, validating the system, ensuring that it can meet the requirements, the operational requirements in the domain of interest that are imposed by the user, all right? There's three dimensions to this key question of understanding how to validate, and I'm gonna just briefly introduce some topics of interest around each of these.

But the first one, trusting the data. Trusting the data. Do we actually have confidence about what goes into this algorithm? I mean, everybody knows garbage in, garbage out. There's various ways that we can make this garbage. We can have data which is insufficiently covering our domain, not representative of the domain.

We can have data that's poorly annotated by our third-party trusted partners who we've trusted to label certain things of interest. So do we trust the data that's going in to the algorithm itself? Do we trust the implementation? We've got a beautiful algorithm, super descriptive, super robust, not brittle at all, well-trained, and we're running it on poor hardware.

We've coded it poorly. We've got buffer overruns right and left. Do we trust the implementation to actually execute in a safe manner? And do we trust the algorithm? Again, generally speaking, we're trying to approximate really complicated functions. I don't think we typically use neural networks to approximate linear systems.

So this is a gnarly, nasty function which has problems of critical interest which are really rare. In fact, they're the only ones of interest. So there's these events that happen very, very infrequently that we absolutely have to get right. It's a hard problem to convince ourselves that the algorithm is gonna perform properly in these unexpected and rare situations.

So these are the sorts of things that we think about and that we have to answer in an intelligent way to convince ourselves that we have a validated neural network-based system. Okay, let me just step through each of these topics really quickly. So the topic of validation, what do we mean by that and why it is hard?

There's a number of different dimensions here. The first is that we don't have insight into the nature of the function that we're trying to approximate. The underlying phenomena is really complicated. Again, if it weren't, we'd probably possibly be modeling it using different techniques. We'd write a closed-form equation to describe it.

So that's a problem. Second, again, the accidents, the actual crashes on the road, what's going crashes and not accidents, these are rare. Luckily, they're very rare. But it makes the statistical argument around these accidents and being able to avoid these accidents really, really difficult. If you believe Rand, and they're pretty smart folks, they say you gotta drive 275 million miles without accident, without a crash, to claim a lower fatality rate than a human with 95% confidence.

Well, how are we gonna do that? Can we think about using some correlated incident, maybe some kind of close call, as a proxy for accidents, which may be more frequent, and maybe back in that way? There's a lot of questions here, which I won't say we don't have any answers to, 'cause I wouldn't go that far, but they're hard questions.

They're not questions with obvious answers. So this is one of them, this issue of rare events. The regulatory dimension is one of these known unknowns. How do we validate a system if the requirements that may be imposed upon us from outside regulatory bodies are still to be written? That's difficult.

So there's a lack of consensus on what the safety target should be for these systems. This is obviously evolving. Smart people are thinking about this. But today, it's not at all clear. If you're driving in Las Vegas, if you're driving in Singapore, if you're driving in San Francisco, or anywhere in between, what this target needs to be.

And then lastly, and this is a really interesting one, we can get through a validation process for a bill to code. Let's assume we can do that. Well, what happens when we wanna update the code? 'Cause obviously we will. Does that mean we have to start that validation process again from scratch, which will unavoidably be expensive and lengthy?

Well, what if we only change a little bit of the code? What if we only change one line? But what if that one line is the most important line of code in the whole code base? This is one that I can tell you keeps a lot of people up at night, this question of revalidation.

And then not even, again, keep that code base fixed. What if we move from one city to the next? And let's say that city is quite similar to your previous city, but not exactly the same. How do we think about validation in the context of new environments? So this continuous development issue is a challenge.

All right, let me move on to talking about the data. There's probably people in this room who are doing active research in this area 'cause it's a really interesting one. But there's a couple of obvious questions, I would say, that we think about when we think about data. We can have a great algorithm, and if we're training it on poor data for one reason or another, we won't have a great output.

So one thing we think about is the sufficiency, the completeness of the data, and the bias that may be inherent in the data for our operational domain. If we wanna operate 24 hours a day, and we only train on data collected during daytime, we're probably gonna have an issue.

Annotating the data is another dimension of the problem. We can collect raw data that's sufficient, that covers our space, but when we annotate it, when we hand it off to a third party, 'cause it's typically a third party, to mark up the interesting aspects of it, we provide them some specifications, but we put a lot of trust in that third party, and trust that they're gonna do a good job annotating the interesting parts, and not the uninteresting parts, that they're gonna catch all the interesting parts that we've asked them to catch, et cetera.

So this annotation part, which seems very mundane, very easy to manage, and kind of like low-hanging fruit, is in fact another key aspect of ensuring that we can trust the data. Okay, and this reference just kind of points to the fact that there are, again, smart people thinking about this problem, which rears its head in many domains beyond autonomous driving.

Now what about the algorithms themselves? So moving on from the data to the actual algorithm, how do we convince ourselves that that algorithm, that like any kind of learning-based algorithm, we've trained on a training set, is gonna do well on some unknown test set? Well, there's a couple kind of properties of the algorithm that we can look at, that we can kind of interrogate, and kind of poke at to convince ourselves that that algorithm will perform well.

You know, one is invariance, and the other one, we can say, is stability. If we make small perturbations to this function, does it behave well? Given kind of, let's say, a bounded input, do we see a bounded output? Or do we see some wild response? You know, I'm sure you've all heard of examples of adversarial images that can confuse learning-based classifiers.

So it's a turtle. You show it a turtle, it says, "Well, that's a turtle." And then you show it a turtle that's maybe fuzzed with a little bit of noise that the human eye can't perceive. So it still looks like a turtle, and it tells you it's a machine gun.

Obviously, for us in the driving domain, we want a stop sign to be correctly identified as a stop sign 100 times of 100. We don't want that stop sign, if somebody goes up and puts a piece of duct tape in the lower right-hand corner, to be interpreted as a yield sign, for example.

So this question of the properties of the algorithm, its invariance, its stability, is something of high interest. And then lastly, to add one more point to this, this notion of interpretability. So interpretability, understanding why an algorithm made a decision that it made. This is the sort of thing that may not be a nice-to-have, may actually be a requirement, and would likely to be a requirement from the regulatory groups that I was referring to a minute ago.

So let's say, imagine the case of a crash, where the system that was governing your trajectory generator was a data-driven system, was a deep-learning-based trajectory generator. Well, you may need to explain to someone exactly why that particular trajectory was generated at that particular moment. And this may be a hard thing to do, if the generator was a data-driven model.

Now, obviously, there are people working and doing active research into this specific question of interpretable learning methods, but it's a thorny one. It's a very, very difficult topic, and it's not at all clear to me when and if we'll get to the stage where we can, to even a technical audience, but beyond that, to a lay jury, be able to explain why algorithm X made decision Y.

Okay, so with all that in mind, let me talk a little bit about safety. That all maybe sounds pretty bleak. You think, well, man, why are we taking this course with Lex, 'cause we're never gonna really use this stuff. But in fact, we can. We can and will, as a community.

There's a lot of tools we can bring to bear to think about neural networks, and they're, generally speaking, within the context of a broader safety argument. I think that's the key. We tend not to think about using a neural network as a holistic system to drive a car, but we'll think about it as a submodule that we can build other systems around, generally speaking, that which we can say, maybe make more rigorous claims about their performance, their underlying properties, and then therefore make a convincing, holistic safety argument that this end-to-end system is safe.

We have tools, functional safety is, maybe familiar to some of you. It's something we think about a lot in the automotive domain. And SOTIF, which stands for Safety of the Intended Functionality, we're basically asking ourselves the question, is this overall function doing what it's intended to do? Is it operating safely?

And is it meeting its specifications? There's kind of an analogy here to validation and verification, if you will. And we have to answer these questions around functional safety and SOTIF affirmatively, even when we have neural network-based elements in order to eventually put this car on the road. All right, so I mentioned that we need to do some embedding.

This is an example of what it might look like. We refer to this as, sometimes we call this caging the learning. So we put the learning in a box. It's this powerful animal we wanna control. And in this case, it's up there at the top in red. That might be that trajectory proposer I was talking about.

So let's say we've got a powerful trajectory proposer. We wanna use this thing. We've got it on what we call our performance compute, our high-powered compute. It's maybe not automotive grade. It's got some potential failure modes, but it's generally speaking, good performance. Let's go there. And we've got our neural network-based generator on it, which we can say some things about, but maybe not everything we'd like to.

Well, we make the argument that if we can surround that, so if we can cage it, kind of underpin it with a safety system that we can say very rigorous things about its performance, then generally speaking, we may be okay. There may be a path to using neural networks on autonomous vehicles if we can wrap them in a safety architecture that we can say a lot of good things about.

And this is exactly what this represents. So I'm gonna conclude my part of the talk here, hand it over to Oscar, with kind of a quote, an assertion. One of my engineers insisted I show today. The argument is the following. Engineering is inching closer to the natural sciences. I won't say how much closer, but closer.

We're creating things that we don't fully understand, and then we're investigating the properties of our creation. We're not writing down closed-form functions. That would be too easy. We're generating these immensely complex functional approximators, and then we're just poking at 'em in different ways and saying, boy, well, what does this thing do under these situations?

And I'll leave you with one image, which I'll present without comment, and then hand it over to Oscar. All right, thank you. (audience applauding) - So thanks a lot, Carl. Thanks, Lex, for the invite. Yes, my name is Oscar. I run the machine learning team at Aptiv Neutronomy. So let me begin with this slide.

You know, not long ago, image classification was, you know, quite literally a joke. So this is an actual comic. How many have seen this before? Okay, well, I was doing my PhD in this era where, you know, building a bird classifier was like a PhD project, right? And it was, you know, it's funny 'cause it's true.

And then, of course, as you well know, the deep learning revolution happened, and Lex, you know, previous introductory slides gives a great overview. I don't wanna redo that. I just wanna say sort of a straight line from what I consider the breakthrough paper by Krzyzewski et al. To the work I'll be talking about today, I'll start with these three.

So you had the, you know, deep learning, end-to-end learning for image net classification by Krzyzewski et al. That paper's been cited 35,000 times. I checked yesterday. Then, 2014, Ross Gershick et al. at Berkeley basically showed how to, you know, repurpose the deep learning architecture to do detection in images.

And that was the first time when the visual community really started seeing, okay, so classification is more general. You can classify anything, an image, an audio signal, whatever, right? But detection in images was very intimate to the computer vision community. We thought we were best in the world, right?

So when this paper came out, that was sort of the final argument for like, okay, we all need to do deep learning now. Right, and then 2016, this paper came out, the single-shot multi-box detector, which I think is a great paper by Liu et al. So if you haven't looked at this paper, by all means, read them carefully.

So as a result, you know, performance is no longer a joke, right? So this is a network that we developed in my group. So it's a joint image classification segmentation network. This thing, we can run this at 200 hertz on a single GPU. And in this video, in this rendering, there's no tracking applied.

There's no temporal smoothing. Every single frame is analyzed independently from the other one. And you can see that we can model several different classes, you know, both boxes and the surfaces at the same time. Here's my cartoon drawing of a perception system on an autonomous vehicle. So you have the three different main sensibilities.

Typically have some module that does detection and tracking. You know, there's tons of variations of this, of course, but you have some sort of sensor pipelines, and then in the end, you have a tracking and fusion step. So what I showed you in the previous video is basically this part.

So like I said, there was no tracking, but it's like going from the camera to detections. And if you look, you know, when I started, so I come strict from the computer science learning community, so when I started looking at this pipeline, I'm like, why are there so many steps?

Why aren't we optimizing things end to end? So obviously, there's a real temptation to just wrap everything in a kernel. It's a very well-defined input/output function. And like Carl alluded to, it's one that can be verified quite well, assuming you have the right data. I'm not gonna be talking about this.

I am gonna talk about this, namely the building a deep learning kernel for the LiDAR pipeline. And LiDAR pipeline is arguably the backbone of the perception system for most autonomous driving systems. So what we're gonna do is, so this is basically gonna be the goal here. So we're gonna have a point cloud, it's input, and we're gonna have a neural network that takes that as input and then generates 3D bounding boxes that are in a well-coordinated system.

So it's like 20 meters that way, it's two meters wide, so long, this rotation and this orientation and so on. So yeah, so that's what this talk is about. So I'm gonna talk about point pillars, which is a new method we developed for this, and new scenes, which is a benchmark data that we released.

Okay, so what is point pillars? Well, it's a novel point cloud encoder. So what we do is we learn a representation that is suitable for downstream detection. It's almost like a, the main innovation is the translation from the point cloud to a canvas that can then be processed by a similar architecture that you would use in an image.

And I'll show you how it performs, you know, all published measurement on KITTI by a large margin, especially with respect to inference speed. And there's a pre-printout and some code available if you guys wanna play around with it. So the architecture that we're gonna use looks something like this.

And I should say, most papers in this space use this architecture. So it's kind of a natural design, right? So you have the point cloud at the top, you have this encoder, and that's where we introduce the point pillars, but you can have, I'll show you guys, you can have various types of encoders.

And then after that, that feeds into a backbone, which is now a standard convolutional 2D backbone. You have a detection head, and you might have, you may or may not have a segmentation head on that. The point is that after the encoder, everything looks just like, the architecture's very, very similar to the SSD architecture or the RCNN architecture.

So let's go into a little bit more detail, right? So the range, so what you're given here is a range of D meters, so you wanna model, you know, 40 meters, a 40 meter circle around the vehicle, for example. You have certain resolution of your bins, and then a number of output channels, right?

So your input is a set of pillars, or in the pillar here is a vertical column, right? So you have N, M of those that are non-empty in the space. And you say a pillar P contains all the points, which are a lot of point X, Y, C, and intensity.

And there's N sub, M indexed by M points in each pillar, right, so just to say that it varies, right? So it could be one single point at a particular location, it could be 200 points. And then it's centered around this bin. And the goal here is to produce a tensor as a fixed size.

So it's height, which is, you know, range of a resolution, width, range of a resolution, and then this parameter C. C is the number of channels, so in an image, C will be three. We don't necessarily care about that. We call it a pseudo-image, but it's the same thing.

It's a fixed number of channels that the backbone can then operate on. Yeah, so here's the same thing without math, right? So you have a lot of points, and then you have this space where you just grid it up in these pillars, right? Some are empty, some are not empty.

So in this sort of, with this notation, let me give a little bit of a literature review. What people tend to do is you take each pillar, and you divide it into voxels, right? So now you have a 3D voxel grid, right? And then you say, I'm gonna extract some sort of features for each voxel.

For example, how many points are in this voxel? Or what is the maximum intensity of all the points in this voxel? Then you extract features for the whole pillar, right? What is the max intensity across all the points in the whole pillar, right? All of these are hand-engineered functions that generates the fixed length output.

So what you can do is you can now concatenate them, and their output is this tensor x, y, z. So then, VoxelNet came around, I'd say, a year or so ago. Maybe a little bit more by now. So they do the first, the first step is similar, right? So you divide each pillar into voxels, and then you take, you map the points in each voxels.

And the novel thing here is that they got rid of the feature engineering. So they said, we'll map it from a voxel to features using a PointNet. And I'm not gonna get into the details of a PointNet, but it's basically a network architecture that allows you to take a point cloud and map it to, again, a fixed length representation.

So it's a series of 1D convolutions and max pooling layers. It's a very neat paper, right? So what they did is they, okay, we say we apply that to each voxel, but now I end up with this awkward four-dimensional tensor 'cause I still have XYZ from the voxels, and then I have this C-dimensional output from the PointNet.

So then they have to consolidate the Z dimension through a 3D convolution, right? And now you achieve your XYZ tensor. So now you're ready to go. So it's very nice in the sense that it's end-to-end method. They showed good performance, but at the end of the day, it was very slow.

They got like five hertz runtime. And the culprit here is this last step, so the 3D convolution. It's much, much slower than a standard 2D convolution. All right, so here's what we did. We basically said, let's just forget about voxels. We'll take all the points in the pillar and we'll put it straight through PointNet.

That's it. So just that single change gave a 10- to 100-fold speedup from VoxelNet. And then we simplified the PointNet. So now, instead of having, so PointNet can have several layers and several modules inside it. So we simplified it to a single 1D convolution and max pooling layer. And then we showed you can get a really fast implementation by taking all your pillars that are not empty, stack them together into a nice, dense tensor with a little bit of padding here and there.

And then you can run the forward pass with a single, you can pose it as a 2D convolution with a one-by-one kernel. So the final encoder runtime is now 1.3 milliseconds, which is really, really fast. So the full method looks like this. So you have the point cloud, you have this pillar feature net, which is the encoder.

So the different steps there, that feeds straight into the backbone and your detection heads. And there you go. So it's still a multi-stage architecture, but of course the key is that none of the steps are, all the steps are fully parameterized. And we can back propagate through the whole thing and learn it.

So putting these things together, these were the results we got on the Qt Benchmark. So if you look at the car class, right, we actually got the highest performance, so this is I think the bird's eye view metric. And we even outperformed the methods that relied on LiDAR and vision.

And we did that running at a little bit over 60 hertz. And this is, like I said, this is in terms of bird's eye view we can also measure the 3D benchmark and we get the same, very similar performance. Yeah, so, you know, car did well, cyclist did well, pedestrian there was one or two methods, future methods that did a little bit better.

But then in aggregate on the top left, we ended up on top. So, yeah. And I put a little asterisk here, this is compared to published methods at the time of submission. And so many things happening so quickly. So there's tons of, you know, submissions on the Qt leaderboard that are completely anonymous, so we don't even know, you know, what was the input, what data did they use.

So we only compare it to published methods. So here's some qualitative results. You have the, you know, just for visibilization you can project them into the image. So you see the gray boxes are the ground truth and the colored ones are the predictions. And yeah, some challenging ones, it's so small here.

So we have, for example, the person on the right there, that's a person with a little stand got interpreted as a bicycle. We have this man on the ladder, which is an actual annotation error. So we discovered it as a person, but it wasn't annotated in the data. Here's a child on a bicycle that didn't get detected.

So that's a, you know, that's a bummer. Okay, so that was KITTI, and then I just wanted to show you guys, of course we can run this on our vehicles. So this is a rendering. We just deploy the network at two hertz on the full 360 sensor suite. Input is still alive, you know, a few lighter sweeps, but just projected into the images for visualization.

And again, no tracking or smoothing applied here. So it's every single frame is analyzed independently. See those arrows sticking out? That's the velocity estimate. So we actually show how you can, yeah, you can actually accumulate multiple point clouds into this method, and now you can start reasoning about velocity as well.

(no audio) So the second part I want to talk about is NuScenes, which is a new benchmark data set that we have published. So what is NuScenes? So it's 1,020 second scenes that we collected with our development platform. So it's a full, it's the same platform that Carl showed, or a sort of previous generation platform, the Zoe vehicle.

So it's full, you know, the full automotive sensor suite, data is registered and synced in 360 degree view. And it's also fully annotated with 3D bounding boxes. I think there's over one million 3D bounding boxes. And we actually make this freely available for research. So you can go to nuscenes.org right now and download a teaser release, which is 100 scenes, the full release will be in about a month.

And of course the motivation is straightforward, right? So, you know, the whole field is driven by benchmark, and you know, without image, I don't think none of it, it may be the case that none of us are here, we're here, right? Because they may never have been able to write that first paper and sort of start this whole thing going.

And when I started looking at 3D, I looked at the Kili benchmark, which is truly groundbreaking. I don't want to take anything away, but it was becoming outdated. They don't have full 3D view, they don't have any radar. So I think this offers an opportunity to sort of push the field forward a little bit.

Right, and just as a comparison, this is sort of the most similar benchmark. And really the only one that you can really compare to is Kiti. But so there's other data sets that have maybe LIDAR only, tons of data sets that have image only, of course. But it's quite a big step up from Kiti.

Yeah, some details. So you see the layout with the radars along the edge, all the cameras on the roof and the top LIDAR, and some of the receptive fields. And this data is all on the website. The taxonomy, so we model several different subcategories of pedestrians, several types of vehicles, some static objects, barrier cones.

And then in addition, a bunch of attributes on the vehicles and on the pedestrians. All right, so without further ado, let's just look at some data. So this is one of the thousand scenes, right? So all I'm showing here is just playing the frames one by one of all the images.

And again, the annotations live in the world coordinate system, right? So they are full 3D boxes. I've just projected them into the image. And that's what's so neat. So we're not really annotating the LIDAR or the camera or the radar. We're annotating the actual objects and put them in a world coordinate system and give all the transformations so you guys can play around with it how you like.

So just to show that, so I can, because everything is ready, so I can now take the LIDAR sweeps and I can just project them into the images at the same time. So here I'm showing just colored by distance. So now you have some sort of sparse density measurement on the images, distance measurement, sorry.

So that's all I wanted to talk about. Thank you. (audience applauding) - Hi, I was really, really interested in your discussion around validation and particularly continuous development and that sort of thing. And so my question was basically is this new scenes dataset, is this enough to guarantee that your model is going to generalize to unseen data and not hit pedestrians and that stuff or do you have other validation that you need to do?

- No, no, no, I mean, so the new scenes effort is purely an academic effort. So we wanna share our data with academic community to drive the field forward. We're not making any claims that this is somehow a sufficient dataset for any safety case. It's a small subset of our data.

Yeah, I would say, obviously, my background is in the academic world. One of the hardest things was always collecting data because it's difficult and expensive. And so having access to a dataset like that, which was expensive to collect and annotate, but which we thought we would make available because, well, we hope that it would spark academic interests and smart people like the people in this room coming up with new and better algorithms, which could benefit the whole community and then maybe some of you would even wanna come work with us at Aptiv.

So not totally, a little bit of self-interest there. Wasn't intended to be for validation, it was more for research. To give you a sense of the scale of validation, there was one quote there at RAND saying you gotta drive 275 million miles or more, depending on the certainty you wanna impose.

But to date as an industry, we've driven about 12 million miles to 12 to 14 million miles in sum, all participants in autonomous mode, under over hundreds of different bills of code and many different environments. So this would now be saying you're supposed to drive hundreds of millions of miles in a particular environment on a single bill of code, a single platform.

Now obviously we're probably not gonna do that. What we'll end up doing is supplementing the driving with quite a lot of simulation and then other methodologies to convince ourselves that we can make a statistical, ultimately a statistical argument for safety. So there'll be use of data sets like this.

We'll be doing lots of regression testing on supersized version of data sets like that and other kind of morally equivalent versions to test different parts of the systems. Now not just classification, but different aspects of the system. Our motion planning, decision making, localization, all aspects of the system. And then augment that with on-road driving and augment that with simulation.

So the safety case is really quite a bit broader, unfortunately, than any single data set would allow you to kind of speak to. - From an industrial perspective, what do you think can 5G offer for autonomous vehicles? - 5G, yeah, it's an interesting one. Well, these vehicles are connected.

You know, that's a requirement. Certainly when you think about operating them as a fleet. When the day comes when you have an autonomous vehicle that is personally owned, and that day will come in some point in the future, it may or may not be connected, it will almost certainly then be too.

But when you have a fleet of vehicles and you wanna coordinate the activity of that fleet in a way to maximize the efficiency of that network, that transportation network, they're certainly connected. The requirements of that connectivity is fairly relaxed if you're talking about just passing back and forth the position of the car and maybe some status indicators.

You know, are you in autonomous mode, manual mode, are all systems go, or do you have a fault code, and what is it? Now, there's some interesting requirements that become a little bit more stringent if you think about what we call teleoperation or remote operation of the car. The case where if the car encounters a situation it doesn't recognize, can't figure out, gets stuck or confused, you may kind of phone a human operator who's sitting remotely to intervene.

And in that case, you know, that human operator will wanna have some situational awareness. There may be a demand of high bandwidth, low latency, high reliability of the sort that maybe 5G is better suited to than 4G. Or LT or whatever you've got. Broadly speaking, we see it as a very nice to have, but like any infrastructure, we understand that it's gonna arrive on a timeline of its own and be maintained by someone who's not us.

So it's very much outside our control. And so for that reason, we design a system such that we don't rely on kind of the coming 5G wave, but we'll certainly welcome it when it arrives. - So you said you have presence in 45 countries. So did you observe any interesting patterns from that?

Like your car, your same self-driving car model that is deployed in Vegas as well as Singapore was able to perform equally well in both Vegas and Singapore, or the model was able to perform very well in Singapore compared to Vegas? - To speak to your question about like country to country variation, you know, we touched on that for a moment in the validation discussion.

But obviously driving in Singapore and driving in Vegas is pretty different. I mean, you're on the other side of the road for starters, but different traffic rules and it's sort of underappreciated people drive differently. There's slightly different traffic norms. So one of the things that, well, if anyone was in this class last year, my co-founder Emilio gave a talk about something we call rule books, which is a structure that we've designed around what we call the driving policy or the decision-making engine, which tries to admit in a general and fairly flexible way the ability to reprioritize rules, reassign rules, change weights on rules to enable us to drive in one community and then another in a fairly seamless manner.

So to give you an example, when we wanted to get on the road in Singapore, if you can imagine you've got a, so let's say you're a autonomy engineer who was tasked with writing the decision-making engine and you decide I'm gonna do a finite state architecture, I'm gonna write down some transition rules, I'm gonna do them by hand, it's gonna be great.

And then you did that for the right-hand driving and your boss came in and said, "Oh yeah, next Monday we're gonna be left-hand driving, "so just flip all that and get it ready to go." That could be a huge pain, pain to do, 'cause it's generally speaking you're doing it manually and then very difficult to validate, to ensure that the outputs are correct across the entire spectrum of possibilities.

So we wanted to avoid that. And so the long story short, we actually quite carefully designed the system such that we can scale to different cities and countries. And one of the ways you do that is by thinking carefully around the architectural design of the decision-making engine. But it's quite different.

There's four cities I mentioned which are our primary sites, Boston, Pittsburgh, Vegas, and Singapore, spans a wide spectrum of driving conditions. I mean, everybody knows Boston, which is pretty bad. Vegas is warm weather, mid-density urban, but it's Vegas. So I mean, all kinds of stuff. And then Singapore is interesting, perfect infrastructure, good weather, flat.

People, generally speaking, obey the rules, so it's kind of close to the ideal case. So that exposure to this different spectrum of data, I think, I'll speak for Oscar, maybe, is pretty valuable. I know for other parts of the development team, quite valuable. - Singapore is ideal except there are, there's constant construction zones.

So every time we drive out, there's a new construction zone. So we focused, have a lot of work on construction zone detection in Singapore. - And the torrential rain. - Yeah, and the jaywalkers. - And the jaywalkers, right. Yeah, they do jaywalk. People don't break the rules, but they jaywalk.

Other than that, it's perfect. So which country's fully equipped? That's a really good question, yeah. Well, it's interesting because there's other dimensions. So when we look at which countries are interesting to us to be in as a market, there's the infrastructure conditions, there's the driving patterns and properties, the density, is it Times Square at rush hour or is it Dubuque, Iowa?

There is the regulatory environment, which is incredibly important. You may have a perfectly well-suited city from a technical perspective and they may not allow you to drive there. So it's really all of these things put together. And so we kind of have a matrix. We analyze which cities check these boxes and assign them scores and then try to understand then also the economics of that market.

Is that city, check all these boxes, but there's no one using mobility services there. There's no opportunity to actually generate revenue from the service. So you factor in all of those things. - Yeah, and I think, I mean, one thing to keep in mind that it's always the first thing I tell candidates when I interview them.

There's a huge difference in the advantage to the business model we're proposing, right? The ride-hailing service. So we can choose, even if we commit to a certain city, we can still select the routes that we feel comfortable and we can roll it out sort of piece by piece. We can say, okay, we don't feel comfortable when driving at night in the city yet.

So we just won't accept any rides, right? So there's like that decision space as well. - Hi, thank you very much for coming and giving us this talk today. It was very, very interesting. I have a question which might reveal more about how naive I am than anything else.

I was comparing your point pillar approach to the earlier approach where you were, which is the Voxel-based approach to interpreting the LIDAR results. And in the Voxels, you had a four-dimensional tensor that you were starting with, and your point pillar, you only have three dimensions. You're throwing away the Z, as I understood it.

So when you do that, are you concerned that you're losing information about potential occlusions or transparencies or semi-occlusions? Is this a concern? - I see. So I may have been a little bit sloppy there. So we're certainly not throwing away the Z. All we're saying is that we're learning the embedding in the Z dimension jointly with everything else.

So VoxelNet, if you want, sort of felt, when I first signed that paper, I felt the need to spoon-feed the network a little bit and say, let's learn everything stratified in this height dimension. And then we'll have a second step where we learn to consolidate that into a single vector.

We just said, why not just learn those things together? So, yeah. - Thanks for your talk. I have a question for Carl. You mentioned that if people make change to the code, do we need another validation or not? So I work in the industry of nuclear power. So we do nuclear power simulations.

So when we make any change to our simulation code, and to make it commercialized, we need to submit a request for the NRC, which is the Nuclear Regulation Committee. So in your opinion, do you think for self-driving, we need another third-party validation committee or not? Or should that be a third party, or is just self-check?

- Yeah, that's a really good question. So I don't know the answer. I wouldn't be surprised, let me put it this way. I would not be surprised either way if the automotive industry ended up with third-party regulatory oversight, or it didn't. And I'll tell you why. There's great precedence for what you just described.

Nuclear, aerospace, there's external bodies who have deep technical competence, who can come in, they can do investigations, they can impose strict regulation, or advise regulation, and they can partner or define requirements for certification of various types. The automotive industry has largely been self-certifying. There's an argument, which is certainly not unreasonable, that you have a real alignment of incentive within the industry and with the public to be as safe as possible.

Simply put, the cost of a crash is enormous, economically, socially, everything else. But whether it continues along that path, I couldn't tell you. It's an interesting space because it's one where the federal government is actually moving very, very quickly. I mean, I would say carefully, too, not overstepping and not trying to impose too much regulation around an industry that has never generated a dollar of revenue.

It's still quite nascent. But if you would have told me a few years ago that there would have been very thoughtfully defined draft regulatory guidelines or advice, I mean, let's say, it's not firm regulation, around this industry, I probably wouldn't have believed you. But in fact, that exists. There's a third version that was released this summer by the Department of Transportation.

So there's intense interest on the regulatory side. In terms of how far the process goes in terms of formation of an external body, I think really remains to be seen. I don't know the answer. - Thanks for your insightful talk. Looking at this slide, I'm wondering how easy and effective your train models are to transfer across different weathers and whether you need, for example, if it is snowing, do you need specific trainings for specifically for your lidars to work effectively or you don't see any issues in that regard?

- No, I mean, I think the same rules apply to this method as any other machine learning-based method. You wanna have support in your training data for the situation you wanna deploy in. So if we have no snow in our training data, I wouldn't go and deploy this in snow.

I do like, one thing I like after having worked so much with vision though is that the lidar point cloud is really easy to augment and play around with. So for example, if you wanna say, you wanna be robust in really rare events, right? So let's say there's a piano on the road.

I really wanna detect that. But it's hard because I have very few examples of pianos on the road, right? Now if you think about augmenting your visual dataset with that data, it's actually quite tricky. So that easy to have a photorealistic piano in your training data. But it is quite easy to do that in your lidar data, right?

So you have a 3D model of your piano, you have the model for your lidar and you can get a pretty accurate, fairly realistic point cloud return from that, right? So I like that part about working with lidar. You can augment, you can play around with it. In fact, one of the things we do when we train this model is that we copy and paste samples from, or like objects from different samples.

So you can take a car that I saw yesterday, take the point returns on that car, you can just paste it into your current lidar sweep. You have to be a little bit careful, right? And this was actually proposed by another, by a previous paper. And we found that that was a really useful data.

It sounds absurd, but it actually works. And it speaks to the ability to do that with lidar point cloud. - Okay, great. Please give Carl and Oscar a big hand. Thank you so much. (audience applauding) - Excellent. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Karl Iagnemma & Oscar Beijbom (Aptiv Autonomous Mobility) - MIT Self-Driving Cars

Chapters

Transcript