Our group at MIT is studying semi-autonomous vehicles. Now that includes both inward-facing sensors for driver state sensing and outward-facing sensors for scene perception and the control planning, motion planning task. Now today we'll look at the second part of that, at the perception and the control of the vehicle. On the dashboard of the Tesla, there's a Jetson TX2 with a camera sitting on top of it.
We have a neural network end-to-end running on the Jetson that's detecting the forward roadway, taking it as a sequence of images and producing steering commands. We also have here a Tesla that has a perception control system on it in the form of Autopilot. It's using a monocular camera. This is the hardware version one.
It's making decisions based on this single video stream producing steering commands. And we'll look at two systems arguing today, Autopilot arguing against a neural network. And we'll see what comes out. In this concept, Tesla Autopilot is the primary AI system and the end-to-end neural network is the secondary AI system.
And the disagreement between the two is used to detect challenging situations and seek human driver supervision. It is important to clarify that this is not a criticism of Autopilot. Of the two, it is by far the superior perception control system. The question is whether the argument between the two systems can create transparency that leverage the human driver as a supervisor of challenging driving scenarios, scenarios that may have not otherwise been caught by Autopilot alone.
This is a general framework for supervision of black box AI systems that we hope can help save human lives. In the paper accompanying this video, we show that we can predict driver-initiated disengagement of Autopilot with a simple threshold on the disagreement of steering decisions. We believe this is a very surprising and powerful result that hopefully may be useful for human supervision of any kind of AI system that operates in the real world and makes decisions where errors may result in loss of human life.
A quick note that we use the intensity of red color on the disagreement detected text as the visualization of disagreement magnitude. In retrospect, this is not an effective visualization because visually it looks like the two systems are constantly disagreeing. They are not. The intent of the on-road demo is to show successful real-time operation of the Argue Machines framework.
The paper that goes along with this approach, on the other hand, is where we show the predictive power of this approach on large-scale naturalistic data. Inside the car, we have a screen over the center stack and a Jetson TX2 with a camera on top of it. The camera is feeding a video stream into the Jetson.
On the Jetson is a neural network that's predicting the steering command, taking in end-to-end the video stream from the forward roadway and as an output for the neural network giving a steering command. That's being shown as pink on this display. The pink line is the steering suggested by the neural network.
Cyan line is the steering of the car, of the Tesla that we're getting from the cam bus. When I move the steering wheel around, we see that live in real-time mapped on this graphic here showing in cyan the steering position of the car. Up top is whenever the two disagree significantly, the disagreement detected red sign appears showing that there's a disagreement.
And I'll demonstrate that on road. We're now driving on the highway with the Tesla being controlled by autopilot and the Jetson TX2 on the dashboard with a camera plugged in has a neural network running on it end-to-end. And the input to the neural network is a sequence of images and the output is steering commands.
Now there's two perception control systems working here. One is autopilot, the other one is an end-to-end neural network. Both the steering commands from both are being visualized on the center stack here. In pink is the output from the neural network, in cyan is the output from autopilot. And whenever there is some disagreement or a lot of disagreement, up on top there's a disagreement detected text that becomes more intensely red the greater the disagreement.
At the bottom of the screen is the input to the neural network that is a sequence of images that is subtracted from each other capturing the temporal dynamics of the scene. All right, so why is this interesting? Because two perception control systems, two AI systems taking in the external world using a monocular camera and making a prediction, making steering commands to control the vehicle.
Now whenever those two systems disagree, that's interesting for many reasons. One, the disagreement is an indicator that from a visual perspective, from a perception perspective the situation is challenging for those systems. Therefore you might want to bring the driver's attention to the situation so they take control back from the vehicle.
It's also interesting for validating systems. So if you propose a new perception control system, you can imagine putting it into a car to go along with autopilot or with other similar systems to see how well that new system works with autopilot when it disagrees, when it doesn't. And the disagreement from the computer vision aspect is also really interesting for detecting edge cases.
So the challenging thing about driving or for building autonomous vehicles is that most of the driving is really boring. The interesting bits happen rarely. So one of the ways to detect those interesting bits, the edge cases, is to look at the disagreement between these perception systems, to look at cases when the two perception systems diverge and therefore they struggle with that situation.
Finally, when the driver is controlling and takes control of the vehicle, which I am doing now, and when my steering decisions, my turning of the steering wheel is such that the neural network disagrees, it perhaps means that I am either distracted or the situation is visually challenging, therefore I should pay extra attention.
So it makes sense for the system to warn you about that situation. Now the interesting thing about Tesla and the autopilot system is that if we instrument a lot of these vehicles, as we have, we've instrumented 20 Teslas as part of the MIT Autonomous Vehicle Study and are collecting month after month, year after year now, data, video in and video out.
We can use that data to train better systems, to train perception systems, control, motion planning and the end-to-end network that we're showing today. We have the large-scale data to train the learning-based perception and control algorithms. Now an important thing to mention is that these systems were designed to work on the highway, at highway speeds.
So the kind of disagreement it's trained to detect is disagreement between autopilot and the neural network in highway situations. So the visual characteristics of lane markings deteriorating or construction zones and so on. Now the details, and if you're interested in more, can be found in a paper titled "Arguing Machines."