Back to Index

MIT-AVT: Data Collection Device (for Large-Scale Semi-Autonomous Driving)


Chapters

0:0
1:22 Components
1:37 Sensory Integration
3:17 Logitech C920 Webcam
7:30 Synchronization
12:30 Next Steps

Transcript

The MIT Autonomous Vehicle Technology Study is all about collecting large amounts of naturalistic driving data. Behind that data collection is this box right here that Dan has termed "Rider". Dan is behind a lot of the hardware work we do, embedded systems, and Michael is behind a lot of the software, the data pipeline, as well as just offloading the data from the device.

We'd like to tell you some of the details behind Rider, behind the sensors. Now, we have three cameras in the car and the wires are running back into the trunk, and that's where Rider is sitting. There's a lot of design specifications to make the system work month after month, reliably across multiple vehicles, across multiple weather conditions, and so on.

At the end of the day, we have multiple sensor streams. We have the three cameras coming in, we have IMU, GPS, and all of the raw CAN messages coming from the vehicle itself. And all of that has to be collected reliably, synchronized, and post-processed once we offload the data.

First, we have a single-board computer here running a custom version of Linux that we wrote specifically for this application. This single-board computer integrates all of the cameras, all the sensors, GPS, CAN, IMU, and offloads it all onto the solid-state hard drive that we have on board. There are some extra components here for cellular communication, as well as power management throughout the device.

Here we have our single-board computer, as well as sensor integration and our power system. This is our solid-state drive that connects directly to our single-board computer. On our single-board computer, we have a sensor integration board on top here. You'll be able to see our real-time clock, as well as its battery backup and CAN transceiver.

On the reverse side of this board, we have our GPS receiver and IMU. This is our CAN-controlled power board, which monitors CAN throughout the car and determines whether or not the system should be on or off. When the system is on, this sends power through a buck converter to reduce the 12 volts from the vehicle down to 5 volts to operate the single-board computer.

We also have a 4G wireless connection on board to monitor the health of Rider and determine things like free capacity left on our drive, as well as temperature and power usage information. The cameras connect to Rider through this USB hub right here. We needed the box to do at least three things.

One was record from at least three cameras, record CAN vehicle telemetry data, and then lastly, be able to store all this data on board for a long period of time, such that people could drive around for months without having us to offload the data from their vehicles. When we're talking about hundreds of thousands of miles of data, for about every 100,000 miles uncompressed, that's about 100 petabytes of video data.

One of the key other requirements was how to store all this data both on the device and how to be able to then offload it successfully onto thousands of machines to be then processed with the computer vision, with the deep learning algorithms that we're using. One of the essential elements for that was to do compression on board.

These are Logitech C920 webcams. They can do up to 1080p at 30 frames a second. The major reason why we went with these is because they do onboard H.264 compression of the video. So that allows us to offload all the processing from our single board computer onto these individual cameras, allowing us to use a very slim, pared down, lightweight single board computer to run all of these sensors.

This is the original Logitech C920 that you would buy at a store. These are the two same Logitech C920s, although they were put into a custom-made camera case just for this application. What this allows us to do is add our own CS-type lenses to enable us to have a zoom lens, as well as a fisheye lens from within the car, allowing us a greater range of field of views inside the vehicle.

So this is the fisheye lens. This is the zoom lens. And the CS type, there's also C type. So these are the types of standard lenses that connect to these types of cameras, often to the industrial cameras that are often used for autonomous vehicle applications. We tested these cameras to see what would happen to them if placed inside of a hot car on a summer day.

We wanted to see would these cameras still be able to hold up to the heat in the summer and still function as needed? We put these cameras in a toaster. A scientific toaster. And this is the temperature that it went up to. We cycled these cameras between 58 and 75 degrees Celsius, which is about the maximum of 150 degrees Fahrenheit max temperature that a car would get in the summer.

We also cranked it up to 127 degrees Celsius just to see what would happen to these cameras after prolonged long-term high heat. In fact, these cameras continued to work perfectly fine after that. Creating a system that would intelligently and autonomously turn off and on to start and end recording was also a key aspect to this device.

Since people were just going to be driving their normal cars, we couldn't rely on them necessarily to start and end recording. So this device, Rider, intelligently figures out when the car is running and when it's off to start and stop recording automatically. So how does Rider specifically know when to turn on?

So we use CAN to determine when the system should turn off and on. When CAN is active, the car is running, and we should turn the system on. When CAN is inactive, we should turn the system off and end recording. This also gives us the ability to trigger on certain CAN messages.

So, for instance, if we want to start recording as soon as they approach the car and unlock the door, we can do that. Or if they turn the car on or they put it into drive or so on. The cost of the car that the system resides in is about a thousand times more than the system itself.

So these are $100,000 plus cars. So we have to make sure that we design the system, we run the wires in such a way that it doesn't do any damage to the vehicles. What kind of things fail when they fail? The biggest issue we've had with this system are camera cables becoming unplugged.

So when a camera cable becomes unplugged, the system will try to restart that subsystem multiple times. And if it's unable to, it completely shuts off recording. And as long as that cable is still unplugged, Rider will not start up the next time. So one issue that we've seen is that cables becoming unplugged causes us to lose the potential to record some data.

And that was one of the requirements of the system from the very beginning, is that all the video streams are always recorded perfectly and synchronized. Now, if any of the systems are failing to be recording from the sensors, then we try again, restart the system, restart the system, and if it's still not working, it shuts down.

So the video, in order to understand what drivers are doing in these systems, the video is essential. So if one of the cameras is not working, that means a system that's not working as a whole. The other crucial component of having a data collection system that's taking multiple streams is that those streams have to be synchronized perfectly.

Synchronization was the highest priority from the very beginning of Rider's design. We have a real-time clock onboard Rider that allows us down to two parts per million of accuracy in time stamping. This means over the course of a one-and-a-half-hour drive, our time stamps issued to each of the different subsystems may drift up to seven or so milliseconds.

Relatively, this is extremely small compared to most clocks on computers today. And once the data is offloaded, the very first thing we do is make sure that the time stamping, that the data was time stamped correctly so that we can synchronize it. And the very first thing as part of the data pipeline we do is synchronize the data.

That means using the time stamp that came from the real-time clock to assign to every single piece of sensor data using that time stamp to align the data together. Now for video, that means 30 frames a second perfectly aligned with other GPS signals and so on. There are some other sensors like IMU and the CAN messages coming from the car that come much more frequently than 30 hertz, 30 frames a second.

So we have a different synchronization scheme there. But overall, synchronization from the very beginning of the design of the hardware to the very end of the design of the software pipeline is crucial because we want to be able to analyze what people are doing in these semi-autonomous vehicles, how they're interacting with the technology.

And that means using data that comes from the face camera, the body camera, the forward view, synchronized together with the GPS, the IMU, and all the messages coming from the vehicle telemetry from CAN. The video stream compression, which is very much CPU or GPU intensive operations, performed onboard the camera.

There are other CPU intensive operations performed on rider, like the sensor fusion for IMU. But for the most part, there's sufficient CPU cycles left for the actual data collection to not have any skips or drifts in the sensor stream collection. One of the questions we get is, how do we get the data from this box to our computers, then to the cluster that's doing the compute?

So when we receive a hard drive from one of these rider boxes that we're swapping, we connect the hard drive locally to our computers, and then we do a remote copy to a server that contains all of our data. We then check the data for consistency and perform any fixes on the raw data in preparation for a synchronization operation.

So we're not doing any remote offloading of data. The data lives on rider until the subjects, the drivers, the owners of the car, come back to us and offload the data. So we take the hard drive, swap it out, and offload the data from the hard drive. Can you tell me the journey that a pixel takes on its way from the camera to our cluster?

Well, first the camera records the raw image data based on the settings that we've configured from the rider box, and that raw image data is compressed on the camera itself into an H.264 format and then transmitted over the USB wire to the single board computer on the rider box.

Then it's recorded onto the solid state drive in a video file, where it will stay until we do an offload in the course of about six months for our NDS subjects and one month for our FT subjects. After that, it is connected to a local computer, synchronized within a remote server, and is then processed with initial cleaning algorithms in order to remove any corrupt data or to fix any subject data in the configuration files for that particular trip.

After the initial cleaning is taken care of, it is synchronized at 30 frames per second and can then be used for different detection algorithms or manual annotation. So the important hard work behind the magic that deep learning computer vision unlocks is the synchronization, the cleaning of the messy data, making sure we get anything that's at all weird in any way in the data out, so that at the end of the pipeline, we have a clean data set of multiple sensor streams perfectly synchronized that we can then use for both analysis and for annotation so that we can improve the neural network models used for the various detection tasks.

So RIDER has done an amazing job over 30 vehicles of collecting hundreds of thousands of miles worth of data, billions of video frames. So we're talking about an incredible amount of data, all compressed with H.264 that's close to 300 terabytes worth of data. But, of course, it can always improve.

So what are our next steps? One huge improvement for RIDER would be transitioning to another single board computer, in particular a Jetson TX2. There's a lot more capability for added sensors as well as much more compute power and even the possibility for developing some real-time systems with a Jetson.

One of the critical things when you're collecting huge amounts of data and driving is you realize that most of driving is quite boring, nothing interesting in terms of understanding driver behavior or training computer vision models for edge cases and so on. Nothing interesting happens. So one of the future steps we're taking is based on the things we found in the data so far, we know which parts are interesting and which are not.

And so we want to design onboard algorithms that are processing in real time that video data to determine, "Is this the kind of data I want to keep at this time? And if not, throw it out." That means we can collect more efficiently just the bits that are interesting for edge case neural network model training or for understanding human behavior.

Now this is a totally unknown open area because really we don't understand what people do in semi-autonomous vehicles, when the car is driving itself, when the human is driving itself. So the initial stages of the study were to keep all the data so we could do the analysis to analyze the body pose, glance allocation, activity, smartphone usage, all the various sun decelerations, autopilot usage, where it's used, how it's used, geographic, weather, night, so on.

But as we start to understand where the fundamental insights come from, we can start to be more and more selective about which epics of data we want to be collecting. Now that requires real-time processing of the data. As Dan said, that's where the Jetson TX2, the power that the Jetson TX2 brings, becomes more and more useful.

Now all of this work is part of the MIT Autonomous Vehicle Technology Study. We've collected over 320,000 miles so far and collecting 500 to 1,000 miles every day. So we're always growing, adding new vehicles. We're looking at adding a Tesla Model 3, a Cadillac CT6 Super Cruise system, and others.

One of the driving principles behind our work is that the kind of data collection we need to design safe, semi-autonomous, and autonomous vehicles is that we need to record not just the forward roadway or any kind of sensor collection on the external environment. We need to have rich sensor information about the internal environment, what the driver is doing, everything about their face, the glance, all the cognitive load, and body pose, everything about their activity.

We truly believe that autonomy, autonomous vehicles, require an understanding of how human supervisors of those systems behave, how we can keep them attentive, keep their glance on the road, keep them as effective, efficient supervisors of those systems. Thanks for watching.