This video accompanies our paper presented at IJCAI, the International Joint Conference on Artificial Intelligence, where we propose a system for detecting driver frustration from the fusion of two data streams, first the audio of the driver's voice and second the video of the driver's face. Let's ask an illustrative question.
These are video snapshots of two drivers using the in-car voice-based navigation system. Which one of them looks more frustrated with the interaction? To help answer that question, let's take a look at an example interaction involving the driver on the right. Our proposed approach uses the audio of the driver's voice when the "human" is speaking and the video of the driver's face when he is listening to the machine speak.
What you are seeing and hearing is the driver attempting to instruct the car's voice-based navigation system to navigate to 177 Massachusetts Ave, Cambridge, Massachusetts. 177 Massachusetts Ave, Cambridge, Massachusetts. Man of the above. 177 Massachusetts Ave, Cambridge, Massachusetts. Man of the above. Man of the above. Cambridge, Massachusetts. So there is your answer.
On a scale of 1 to 10, with 1 being completely satisfied and 10 being completely frustrated, the smiling driver reported his frustration level with this interaction to be a 9. We use self-reported level of frustration as the ground truth for the binary classification of satisfied versus frustrated. When the driver is speaking, we extract the Geneva Minimalistic Acoustic Parameter Set (GMAPS) features from their voice which measures basic physiological changes in voice production.
When the driver is listening, we extract 14 facial actions using the AFDEX system from the video of the driver's face. The classified decisions are fused together to produce an accuracy of 88.5% on an on-road data set of 20 subjects. There are two takeaways from this work that may go beyond just detecting driver frustration.
First, self-reported emotion state may be very different than one assigned by a group of external annotators, so we have to be careful when using such annotations as the ground truth for other effective computing experiments. Second, detection of emotion may require considering not just facial actions or voice acoustics, but also context of the interaction, and the target of the effective communication.
For more information or to contact the authors, please visit the following website. Thank you. 1 Page 2 of 9