back to index

Detecting Driver Frustration from Audio and Video (IJCAI 2016)


Whisper Transcript | Transcript Only Page

00:00:00.000 | This video accompanies our paper presented at IJCAI, the International Joint Conference
00:00:05.720 | on Artificial Intelligence, where we propose a system for detecting driver frustration
00:00:12.120 | from the fusion of two data streams, first the audio of the driver's voice and second
00:00:18.000 | the video of the driver's face.
00:00:22.160 | Let's ask an illustrative question.
00:00:25.000 | These are video snapshots of two drivers using the in-car voice-based navigation system.
00:00:31.840 | Which one of them looks more frustrated with the interaction?
00:00:36.760 | To help answer that question, let's take a look at an example interaction involving
00:00:41.200 | the driver on the right.
00:00:43.900 | Our proposed approach uses the audio of the driver's voice when the "human" is speaking
00:00:50.120 | and the video of the driver's face when he is listening to the machine speak.
00:00:57.640 | What you are seeing and hearing is the driver attempting to instruct the car's voice-based
00:01:01.760 | navigation system to navigate to 177 Massachusetts Ave, Cambridge, Massachusetts.
00:01:09.800 | 177 Massachusetts Ave, Cambridge, Massachusetts.
00:01:24.840 | Man of the above.
00:01:31.840 | 177 Massachusetts Ave, Cambridge, Massachusetts.
00:01:55.840 | Man of the above.
00:02:02.840 | Man of the above.
00:02:08.840 | Cambridge, Massachusetts.
00:02:20.840 | So there is your answer.
00:02:22.680 | On a scale of 1 to 10, with 1 being completely satisfied and 10 being completely frustrated,
00:02:28.680 | the smiling driver reported his frustration level with this interaction to be a 9.
00:02:35.540 | We use self-reported level of frustration as the ground truth for the binary classification
00:02:40.300 | of satisfied versus frustrated.
00:02:45.320 | When the driver is speaking, we extract the Geneva Minimalistic Acoustic Parameter Set
00:02:50.600 | (GMAPS) features from their voice which measures basic physiological changes in voice production.
00:02:58.520 | When the driver is listening, we extract 14 facial actions using the AFDEX system from
00:03:03.920 | the video of the driver's face.
00:03:06.840 | The classified decisions are fused together to produce an accuracy of 88.5% on an on-road
00:03:13.380 | data set of 20 subjects.
00:03:18.080 | There are two takeaways from this work that may go beyond just detecting driver frustration.
00:03:23.320 | First, self-reported emotion state may be very different than one assigned by a group
00:03:29.480 | of external annotators, so we have to be careful when using such annotations as the ground
00:03:35.160 | truth for other effective computing experiments.
00:03:39.680 | Second, detection of emotion may require considering not just facial actions or voice acoustics,
00:03:46.720 | but also context of the interaction, and the target of the effective communication.
00:03:54.040 | For more information or to contact the authors, please visit the following website.
00:03:58.680 | Thank you.
00:03:59.720 | [END]
00:04:01.220 | Page 2 of 9