This paper investigates the audio-visual correlates and the detection of word prominence in a scenario where subjects were interacting with a computer in a small cartoon game. I set up a Wizard of Oz experiment in which subjects were asked to make corrections for a misunderstanding of the system. As only one word was misunderstood this evoked a narrow focus condition rendering the corrected word highly prominent. I made audio-visual recordings with a distant microphone and without visual markers. From these conditions I expected to elicit natural reactions from the subjects in a human-machine interaction task. As acoustic features I extracted duration, intensity, fundamental frequency and spectral emphasis. From the visual channel I extracted head movements based on the movements of the nose and image transformation based features from the mouth region. First I show that the extracted features are significantly different for the two focus conditions (broad and narrow). Based on classification results I demonstrate that they can be differentiated without knowledge of the word identity. Furthermore, I show that the visual channel by itself yields comparable accuracies as acoustic features and that a combination of both modalities increases performance.
Index Terms: prosody, prominence, visual, audio-visual, spectral emphasis, lip movement, head movement