ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

How auditory and visual prosody is used in end-of-utterance detection

Pashiera Barkhuysen, Emiel Krahmer, Marc Swerts

In this paper, we describe a series of perception studies using visual and auditory cues to end-of-utterance. Fragments were taken from a recorded interview session, consisting of the parts in which speakers provided answers. Final and non-final parts of these fragments were used, varying in length. The subjects had to assess whether the speaker had finished his or her turn, based upon these fragments. The fragments were presented in 3 modalities: either a bimodal presentation mode (both auditory and visually), or in only the auditory or the visual mode. Results show that the audio-visual condition evoked the highest proportion of correct classifications and the auditory condition the lowest. Thus, the combination of modalities clearly works best. Also, non-final fragments are classified better than final ones, and longer fragments are classified better than short ones. It furthermore appears that these factors are different for different modalities: longer fragments are better classified in the auditory modality, while for short fragments the visual modality works better. This suggests that people may make more use of global cues in the auditory modality, while for the visual modality local cues are sufficient.