ISCA Archive AVSP 2008
ISCA Archive AVSP 2008

Effect of audio-visual asynchrony between time-expanded speech and a moving image of a talker²s face on detection and tolerance thresholds

Shuichi Sakamoto, Akihiro Tanaka, Shun Numahata, Atsushi Imai, Tohru Takagi, Yôiti Suzuki

In this study, we measured detection and tolerance thresholds of auditory-visual asynchrony between time-expanded speech and a moving image of the talker’s face. During experiments, words were presented under two conditions: asynchrony by time-expanded speech (expansion condition: EXP) and simple timing shift (asynchronous condition: ASYN). We used 16 Japanese shorter words (four morae) and 20 Japanese longer words (seven or eight morae). All auditory speech was presented in pink noise to avoid the ceiling effect. The SNRs for shorter and longer words were respectively set to -10 dB and -3.5 dB. For EXP, auditory speech signals were analyzed and resynthesized using STRAIGHT to change the words’ duration (Kawahara et al., 1998). The resynthesized auditory signals were combined with the visual signals so that the onset of the utterance was synchronous. For ASYN, the auditory speech signal was simply lagged behind the visual speech signal. Results showed that detection and tolerance thresholds in longer words were higher than those for shorter words. However, when the threshold was recalculated as a function of the ratio of the expansion rate to word duration, these differences were not observed. These results suggest that detection and tolerance thresholds for auditory-visual asynchrony between timeexpanded speech and a moving image of talker’s face might depend on the ratio of the expansion rate to word duration.