ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Viseme comparison based on phonetic cues for varying speech accents

Chitralekha Bhat, Sunil Kopparapu

Human interaction through speech is a multisensory activity, wherein the spoken audio is perceived using both auditory and visual cues. However, in the absence of auditory stimulus, speech content can be perceived through lip reading, using the dynamics of the social context. In our earlier work [1], we had presented a tool enabling hearing impaired to understand spoken speech in videos, through lip reading. During evaluation it was found that a hearing impaired person, trained to lip read Indian English was unable to lip read speech in other accents of English. We hypothesize that this difficulty can be attributed to a difference in viseme formation arising from underlying phonetic characteristics. In this paper, we present a comparison between auditory and visual space for the same speech utterance in English, as spoken by an Indian and a Croatian national. Results show a clear correlation between distances in the visual and auditory domain at viseme level. We then evaluate the feasibility of building visual subtitles through viseme adaptation from unknown accent to known accent.