ISCA Archive AVSP 1998
ISCA Archive AVSP 1998

Two- and Three-Dimensional Audio-Visual Speech Synthesis

N. Michael Brooke, Simon D. Scott

An audio-visual speech synthesiser has been built that will generate animated computer-graphics displays of high-resolution, colour images of a speaker's mouth area. The visual displays can simulate the movements of the lower face of a talker for any spoken sentence of British English, given a text input. The synthesiser is based on a data-driven technique. It uses encoded video-recorded images and sounds of a real speaker to find optimal parameter values for, or 'train', hidden Markov models (or HMMs) that capture both the sounds and facial gestures for each of the speech sounds of British English. To synthesise an utterance, the trained HMMs associated with the speech sounds are invoked in sequence to produce outputs which can be decoded into an image and sound sequence. Whilst the basic image syntheses are two-dimensional, they can be pasted onto a three-dimensional wireframe model of the lower part of a head, which, when the jaw outline is adjusted, produces a plausible three-dimensional visual speech animation.