ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

An expandable web-based audiovisual text-to-speech synthesis system

Sascha Fagel, Walter F. Sendlmeier

The authors propose a framework for audiovisual speech synthesis systems [1] and present a first implementation of the framework [2], which is called MASSY - Modular Audiovisual Speech SYnthesizer. This paper describes how the audiovisual speech synthesis system, the `talking head', works, how it can be integrated into web-applications, and why it is worthwhile using it. The presented applications use the wrapped audio synthesis, the phonetic and visual articulation modules, and a face module. One of the two already implemented visual articulation models, based on a dominance model for coarticulation, is used. The face is a 3D model described in VRML 97. The facial animation is described in a motion parameter model which is capable of realizing the most important visible articulation gestures [3][4]. MASSY is developed in the client-server paradigm. The server is easy to set up and does not need special or high performance hardware. The required bandwidth is low, and the client is an ordinary web browser with a freely available standard plug-in. The system is used for the evaluation of measured and predicted articulation models and is also suitable for the enhancement of human-computer-interfaces in applications like e.g. virtual tutors in e-learning environments, speech training, video conferencing, computer games, audiovisual information systems, virtual agents, and many more.