ISCA Archive SpeechProsody 2008
ISCA Archive SpeechProsody 2008

Dialog speech acts and prosody: considerations for TTS

Ann K. Syrdal, Yeon-Jun Kim

As natural language dialog systems involving both speech recognition and text-to-speech (TTS) synthesis become more sophisticated, the limitations of general-purpose TTS for human-computer dialogs have become more apparent. Much subtlety and complexity of meaning in natural language dialogs is conveyed by prosody; how something is said is often as important as what words are spoken. At the same time, advances such as unit selection synthesis have greatly improved the naturalness of synthetic speech because much less signal processing is required, resulting in less distortion. However, the improved naturalness provided by unit selection synthesis has been achieved at the cost of the more precise prosodic control provided by earlier, more robotic sounding synthesizers.

With the goal of providing more prosodic and expressive control over unit selection TTS for dialog applications, while retaining naturalness, we have focused on speech acts, the communicative function of an utterance. The current working set of speech acts being used includes:

Imperative: directive, request, wait, repeat, warning Interrogative: question-wh, question-yes/no, question-multiple choice Assertive: informative-general, informative-detail Affective: apology, exclamation-positive, exclamationnegative, greeting, good-bye, thanks Others: confirmation, disconfirmation, back-channel, cue phrase

Our work is practically focused, but also involves some observations of more general interest. We use a relatively small set of speech acts both to classify utterances in a speech corpus according to their communicative function, and then to preferentially select speech act-appropriate units to match the desired speech act of the utterance to be synthesized. The corpus is composed of speech read (primarily from interactive dialogs of various kinds) by a female US English speaker (a voice talent used to build one of our TTS voices). We examine prosodic differences of a more .global. nature (mean f0, f0 range, speaking rate, energy level) for the entire set of speech acts. A portion of the database has also been ToBI labeled and analyzed for systematic differences. There are several significant prosodic differences among the various speech acts.

In our current TTS implementation, speech acts are being used as another feature to select speech units for concatenation, but results from analyzing prosodic features of the various speech acts will also be used to better predict the prosodic features desired. Results thus far are promising and examples will be demonstrated.