ISCA Archive Eurospeech 1993
ISCA Archive Eurospeech 1993

English speech synthesis based on multi-layered context oriented clustering; towards multi-lingual speech synthesis

Shin'ya Nakajima

In this paper, we propose a new synthesis unit learning method aiming at multi-lingual speech synthesis and describe its application to English speech synthesis. The method termed Multi-Layered Context Oriented Clustering (ML-COC), is a generalized framework of the COC method which has been applied to Japanese speech synthesis. The conventional COC method produces a set of phonetic context dependent units through a cluster splitting process. In ML-COC, the notion of context is generalized and the factors other than phonetic context such as stressing and syntactical boundaries, are taken into account to capture the richer phoneme variations of English. A synthesis unit generation experiment shows that ML-COC produces about three times as many synthesis units as the conventional COC (Single-Layered COC: SL-COC) method, the average of inner-cluster variances of ML-COC units is 20% lower than that of SL-COC, and each ML-COC unit has about twice as many contexts as each SL-COC unit on average. These results suggest that the ML-COC synthesis units reflect the phonological structure of English much more conscientiously than do the SL-COC units.

Keywords: speech synthesis, multi-lingual speech synthesis, context dependent unit, phonetic context.