ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Conditional random fields for hierarchical segment selection in text-to-speech synthesis

Christian Weiss, Wolfgang Hess

In this paper we present the statistically motivated conditional random fields (CRF) approach to concatenative TTS. We use contextual CRFs for speech segment selection where we concatenate the selected segments to an acoustic speech waveform. The CRF approach is used in our corpus-based TTS system AVISS. The acoustic synthesis module consists of trained context dependent CRF models on a multi-level acoustic unit inventory where we apply a hierarchical top-down search to select appropriate segments. The acoustic synthesis is easily adaptable to other languages while there is only the need of a language specific module for text and symbolic preprocessing as well as duration and F0 prediction which can be performed by a prosodic module. The system shows good results in the generated speech waveforms. The CRF approach is usable for acoustic units as well as a parametric synthesis where the speech parameters are generated by CRFs and the speech waveform is produced by a synthesis filter.