Audiobook data is a freely available source of rich expressive speech data. To accurately generate speech of this form, expressiveness must be incorporated into the synthesis system. This paper investigates two parts of this process: the representation of expressive information in a statistical parametric speech synthesis system; and whether discrete expressive state labels can sufficiently represent the full diversity of expressive speech. Initially a discrete form of expressive information was used. A new form of expressive representation, where each condition maps to a point in an expressive speech space, is described. This cluster adaptively trained (CAT) system is compared to incorporating information in the decision tree construction and a transform based system using CMLLR and CSMAPLR. Experimental results indicate that the CAT system outperformed the contrast systems in both expressiveness and voice quality. The CAT-style representation yields a continuous expressive speech space. Thus, it is possible to treat utterance-level expressiveness as a point in this continuous space, rather than as one of a discrete states. This continuous-space representation outperformed discrete clusters, indicating limitations of discrete labels for expressiveness in audiobook data.
Index Terms: expressive speech synthesis, hidden Markov model, cluster adaptive training, audiobook