The performance of current high quality concatenative text-to-speech (TTS) systems is limited under noisy environments. This paper investigates whether or not the intelligibility of synthesized speech in noise can be improved by emphasizing the prosody. Additionally, the paper presents a method that can effectively emphasize the prosody of units in existing TTS databases. The circular linear prediction (CLP) model is combined with the constant-pitch transform (CPT) to perform pitch and duration modifications to concatenative TTS units with little impact to the subjective quality. Test utterances are generated using the method and compared to reference utterances synthesized by a high quality TTS engine. The subjective test results demonstrate a preference for emphasized prosody in the majority of the test cases.
Index Terms. TTS, speech synthesis, linear prediction, prosody, noisy speech