This paper describes our efforts towards the creation of corporate synthetic voices from low quality speech data, as it can typically be found on many Interactive Voice Response (IVR) units. In doing so, we first touch on several normalization techniques that aim on a better support of a highly automated voice construction process. Subsequently, we describe methods for the creation of enriched corporate voices which integrate speech recordings from different speakers in order to overcome problems arising from limited domain training data.
Experiments are described which demonstrate the feasibility of the approach by comparing it to a less flexible solution that uses prerecorded prompts in combination with a large footprint standard concatenative synthesizer. Results show that the enriched voices clearly outperform those voices build solely from IVR data, while achieving almost the same overall rating as the pre-recorded prompts solution.