ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis

A. Gallardo-Antolín, J. M. Montero, Simon King

Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.