ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis

A. Gallardo-Antolín, J. M. Montero, Simon King

Traditional Text-To-Speech (TTS) systems have been developed using especially-designed non-expressive scripted recordings. In order to develop a new generation of expressive TTS systems in the Simple4All project, real recordings from the media should be used for training new voices with a whole new range of speaking styles. However, for processing this more spontaneous material, the new systems must be able to deal with imperfect data (multi-speaker recordings, background and foreground music and noise), filtering out low-quality audio segments and creating mono-speaker clusters. In this paper we compare several architectures for combining speaker diarization and music and noise detection which improve the precision and overall quality of the segmentation.


doi: 10.21437/Interspeech.2014-515

Cite as: Gallardo-Antolín, A., Montero, J.M., King, S. (2014) A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis. Proc. Interspeech 2014, 2370-2374, doi: 10.21437/Interspeech.2014-515

@inproceedings{gallardoantolin14_interspeech,
  author={A. Gallardo-Antolín and J. M. Montero and Simon King},
  title={{A comparison of open-source segmentation architectures for dealing with imperfect data from the media in speech synthesis}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={2370--2374},
  doi={10.21437/Interspeech.2014-515},
  issn={2308-457X}
}