This work aims at creating expressive voices from audiobooks using
semantic selection. First, for each utterance of the audiobook an acoustic
feature vector is extracted, including iVectors built on MFCC and on
F0 basis. Then, the transcription is projected into a semantic vector
space. A seed utterance is projected to the semantic vector space and
the N nearest neighbors are selected. The selection is then filtered
by selecting only acoustically similar data.
The proposed technique
can be used to train emotional voices by using emotional keywords or
phrases as seeds, obtaining training data semantically similar to the
seed. It can also be used to read larger texts in an expressive manner,
creating specific voices for each sentence. That later application
is compared to a DNN predictor, which predicts acoustic features from
semantic features. The selected data is used to adapt statistical speech
synthesis models. The performance of the technique is analyzed objectively
and in a perceptive experiment. In the first part of the experiment,
subjects clearly show preference for particular expressive voices to
synthesize semantically expressive utterances. In the second part,
the proposed method is shown to achieve similar or better performance
than the DNN based prediction.