ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Text-to-speech synthesis using spectral modeling based on non-negative autoencoder

Takeru Gorai, Daisuke Saito, Nobuaki Minematsu

This paper proposes a statistical parametric speech synthesis system that uses non-negative autoencoder (NAE) for spectral modeling. NAE is a model that extends non-negative matrix factorization (NMF) as neural networks. In the proposed method, we employ latent variables in NAE as acoustic features. Reconstruction of spectral information and estimation of latent variables are simultaneously trained. The non-negativity of latent variables in NAE is expected to contribute to dimensionality reduction such that the fine structure of the spectral envelopes is preserved. Experimental results demonstrates the effectiveness of the proposed framework. We also study multispeaker modeling where each of NAEs corresponds to each single speaker. In addition, a neural source-filter (NSF) model was applied to the waveform generation. When a neural vocoder is trained with natural acoustic features and tested with synthesized features, quality degradation occurs due to the mismatch between training and test data. In order to mitigate the mismatch, this system uses features obtained by reconstructing natural speech using NAE for training. Experimental results show that reconstructed features are similar to synthesized features, and as a result, the quality of the synthesized speech is improved.