ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Statistical parametric speech synthesis using weighted multi-distribution deep belief network

Shiyin Kang, Helen Meng

This paper presents a weighted multi-distribution deep belief network (wMD-DBN) for context-dependent statistical parametric speech synthesis. We have previously proposed the use of MD-DBN for speech synthesis, which models simultaneously both spectrum and fundamental frequency (F0), and has demonstrated the potential to generate high-dimensional spectra with high quality and to produce natural synthesized speech. However, the model showed only mediocre performance on low-dimensional data, such as the F0 and voiced/unvoiced (V/UV) flag, resulting in a vibrating pitch contour in the synthesized voice. To address this problem, this paper investigates the use of an extra weighting vector on the acoustic output layer of the MD-DBN. It reduces the dimensional imbalance between spectrum and pitch parameters by giving different weighting coefficients to the spectrum, F0 and the V/UV flag in the training procedure. Experimental results show that wMD-DBN can generate smoother pitch contours and improve the naturalness of the synthesized speech.


doi: 10.21437/Interspeech.2014-442

Cite as: Kang, S., Meng, H. (2014) Statistical parametric speech synthesis using weighted multi-distribution deep belief network. Proc. Interspeech 2014, 1959-1963, doi: 10.21437/Interspeech.2014-442

@inproceedings{kang14_interspeech,
  author={Shiyin Kang and Helen Meng},
  title={{Statistical parametric speech synthesis using weighted multi-distribution deep belief network}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1959--1963},
  doi={10.21437/Interspeech.2014-442},
  issn={2308-457X}
}