Conventional text-to-speech (TTS) synthesis requires extensive linguistic
processing for producing quality output. The advent of end-to-end (E2E)
systems has caused a relocation in the paradigm with better synthesized
voices. However, hidden Markov model (HMM) based systems are still
popular due to their fast synthesis time, robustness to less training
data, and flexible adaptation of voice characteristics, speaking styles,
and emotions.
This paper proposes a technique that combines the classical parametric
HMM-based TTS framework (HTS) with the neural-network-based Waveglow
vocoder using histogram equalization (HEQ) in a low resource environment.
The two paradigms are combined by performing HEQ across mel-spectrograms
extracted from HTS generated audio and source spectra of training data.
During testing, the synthesized mel-spectrograms are mapped to the
source spectrograms using the learned HEQ. Experiments are carried
out on Hindi male and female dataset of the Indic TTS database. Systems
are evaluated based on degradation mean opinion scores (DMOS). Results
indicate that the synthesis quality of the hybrid system is better
than that of the conventional HTS system. These results are quite promising
as they pave way to good quality TTS systems with less data compared
to E2E systems.