Voice conversion systems aim to process speech from a source speaker so it would be perceived as spoken by a target speaker. This paper presents a procedure to improve high resolution voice conversion by modifying the algorithm used for residual estimation. The proposed residual estimation algorithm exploits the temporal dependencies between residuals in consecutive speech frames using a hidden Markov model. A previous residual estimation technique based on Gaussian mixtures is used as comparison. Both algorithms are subjected to tests to measure perceived identity conversion and converted speech quality. It was found that the proposed algorithm generates converted speech with significantly better quality without degraded identity conversion performance with respect to the baseline, working particularly well for female target speakers and cross-gender conversions.
Index Terms: Voice conversion, residual estimation, HMM, MOS test, ABX test