ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Novel method for data clustering and mode selection with application in voice conversion

Jani Nurminen, Jilei Tian, Victor Popa

Since the statistical properties of speech signals are variable and depend heavily on the content, it is hard to design speech processing techniques that would perform well on all inputs. For example, in voice conversion, where the aim is to transform the speech uttered by a source speaker to sound as if it was spoken by a target speaker, different types of inter-speaker relationships can be found from different types of speech segments. To tackle this problem in a robust manner, we have developed a novel scheme for data clustering and mode selection. When applied in the voice conversion application, the main idea of the proposed approach is to first cluster the target data to achieve a minimized intra-cluster variability. Then, a mode selector or a classifier is trained on aligned source-related data to recognize the target-based clusters. Auxiliary speech features can be used to enhance the classification accuracy, in addition to the source data. Finally, a separate conversion scheme is trained and used for each cluster. The proposed scheme is fully data-driven and it avoids the need to use heuristic solutions. The superior performance of the proposed scheme has been verified in a practical voice conversion system.