This paper introduces speaker adaptive training techniques to tensor-based arbitrary speaker conversion. In voice conversion studies, realization of conversion from/to an arbitrary speakerĀfs voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. Although the EVC can effectively construct the conversion model for arbitrary target speakers using only a few utterances, it does not effectively improve the performance even when using a lot of adaptation data, because of an inherent problem in GMM supervectors. We previously proposed tensor-based speaker space as the solution for this problem, and realized more flexible control of speaker characteristics. In this paper, for larger improvement of the performance of VC, speaker adaptive training and tensorbased speaker representation are integrated. The proposed method can construct the flexible and precise conversion model, and experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.
Index Terms: voice conversion, Gaussian mixture model, eigenvoice, Tucker decomposition, speaker adaptive training