ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Noise and Metadata Sensitive Bottleneck Features for Improving Speaker Recognition with Non-Native Speech Input

Yao Qian, Jidong Tao, David Suendermann-Oeft, Keelan Evanini, Alexei V. Ivanov, Vikram Ramanarayanan

Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow the comparison among different speakers with “soft-aligned” phonetic content, have significantly outperformed standard i-vector based systems [9–12]. However, when applied to speaker recognition on a non-native spontaneous corpus, DNN-based speaker recognition does not show its superior performance due to the relatively lower accuracy of phonetic content recognition. In this paper, noise-aware features and multi-task learning are investigated to improve the alignment of speech feature frames into the sub-phonemic “senone” space and to “distill” the L1 (native language) information of the test takers into bottleneck features (BNFs), which we refer to as metadata sensitive BNFs. Experimental results show that the system with metadata sensitive BNFs can improve speaker recognition performance by a 23.9% relative reduction in equal error rate (EER) compared to the baseline i-vector system. In addition, L1 info is just used to train the BNFs extractor, so it is not necessary to be used as input for BNFs extraction, i-vector extraction and scoring for the enrollment and evaluation sets, which can avoid the use of erroneous L1s claimed by imposters.