ISCA Archive Odyssey 2018
ISCA Archive Odyssey 2018

End-to-End versus Embedding Neural Networks for Language Recognition in Mismatched Conditions

Jesus Antonio Villalba Lopez, Niko Brummer, Najim Dehak

Neural network architectures mapping variable-length speech utterances into fixed dimensional embeddings have started to outperform state-of-the-art i-vector systems in speaker and language recognition tasks. However, neural networks areprone to over-fit to the training domain and may be difficult to adapt to new domains with limited development data. A successful solution, used in recent NIST 2017 language recognition evaluation, consists of training the embedding extractor on out-of-domain data and applying a back-end classifier adapted to the target domain. In this paper, we compare the embedding+back-end approach with the end-to-end evaluation of the neural network to obtain language log-likelihoods. Doing careful adaptation of the networks, we show that end-to-end improves detection cost by 6\% relative w.r.t. the best embedding system. We compared two embedding architectures. First, we evaluated embeddings using a temporal mean+stddev pooling layer to capture the long-term sequence information (a.k.a. x-vectors). Second, we present a novel probabilistic embedding framework where the embedding is a hidden variable. The network predicts a Gaussian posterior distribution for the embedding given each feature frame. Finally, the frame level posteriors can be combined in a principled way to obtain sequence level posteriors. In this manner, we obtain an uncertainty measure about the embedding value. Language scores are obtained integrating over the embedding posterior distribution. In our experiments, x-vectors outperformed probabilistic embeddings for embedding+back-end systems but both attained comparable results for end-to-end systems.