ISCA Archive IberSPEECH 2018
ISCA Archive IberSPEECH 2018

On the use of Phone-based Embeddings for Language Recognition

Christian Salamea, Ricardo de Córdoba, Luis Fernando D'Haro, Rubén San-Segundo, Javier Ferreiros

Language Identification (LID) is the process for automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phonemes sequence generated by a speech recognizer (ASR), but instead phonemes we have used phonetic units that contain context information “phone-grams”. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vectors framework to train a multi class logistic classifier. These NEs incorporate information from the neighboring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both, Skip-Gram and Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as a metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,69%. Our strategy to incorporate information from the neighboring phone-grams to define the final sequences contributes obtaining up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, fusing our best system with an MFCC-based acoustic i-Vectors system provides up to 34,1% improvement.

doi: 10.21437/IberSPEECH.2018-12

Cite as: Salamea, C., de Córdoba, R., D'Haro, L.F., San-Segundo, R., Ferreiros, J. (2018) On the use of Phone-based Embeddings for Language Recognition. Proc. IberSPEECH 2018, 55-59, doi: 10.21437/IberSPEECH.2018-12

  author={Christian Salamea and Ricardo {de Córdoba} and Luis Fernando D'Haro and Rubén San-Segundo and Javier Ferreiros},
  title={{On the use of Phone-based Embeddings for Language Recognition}},
  booktitle={Proc. IberSPEECH 2018},