ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Generalized Keyword Spotting using ASR embeddings

Kirandevraj R, Vinod Kumar Kurmi, Vinay Namboodiri, C V Jawahar

Keyword Spotting (KWS) detects a set of pre-defined spoken keywords. Building a KWS system for an arbitrary set requires massive training datasets. We propose to use the text transcripts from an Automatic Speech Recognition (ASR) system alongside triplets for KWS training. The intermediate representation from the ASR system trained on a speech corpus is used as acoustic word embeddings for keywords. Triplet loss is added to the Connectionist Temporal Classification (CTC) loss in the ASR while training. This method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words. This method is also applied to low-resource languages such as Tamil by converting Tamil characters to English using transliteration. This is a very challenging novel task for which we provide a dataset of transcripts for the keywords. Despite our model not generalizing well, we achieve a benchmark AP of 0.321 on over 38 words unseen by the model on the MSWC Tamil keyword set. The model also produces an accuracy of 96.2% for classification tasks on the Google Speech Commands dataset.