ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Optimizing Large-Scale Context Retrieval for End-to-End ASR

Zhiqi Huang, Diamantino Caseiro, Kandarp Joshi, Christopher Li, Pat Rondon, Zelin Wu, Petr Zadrazil, Lillian Zhou

Contextual Automatic Speech Recognition (ASR) requires scalable and accurate retrieval of content relevant to the user’s context. This paper presents a comparative study of two independent context retrieval methods: sequence and segment level scoring. Evaluated on datasets with up to 100k phrases, all methods exhibit excellent retrieval recall. Notably, the segment-level scoring achieves an outstanding 75.6% recall over 100k entities. When each method is further integrated with ASR through joint training, significant improvements over nonbiased ASR are observed, with WER reduction of up to 36% with 2k entities and 28% with 100k entities. This comparative analysis provides valuable insights for selecting the optimal context retrieval technique to achieve scalable and accurate performance in contextual ASR applications.