ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Fully End-to-end Streaming Open-vocabulary Keyword Spotting with W-CTC Forced Alignment

Dohyun Kim, Jiwook Hwang

In open-vocabulary keyword spotting, an acoustic encoder pre-trained with Connectionist Temporal Classification (CTC) loss is typically used to train a text encoder by aligning audio embedding space with text embedding space. In previous work, word-aligned datasets were created by forced alignment algorithms such as the Montreal Forced Aligner (MFA) to train text encoder and verifier models. In this paper, we propose a new training pipeline for open-vocabulary keyword spotting using the W-CTC forced alignment algorithm, a simple modification of the practical CTC algorithm. Our approach eliminates the need for creating word-aligned datasets, operates in a fully end-to-end manner, and demonstrates superior performance on the Libriphrase hard dataset.