Multitask Learning with Fused Attention for Improved ASR and Mispronunciation Detection in Children's Speech Sound Disorders
Selina S. Sung, Seunghee Ha, Tae-Jin Yoon, Jungmin So
This study proposes a multitask learning framework with fused attention to enhance automatic speech recognition (ASR) for pronunciation-based transcription and mispronunciation detection (MPD) in speech sound disorders (SSD). Our approach leverages multitask learning by carrying out ASR and classification tasks concurrently. To further improve performance, we propose a fused attention mechanism that refines hidden states by weighting features relevant to mispronunciations. The classification head and the attention mechanism work synergistically, jointly optimizing transcription and detection performance. Evaluated on a Korean children SSD dataset, our approach outperforms the baseline, achieving lower Character Error Rates (CER) and higher Unweighted Average Recall (UAR), demonstrating the effectiveness of multitask learning with fused attention for mispronunciation detection.