ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Exploiting Cross-Domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Shujie Hu, Xurong Xie, Mengzhe Geng, Mingyu Cui, Jiajun Deng, Guinan Li, Tianzi Wang, Helen Meng, Xunying Liu

Articulatory features (AFs) are inherently invariant to acoustic signal distortion. Their practical application to atypical domains such as elderly, disordered speech across languages is limited by data scarcity. This paper presents a cross-domain and cross-lingual Acoustic-to-Articulatory (A2A) inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model training before being adapted to three datasets: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based AFs. Experiments suggest incorporating the generated AFs consistently outperforms the baseline TDNN/Conformer ASR systems using acoustic features only by statistically significant word/character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding.