ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI

Yaoyao Yue, Michael Proctor, Luping Zhou, Rijul Gupta, Tharinda Piyadasa, Amelia Gully, Kirrie Ballard, Craig Jin

Real-time magnetic resonance image (rtMRI) data of the upper airway provides a rich source of information about vocal tract shaping that can inform phonemic analysis and classification. We describe a multimodal phonemic classifier that combines articulatory data with speech audio features to improve performance. A deep network model processes rtMRI video data using ResNet18 and speech audio using a custom CNN and then combines the two data streams using a Transformer layer to fully explore the correlation of the two streams towards better vowel-consonant-vowel classification via the Transformer's multi-head self-attention mechanism. The classification accuracy of both the unimodal and multimodal models show substantial improvement on previous work (> 38%). The addition of audio features improves classification accuracy in the multimodal model by 7% compared with the unimodal model using articulatory data. We analyze the model and discuss the phonetic implications.