Recent advances in real-time magnetic resonance imaging (rtMRI) of
the vocal tract provides opportunities for studying human speech. This
modality together with acquired speech may enable the mapping of articulatory
configurations to acoustic features. In this study, we take the first
step by training a deep learning model to classify 27 different phonemes
from midsagittal MR images of the vocal tract.
An American English
database was used to train a convolutional neural network for classifying
vowels (13 classes), consonants (14 classes) and all phonemes (27 classes)
of 17 subjects. Classification top-1 accuracy of the test set for all
phonemes was 57%. Error analysis showed voiced and unvoiced sounds
often being confused. Moreover, we performed principal component analysis
on the network’s embedding and observed topological similarities
between the network learned representation and the vowel diagram. Saliency
maps gave insight into the anatomical regions most important for classification
and show congruence with known regions of articulatory importance.
We demonstrate the feasibility for deep learning to distinguish
between phonemes from MRI. Network analysis can be used to improve
understanding of normal articulation and speech and, in the future,
impaired speech. This study brings us a step closer to the articulatory-to-acoustic
mapping from rtMRI.