ISCA Archive SSW 2023
ISCA Archive SSW 2023

Data Augmentation Methods on Ultrasound Tongue Images for Articulation-to-Speech Synthesis

Ibrahim Ibrahimov, Gabor Gosztolya, Tamas Gabor Csapo

Articulation-to-Speech Synthesis (ATS) focuses on convertingarticulatory biosignal information into audible speech, nowadays mostly using DNNs, with a future target application of aSilent Speech Interface. Ultrasound Tongue Imaging (UTI) isan affordable and non-invasive technique that has become popular for collecting articulatory data. Data augmentation has beenshown to improve the generalization ability of DNNs, e.g. toavoid overfitting, introduce variations into the existing dataset,or make the network more robust against various noise types onthe input data. In this paper, we compare six different data augmentation methods on the UltraSuite-TaL corpus during UTI-based ATS using CNNs. Validation mean squared error is usedto evaluate the performance of CNNs, while by the synthesizedspeech samples, the performace of direct ATS is measured using MCD and PESQ scores. Although we did not find largedifferences in the outcome of various data augmentation techniques, the results of this study suggest that while applying dataaugmentation techniques on UTI poses some challenges due tothe unique nature of the data, it provides benefits in terms ofenhancing the robustness of neural networks. In general, articulatory control might be beneficial in TTS as well.