ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Highly Intelligible Speaker-Independent Articulatory Synthesis

Charles McGhee, Kate Knill, Mark Gales

An articulatory synthesiser which could accurately map vocal tract features to speech would enable novel evaluation of acoustic-to-articulatory inversion models beyond the small, typically monolingual, articulatory datasets available. However, current deep articulatory synthesisers and physical simulation-based synthesisers struggle to produce consistently intelligible speech, with Word Error Rates (WER) of around 20% for real or hand-crafted articulatory input. Additionally, deep learning methods have often only achieved this level of intelligibility when training and evaluating on the same speaker (speaker-dependent training). In this paper, we create a highly intelligible (WER 7% for real data and 10% for synthetic), speaker-independent articulatory synthesiser by training a deep synthesiser on a combination of high-quality real data and synthetic data generated by inversion. We then perform a multilingual evaluation of the joint inversion-synthesis system.