ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Glottal inverse filtering based on articulatory synthesis and deep learning

Ingo Langheinrich, Simon Stone, Xinyu Zhang, Peter Birkholz

We propose a new method to estimate the glottal vocal tract excitation from speech signals based on deep learning. To that end, a bidirectional recurrent neural network with long short-term memory units was trained to predict the glottal airflow derivative from the speech signal. Since natural reference data for this task is unobtainable at the required scale, we used the articulatory speech synthesizer VocalTractLab to generate a large dataset containing synchronous connected speech and glottal airflow signals for training. The trained model's performance was objectively evaluated by means of stationary synthetic signals from the OPENGLOT glottal inverse filtering benchmark dataset and by using our dataset of connected synthetic speech. Compared to the state of the art, the proposed model produced a more accurate estimation using OPENGLOT's physically synthesized signals but was less accurate for its computationally simulated signals. However, our model was much more accurate and plausible on the connected speech signals, especially for sounds with mixed excitation (e.g. fricatives) or sounds with pronounced zeros in their transfer function (e.g. nasals). Future work will introduce more variety into the training data (e.g. regarding pitch and phonation) and focus on estimating features of the glottal flow instead of the entire waveform.