ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

A Prototypical Network Approach for Evaluating Generated Emotional Speech

Alice Baird, Silvan Mertes, Manuel Milling, Lukas Stappen, Thomas Wiest, Elisabeth André, Björn W. Schuller

The collection of emotional speech data is a time-consuming and costly endeavour. Generative networks can be applied to augment the limited audio data artificially. However, it is challenging to evaluate generated audio for its similarity to source data, as current quantitative metrics are not necessarily suited to the audio domain. We explore the use of a prototypical network to evaluate four classes of generated emotional audio with this in mind. We first extract spectrogram images from WaveGan generated audio and other audio augmentation approaches, comparing similarity to the class prototype and diversity within the embedding space. Furthermore, we augment the source training set with each augmentation type and perform a classification to explore the generated audio plausibility. Results suggest that quality and diversity can be quantitatively observed with this approach. In the chosen context, we see that WaveGan generated data is recognisable as a source data class (F1-score 43.6%), and the samples add similar diversity as unseen source data. This result leads to more plausible data for augmentation of the source training set — achieving up to 63.9% F1 which is a 3.5% improvement over the source data baseline.