ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

MetricAug: A Distortion Metric-Lead Augmentation Strategy for Training Noise-Robust Speech Emotion Recognizer

Ya-Tse Wu, Chi-Chun Lee

Noise-robust speech emotion recognition (SER) systems are important in real world applications. Conventionally, noise robustness is achieved by training on a noise-augmented dataset. In this work, instead of pre-defining noise SNRs to augment the clean set, we propose an augment-while-train strategy while referencing speech distortion metric. This strategy (MetricAug) constructs an augmented set per each training epoch by assessing the effect of different distortion levels have on degrading the SER performances. That is, we augment more of those noisy data that degrade the SER performance the most dynamically at each learning epoch. We evaluate our framework on two databases, MSP-Podcast and MELD. Our framework shows consistent robustness against varying levels and even unseen noise types. Further analysis reveals that by choosing STOI as the metric of noise distortion, it leads the construction of augmented sets better than metrics of PESQ and fwSNRseg.