ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Keep, Delete, or Substitute: Frame Selection Strategy for Noise-Robust Speech Emotion Recognition

Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

Speech emotion recognition (SER) system can exploit an Speech enhancement (SE) model to increase its noise robustness by suppressing the background noise. However, SE could also suppress emotionally discriminative features, affecting the emotion prediction. We propose an alternative framework, Keep or Delete (KoD), to keep the information of the original speech while minimizing the influence of background noise. We train a frame reliability predictor that determines clean frames to keep, discarding the noisy frames. We expand this framework by replacing the dropped frames with those extracted from the enhanced speech to keep the lexical information. We refer to this implementation as Keep or Substitute (KoS). Our experiment shows that the KoD model improves the SER results under noisy conditions without fine-tuning the whole model. Also, the KoS framework performs better than enhancing all the frames, indicating the importance of avoiding speech distortion.