ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Hierarchical Distribution Adaptation for Unsupervised Cross-corpus Speech Emotion Recognition

Cheng Lu, Yuan Zong, Yan Zhao, Hailun Lian, Tianhua Qi, Björn Schuller, Wenming Zheng

The primary issue of unsupervised cross-corpus speech emotion recognition (SER) is that domain shift between the training and testing data undermines the SER model’s ability to generalize on unknown testing datasets. In this paper, we propose a straightforward and effective strategy, called Hierarchical Distribution Adaptation (HDA), to address the domain bias issue. HDA leverages a hierarchical emotion representation module based on nested Transformers to extract speech emotion features at different levels (e.g., frame/segment/utterance-level), for capturing multiple-scale emotion correlations in speech. Furthermore, a hierarchical distribution adaptation module, including frame-level distribution adaptation (FDA), segment- level distribution adaptation (SDA), and utterance-level distribution adaptation (UDA), is developed to align the hierarchical-level emotion representations of the training and testing speech samples to effectively eliminate domain discrepancy. Extensive experimental results demonstrate the superiority of our proposed HDA over other state-of-the art (SOTA) methods.