ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Exploring Downstream Transfer of Self-Supervised Features for Speech Emotion Recognition

Yuanbo Fang, Xiaofen Xing, Xiangmin Xu, Weibin Zhang

Huge progress has been made in self-supervised audio representation learning recently, and transformer based downstream model using Multi-head Self-Attention and Feed-Forward Network (MSA-FFN) as the basic block delivered promising transfer performance on downstream speech tasks. However, it is unclear whether the traditional transformer architecture is appropriate for downstream transfer. In this paper, we adopt a block architecture search strategy (BAS) to explore this issue, taking speech emotion recognition as an example. We found that 1) it is crucial to incorporate an FFN-like representation learning module without MSA design in the early stages of the downstream model; 2) with the use of self-supervised features, it is good enough to use a simple FFN for the downstream task. This work can serve as a source of inspiration for all other downstream speech tasks that utilize self-supervised features.