ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Fine-tune Pre-Trained Models with Multi-Level Feature Fusion for Speaker Verification

Shengyu Peng, Wu Guo, Haochen Wu, Zuoliang Li, Jie Zhang

In this paper, we consider speaker verification by fine-tuning the pre-trained model (PTM) with multi-level features, including multi-layer features from PTM and hand-crafted features. The proposed framework contains a PTM as front-end and a Dual-Branch ECAPA-TDNN (DBE) back-end. For the front-end, we propose an attention-based fusion module (AFM) to merge deep and shallow layers of the PTM features. To enhance the performance, we add an auxiliary speaker loss after the last layer of the PTM. For the DBE back-end, each branch of DBE takes aggregated features from PTM and FBank features as input. The AFM is further used to merge the dual-branch features to provide complementary information. Experimental results on VoxCeleb datasets confirm the effectiveness of our proposed method across different PTMs.