ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

ECAPA++: Fine-grained Deep Embedding Learning for TDNN Based Speaker Verification

Bei Liu, Yanmin Qian

In this paper, we aim to bridge the performance gap between TDNN and 2D CNN based speaker verification systems. Specifically, three types of architectural enhancements to ECAPA-TDNN are proposed: 1) follow depth-first design to significantly increase network depth while maintaining its complexity. 2) introduce recursive convolution to better capture fine-grained speaker information. 3) propose pyramid-based multi-path feature enhancement module to yield more discriminative speaker representation. Experiments on Voxceleb show that our final model, named ECAPA++, achieves 25%, 23% and 24% relative improvements on Vox1-O, E and H respectively, while with 2.4x fewer parameters and 2.3x fewer FLOPs over the previous best TDNN-based system. Meanwhile, it is comparable to the state-of-the-art ResNet-based systems with higher computational efficiency. In addition, further performance gains can be achieved by fusing ECAPA++ and ResNet-based systems.