ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Advanced RawNet2 with Attention-based Channel Masking for Synthetic Speech Detection

Jing Li, Yanhua Long, Yijie Li, Dongxing Xu

Automatic speaker verification (ASV) systems are often vulnerable to spoofing attacks, particularly unseen attacks. Due to the diversity of text-to-speech and voice conversion algorithms, how to improve the generalization ability of synthetic speech detection systems is a challenging issue. To address this issue, we propose an advanced RawNet2 (ARawNet2) by introducing an attention-based channel masking (ACM) block to improve the RawNet2, with three main components: the squeeze-and-excitation, the channel masking, and a global-local feature aggregation. The effectiveness of the proposed system is evaluated on both the ASVspoof 2019 and ASVspoof 2021 datasets. Specifically, the ARawNet2 achieves an EER of 4.61% on the ASVspoof 2019 logical access (LA) task, and on the ASVspoof 2021 LA and speech deepfake (DF) tasks, it achieves EER of 8.36% and 19.03%, which obtains relative 12.00% and 14.97% EER reductions over the RawNet2 baseline, respectively.