ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Balance, Multiple Augmentation, and Re-synthesis: A Triad Training Strategy for Enhanced Audio Deepfake Detection

Thien-Phuc Doan, Long Nguyen-Vu, Kihun Hong, Souhwan Jung

The detection of deepfake voices has become increasingly challenging. Finding the boundary that separates real and synthetic voices requires a good training set and an effective strategy. In this study, we introduce a novel training strategy designed to improve detection performance by actively assembling training mini-batches within the framework of Supervised Contrastive Learning. We argue that enhancing model robustness is achievable through balancing samples between classes, applying multiple speech augmentation methods, and re-synthesized samples training. By carefully fine-tuning the mini-batch settings, we have surpassed the performance and model generalization of existing methods on various audio deepfake benchmark sets. These include the ASVspoof DF evaluation and in-the-wild benchmarking sets, where we achieved Equal Error Rates of 2.17% and 4.51%, respectively. The experiment of this work is available on Github.