ISCA Archive SLaTE 2023
ISCA Archive SLaTE 2023

SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

Yassine EL Kheir, Shammur Chowdhury, Ahmed Ali, Hamdy Mubarak, Shazia Afzal

The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique showcases significant improvement at the phoneme level in two L2 datasets, we achieved state-of-the-art results on ASR-dependent mispronunciation models with publicly available English Speechocean762 testset, resulting in a notable 5.0% gain in Pearson Correlation Coefficient (PCC). Additionally, we benchmark and demonstrate a substantial 4.6% increase in F1-score with the Arabic AraVoiceL2 testset.