ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Hybrid Data Sampling for ASR: Integrating Acoustic Diversity and Transcription Uncertainty

Komei Hiruta, Yosuke Yamano, Hideaki Tamori

Efficiently selecting training data is crucial for improving automatic speech recognition (ASR) models while minimizing annotation costs. This research extends TypiClust, a data sampling method originally validated for images, to ASR by jointly considering acoustic diversity and transcription uncertainty. Our method clusters speech embeddings from Wav2Vec2 and prioritizes typical and low transcription-confidence samples within each cluster, ensuring selection of representative and hard-to-train samples. We evaluate our method on Japanese speech datasets, CSJ and ReazonSpeech, demonstrating that it achieves lower recognition error rates than random or single-criterion-based data selection. These results indicate that selecting data based on both diversity and uncertainty enhances speech recognition model performance while reducing annotation costs.