ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Beyond Traditional Speech Modifications : Utilizing Self Supervised Features for Enhanced Zero-Shot Children ASR

Abhijit Sinha, Hemant Kumar Kathania, Mikko Kurimo

Zero-shot automatic speech recognition (ASR) for children is challenging due to pronounced acoustic and linguistic mismatches, speaker variability and limited annotated data. This work utilizes self-supervised learning (SSL) features to address these challenges without requiring child specific data. We perform a layer-wise analysis of SSL models, Wav2Vec2, HuBERT, and Data2Vec to identify optimal representations for zero-shot children ASR. Our results show that features from specific layers (e.g., layer 22 of Wav2Vec2) capture robust, speaker-invariant phonetic information, significantly improving recognition accuracy by reducing the word error rate (WER) from 10.65% to 5.15%, a 51.64% relative improvement over Wav2Vec2 baseline. Additionally, while conventional acoustic modifications (pitch, speaking rate, formant) enhance performance in traditional systems, they yield minimal gains for SSL-based models, highlighting the intrinsic speaker invariance of SSL representations.