ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Robust Vocal Intensity Prediction: Overcoming Dataset Bias with Pretrained Deep Models

Quentin Le Tellier, Marc Evrard, Albert Rilliard, Jean-Sylvain Liénard

Vocal intensity prediction has been investigated in previous studies, where machine-learning (ML) models (e.g., linear regression, SVM) trained on basic speech features (such as spectrograms and Mel-spectrograms) have demonstrated good prediction accuracy at the utterance level. In this work, we revisit these methods by evaluating them on two calibrated datasets, including a bilingual corpus containing both French and English speech. Our findings show that prior approaches struggle to generalize across datasets. To address this limitation, we leverage embeddings from the Wav2Vec 2.0 model as input features to classical ML regressors. While previous work used such embeddings for vocal intensity classification, our study demonstrates their effectiveness in addressing cross-dataset generalization issues for the regression task. The results confirm that pretrained embeddings significantly improve generalization and mitigate overfitting issues linked to traditional acoustic features.