Lightweight Speech Intelligibility Prediction with Spectro-Temporal Modulation for Hearing-Impaired Listeners
Xiajie Zhou, Candy Olivia Mawalim, Huy Quoc Nguyen, Masashi Unoki
Hearing loss leads to reduced frequency resolution and impaired temporal resolution, making it difficult for listeners to distinguish similar sounds and perceive speech dynamics in noise. To capture these perceptual degradations, we employ spectro-temporal modulation (STM) analysis as the core feature representation. This study proposes a speech intelligibility prediction framework that uses STM representations as input to lightweight convolutional neural network (CNN) models. We design two models: STM-CNN-SE (E020a), which incorporates squeeze-and-excitation (SE) block, and STM-CNN-ECA (E020b), which uses efficient channel attention (ECA) block and richer input features. Compared to the HASPI, experiments on the CPC3 development dataset show that E020a and E020b reduce root-mean-square error (RMSE) by 11.2% and 12.6%, respectively. These results demonstrate the effectiveness of STM-based CNN architectures for speech intelligibility prediction under hearing loss conditions.