Non-intrusive Speech Intelligibility Prediction Model for Hearing Aids using Multi-domain Fused Features
Guojian Lin, Fei Chen
Automatic speech intelligibility prediction system plays an important part in the development of hearing aids. With two Clarity Prediction Challenges, speech foundation models (SFMs) have shown remarkable performance in the task of speech intelligibility prediction. In this paper, we propose a non-intrusive speech intelligibility assessment model for hearing aids with multi-domain fused SFM embedding representations. The proposed model employs left and right ear branches to process input speech signals, fusing frame-level representations from three pretrained SFMs: Hubert, Whisper, and M2D-CLAP. Moreover, the model utilizes a Bi-LSTM layer and a multi-head attention layer to process the fused representations in both frame and feature dimensions, generating frame-level intelligibility scores. Finally, the model outputs the predicted intelligibility score through global average pooling of the frame-level scores. We evaluated the root mean square error and correlation WITH single SFM representations and multi-domain fused representations on the the third Clarity Prediction Challenge dataset. Experimental results demonstrate that multi-domain fused features present strong ability of capturing comprehensive speech information and the performance of multi-domain fused features relies on the best-performing Whisper representations.