Non-Intrusive Speech Intelligibility Prediction Using Whisper ASR and Wavelet Scattering Embeddings for Hearing-Impaired Individuals
Rantu Buragohain, Jejariya Ajaybhai, Aashish Kumar Singh, Karan Nathwani, Sunil Kumar Kopparapu
Hearing loss affects a significant population worldwide leading to an increase in usage of hearing aids. Ability to accurately predict intelligibility of speech, especially in noisy environments can go a long way in helping improve the performance of hearing aids. We present, as part of the 3rd Clarity Prediction Challenge (CPC3), a deep neural network framework which benefits from the contextual depth of Whisper-based embeddings and the resilience of Wavelet Scattering Transform (WST) embeddings to enable a robust speech intelligibility (SI) prediction. While the Whisper-based embeddings are the output of the final encoder (1024) and the final decoder (768) of a pre-trained encoder-decoder transformer trained on 680k hours of multilingual data, derived from the 80-channel log-Mel spectrogram of the input waveform, the second-order WST-based embeddings, with J=6 filterbanks and Q=8 wavelets per octave are extracted from the raw waveform. The WST-based embeddings provide deformation-stable time-frequency representations. We propose five systematically designed models: (Model #1) encode-only, leveraging embeddings from the final encoder layer of Whisper-medium; (Model #2) decode-only, utilizing the final decoder layer embeddings of Whisper-small; (Model #3) encode-decode, a fusion model that combines both encoder and decoder embeddings; (Model #4) hybrid, a model that uses encode-decode and WST-based embeddings; and (Model #5) ensemble, an average of (Model #1+Model #2+Model #3) with and without post-processing. Each embedding stream is independently processed using bidirectional long-short term memory (Bi-LSTM) layers and attention pooling, followed by fully connected (linear) layers to predict SI score. Our best performing ensemble with & without post processing, combining the outputs of first three models, achieves a root mean square error (RMSE) of 21.87 & 22.66 (development) and 25.31 & 25.3 (evaluation), respectively.