Intrusive Intelligibility Prediction with ASR Encoders
Hanlin Yu, Haoshuai Zhou, Boxuan Cao, Changgeng Mo, Linkai Li, Shan Xiang Wang
We present a reference-aware speech intelligibility predictor developed for the 3rd Clarity Prediction Challenge (CPC3). Our system (E025) ranked first on the official leaderboard with a dev-set RMSE of 22.36 and correlation of 0.83, and an evaluation-set RMSE of 24.98 with correlation 0.80. Previous sentence-level predictors have plateaued with RMSEs above 20 on CPC2, highlighting the challenge of estimating intelligibility without a clean reference. We ask whether incorporating clean reference signals can improve sentence-level predictions for hearing-impaired listeners. Our approach combines mid-depth representations from speech foundation models (SFM) with multi-scale CNN features. Specifically, we identify and ensemble layers 10–17 of Canary-1B-Flash and Parakeet-TDT-0.6B-V2, fuse them with a CNN front end, and apply cross-attention across reference and ear streams at both temporal and layer levels. A severity embedding further conditions predictions on listener profiles. The resulting model generalizes well across datasets, achieving state-of-the-art performance on CPC3. These findings demonstrate that integrating clean reference signals with carefully selected SFM layers enables more accurate and robust intelligibility prediction for hearing-impaired listeners.