ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Transformer Networks for Non-Intrusive Speech Quality Prediction

M K Jayesh, Mukesh Sharma, Praneeth Vonteddu, Mahaboob Ali Basha Shaik, Sriram Ganapathy

This paper presents the details of our speech quality prediction system submitted to the Conferencing Speech-2022 challenge. The challenge involved the task of non-intrusive speech quality assessment intended for online conferencing applications. We propose two approaches for speech quality prediction in this work. The first approach uses a combination of deep convolutional neural network (CNN) and LSTM neural network with Kullback-Leibler (KL) loss function and cross entropy (CE) loss function for estimating the mean opinion scores (MOS). Our second approach uses transformer based encoder network before applying attention pooling. We observe that our proposed second method gives significant improvements compared to our first method as well as on the baselines provided by the challenge organizers with respect to Pearson Correlation Coefficient (PCC) and Spearman Rank Correlation Coefficient (SRCC) along with reductions in root mean square error (RMSE). The model is also seen to generalize for unseen data resources on the evaluation dataset.