ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

PAM: Prompting Audio-Language Models for Audio Quality Assessment

Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, Huaming Wang

Audio quality is a key performance metric for various audio processing tasks, including generative modeling, however its objective measurement remains a challenge. Audio-Language Models (ALM) are pre-trained on millions of audio-text pairs that may contain information about audio quality, the presence of artifacts or noise. Given an audio input and a text prompt about quality, an ALM can calculate a similarity score between the two. We exploit this capability and introduce PAM, a truly reference-free metric for assessing audio quality for different audio processing tasks. Contrary to other “reference-free” metrics, PAM does not require computing embeddings on a reference dataset nor training a task-specific model on a costly set of human listening scores. We extensively evaluate PAM against established metrics and newly collected human listening scores on four tasks: text- to-audio (TTA), text-to-music generation (TTM), text-to-speech (TTS), and deep noise suppression (DNS). We perform multiple ablation studies with controlled audio distortions, in-the-wild setups, and prompt choices. Our evaluation shows that overall, PAM correlates strongly with human listening scores and performs better than existing metrics. These results demonstrate the potential of ALM for computing a general-purpose audio quality metric. Code and human listening scores will be released.