ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Wav2vec 2.0 Embeddings Are No Swiss Army Knife -- A Case Study for Multiple Sclerosis

Gábor Gosztolya, Mercedes Vetráb, Veronika Svindt, Judit Bóna, Ildikó Hoffmann

In the past few years, self-supervised learning has revolutionalized automatic speech recognition. Self-supervised models such as wav2vec2, due to their generalization ability on huge unannotated audio corpora, were claimed to be state-of-the-art feature extractors in paralinguistic and pathological applications as well. In this study we test embeddings extracted from a wav2vec 2.0 model fine-tuned on the target language as features on a multiple sclerosis audio corpus, using three speech tasks. After comparing the resulting classification performances with traditional features such as ComParE functionals, ECAPA-TDNN and activations of a HMM/DNN hybrid acoustic model, we found that wav2vec2-based models, surprisingly, only produced a mediocre classification performance. In contrast, the decade-old ComParE functionals feature set consistently led to high scores. Our results also indicate that the number of features correlates surprisingly well with classification performance.